Add delimited term frequency token filter documentation (opensearch-p…

…roject#5043) * Add token filter documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add delimited term frequency token filter documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add phonetic token filter Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Table format fix Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add script example Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove similarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
harshavamsi · Oct 31, 2023 · cc07ac2 · cc07ac2
1 parent 0edd4ec
commit cc07ac2
Show file tree

Hide file tree

Showing 2 changed files with 339 additions and 0 deletions.
diff --git a/_analyzers/token-filters/delimited-term-frequency.md b/_analyzers/token-filters/delimited-term-frequency.md
@@ -0,0 +1,275 @@
+---
+layout: default
+title: Delimited term frequency
+parent: Token filters
+nav_order: 100
+---
+
+# Delimited term frequency token filter
+
+The `delimited_term_freq` token filter separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is its term frequency. If there is no delimiter, the token filter does not modify the term frequency. 
+
+You can either use a preconfigured `delimited_term_freq` token filter or create a custom one.
+
+## Preconfigured `delimited_term_freq` token filter
+
+The preconfigured `delimited_term_freq` token filter uses the `|` default delimiter. To analyze text with the preconfigured token filter, send the following request to the `_analyze` endpoint:
+
+```json
+POST /_analyze
+{
+  "text": "foo|100",
+  "tokenizer": "keyword",
+  "filter": ["delimited_term_freq"],
+  "attributes": ["termFrequency"],
+  "explain": true
+}
+```
+{% include copy-curl.html %}
+
+The `attributes` array specifies that you want to filter the output of the `explain` parameter to return only `termFrequency`. The response contains both the original token and the parsed output of the token filter that includes the term frequency:
+
+```json
+{
+  "detail": {
+    "custom_analyzer": true,
+    "charfilters": [],
+    "tokenizer": {
+      "name": "keyword",
+      "tokens": [
+        {
+          "token": "foo|100",
+          "start_offset": 0,
+          "end_offset": 7,
+          "type": "word",
+          "position": 0,
+          "termFrequency": 1
+        }
+      ]
+    },
+    "tokenfilters": [
+      {
+        "name": "delimited_term_freq",
+        "tokens": [
+          {
+            "token": "foo",
+            "start_offset": 0,
+            "end_offset": 7,
+            "type": "word",
+            "position": 0,
+            "termFrequency": 100
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+## Custom `delimited_term_freq` token filter
+
+To configure a custom `delimited_term_freq` token filter, first specify the delimiter in the mapping request, in this example, `^`:
+
+```json
+PUT /testindex
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "my_delimited_term_freq": {
+          "type": "delimited_term_freq",
+          "delimiter": "^"
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then analyze text with the custom token filter you created:
+
+```json
+POST /testindex/_analyze
+{
+  "text": "foo^3",
+  "tokenizer": "keyword",
+  "filter": ["my_delimited_term_freq"],
+  "attributes": ["termFrequency"],
+  "explain": true
+}
+```
+{% include copy-curl.html %}
+
+The response contains both the original token and the parsed version with the term frequency:
+
+```json
+{
+  "detail": {
+    "custom_analyzer": true,
+    "charfilters": [],
+    "tokenizer": {
+      "name": "keyword",
+      "tokens": [
+        {
+          "token": "foo|100",
+          "start_offset": 0,
+          "end_offset": 7,
+          "type": "word",
+          "position": 0,
+          "termFrequency": 1
+        }
+      ]
+    },
+    "tokenfilters": [
+      {
+        "name": "delimited_term_freq",
+        "tokens": [
+          {
+            "token": "foo",
+            "start_offset": 0,
+            "end_offset": 7,
+            "type": "word",
+            "position": 0,
+            "termFrequency": 100
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+## Combining `delimited_token_filter` with scripts
+
+You can write Painless scripts to calculate custom scores for the documents in the results.
+
+First, create an index and provide the following mappings and settings:
+
+```json
+PUT /test
+{
+  "settings": {
+    "number_of_shards": 1,
+    "analysis": {
+      "tokenizer": {
+        "keyword_tokenizer": {
+          "type": "keyword"
+        }
+      },
+      "filter": {
+        "my_delimited_term_freq": {
+          "type": "delimited_term_freq",
+          "delimiter": "^"
+        }
+      },
+      "analyzer": {
+        "custom_delimited_analyzer": {
+          "tokenizer": "keyword_tokenizer",
+          "filter": ["my_delimited_term_freq"]
+        }
+      }
+    }
+  },
+  "mappings": {
+    "properties": {
+      "f1": {
+        "type": "keyword"
+      },
+      "f2": {
+        "type": "text",
+        "analyzer": "custom_delimited_analyzer",
+        "index_options": "freqs"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+The `test` index uses a keyword tokenizer, a delimited term frequency token filter (where the delimiter is `^`), and a custom analyzer that includes a keyword tokenizer and a delimited term frequency token filter. The mappings specify that the field `f1` is a keyword field and the field `f2` is a text field. The field `f2` uses the custom analyzer defined in the settings for text analysis. Additionally, specifying `index_options` signals to OpenSearch to add the term frequencies to the inverted index. You'll use the term frequencies to give documents with repeated terms a higher score.
+
+Next, index two documents using bulk upload:
+
+```json
+POST /_bulk?refresh=true
+{"index": {"_index": "test", "_id": "doc1"}}
+{"f1": "v0|100", "f2": "v1^30"}
+{"index": {"_index": "test", "_id": "doc2"}}
+{"f2": "v2|100"}
+```
+{% include copy-curl.html %}
+
+The following query searches for all documents in the index and calculates document scores as the term frequency of the term `v1` in the field `f2`:
+
+```json
+GET /test/_search
+{
+  "query": {
+    "function_score": {
+      "query": {
+        "match_all": {}
+      },
+      "script_score": {
+        "script": {
+          "source": "termFreq(params.field, params.term)",
+          "params": {
+            "field": "f2",
+            "term": "v1"
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+In the response, document 1 has a score of 30 because the term frequency of the term `v1` in the field `f2` is 30. Document 2 has a score of 0 because the term `v1` does not appear in `f2`:
+
+```json
+{
+  "took": 4,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 2,
+      "relation": "eq"
+    },
+    "max_score": 30,
+    "hits": [
+      {
+        "_index": "test",
+        "_id": "doc1",
+        "_score": 30,
+        "_source": {
+          "f1": "v0|100",
+          "f2": "v1^30"
+        }
+      },
+      {
+        "_index": "test",
+        "_id": "doc2",
+        "_score": 0,
+        "_source": {
+          "f2": "v2|100"
+        }
+      }
+    ]
+  }
+}
+```
+
+## Parameters
+
+The following table lists all parameters that the `delimited_term_freq` supports.
+
+Parameter | Required/Optional | Description
+:--- | :--- | :---
+`delimiter` | Optional | The delimiter used to separate tokens from term frequencies. Must be a single non-null character. Default is `|`.