Add delimited term frequency token filter documentation #5043

kolchfa-aws · 2023-09-18T19:33:19Z

Fixes #4986

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws · 2023-09-18T19:39:43Z

@noCharger Could you please review this PR for technical accuracy?

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

…h-project/documentation-website into delimited-token-filter

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

noCharger

Thank you; this looks fantastic! I was wondering if we could include an example of how this token filter works in conjunction with the termFreq method in painless script.

Also I would check with @russcam, who is the author of original PR

noCharger · 2023-09-19T16:45:56Z

_analyzers/token-filters/index.md

+has_toc: false
+---
+
+# Token filters


This is amazing! I would like to double check this list based on what OpenSearch supports https://github.com/opensearch-project/OpenSearch/tree/2a5b124ee8ef4376d62c484b6cd3ea1d98ca75d1/modules/analysis-common/src/main/java/org/opensearch/analysis/common

I checked against this list and also ran/tested all token filters to verify that they work.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws · 2023-09-19T20:39:00Z

@russcam Could you review this documentation PR please?

noCharger · 2023-09-19T21:01:05Z

_analyzers/token-filters/delimited-term-frequency.md

+      },
+      "f2": {
+        "type": "text",
+        "similarity": "BM25",


My bad, this similarity setup is not closely related within the context of this example. I am fine with removing it.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

vagimeli

Great work! Minimal comments/changes.

_analyzers/token-filters/delimited-term-frequency.md

vagimeli · 2023-09-19T21:21:06Z

_analyzers/token-filters/delimited-term-frequency.md

+
+The following table lists all parameters the `delimited_term_freq` supports.
+
+Parameter | Required/Optional | Description


I like how you styled this heading. I'll follow same format.

vagimeli · 2023-09-19T21:34:09Z

_analyzers/token-filters/index.md

+`trim` | [TrimFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing whitespace from each token in a stream. 
+`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. 
+`unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. 
+`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. 


Should "lowercase" be "uppercase?"

Yes, thank you!

_analyzers/token-filters/index.md

Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

_analyzers/token-filters/index.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

natebower

@kolchfa-aws Great job on this. Only minimal changes. Thanks!

natebower · 2023-09-20T12:16:15Z

_analyzers/token-filters/delimited-term-frequency.md

+```
+{% include copy-curl.html %}
+
+The `attributes` array specifies that you want to filter the output of the `explain` parameter to return only `termFrequency`. The response contains both the original token and the parsed output of the token filter that includes term frequency:


Should "the" precede "term frequency"?

_analyzers/token-filters/delimited-term-frequency.md

natebower · 2023-09-20T12:19:22Z

_analyzers/token-filters/delimited-term-frequency.md

+```
+{% include copy-curl.html %}
+
+In the response, document 1 has a score of 30 because the term frequency of the term `v1` in the field `f2` is 30. Document 2 has a score of 0 because the term `v1` does not appear in `f2`:


Should the first instance of "document" be capitalized?

I don't think so because it's not a proper name of the document?

_analyzers/token-filters/index.md

natebower · 2023-09-20T12:31:11Z

_analyzers/token-filters/index.md

+`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`,      `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
+`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
+`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
+`synonym` | N/A | Supplies a synonym list to the analysis process. The synonym list is provided using a configuration file.


"for" instead of "to"?

natebower · 2023-09-20T12:31:30Z

_analyzers/token-filters/index.md

+`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
+`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
+`synonym` | N/A | Supplies a synonym list to the analysis process. The synonym list is provided using a configuration file.
+`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, to the analysis process.


"for" instead of "to"?

_analyzers/token-filters/index.md

_analyzers/token-filters/delimited-term-frequency.md

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

…roject#5043) * Add token filter documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add delimited term frequency token filter documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add phonetic token filter Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Table format fix Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add script example Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove similarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

* Add token filter documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add delimited term frequency token filter documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add phonetic token filter Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Table format fix Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add script example Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Remove similarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Apply suggestions from code review Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

kolchfa-aws added 2 commits September 18, 2023 11:27

Add token filter documentation

a72a7ef

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Add delimited term frequency token filter documentation

0ffb6bc

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws self-assigned this Sep 18, 2023

kolchfa-aws requested review from cwillum, hdhalter, Naarcha-AWS, vagimeli, ananzh, seanneumann, AMoo-Miki and natebower as code owners September 18, 2023 19:33

kolchfa-aws added v2.10.0 release-notes PR: Include this PR in the automated release notes 3 - Tech review PR: Tech review in progress labels Sep 18, 2023

Merge branch 'main' into delimited-token-filter

2a6b0c0

kolchfa-aws added 3 commits September 18, 2023 15:57

Add phonetic token filter

47127e9

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Merge branch 'delimited-token-filter' of https://github.com/opensearc…

cc5a66d

…h-project/documentation-website into delimited-token-filter

Table format fix

5b6de22

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

noCharger reviewed Sep 19, 2023

View reviewed changes

Add script example

7ea54a1

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

noCharger approved these changes Sep 19, 2023

View reviewed changes

Remove similarity

1641190

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

vagimeli approved these changes Sep 19, 2023

View reviewed changes

kolchfa-aws commented Sep 19, 2023

View reviewed changes

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved

Apply suggestions from code review

cb2bbaf

Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws commented Sep 19, 2023

View reviewed changes

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved

kolchfa-aws commented Sep 19, 2023

View reviewed changes

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved

Apply suggestions from code review

7594f70

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

hdhalter added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Sep 20, 2023

natebower reviewed Sep 20, 2023

View reviewed changes

kolchfa-aws commented Sep 20, 2023

View reviewed changes

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved

kolchfa-aws commented Sep 20, 2023

View reviewed changes

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved

kolchfa-aws commented Sep 20, 2023

View reviewed changes

_analyzers/token-filters/delimited-term-frequency.md Outdated Show resolved Hide resolved

Apply suggestions from code review

e65e50f

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws added 6 - Done but waiting to merge PR: The work is done and ready to merge and removed 5 - Editorial review PR: Editorial review in progress labels Sep 20, 2023

kolchfa-aws merged commit e44a4e7 into main Sep 22, 2023
5 checks passed

Naarcha-AWS deleted the delimited-token-filter branch March 28, 2024 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add delimited term frequency token filter documentation #5043

Add delimited term frequency token filter documentation #5043

kolchfa-aws commented Sep 18, 2023

kolchfa-aws commented Sep 18, 2023

noCharger left a comment

noCharger Sep 19, 2023

kolchfa-aws Sep 19, 2023 •

edited

Loading

kolchfa-aws commented Sep 19, 2023

noCharger Sep 19, 2023

vagimeli left a comment

vagimeli Sep 19, 2023

vagimeli Sep 19, 2023

kolchfa-aws Sep 19, 2023

natebower left a comment

natebower Sep 20, 2023

natebower Sep 20, 2023

kolchfa-aws Sep 20, 2023

natebower Sep 20, 2023

natebower Sep 20, 2023


		The following table lists all parameters the `delimited_term_freq` supports.

		Parameter \| Required/Optional \| Description

Add delimited term frequency token filter documentation #5043

Add delimited term frequency token filter documentation #5043

Conversation

kolchfa-aws commented Sep 18, 2023

Checklist

kolchfa-aws commented Sep 18, 2023

noCharger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolchfa-aws Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

kolchfa-aws commented Sep 19, 2023

Choose a reason for hiding this comment

vagimeli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolchfa-aws Sep 19, 2023 •

edited

Loading