Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Closed
edudar opened this issue Jul 25, 2017 · 11 comments · Fixed by #49250
Closed
Assignees
Labels
>enhancement help wanted adoptme :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@edudar
Copy link

edudar commented Jul 25, 2017

I bumped into this while implementing autocomplete/typeahead functionality with highlighting.

My index settings are:

  analysis:
    tokenizer:
      autocomplete_highlight:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
        token_chars: ["letter", "digit"]
    filter:
      autocomplete_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
    analyzer:
      autocomplete_index:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, autocomplete_ngram]
      autocomplete_search:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, stop]
      autocomplete_highlight:
        type: custom
        tokenizer: autocomplete_highlight
        filter: [standard, icu_normalizer, icu_folding]

I do the search by autocomplete field and highlight on autocomplete_highlight. Everything works fine until I meet _ in a search query. icu_tokenizer keeps it while autocomplete_highlight tokenizer removes as it keeps letters and digits only. Here I can't keep _ only but full punctuation class instead that comes with a whole load of additional symbols that I don't need and they have to go.

I would be helpful to be able to specify exact characters to keep like _.

At the moment I've implemented char_filter that replaces _ with - but that's suboptimal as _ is considered a part of words (same as in icu_tokenizer) and is expected to match rather than being ignored.

@edudar edudar changed the title Allow specific characters in token_char of edge ngram tokenizer in addition to classes Allow specific characters in token_chars of edge ngram tokenizer in addition to classes Jul 25, 2017
@nik9000 nik9000 added the :Search Relevance/Analysis How text is split into tokens label Jul 25, 2017
@andrewthad
Copy link

I would like this as well, except that I'm need it for the ngram tokenizer, not the edge ngram tokenizer. Basically, I have a bunch of logs that end up in elasticsearch, and the only character need to be sure will break up tokens is a comma. Anything else is fair game for inclusion.

@DanielBaird
Copy link

I also need to specify an individual character in my edgeNGram filter's token_chars. In my situation, I need a hyphen - to be counted as an in-token character, without including brackets and whatnot.

@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@jcox
Copy link

jcox commented May 7, 2018

Not being able to specify an individual character like '_' in the edge_ngram tokenizer's token_chars list is particularly painful because the 'standard' ES tokenizer uses the UNIX Text Segmentation Algorithm, which does include UNDERSCORE.

It would be nice to be able to index with edge_ngram & search with the standard tokenizer. Currently, you can't if you have strings like "foo_bar".

@borstelmahlsdorf
Copy link

I need also a way to specify a single char as token_chars, in my case a "-"

@Piersen
Copy link

Piersen commented Sep 18, 2018

In my case, being able to specify token chars as any character except for specified delimiters would be useful

@cbuescher cbuescher added the help wanted adoptme label Apr 9, 2019
@cbuescher cbuescher self-assigned this Apr 9, 2019
@poef124
Copy link

poef124 commented May 21, 2019

I wanted to know if there is a workaround to achieve this? I want to include certain punctuations like . : but not all the others

@kumropotash1
Copy link

Plus one from me.
I want a hyphen - to be considered as a char.

@seal256
Copy link

seal256 commented Jul 25, 2019

Need this feature

@zion3mx
Copy link

zion3mx commented Sep 30, 2019

i need this feature with letter and digit.

@binkymilk
Copy link

I need this feature too for the ngram tokenizer! For both _ and -.

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Nov 18, 2019
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes elastic#25894
cbuescher pushed a commit that referenced this issue Nov 20, 2019
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894
cbuescher pushed a commit that referenced this issue Nov 20, 2019
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.