Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

edudar · 2017-07-25T22:30:18Z

I bumped into this while implementing autocomplete/typeahead functionality with highlighting.

My index settings are:

  analysis:
    tokenizer:
      autocomplete_highlight:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
        token_chars: ["letter", "digit"]
    filter:
      autocomplete_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
    analyzer:
      autocomplete_index:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, autocomplete_ngram]
      autocomplete_search:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, stop]
      autocomplete_highlight:
        type: custom
        tokenizer: autocomplete_highlight
        filter: [standard, icu_normalizer, icu_folding]

I do the search by autocomplete field and highlight on autocomplete_highlight. Everything works fine until I meet _ in a search query. icu_tokenizer keeps it while autocomplete_highlight tokenizer removes as it keeps letters and digits only. Here I can't keep _ only but full punctuation class instead that comes with a whole load of additional symbols that I don't need and they have to go.

I would be helpful to be able to specify exact characters to keep like _.

At the moment I've implemented char_filter that replaces _ with - but that's suboptimal as _ is considered a part of words (same as in icu_tokenizer) and is expected to match rather than being ignored.

The text was updated successfully, but these errors were encountered:

andrewthad · 2017-08-28T19:52:08Z

I would like this as well, except that I'm need it for the ngram tokenizer, not the edge ngram tokenizer. Basically, I have a bunch of logs that end up in elasticsearch, and the only character need to be sure will break up tokens is a comma. Anything else is fair game for inclusion.

DanielBaird · 2017-11-20T01:41:18Z

I also need to specify an individual character in my edgeNGram filter's token_chars. In my situation, I need a hyphen - to be counted as an in-token character, without including brackets and whatnot.

romseygeek · 2018-03-14T13:58:13Z

cc @elastic/es-search-aggs

jcox · 2018-05-07T20:34:25Z

Not being able to specify an individual character like '_' in the edge_ngram tokenizer's token_chars list is particularly painful because the 'standard' ES tokenizer uses the UNIX Text Segmentation Algorithm, which does include UNDERSCORE.

It would be nice to be able to index with edge_ngram & search with the standard tokenizer. Currently, you can't if you have strings like "foo_bar".

borstelmahlsdorf · 2018-08-27T05:23:19Z

I need also a way to specify a single char as token_chars, in my case a "-"

Piersen · 2018-09-18T10:41:17Z

In my case, being able to specify token chars as any character except for specified delimiters would be useful

poef124 · 2019-05-21T20:18:19Z

I wanted to know if there is a workaround to achieve this? I want to include certain punctuations like . : but not all the others

kumropotash1 · 2019-06-18T06:20:31Z

Plus one from me.
I want a hyphen - to be considered as a char.

seal256 · 2019-07-25T14:30:53Z

Need this feature

zion3mx · 2019-09-30T11:17:04Z

i need this feature with letter and digit.

binkymilk · 2019-11-18T13:31:25Z

I need this feature too for the ngram tokenizer! For both _ and -.

Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes elastic#25894

Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894

edudar changed the title ~~Allow specific characters in token_char of edge ngram tokenizer in addition to classes~~ Allow specific characters in token_chars of edge ngram tokenizer in addition to classes Jul 25, 2017

nik9000 added the :Search Relevance/Analysis How text is split into tokens label Jul 25, 2017

colings86 added the >enhancement label Apr 24, 2018

cbuescher added the help wanted adoptme label Apr 9, 2019

cbuescher self-assigned this Apr 9, 2019

cbuescher mentioned this issue Nov 18, 2019

Allow custom characters in token_chars of ngram tokenizers #49250

Merged

cbuescher closed this as completed in #49250 Nov 20, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

edudar commented Jul 25, 2017 •

edited

Loading

andrewthad commented Aug 28, 2017

DanielBaird commented Nov 20, 2017

romseygeek commented Mar 14, 2018

jcox commented May 7, 2018 •

edited

Loading

borstelmahlsdorf commented Aug 27, 2018

Piersen commented Sep 18, 2018

poef124 commented May 21, 2019 •

edited

Loading

kumropotash1 commented Jun 18, 2019

seal256 commented Jul 25, 2019

zion3mx commented Sep 30, 2019 •

edited

Loading

binkymilk commented Nov 18, 2019

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

Comments

edudar commented Jul 25, 2017 • edited Loading

andrewthad commented Aug 28, 2017

DanielBaird commented Nov 20, 2017

romseygeek commented Mar 14, 2018

jcox commented May 7, 2018 • edited Loading

borstelmahlsdorf commented Aug 27, 2018

Piersen commented Sep 18, 2018

poef124 commented May 21, 2019 • edited Loading

kumropotash1 commented Jun 18, 2019

seal256 commented Jul 25, 2019

zion3mx commented Sep 30, 2019 • edited Loading

binkymilk commented Nov 18, 2019

edudar commented Jul 25, 2017 •

edited

Loading

jcox commented May 7, 2018 •

edited

Loading

poef124 commented May 21, 2019 •

edited

Loading

zion3mx commented Sep 30, 2019 •

edited

Loading