-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894
Comments
I would like this as well, except that I'm need it for the ngram tokenizer, not the edge ngram tokenizer. Basically, I have a bunch of logs that end up in elasticsearch, and the only character need to be sure will break up tokens is a comma. Anything else is fair game for inclusion. |
I also need to specify an individual character in my |
cc @elastic/es-search-aggs |
Not being able to specify an individual character like '_' in the edge_ngram tokenizer's token_chars list is particularly painful because the 'standard' ES tokenizer uses the UNIX Text Segmentation Algorithm, which does include UNDERSCORE. It would be nice to be able to index with edge_ngram & search with the standard tokenizer. Currently, you can't if you have strings like "foo_bar". |
I need also a way to specify a single char as token_chars, in my case a "-" |
In my case, being able to specify token chars as any character except for specified delimiters would be useful |
I wanted to know if there is a workaround to achieve this? I want to include certain punctuations like . : but not all the others |
Plus one from me. |
Need this feature |
i need this feature with |
I need this feature too for the |
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes elastic#25894
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894
I bumped into this while implementing autocomplete/typeahead functionality with highlighting.
My index settings are:
I do the search by
autocomplete
field and highlight onautocomplete_highlight
. Everything works fine until I meet_
in a search query.icu_tokenizer
keeps it whileautocomplete_highlight
tokenizer removes as it keeps letters and digits only. Here I can't keep_
only but fullpunctuation
class instead that comes with a whole load of additional symbols that I don't need and they have to go.I would be helpful to be able to specify exact characters to keep like
_
.At the moment I've implemented char_filter that replaces
_
with-
but that's suboptimal as_
is considered a part of words (same as inicu_tokenizer
) and is expected to match rather than being ignored.The text was updated successfully, but these errors were encountered: