-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Adding a char_group tokenizer #24186
Conversation
=== Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Are you sure that there is an overhead when using a pattern tokenizer for single char tokenization ? |
@jimczi yes, at a client's site and it's measurable even for a simple regex pattern. From experience, maintaining a plugin is a considerable overhead and this seems quite a legit and useful addition, considering the existence of the |
Fair enough. I marked this PR with the discuss label. It is also unclear to me if this tokenizer should be added to Lucene first. @jpountz may have a better answer regarding this. And thanks @synhershko for submitting your PR. |
Sorry for the late reply, I had totally missed the ping. @synhershko I think this new tokenizer could help. The main question to me is how we should allow the list of token chars to be configured. We already have a |
To @jimczi 's question, I think Elasticsearch is the right place to put this tokenizer given that Lucene does not need more than its existing CharTokenizer? |
Ping @synhershko is this still something you're interested in pursuing? |
@dakrone sure, the PR with all required changes are here - not sure what you are asking? :) |
@synhershko I think @jpountz's questions from his last comment are still unanswered, other than that I'm not sure what else needs to happen, @jpountz will likely have a better idea |
Aye, responding below
That's just a naming question - fine by me to change that.
To be honest I don't remember the actual use-case, but I do remember it was important enough to call for this PR :) I assume this was an attempt to efficiently parse incoming user data with custom separators. We all know those type of systems... The Pattern replace tokenizer would most likely help that cause but it does come with performance penalty.
Using character classes does look tempting for simplicity reasons, but in most likelihood this tokenizer will mostly be useful for it's granularity. Do note |
Maybe we could make it take eg. |
Sounds good. You want me to go ahead and apply this change? If so, what should we do if |
@synhershko Yes, something like that. I suspect your example didn't mean to be all-inclusive, but just in case, supporting |
Actually I just realized that reusing |
Oh right, that might be the original reason why I chose that naming ;) been too long. I pushed an update to the code - can you confirm structure and config look good? if so I will update tests |
/cc @elastic/es-search-aggs |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
@elasticmachine test this please |
@synhershko I took the liberty of doing some minor changes like moving the code to analysis-common before merging this PR. Sorry for the long delay. |
=== Char Group Tokenizer The `char_group` tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>> is not acceptable. === Configuration The `char_group` tokenizer accepts one parameter: `tokenize_on_chars`:: A string containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. Also supports escaped values like `\\n` and `\\f`, and in addition `\\s` to represent whitespace, `\\d` to represent digits and `\\w` to represent letters. Defaults to an empty list. === Example output ```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2``` When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the above sentence would produce the following terms: ```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone, for, $2 ]```
* master: QA: Add xpack tests to rolling upgrade (#30795) Modify state of VerifyRepositoryResponse for bwc (#30762) Reduce CLI scripts to one-liners on Windows (#30772) Simplify number of shards setting (#30783) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) [Docs] Fix script-fields snippet execution (#30693) Upgrade to Lucene-7.4.0-snapshot-cc2ee23050 (#30778) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) Make http pipelining support mandatory (#30695) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) Increase the maximum number of filters that may be in the cache. (#30655) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Add more yaml tests for get alias API (#29513) Ignore empty completion input (#30713) [DOCS] fixed incorrect default [ML] Filter undefined job groups from update calendar actions (#30757) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Enable installing plugins from snapshots.elastic.co (#30765) Remove fedora 26, add 28 (#30683) Accept Gradle build scan agreement (#30645) Remove logging from elasticsearch-nio jar (#30761) Add Delete Repository High Level REST API (#30666)
* 6.x: [DOCS] Fixes typos in security settings Add support for indexed shape routing in geo_shape query (#30760) [DOCS] Splits auditing.asciidoc into smaller files Painless: Types Section Clean Up (#30283) [test] java tests for archive packaging (#30734) Deprecate http.pipelining setting (#30786) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) Change serialization version of doc-value fields. Add a `format` option to `docvalue_fields`. (#29639) [TEST] Don't expect acks when isolating nodes Fixes UpdateSettingsRequestStreamableTests mutate bug Revert "Add more yaml tests for get alias API (#29513)" Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) QA: Add xpack tests to rolling upgrade (#30795) Add support for search templates to the high-level REST client. (#30473) Reduce CLI scripts to one-liners on Windows (#30772) Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561) Add more yaml tests for get alias API (#29513) [Docs] Fix script-fields snippet execution (#30693) Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182) Remove assert statements from field caps documentation. (#30601) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181) Add support for field capabilities to the high-level REST client. (#29664) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) Increase the maximum number of filters that may be in the cache. (#30655) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [DOCS] fixed incorrect default [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Enable installing plugins from snapshots.elastic.co (#30765) Ignore empty completion input (#30713) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Add Delete Repository High Level REST API (#30666) Reduce CLI scripts to one-liners (#30759)
Char Group Tokenizer
The
char_group
tokenizer breaks text into terms whenever it encountersa
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer,
pattern
tokenizer>>is not acceptable.
Configuration
The
char_group
tokenizer accepts one parameter:tokenize_on_chars
::A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like
\\n
and\\f
,and in addition
\\s
to represent whitespace,\\d
to representdigits and
\\w
to represent letters.Defaults to an empty list.
Example output
The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2
When the configuration
\\s-:<>
is used fortokenize_on_chars
, theabove sentence would produce the following terms: