-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support analyzer for keyword type #18064
Comments
…lastic#19028 This is the same as what Lucene does for its analysis factories, and we hawe tests that make sure that the elasticsearch factories are in sync with Lucene's. This is a first step to move forward on elastic#9978 and elastic#18064.
Most of the work needed to implement this feature has been merged into Lucene and will be available in 6.2. Analyzer got a new method called Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor. I am wondering if we should use a different property name than "my_field": {
"type": "keyword",
"normalizer": "standard"
} This would avoid potential confusion about what happens with analyzers that would generate multiple tokens and make clearer that only normalization would be applied? |
That would complicate the process but I guess we have to live with that. At least, we have a workaround.
Totally agree. |
Instead of calling it a "normalizer", I'd call it by it's name |
I think I agree with that. I initially thought that maybe integration with https://issues.apache.org/jira/browse/LUCENE-7355 would make sense, but maybe we should just apply a list of token filters manually, this would probably be simpler. |
Yeah, I think it's a much simpler approach than involving a queryparser here. No need for one IMO. Also please note that order matters in the |
What about character filters? They can also be useful here. My initial thought was to keep it as |
hi guys, great to see you have an enhancement for this requirement! Any idea how can I support case insensitive search on a "keyword" type field (which I also use for aggregations) for v5.0? In ES 2.3 I used: But that does not seem to work without enabling fielddata in ES 5. Any workaround I can use for now? |
You can use ingest to lower case your field. |
Hi Guys, |
This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes elastic#18064
This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes #18064
This adds a new `normalizer` property to `keyword` fields that pre-processes the field value prior to indexing, but without altering the `_source`. Note that only the normalization components that work on a per-character basis are applied, so for instance stemming filters will be ignored while lowercasing or ascii folding will be applied. Closes #18064
I understand from this thread that the ability has been added to sort case insensitive. But how? Is there documentation or an example available ? |
@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5 |
Currently, using the path_hierarcy tokenizer means we can't aggregate on the field. That means we would have to set `"fielddata": true`, which comes with a memory cost. This was discussed but not solved in elastic/elasticsearch#18064. Perhaps fielddata would be OK for file paths like ours, but this seems safer.
Thanks a million! |
Sometimes you want to analyze text to make it consistent when running aggregations on top of it.
For example, let's say I have a
city
field mapped as akeyword
.This field can contain
San Francisco
,SAN FRANCISCO
,San francisco
...If I build a terms aggregation on top of it, I will end up with
I'd like to be able to analyze this text before it gets indexed. Of course I could use a
text
field instead and setfielddata: true
but that would not create doc values for this field.I can imagine that we allow an analyzer at index time for this field.
We can restrict its usage if we wish and only allows analyzers which are using tokenizers like
lowercase
,keyword
,path
but I would let the user decide.If we allow setting
analyzer: simple
for example, my aggregation will become:Same applies for path tokenizer.
Let say I'm building a dir tree like:
Applying a path tokenizer would help me to generate an aggregation like:
The text was updated successfully, but these errors were encountered: