Skip to content
This repository has been archived by the owner on Apr 5, 2021. It is now read-only.

Update autocomplete ES filters and use whitespace tokenizer #9

Merged
merged 1 commit into from
Sep 30, 2016

Conversation

kynetiv
Copy link

@kynetiv kynetiv commented Sep 15, 2016

This PR is an update to the elastic search config for the autocomplete type on both index and search analyzers. The main differences here are:

  • Lower the edge_ngram min_gram to 1.
    • In order to pickup terms that are significant, however maybe very short (el paso, la verne) or resemble stop words (A & M).
  • Remove stop words filter.
    • The existing autocomplete query is using the common query which effectively is already dynamically dropping common words (including stop words) according to some of the supplied parameters. see this post for more on the common query.
  • Switch to whitespace tokenizer.
    • The current standard tokenizer is dropping some special characters that when supplied to a query (in the context of our common query using an and for low frequency terms) will not match as the special character would be missing from any indexed documents. Using the whitespace tokenizer for indexing and searching, we instead only split on whitespace and allow terms with - & and others that join words.
  • Add word_delimiter filter.
    • To account for some subwords that the whitespace tokenizer would lose, for example California-Berkeley, we add the word_delimiter filter that will again split on special characters. This is only helpful when specifying the preserve_original flag so that we can also index the original term so as to keep any hyphenated or other joined words. I think this is helpful in scenarios like the Berkeley example where the words can be searched independently, as well as it appears in the name. Additionally helpful when one of the two words may be a common (high frequency word).

@kynetiv kynetiv changed the title [WIP] update autocomplete ES filters and use whitespace tokenizer Update autocomplete ES filters and use whitespace tokenizer Sep 30, 2016
@kynetiv
Copy link
Author

kynetiv commented Sep 30, 2016

This is good to go.

@kynetiv kynetiv merged commit cec09af into dev Sep 30, 2016
@kynetiv kynetiv deleted the update-autocomplete-analyzers branch September 30, 2016 19:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant