Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster wildcard search #48852

Closed
jpountz opened this issue Nov 4, 2019 · 22 comments · Fixed by #49993
Closed

Faster wildcard search #48852

jpountz opened this issue Nov 4, 2019 · 22 comments · Fixed by #49993
Labels
Dependency:SIEM >feature :Search/Search Search-related issues that do not fall into other categories Top Ask v7.9.0 v8.0.0-alpha1

Comments

@jpountz
Copy link
Contributor

jpountz commented Nov 4, 2019

While wildcard queries are often a symptom that content hasn't been properly analyzed, there are also cases when users really need the flexibility of wildcard queries. Could we leverage ngrams to make this kind of queries faster, similarly to how we made prefix queries faster on text fields by indexing edge ngrams?

I tried to think about how this could work and here is an idea:

  • Null bytes would be used as start/end markers in order to make suffix queries possible (*foobar) and prefix queries faster (foobar*).
  • Content would be indexed with a ngram tokenizer that has a fixed gram size, e.g. 5 (could be configurable). It would be used to return a good approximation of the matches of the wildcard query.
  • Doc values would store the original value and could be used for a two-phase verification.

Here are some example queries and how they would run internally with this setup:

Query Approximation Two-phase verification Note
*foobar* fooba AND oobar Check positions Like a phrase query
foobar* \0foob AND fooba AND oobar Check positions We could skip the middle term like Lucene's NGramPhraseQuery
*foobar fooba AND oobar AND obar\0 Check positions We could skip the middle term like Lucene's NGramPhraseQuery
*foo* *foo* Always returns true Need a wildcard query because the substring is shorter than the gram size
foo* \0foo* Always returns true Need a wildcard query because the substring is shorter than the gram size
*foo *foo\0 Always returns true Need a wildcard query because the substring is shorter than the gram size
foobar \0foob AND fooba AND oobar AND obar\0 Check positions We could skip the middle terms like Lucene's NGramPhraseQuery
*foobar*quux* fooba AND oobar Check doc values We don't want to run a wildcard query on *quux*, the approximation we have might be selective enough already

Notes:

  • This might be quite space-intensive on long fields.

Questions:

  • Are there other ways it could work?
  • What is a good default gram size?
  • How much more space does it require compared to keyword indexing?
  • Should it be an option on text/keyword fields or a new field? This way of indexing allows to run exact queries too so we wouldn't need to index as a keyword too in addition to these ngrams.
  • Should we index positions? Whenever we check positions above, we could also check doc values instead.
  • Should we use BINARY or SORTED doc values? BINARY feels more appropriate for the access pattern described above, but compression would be terrible.
@jpountz jpountz added >feature :Search/Search Search-related issues that do not fall into other categories labels Nov 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@jimczi
Copy link
Contributor

jimczi commented Nov 14, 2019

Are there other ways it could work?

I think we can try to extend the optimization to handle leading wildcard efficiently. It seems for example that this solution would allow to transform any query in the form of *foo* to a simple prefix query foo* ? Simple leading wildcard query like *foo are more problematic if the ending part is too short. Maybe we could make the ngram size a factor of this ? 3 seems a good compromise, we'd need to check more positions but we'd avoid the degraded case more often ?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 14, 2019

Oh I like this idea. Maybe instead of changing the gram size, we could index suffixes with multiple gram sizes. For instance under my first proposal, foobar would be indexed as ["\0foob", "fooba", "oobar", "obar\0"], but maybe we could actually do ["\0foob", "fooba", "oobar", "obar\0", "bar\0", "ar\0", "r\0"]. Then we could always run *foo* as foo* internally? So this would give this updated table

Query Approximation Two-phase verification Note
*foobar* fooba AND oobar Check positions Like a phrase query
foobar* \0foob AND fooba AND oobar Check positions We could skip the middle term like Lucene's NGramPhraseQuery
*foobar fooba AND oobar AND obar\0 Check positions We could skip the middle term like Lucene's NGramPhraseQuery
*foo* foo* Always returns true Need a trailing wildcard query because the substring is shorter than the gram size
foo* \0foo* Always returns true Need a wildcard query because the substring is shorter than the gram size
*foo foo\0 Always returns true Term query
foobar \0foob AND fooba AND oobar AND obar\0 Check positions We could skip the middle terms like Lucene's NGramPhraseQuery
*foobar*quux* fooba AND oobar, or maybe fooba AND oobar AND quux* Check doc values, or maybe check positions a la MultiPhraseQuery? The right approach probably depends on the number of expansions of substrings that are shorter than the gram size

@markharwood
Copy link
Contributor

What is a good default gram size?

Given this is intended to be supportive of EQL can we do some analysis of existing rules out there?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 28, 2019

+1 it would be useful to know the distribution of the number of chars between wildcards

@markharwood
Copy link
Contributor

markharwood commented Nov 28, 2019

I had a scan of EQL rules they fell into these groups

Pattern wildcard type Number of patterns Average pattern length
left and right (*foo*) 482 22
left and middle (*fo*o) 107 56
middle and right (f*oo*) 15 56
middle and middle (f*o*o) 11 68
left only (*foo) 296 31
right only (foo*) 68 18
middle only (f*oo) 38 54

@markharwood
Copy link
Contributor

markharwood commented Nov 29, 2019

similarly to how we made prefix queries faster on text fields

So would this be an index_wildcards addition to the index_prefixes and index_phrases options for text fields?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 29, 2019

This is a good question. One difference with this way of indexing unlike indexing prefixes or shingles is that it can also help find exact matches. And I think that a number of users would rather like to save some space by only indexing ngrams, instead of indexing both the original keyword and its ngrams. So I'm leaning towards exposing a different field that only indexes ngrams?

@markharwood
Copy link
Contributor

markharwood commented Dec 2, 2019

So I'm leaning towards exposing a different field that only indexes ngrams?

Fair enough. Will there be a specialised new query type for this too? Would we reject traditional token-based queries like term, terms and match? Also, will we support highlighting?

@jpountz
Copy link
Contributor Author

jpountz commented Dec 2, 2019

I'm viewing this as a specialized keyword field for wildcard search, so I think it shouldn't have a specialized query type but reuse the existing ones based on the approach outlined above. And like keywords, it wouldn't support highlighting.

@markharwood
Copy link
Contributor

Would interval queries be efficient here for checking the sequences of these expressions?
I've sketched out an approach that doesn't verify positions here but if I simply swapped an Interval query rather than the Boolean query used in my example would that have all the logic for efficient 2-phase operation? It would avoid having big doc-value fields and string matching for the verification step.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 3, 2019

I think we can't escape from having to run wildcards against doc values in some cases. For instance, if the query looks like *a*b* then we need to find all ngrams that start with an a that are followed by an ngram that starts with a b, which creates combinatorial explosion issues if we want to verify matches using positions. For a query like that, I think the only realistic way to verify that there is a match is by running the wildcard against doc values.

@markharwood
Copy link
Contributor

OK - I'll code up the DV approach for all cases for now and we can optimise later for cases with longer patterns if we think intervals would be a big improvement.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 3, 2019

This sounds good to me.

I'd like us to try both approaches to see how they compare, but my gut feeling is that the storage overhead of positions is going to be huge, and probably a barrier to adoption for many users. So only indexing ngrams with IndexOptions.DOCS and then checking with doc values is likely going to be a better trade-off. I might be surprised by data though. :)

@markharwood
Copy link
Contributor

For the verification phase we may have to consider some of the flags we see in regex queries to do with "greedy" evaluation.
I tried to match a food is good document with a f*od query using Regex.simpleMatch and it failed whereas a food*od query matched OK.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 9, 2019

Can we reuse the same logic as WildcardQuery? It would be a better experience if a wildcard query on a keyword or this new type would return the exact same results.

@markharwood
Copy link
Contributor

markharwood commented Jan 9, 2020

Is there mileage in thinking of this feature more broadly as a form of index-backed regex with its own query type?
There's extra functionality in Interval queries like ORs or max gap that we could expose if we move beyond wildcard syntax and include some things from regex e.g. the | OR symbol to give us patterns like [elastic search|elasticsearch] error.

I guess we could add the field type for now and dream up new query syntax later. It might make us want to reconsider the name of the field type in the first instance though.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 9, 2020

I was thinking that this field would be a good fit for regexp or fuzzy queries indeed. Why would you create a custom query type, we could reuse the existing regexp and fuzzy queries? MappedFieldType has factory methods for both these methods so fields are free to implement these queries however they want.

@markharwood
Copy link
Contributor

Why would you create a custom query type

I was thinking that if we opt for the pos-based implementation the regex functionality we could offer might be constrained by what Interval queries can support. This would create a divergence in functionality between "real" regex and what this new field type could support

@jpountz
Copy link
Contributor Author

jpountz commented Jan 9, 2020

OK, let's focus on the positions vs. doc-values discussion first then. I'll write some thoughts about this on the PR.

@lucasklaassen
Copy link

I just have one question... Why can I not find public documentation ANYWHERE on this new field?

@cbuescher
Copy link
Member

I just have one question... Why can I not find public documentation ANYWHERE on this new field?

Thats a good one, and a difficult one to answer because the documentation is right here. Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependency:SIEM >feature :Search/Search Search-related issues that do not fall into other categories Top Ask v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants