Faster wildcard search #48852

jpountz · 2019-11-04T18:37:11Z

While wildcard queries are often a symptom that content hasn't been properly analyzed, there are also cases when users really need the flexibility of wildcard queries. Could we leverage ngrams to make this kind of queries faster, similarly to how we made prefix queries faster on text fields by indexing edge ngrams?

I tried to think about how this could work and here is an idea:

Null bytes would be used as start/end markers in order to make suffix queries possible (*foobar) and prefix queries faster (foobar*).
Content would be indexed with a ngram tokenizer that has a fixed gram size, e.g. 5 (could be configurable). It would be used to return a good approximation of the matches of the wildcard query.
Doc values would store the original value and could be used for a two-phase verification.

Here are some example queries and how they would run internally with this setup:

Query	Approximation	Two-phase verification	Note
`foobar`	`fooba AND oobar`	Check positions	Like a phrase query
`foobar*`	`\0foob AND fooba AND oobar`	Check positions	We could skip the middle term like Lucene's NGramPhraseQuery
`*foobar`	`fooba AND oobar AND obar\0`	Check positions	We could skip the middle term like Lucene's NGramPhraseQuery
`foo`	`foo`	Always returns true	Need a wildcard query because the substring is shorter than the gram size
`foo*`	`\0foo*`	Always returns true	Need a wildcard query because the substring is shorter than the gram size
`*foo`	`*foo\0`	Always returns true	Need a wildcard query because the substring is shorter than the gram size
`foobar`	`\0foob AND fooba AND oobar AND obar\0`	Check positions	We could skip the middle terms like Lucene's NGramPhraseQuery
`foobarquux*`	`fooba AND oobar`	Check doc values	We don't want to run a wildcard query on `quux`, the approximation we have might be selective enough already

Notes:

This might be quite space-intensive on long fields.

Questions:

Are there other ways it could work?
What is a good default gram size?
How much more space does it require compared to keyword indexing?
Should it be an option on text/keyword fields or a new field? This way of indexing allows to run exact queries too so we wouldn't need to index as a keyword too in addition to these ngrams.
Should we index positions? Whenever we check positions above, we could also check doc values instead.
Should we use BINARY or SORTED doc values? BINARY feels more appropriate for the access pattern described above, but compression would be terrible.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-11-04T18:37:12Z

Pinging @elastic/es-search (:Search/Search)

jimczi · 2019-11-14T08:57:27Z

Are there other ways it could work?

I think we can try to extend the optimization to handle leading wildcard efficiently. It seems for example that this solution would allow to transform any query in the form of *foo* to a simple prefix query foo* ? Simple leading wildcard query like *foo are more problematic if the ending part is too short. Maybe we could make the ngram size a factor of this ? 3 seems a good compromise, we'd need to check more positions but we'd avoid the degraded case more often ?

jpountz · 2019-11-14T14:37:17Z

Oh I like this idea. Maybe instead of changing the gram size, we could index suffixes with multiple gram sizes. For instance under my first proposal, foobar would be indexed as ["\0foob", "fooba", "oobar", "obar\0"], but maybe we could actually do ["\0foob", "fooba", "oobar", "obar\0", "bar\0", "ar\0", "r\0"]. Then we could always run *foo* as foo* internally? So this would give this updated table

Query	Approximation	Two-phase verification	Note
`foobar`	`fooba AND oobar`	Check positions	Like a phrase query
`foobar*`	`\0foob AND fooba AND oobar`	Check positions	We could skip the middle term like Lucene's NGramPhraseQuery
`*foobar`	`fooba AND oobar AND obar\0`	Check positions	We could skip the middle term like Lucene's NGramPhraseQuery
`foo`	`foo*`	Always returns true	Need a trailing wildcard query because the substring is shorter than the gram size
`foo*`	`\0foo*`	Always returns true	Need a wildcard query because the substring is shorter than the gram size
`*foo`	`foo\0`	Always returns true	Term query
`foobar`	`\0foob AND fooba AND oobar AND obar\0`	Check positions	We could skip the middle terms like Lucene's NGramPhraseQuery
`foobarquux*`	`fooba AND oobar`, or maybe `fooba AND oobar AND quux*`	Check doc values, or maybe check positions a la MultiPhraseQuery?	The right approach probably depends on the number of expansions of substrings that are shorter than the gram size

markharwood · 2019-11-28T10:52:48Z

What is a good default gram size?

Given this is intended to be supportive of EQL can we do some analysis of existing rules out there?

jpountz · 2019-11-28T11:03:21Z

+1 it would be useful to know the distribution of the number of chars between wildcards

markharwood · 2019-11-28T16:35:25Z

I had a scan of EQL rules they fell into these groups

Pattern wildcard type	Number of patterns	Average pattern length
left and right (foo)	482	22
left and middle (foo)	107	56
middle and right (foo)	15	56
middle and middle (foo)	11	68
left only (*foo)	296	31
right only (foo*)	68	18
middle only (f*oo)	38	54

markharwood · 2019-11-29T10:23:16Z

similarly to how we made prefix queries faster on text fields

So would this be an index_wildcards addition to the index_prefixes and index_phrases options for text fields?

jpountz · 2019-11-29T17:31:43Z

This is a good question. One difference with this way of indexing unlike indexing prefixes or shingles is that it can also help find exact matches. And I think that a number of users would rather like to save some space by only indexing ngrams, instead of indexing both the original keyword and its ngrams. So I'm leaning towards exposing a different field that only indexes ngrams?

markharwood · 2019-12-02T09:42:35Z

So I'm leaning towards exposing a different field that only indexes ngrams?

Fair enough. Will there be a specialised new query type for this too? Would we reject traditional token-based queries like term, terms and match? Also, will we support highlighting?

jpountz · 2019-12-02T12:03:51Z

I'm viewing this as a specialized keyword field for wildcard search, so I think it shouldn't have a specialized query type but reuse the existing ones based on the approach outlined above. And like keywords, it wouldn't support highlighting.

markharwood · 2019-12-03T10:38:41Z

Would interval queries be efficient here for checking the sequences of these expressions?
I've sketched out an approach that doesn't verify positions here but if I simply swapped an Interval query rather than the Boolean query used in my example would that have all the logic for efficient 2-phase operation? It would avoid having big doc-value fields and string matching for the verification step.

jpountz · 2019-12-03T11:03:38Z

I think we can't escape from having to run wildcards against doc values in some cases. For instance, if the query looks like *a*b* then we need to find all ngrams that start with an a that are followed by an ngram that starts with a b, which creates combinatorial explosion issues if we want to verify matches using positions. For a query like that, I think the only realistic way to verify that there is a match is by running the wildcard against doc values.

markharwood · 2019-12-03T11:16:25Z

OK - I'll code up the DV approach for all cases for now and we can optimise later for cases with longer patterns if we think intervals would be a big improvement.

jpountz · 2019-12-03T13:21:30Z

This sounds good to me.

I'd like us to try both approaches to see how they compare, but my gut feeling is that the storage overhead of positions is going to be huge, and probably a barrier to adoption for many users. So only indexing ngrams with IndexOptions.DOCS and then checking with doc values is likely going to be a better trade-off. I might be surprised by data though. :)

markharwood · 2019-12-09T10:32:31Z

For the verification phase we may have to consider some of the flags we see in regex queries to do with "greedy" evaluation.
I tried to match a food is good document with a f*od query using Regex.simpleMatch and it failed whereas a food*od query matched OK.

jpountz · 2019-12-09T12:52:54Z

Can we reuse the same logic as WildcardQuery? It would be a better experience if a wildcard query on a keyword or this new type would return the exact same results.

markharwood · 2020-01-09T12:03:41Z

Is there mileage in thinking of this feature more broadly as a form of index-backed regex with its own query type?
There's extra functionality in Interval queries like ORs or max gap that we could expose if we move beyond wildcard syntax and include some things from regex e.g. the | OR symbol to give us patterns like [elastic search|elasticsearch] error.

I guess we could add the field type for now and dream up new query syntax later. It might make us want to reconsider the name of the field type in the first instance though.

jpountz · 2020-01-09T14:35:33Z

I was thinking that this field would be a good fit for regexp or fuzzy queries indeed. Why would you create a custom query type, we could reuse the existing regexp and fuzzy queries? MappedFieldType has factory methods for both these methods so fields are free to implement these queries however they want.

markharwood · 2020-01-09T14:41:37Z

Why would you create a custom query type

I was thinking that if we opt for the pos-based implementation the regex functionality we could offer might be constrained by what Interval queries can support. This would create a divergence in functionality between "real" regex and what this new field type could support

jpountz · 2020-01-09T18:30:00Z

OK, let's focus on the positions vs. doc-values discussion first then. I'll write some thoughts about this on the PR.

Closes elastic#48852

lucasklaassen · 2024-04-19T21:51:38Z

I just have one question... Why can I not find public documentation ANYWHERE on this new field?

cbuescher · 2024-04-22T07:58:13Z

I just have one question... Why can I not find public documentation ANYWHERE on this new field?

Thats a good one, and a difficult one to answer because the documentation is right here. Hope that helps.

jpountz added >feature :Search/Search Search-related issues that do not fall into other categories labels Nov 4, 2019

colings86 mentioned this issue Nov 27, 2019

Supporting EQL in Elasticsearch #49581

Closed

26 tasks

tonymeehan added the Dependency:SIEM label Dec 4, 2019

markharwood mentioned this issue Dec 9, 2019

Wildcard field optimised for wildcard queries #49993

Merged

markharwood added a commit to markharwood/elasticsearch that referenced this issue Mar 10, 2020

First cut at Wildcard field optimised for wildcard queries

baf369f

Closes elastic#48852

pcsanwald added the Top Ask label Mar 11, 2020

markharwood closed this as completed in #49993 Mar 16, 2020

jimczi added v7.9.0 v8.0.0 labels Jul 2, 2020

russcam mentioned this issue Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster wildcard search #48852

Faster wildcard search #48852

jpountz commented Nov 4, 2019

elasticmachine commented Nov 4, 2019

jimczi commented Nov 14, 2019

jpountz commented Nov 14, 2019

markharwood commented Nov 28, 2019

jpountz commented Nov 28, 2019

markharwood commented Nov 28, 2019 •

edited

Loading

markharwood commented Nov 29, 2019 •

edited

Loading

jpountz commented Nov 29, 2019

markharwood commented Dec 2, 2019 •

edited

Loading

jpountz commented Dec 2, 2019

markharwood commented Dec 3, 2019

jpountz commented Dec 3, 2019

markharwood commented Dec 3, 2019

jpountz commented Dec 3, 2019

markharwood commented Dec 9, 2019

jpountz commented Dec 9, 2019

markharwood commented Jan 9, 2020 •

edited

Loading

jpountz commented Jan 9, 2020

markharwood commented Jan 9, 2020

jpountz commented Jan 9, 2020

lucasklaassen commented Apr 19, 2024

cbuescher commented Apr 22, 2024

Faster wildcard search #48852

Faster wildcard search #48852

Comments

jpountz commented Nov 4, 2019

elasticmachine commented Nov 4, 2019

jimczi commented Nov 14, 2019

jpountz commented Nov 14, 2019

markharwood commented Nov 28, 2019

jpountz commented Nov 28, 2019

markharwood commented Nov 28, 2019 • edited Loading

markharwood commented Nov 29, 2019 • edited Loading

jpountz commented Nov 29, 2019

markharwood commented Dec 2, 2019 • edited Loading

jpountz commented Dec 2, 2019

markharwood commented Dec 3, 2019

jpountz commented Dec 3, 2019

markharwood commented Dec 3, 2019

jpountz commented Dec 3, 2019

markharwood commented Dec 9, 2019

jpountz commented Dec 9, 2019

markharwood commented Jan 9, 2020 • edited Loading

jpountz commented Jan 9, 2020

markharwood commented Jan 9, 2020

jpountz commented Jan 9, 2020

lucasklaassen commented Apr 19, 2024

cbuescher commented Apr 22, 2024

markharwood commented Nov 28, 2019 •

edited

Loading

markharwood commented Nov 29, 2019 •

edited

Loading

markharwood commented Dec 2, 2019 •

edited

Loading

markharwood commented Jan 9, 2020 •

edited

Loading