-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add NGRAM bloom filter index to speed up like queries. #10733
Comments
@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate |
Yes, for tokenbf it's work for exact match based on tokenization, maybe it's useful for english, but for Chinese I think you need other tools to do tokenization. BTW, I think it's worth to add tokenbf. Our code will be ready soon. |
@compasses great! look forward for you pr. |
…o speed up like query (#11579) This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733. The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve. For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```, you need set it to ```true```, to make the index work for like query.
Search before asking
Description
To speed up like queries we have pushed the like function to storage layer in PR #10355 , which can get 2x~3x performance gain, no matter vectorized or not. But we want to go the extra mile, and make it more faster and less resource overhead. Base on that, we are going to implement a new index for like queries.
We have researched several solutions such as pg_trgm from postgresql、ngrambf from clickhouse and FST from elasticsearch. Since Doris have bloom filter index already, in consideration of complexity、function scope and compatibility. Finally, we will choose the way as clickhouse did
ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)
: the input column string is split into n-grams (first parameter – n-gram size), and then stored in a bloom filter. During query, the like pattern will also be split to n-grams and generate a bloom filter to do the filter, use the bloom filter to skip granule.For doris here is the details:
That's all, thanks.
Use case
No response
Related issues
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: