Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

Closed
3 tasks done
compasses opened this issue Jul 10, 2022 · 3 comments · Fixed by #11579
Closed
3 tasks done

[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

compasses opened this issue Jul 10, 2022 · 3 comments · Fixed by #11579
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@compasses
Copy link
Contributor

compasses commented Jul 10, 2022

Search before asking

  • I had searched in the issues and found no similar issues.

Description

To speed up like queries we have pushed the like function to storage layer in PR #10355 , which can get 2x~3x performance gain, no matter vectorized or not. But we want to go the extra mile, and make it more faster and less resource overhead. Base on that, we are going to implement a new index for like queries.

We have researched several solutions such as pg_trgm from postgresql、ngrambf from clickhouse and FST from elasticsearch. Since Doris have bloom filter index already, in consideration of complexity、function scope and compatibility. Finally, we will choose the way as clickhouse did ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): the input column string is split into n-grams (first parameter – n-gram size), and then stored in a bloom filter. During query, the like pattern will also be split to n-grams and generate a bloom filter to do the filter, use the bloom filter to skip granule.

For doris here is the details:

  1. Reuse the exist bloom filter index read/write process, and the storage layer will be unaffected.
  2. Add a new kind of bloom filter index.
  3. Add new type of algorithm: NGRAM_BLOOM_FILTER, which will extract gram and calculate the bloom filter.
  4. For the new algorithm the HashStrategy will follow the clickhouse
  5. Query will support index filter pages for like queries , if exist the ngram bloom filter, which based on the [Optimize] Improve performance like/not like filter through pushdown function to storage engine #10355
  6. Support add index for history data:alter table example_db.table3 add index idx_ngrambf(username) using NGRAM_BF(3, 256) comment 'username ngram_bf index'.

image

That's all, thanks.

Use case

No response

Related issues

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@compasses compasses added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 10, 2022
@xiaokang
Copy link
Contributor

@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate
with you and take tokenbf_v1.

@compasses
Copy link
Contributor Author

@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate with you and take tokenbf_v1.

Yes, for tokenbf it's work for exact match based on tokenization, maybe it's useful for english, but for Chinese I think you need other tools to do tokenization.

BTW, I think it's worth to add tokenbf. Our code will be ready soon.

@xiaokang
Copy link
Contributor

@compasses great! look forward for you pr.

morningman pushed a commit that referenced this issue Dec 28, 2022
…o speed up like query (#11579)

This PR implement  the new bloom filter index: NGram bloom filter index, which was proposed in  #10733.
The new index can improve the like query performance greatly, from our some test case , can  get order of magnitude  improve.
For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```,
you need set it to ```true```, to make the index work for like query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
2 participants