[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

compasses · 2022-07-10T06:12:35Z

Search before asking

I had searched in the issues and found no similar issues.

Description

To speed up like queries we have pushed the like function to storage layer in PR #10355 , which can get 2x~3x performance gain, no matter vectorized or not. But we want to go the extra mile, and make it more faster and less resource overhead. Base on that, we are going to implement a new index for like queries.

We have researched several solutions such as pg_trgm from postgresql、ngrambf from clickhouse and FST from elasticsearch. Since Doris have bloom filter index already, in consideration of complexity、function scope and compatibility. Finally, we will choose the way as clickhouse did ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): the input column string is split into n-grams (first parameter – n-gram size), and then stored in a bloom filter. During query, the like pattern will also be split to n-grams and generate a bloom filter to do the filter, use the bloom filter to skip granule.

For doris here is the details:

Reuse the exist bloom filter index read/write process, and the storage layer will be unaffected.
Add a new kind of bloom filter index.
Add new type of algorithm: NGRAM_BLOOM_FILTER, which will extract gram and calculate the bloom filter.
For the new algorithm the HashStrategy will follow the clickhouse
Query will support index filter pages for like queries , if exist the ngram bloom filter, which based on the [Optimize] Improve performance like/not like filter through pushdown function to storage engine #10355
Support add index for history data：alter table example_db.table3 add index idx_ngrambf(username) using NGRAM_BF(3, 256) comment 'username ngram_bf index'.

That's all, thanks.

Use case

No response

Related issues

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

xiaokang · 2022-07-24T07:13:09Z

@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate
with you and take tokenbf_v1.

compasses · 2022-07-24T15:05:17Z

@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate with you and take tokenbf_v1.

Yes, for tokenbf it's work for exact match based on tokenization, maybe it's useful for english, but for Chinese I think you need other tools to do tokenization.

BTW, I think it's worth to add tokenbf. Our code will be ready soon.

xiaokang · 2022-07-25T13:48:06Z

@compasses great! look forward for you pr.

…o speed up like query (#11579) This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733. The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve. For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```, you need set it to ```true```, to make the index work for like query.

compasses added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 10, 2022

compasses mentioned this issue Aug 7, 2022

[Feature](NGram BloomFilter Index) add new ngram bloom filter index to speed up like query #11579

Merged

5 tasks

morningman closed this as completed in #11579 Dec 28, 2022

xiaofan-luan mentioned this issue Jan 13, 2024

[Feature]: NGram filter support milvus-io/milvus#29962

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

compasses commented Jul 10, 2022 •

edited

Loading

xiaokang commented Jul 24, 2022

compasses commented Jul 24, 2022

xiaokang commented Jul 25, 2022

[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

[Feature] Add NGRAM bloom filter index to speed up like queries. #10733

Comments

compasses commented Jul 10, 2022 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit PR?

Code of Conduct

xiaokang commented Jul 24, 2022

compasses commented Jul 24, 2022

xiaokang commented Jul 25, 2022

compasses commented Jul 10, 2022 •

edited

Loading