fix: re-enable (xor) bloom filter index #7870

dantengsky · 2022-09-24T16:27:18Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

as xor bloom has been merged in PR #7860, replace the vanilla bloom filter with it.

for legacy index data of v1 bloom filter, the index will be ignored
but will be removed if table data is dropped
xor index is created under path "_i_b_v2"
so that, other types of indexes could be put in separated "dir"s
bloom filter index is enabled for string type column only

Fixes #issue

vercel · 2022-09-24T16:27:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Updated
databend	⬜️ Ignored (Inspect)		Sep 27, 2022 at 1:46AM (UTC)

BohuTANG · 2022-09-25T00:04:58Z

Cool.
The bloom filter should have a type so that we can try to fuse binary bloom(in xorfilter crate) in the future:
Binary Fuse Filters: Fast and Smaller Than Xor Filters

Fuse 8 Disadvantages:

 Fuse8 filter can only tolerate few duplicates in a given data-set. So make sure to supply a hasher that is capable of generating unique digests, (with allowed tolerance of duplicates) and while supplying the digests directly via populate_keys() and build_keys() make sure they don't have more than few duplicates.

BohuTANG · 2022-09-25T00:55:04Z

From the test, XOR8 is better than FUSE8 for the applications, FUSE8 should check the duplicate keys(like hyperloglog)
Test Result:

fuse8: u64 bitmap encoding:1130544 bytes, raw:8000000 bytes, ratio:0.141318
fuse8: bool bitmap encoding:1130544 bytes, raw:1000000 bytes, ratio:1.130544
fuse8: string encoding:118832 bytes, raw:3000000 bytes, ratio:0.039610665
fuse8: same string encoding:118832 bytes, raw:3000000 bytes, ratio:0.039610665
xor8: u64 bitmap encoding:1230069 bytes, raw:8000000 bytes, ratio:0.15375863
xor8: bool bitmap encoding:61 bytes, raw:1000000 bytes, ratio:0.000061
xor8: string encoding:123067 bytes, raw:3000000 bytes, ratio:0.041022334
xor8: same string encoding:61 bytes, raw:3000000 bytes, ratio:0.000020333333

BohuTANG@b45793c

src/query/storages/index/tests/it/bloom/bloom_filter.rs

BohuTANG · 2022-09-26T03:18:08Z

I will bench the xor bloom index loading after #7867 and #7857

BohuTANG · 2022-09-27T01:27:10Z

Test result:

Gen data(unique):

create table t10b as select number as c1, cast(rand() as string) as c2 from numbers(1000000000);

main branch:

mysql> select * from t10b where c2='0.4091273956277217';
Connection id:    9
Current database: bloom

+-----------+--------------------+
| c1        | c2                 |
+-----------+--------------------+
| 125000007 | 0.4091273956277217 |
+-----------+--------------------+
1 row in set (1 min 13.72 sec)
Read 1000000000 rows, 24.47 GiB in 73.697 sec., 13.57 million rows/sec., 340.05 MiB/sec.

This branch:

mysql> select * from t10b where c2='0.4091273956277217';
Connection id:    9
Current database: bloom

+-----------+--------------------+
| c1        | c2                 |
+-----------+--------------------+
| 125000007 | 0.4091273956277217 |
+-----------+--------------------+
1 row in set (41.11 sec)
Read 6000000 rows, 157.95 MiB in 41.090 sec., 146.02 thousand rows/sec., 3.84 MiB/sec.

XOR filter is ~2X faster and scans less data(only need to scan xor filter index):

mysql> explain select * from t10b where c2='0.4091273956277217';
+-----------------------------------------------------------------------------+
| explain                                                                     |
+-----------------------------------------------------------------------------+
| TableScan                                                                   |
| ├── table: default.bloom.t10b                                               |
| ├── read rows: 6000000                                                      |
| ├── read bytes: 126715443                                                   |
| ├── partitions total: 1000                                                  |
| ├── partitions scanned: 6                                                   |
| └── push downs: [filters: [(c2 = '0.4091273956277217')], limit: NONE]       |
+-----------------------------------------------------------------------------+

With #7893 would be great for the point query.

Data size:

XOR index size(This is the worst case, because all my keys are unique):

BohuTANG

👍

mergify bot added the pr-bugfix this PR patches a bug in codebase label Sep 24, 2022

dantengsky marked this pull request as ready for review September 25, 2022 09:17

dantengsky requested review from junli1026 and BohuTANG September 25, 2022 09:18

dantengsky marked this pull request as draft September 25, 2022 10:52

dantengsky force-pushed the feat-renable-xor-bloom-index branch from 183c789 to 08ea640 Compare September 25, 2022 11:36

dantengsky commented Sep 25, 2022

View reviewed changes

src/query/storages/index/tests/it/bloom/bloom_filter.rs Outdated Show resolved Hide resolved

dantengsky marked this pull request as ready for review September 25, 2022 15:51

dantengsky added 7 commits September 27, 2022 09:38

reenable (xor) bloom filter index

8f5e84a

remove redundant dependency (bit-vec)

56f5f5a

remove ctx from BloomFilterIndexer

33e84ac

fix stateless test

3b55c37

enable bloom filter index for string type

78c85fa

Update src/query/storages/index/tests/it/bloom/bloom_filter.rs

b8a3cd8

fix ut

deeb3b5

dantengsky force-pushed the feat-renable-xor-bloom-index branch from f1b7841 to deeb3b5 Compare September 27, 2022 01:46

BohuTANG approved these changes Sep 27, 2022

View reviewed changes

BohuTANG merged commit 8d83d87 into databendlabs:main Sep 27, 2022

dantengsky mentioned this pull request Sep 27, 2022

BUG: feed estimated number of distinct values to bloom filter #7825

Closed

This was referenced Sep 27, 2022

Release proposal: Nightly v0.9 #7052

Closed

read_bloom_filter_index() take 2min for ontime #7379

Closed

bug: point query(BloomFilter) EXPLAIN read a lot of data #7674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: re-enable (xor) bloom filter index #7870

fix: re-enable (xor) bloom filter index #7870

dantengsky commented Sep 24, 2022 •

edited

Loading

vercel bot commented Sep 24, 2022 •

edited

Loading

BohuTANG commented Sep 25, 2022 •

edited

Loading

BohuTANG commented Sep 25, 2022

BohuTANG commented Sep 26, 2022

BohuTANG commented Sep 27, 2022 •

edited

Loading

BohuTANG left a comment

fix: re-enable (xor) bloom filter index #7870

fix: re-enable (xor) bloom filter index #7870

Conversation

dantengsky commented Sep 24, 2022 • edited Loading

Summary

vercel bot commented Sep 24, 2022 • edited Loading

BohuTANG commented Sep 25, 2022 • edited Loading

BohuTANG commented Sep 25, 2022

BohuTANG commented Sep 26, 2022

BohuTANG commented Sep 27, 2022 • edited Loading

BohuTANG left a comment

Choose a reason for hiding this comment

dantengsky commented Sep 24, 2022 •

edited

Loading

vercel bot commented Sep 24, 2022 •

edited

Loading

BohuTANG commented Sep 25, 2022 •

edited

Loading

BohuTANG commented Sep 27, 2022 •

edited

Loading