Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: re-enable (xor) bloom filter index #7870

Merged
merged 7 commits into from
Sep 27, 2022

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Sep 24, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

as xor bloom has been merged in PR #7860, replace the vanilla bloom filter with it.

  • for legacy index data of v1 bloom filter, the index will be ignored
    but will be removed if table data is dropped
  • xor index is created under path "_i_b_v2"
    so that, other types of indexes could be put in separated "dir"s
  • bloom filter index is enabled for string type column only

Fixes #issue

@vercel
Copy link

vercel bot commented Sep 24, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Sep 27, 2022 at 1:46AM (UTC)

@mergify mergify bot added the pr-bugfix this PR patches a bug in codebase label Sep 24, 2022
@BohuTANG
Copy link
Member

BohuTANG commented Sep 25, 2022

Cool.
The bloom filter should have a type so that we can try to fuse binary bloom(in xorfilter crate) in the future:
Binary Fuse Filters: Fast and Smaller Than Xor Filters
image

image

Fuse 8 Disadvantages:

 Fuse8 filter can only tolerate few duplicates in a given data-set. So make sure to supply a hasher that is capable of generating unique digests, (with allowed tolerance of duplicates) and while supplying the digests directly via populate_keys() and build_keys() make sure they don't have more than few duplicates.

@BohuTANG
Copy link
Member

From the test, XOR8 is better than FUSE8 for the applications, FUSE8 should check the duplicate keys(like hyperloglog)
Test Result:

fuse8: u64 bitmap encoding:1130544 bytes, raw:8000000 bytes, ratio:0.141318
fuse8: bool bitmap encoding:1130544 bytes, raw:1000000 bytes, ratio:1.130544
fuse8: string encoding:118832 bytes, raw:3000000 bytes, ratio:0.039610665
fuse8: same string encoding:118832 bytes, raw:3000000 bytes, ratio:0.039610665
xor8: u64 bitmap encoding:1230069 bytes, raw:8000000 bytes, ratio:0.15375863
xor8: bool bitmap encoding:61 bytes, raw:1000000 bytes, ratio:0.000061
xor8: string encoding:123067 bytes, raw:3000000 bytes, ratio:0.041022334
xor8: same string encoding:61 bytes, raw:3000000 bytes, ratio:0.000020333333

BohuTANG@b45793c

@dantengsky dantengsky marked this pull request as ready for review September 25, 2022 09:17
@dantengsky dantengsky marked this pull request as draft September 25, 2022 10:52
@dantengsky dantengsky force-pushed the feat-renable-xor-bloom-index branch from 183c789 to 08ea640 Compare September 25, 2022 11:36
@dantengsky dantengsky marked this pull request as ready for review September 25, 2022 15:51
@BohuTANG
Copy link
Member

I will bench the xor bloom index loading after #7867 and #7857

@BohuTANG
Copy link
Member

BohuTANG commented Sep 27, 2022

Test result:

Gen data(unique):

create table t10b as select number as c1, cast(rand() as string) as c2 from numbers(1000000000);

main branch:

mysql> select * from t10b where c2='0.4091273956277217';
Connection id:    9
Current database: bloom

+-----------+--------------------+
| c1        | c2                 |
+-----------+--------------------+
| 125000007 | 0.4091273956277217 |
+-----------+--------------------+
1 row in set (1 min 13.72 sec)
Read 1000000000 rows, 24.47 GiB in 73.697 sec., 13.57 million rows/sec., 340.05 MiB/sec.

This branch:

mysql> select * from t10b where c2='0.4091273956277217';
Connection id:    9
Current database: bloom

+-----------+--------------------+
| c1        | c2                 |
+-----------+--------------------+
| 125000007 | 0.4091273956277217 |
+-----------+--------------------+
1 row in set (41.11 sec)
Read 6000000 rows, 157.95 MiB in 41.090 sec., 146.02 thousand rows/sec., 3.84 MiB/sec.

XOR filter is ~2X faster and scans less data(only need to scan xor filter index):

mysql> explain select * from t10b where c2='0.4091273956277217';
+-----------------------------------------------------------------------------+
| explain                                                                     |
+-----------------------------------------------------------------------------+
| TableScan                                                                   |
| ├── table: default.bloom.t10b                                               |
| ├── read rows: 6000000                                                      |
| ├── read bytes: 126715443                                                   |
| ├── partitions total: 1000                                                  |
| ├── partitions scanned: 6                                                   |
| └── push downs: [filters: [(c2 = '0.4091273956277217')], limit: NONE]       |
+-----------------------------------------------------------------------------+

With #7893 would be great for the point query.

Data size:
image

XOR index size(This is the worst case, because all my keys are unique):
image

@dantengsky dantengsky force-pushed the feat-renable-xor-bloom-index branch from f1b7841 to deeb3b5 Compare September 27, 2022 01:46
Copy link
Member

@BohuTANG BohuTANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-bugfix this PR patches a bug in codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants