Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable bloom filter index #6639

Merged

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Jul 15, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Enable bloom filter at the block level. (thanks @junli1026 !)

  • For fields of primitive types, bloom filter index will be built during insertion (or rebuild during mutations)
    for each block, an index file will be generated in the index path (prefixed with _i/)
  • bloom index will be used if point queries are detected (currently only binop "=")
    • only the column that is used will be loaded (and cached if table cache is enabled)
    • the default max bytes of cached index data is set to 1G

Performance

  • Read. Performance is improved as expected if the bloom filter index can be utilized.
  • Write. No significant impact found for write performance.
  • Index Size. For block of 1M rows, about 1.5MB index per column (on disk file size)
    false positive rate set to 1%
    number of distinct values is the number of rows of the given block (could be optimised later)

Test scenario

  • Standalone deployment, with Local FS
  • Table of 10B rows:
    create table t10b as select cast(number as string) as c1, cast(rand() as string) as c2 from numbers(10000000000)

Read:

  • No Table Meta and Index Cache

without bloom filter index

mysql> select * from t10b where c2 = "0.7826850382733147";
Empty set (1 min 42.89 sec)
Read 10000000000 rows, 411.26 GiB in 102.846 sec., 97.23 million rows/sec., 4.00 GiB/sec.

with bloom filter index

mysql> select * from t10b where c2 = "0.7826850382733147";
Empty set (5.50 sec)
Read 106000000 rows, 4.36 GiB in 1.201 sec., 88.28 million rows/sec., 3.63 GiB/sec.
  • Table Meta and Index are fully cached (A 1B rows table used in this case, so that index can be fully cached)

without bloom filer index

mysql> select * from t1b where c2 = "0.78268503827331471";
Empty set (10.04 sec)
Read 1000000000 rows, 40.19 GiB in 10.028 sec., 99.72 million rows/sec., 4.01 GiB/sec.

with bloom filer index

mysql> select * from t1b where c2 = "0.78268503827331471";
Empty set (1.11 sec)
Read 7666666 rows, 315.42 MiB in 0.090 sec., 85.24 million rows/sec., 3.42 GiB/sec.

Write:

without bloom filter index

mysql> create table t10b_no_idx as select cast(number as string) as c1, cast(rand() as string) as c2 from numbers(10000000000);
Query OK, 0 rows affected (30 min 21.37 sec)

with bloom filter index

mysql> create table t10b as select cast(number as string) as c1, cast(rand() as string) as c2 from numbers(10000000000);
Query OK, 0 rows affected (31 min 35.24 sec)

Storage:

For this test scenario, index / data ~= 10%

Fixes #issue

@vercel
Copy link

vercel bot commented Jul 15, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Aug 3, 2022 at 10:52AM (UTC)

@dantengsky dantengsky changed the title feat : enable bloom filter index feat: enable bloom filter index Jul 15, 2022
@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Jul 15, 2022
@dantengsky dantengsky force-pushed the feat-bloom-switch-to-external-index branch from cd2b416 to 0f346c0 Compare July 15, 2022 03:29
@dantengsky dantengsky force-pushed the feat-bloom-switch-to-external-index branch from 561dc0f to 72a43d7 Compare July 25, 2022 14:19
common/settings/src/lib.rs Outdated Show resolved Hide resolved
@BohuTANG BohuTANG mentioned this pull request Aug 3, 2022
55 tasks
@dantengsky dantengsky force-pushed the feat-bloom-switch-to-external-index branch from aa0c1fd to c3415d7 Compare August 3, 2022 06:17
@dantengsky
Copy link
Member Author

@youngsofun bloom filter index is not populated for ResultTables, seems it is not suitable, if anything else should be covered, please let me know.

@flaneur2020 filed index_size of table system.tables will no longer be all NULLs, for tables of fuse engine, the value will be eq to or larger than 0, hope this will not break things.

@dantengsky dantengsky marked this pull request as ready for review August 3, 2022 06:26
@dantengsky dantengsky requested review from zhyass and youngsofun August 3, 2022 06:27
@BohuTANG BohuTANG requested review from zhang2014 and sundy-li August 3, 2022 06:40
@Xuanwo
Copy link
Member

Xuanwo commented Aug 3, 2022

Looks great so far!

@zhang2014
Copy link
Member

How to set bloom filter false positive?

@dantengsky
Copy link
Member Author

How to set bloom filter false positive?

it is hard coded as 1% now

@BohuTANG
Copy link
Member

BohuTANG commented Aug 3, 2022

Expected: statement query must get result equal to expected
Message: 
 Expected:
enable_async_insert 0 0 SESSION Whether the client open async insert mode, default value: 0 UInt64
enable_bloom_filter_index 0 0 SESSION Enable bloom filter index (if applicable for the underlying table engine) by setting this variable to 1, default value: 0	UInt64
enable_new_processor_framework 1 1 SESSION Enable new processor framework if value != 0, default value: 1 UInt64
enable_planner_v2 1 0 SESSION Enable planner v2 by setting this variable to 1, default value: 0 UInt64
 Actual:
                                                enable_async_insert                                                                  0                                                                  0                                                            SESSION        Whether the client open async insert mode, default value: 0                                                             UInt64
                                     enable_new_processor_framework                                                                  1                                                                  1                                                            SESSION     Enable new processor framework if value != 0, default value: 1                                                             UInt64
                                                  enable_planner_v2                                                                  1                                                                  0                                                            SESSION  Enable planner v2 by setting this variable to 1, default value: 0                                                             UInt64
 Statement:
Parsed Statement
    at_line: 32,
    s_type: Statement: query, type: TTTTTT, query_type: TTTTTT, retry: False,
    suite_name: base/06_show/06_0003_show_settings_v2,
    text:
        SHOW SETTINGS LIKE 'enable%';
    results: [(<re.Match object; span=(0, 4), match='----'>, 39, 'enable_async_insert 0 0 SESSION Whether the client open async insert mode, default value: 0 UInt64\nenable_bloom_filter_index 0 0 SESSION Enable bloom filter index (if applicable for the underlying table engine) by setting this variable to 1, default value: 0\tUInt64\nenable_new_processor_framework 1 1 SESSION Enable new processor framework if value != 0, default value: 1 UInt64\nenable_planner_v2 1 0 SESSION Enable planner v2 by setting this variable to 1, default value: 0 UInt64')],
    runs_on: {'mysql'},
 Start Line: 39, Result Label: 

@mergify mergify bot merged commit 71b9327 into databendlabs:main Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants