Allow fractional bits/key in BloomFilterPolicy #6092

pdillinger · 2019-11-26T18:26:26Z

Summary: There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.

This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.

Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.

Test Plan: unit tests included

Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. TODO: more description Test Plan: unit tests included.

facebook-github-bot

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-11-26T20:24:18Z

@pdillinger has updated the pull request. Re-import the pull request

facebook-github-bot

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying · 2019-11-26T20:58:07Z

table/block_based/filter_policy.cc

+    : mode_(mode) {
+  // Sanitize bits_per_key
+  if (bits_per_key < 1.0) {
+    bits_per_key = 1.0;


Any reason for this sanitizing? I know it doesn't make sense to use lower than 1.0. It doesn't seem to make sense to use 1.0 either. In that case, should we just leave it as it is?

1.0 is OK if you consider a 63% FP rate better than nothing. I do.

The big thing is getting rid of values like negatives, infinities, or NaN that can gum up the conversion. And I can see these arising in client code due to subtle logical errors, etc. I think it's better to keep the value within a generous realm of "reasonable" than to always trust the client. Note: this plays into PR #6088 because that makes it more likely that client code might compute rather than hard-code bits_per_key values.

If 1.0 is OK, then why 0.99 is not? 0.98 is not? I think it's more like a code readability problem. When 1.0 is hard coded here, readers might assume something special to it. I don't have objection though, as long as code comment explains it clearly.

siying · 2019-11-26T20:59:17Z

table/block_based/filter_policy.cc

+
+  // Includes a nudge toward rounding up, to ensure on all platforms
+  // that doubles specified with three decimal digits after the decimal
+  // point are interpreted accurately.


No objection to the code, but just a question: do we need the result to be exactly the same for every platform?

No, we don't need that in any schema compatibility sense. It's just a judgment call on what kind of surprises we want to avoid.

siying · 2019-11-26T21:03:52Z

table/block_based/filter_policy.cc

+  // For better or worse, this is a rounding up of a nudged rounding up,
+  // e.g. 7.4999999999999 will round up to 8, but that provides more
+  // predictability against small arithmetic errors in floating point.
+  whole_bits_per_key_ = (millibits_per_key_ + 500) / 1000;


What's the harm of not having whole_bits_per_key_ and just use millibits_per_key_ for legacy filters too?

I could do that. I just thought it somewhat safer not to unlock new behaviors in the legacy filters. Forward-compatibility should work but that's more manual effort to test. This way, format_version alone can back out of potential issues.

We are not breaking back compatibility by doing it, right? Integer value should still generate the same value no matter what.

facebook-github-bot · 2019-11-26T21:57:56Z

@pdillinger has updated the pull request. Re-import the pull request

facebook-github-bot

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-11-27T06:09:08Z

@pdillinger merged this pull request in 57f3032.

* upstream/master: (572 commits) Work around weird unused errors with Mingw (facebook#6075) Support options.max_open_files = -1 with periodic_compaction_seconds (facebook#6090) Fix HISTORY.md for 6.6.0 (facebook#6096) Expose and elaborate FilterBuildingContext (facebook#6088) Fix compilation under MSVC VS2015 (facebook#6081) Add shared library for musl-libc (facebook#3143) Refactor and clean up the code that reads a blob from a file (facebook#6093) Allow fractional bits/key in BloomFilterPolicy (facebook#6092) Refactor blob file creation logic (facebook#6066) Use lowercase for shlwapi.lib rpcrt4.lib (facebook#6076) Fix naming of library on PPC64LE (facebook#6080) Small improvements to Docker build for RocksJava (facebook#6079) Remove unused/undefined ImmutableCFOptions() (facebook#6086) Update 3rd-party libraries used by RocksJava (facebook#6084) Make default value of options.ttl to be 30 days when it is supported. (facebook#6073) Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (facebook#6072) Update HISTORY.md for forward compatibility (facebook#6085) Support ttl in Universal Compaction (facebook#6071) Fix the constness issues around autovector::iterator_impl's dereference operators (facebook#6057) Support options.ttl with options.max_open_files = -1 (facebook#6060) ...

Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. This change not only requires refining the logic for choosing the best num_probes for a given bits/key setting, it revealed a flaw in that logic. As bits/key gets higher, the best num_probes for a cache-local Bloom filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a standard Bloom filter. For example, at 16 bits per key, the best num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%). This change fixes and refines that logic (for the format_version=5 Bloom filter only, just in case) based on empirical tests to find accuracy inflection points between each num_probes. Although bits_per_key is now specified as a double, the new Bloom filter converts/rounds this to "millibits / key" for predictable/precise internal computations. Just in case of unforeseen compatibility issues, we round to the nearest whole number bits / key for the legacy Bloom filter, so as not to unlock new behaviors for it. Pull Request resolved: facebook#6092 Test Plan: unit tests included Differential Revision: D18711313 Pulled By: pdillinger fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a

facebook-github-bot added the CLA Signed label Nov 26, 2019

pdillinger requested a review from siying November 26, 2019 18:46

facebook-github-bot reviewed Nov 26, 2019

View reviewed changes

pdillinger added 2 commits November 26, 2019 12:20

Signed/unsigned in options_test

3d4edd4

Top out at 24 probes in new Bloom filter

c3bae40

facebook-github-bot reviewed Nov 26, 2019

View reviewed changes

siying reviewed Nov 26, 2019

View reviewed changes

siying approved these changes Nov 26, 2019

View reviewed changes

Use cmath constants instead of division by 0.0

4000413

facebook-github-bot reviewed Nov 26, 2019

View reviewed changes

facebook-github-bot closed this in 57f3032 Nov 26, 2019

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 27, 2019

Add explicit (leftover from facebook#6092)

93a5737

facebook-github-bot added the Merged label Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow fractional bits/key in BloomFilterPolicy #6092

Allow fractional bits/key in BloomFilterPolicy #6092

pdillinger commented Nov 26, 2019 •

edited

Loading

facebook-github-bot left a comment

facebook-github-bot commented Nov 26, 2019

facebook-github-bot left a comment

siying Nov 26, 2019

pdillinger Nov 26, 2019

siying Nov 26, 2019

siying Nov 26, 2019

pdillinger Nov 26, 2019

siying Nov 26, 2019

pdillinger Nov 26, 2019

siying Nov 26, 2019

facebook-github-bot commented Nov 26, 2019

facebook-github-bot left a comment

facebook-github-bot commented Nov 27, 2019

Allow fractional bits/key in BloomFilterPolicy #6092

Allow fractional bits/key in BloomFilterPolicy #6092

Conversation

pdillinger commented Nov 26, 2019 • edited Loading

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 26, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 26, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 27, 2019

pdillinger commented Nov 26, 2019 •

edited

Loading