-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fractional bits/key in BloomFilterPolicy #6092
Conversation
Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. TODO: more description Test Plan: unit tests included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@pdillinger has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
: mode_(mode) { | ||
// Sanitize bits_per_key | ||
if (bits_per_key < 1.0) { | ||
bits_per_key = 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this sanitizing? I know it doesn't make sense to use lower than 1.0. It doesn't seem to make sense to use 1.0 either. In that case, should we just leave it as it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 is OK if you consider a 63% FP rate better than nothing. I do.
The big thing is getting rid of values like negatives, infinities, or NaN that can gum up the conversion. And I can see these arising in client code due to subtle logical errors, etc. I think it's better to keep the value within a generous realm of "reasonable" than to always trust the client. Note: this plays into PR #6088 because that makes it more likely that client code might compute rather than hard-code bits_per_key values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 1.0 is OK, then why 0.99 is not? 0.98 is not? I think it's more like a code readability problem. When 1.0 is hard coded here, readers might assume something special to it. I don't have objection though, as long as code comment explains it clearly.
|
||
// Includes a nudge toward rounding up, to ensure on all platforms | ||
// that doubles specified with three decimal digits after the decimal | ||
// point are interpreted accurately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection to the code, but just a question: do we need the result to be exactly the same for every platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't need that in any schema compatibility sense. It's just a judgment call on what kind of surprises we want to avoid.
// For better or worse, this is a rounding up of a nudged rounding up, | ||
// e.g. 7.4999999999999 will round up to 8, but that provides more | ||
// predictability against small arithmetic errors in floating point. | ||
whole_bits_per_key_ = (millibits_per_key_ + 500) / 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the harm of not having whole_bits_per_key_
and just use millibits_per_key_
for legacy filters too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that. I just thought it somewhat safer not to unlock new behaviors in the legacy filters. Forward-compatibility should work but that's more manual effort to test. This way, format_version alone can back out of potential issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not breaking back compatibility by doing it, right? Integer value should still generate the same value no matter what.
@pdillinger has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@pdillinger merged this pull request in 57f3032. |
* upstream/master: (572 commits) Work around weird unused errors with Mingw (facebook#6075) Support options.max_open_files = -1 with periodic_compaction_seconds (facebook#6090) Fix HISTORY.md for 6.6.0 (facebook#6096) Expose and elaborate FilterBuildingContext (facebook#6088) Fix compilation under MSVC VS2015 (facebook#6081) Add shared library for musl-libc (facebook#3143) Refactor and clean up the code that reads a blob from a file (facebook#6093) Allow fractional bits/key in BloomFilterPolicy (facebook#6092) Refactor blob file creation logic (facebook#6066) Use lowercase for shlwapi.lib rpcrt4.lib (facebook#6076) Fix naming of library on PPC64LE (facebook#6080) Small improvements to Docker build for RocksJava (facebook#6079) Remove unused/undefined ImmutableCFOptions() (facebook#6086) Update 3rd-party libraries used by RocksJava (facebook#6084) Make default value of options.ttl to be 30 days when it is supported. (facebook#6073) Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (facebook#6072) Update HISTORY.md for forward compatibility (facebook#6085) Support ttl in Universal Compaction (facebook#6071) Fix the constness issues around autovector::iterator_impl's dereference operators (facebook#6057) Support options.ttl with options.max_open_files = -1 (facebook#6060) ...
Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. This change not only requires refining the logic for choosing the best num_probes for a given bits/key setting, it revealed a flaw in that logic. As bits/key gets higher, the best num_probes for a cache-local Bloom filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a standard Bloom filter. For example, at 16 bits per key, the best num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%). This change fixes and refines that logic (for the format_version=5 Bloom filter only, just in case) based on empirical tests to find accuracy inflection points between each num_probes. Although bits_per_key is now specified as a double, the new Bloom filter converts/rounds this to "millibits / key" for predictable/precise internal computations. Just in case of unforeseen compatibility issues, we round to the nearest whole number bits / key for the legacy Bloom filter, so as not to unlock new behaviors for it. Pull Request resolved: facebook#6092 Test Plan: unit tests included Differential Revision: D18711313 Pulled By: pdillinger fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
Summary: There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.
This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.
Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.
Test Plan: unit tests included