Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fractional bits/key in BloomFilterPolicy #6092

Closed
wants to merge 4 commits into from

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Nov 26, 2019

Summary: There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.

This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.

Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.

Test Plan: unit tests included

Summary: There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.

TODO: more description

Test Plan: unit tests included.
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

: mode_(mode) {
// Sanitize bits_per_key
if (bits_per_key < 1.0) {
bits_per_key = 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this sanitizing? I know it doesn't make sense to use lower than 1.0. It doesn't seem to make sense to use 1.0 either. In that case, should we just leave it as it is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 is OK if you consider a 63% FP rate better than nothing. I do.

The big thing is getting rid of values like negatives, infinities, or NaN that can gum up the conversion. And I can see these arising in client code due to subtle logical errors, etc. I think it's better to keep the value within a generous realm of "reasonable" than to always trust the client. Note: this plays into PR #6088 because that makes it more likely that client code might compute rather than hard-code bits_per_key values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 1.0 is OK, then why 0.99 is not? 0.98 is not? I think it's more like a code readability problem. When 1.0 is hard coded here, readers might assume something special to it. I don't have objection though, as long as code comment explains it clearly.


// Includes a nudge toward rounding up, to ensure on all platforms
// that doubles specified with three decimal digits after the decimal
// point are interpreted accurately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection to the code, but just a question: do we need the result to be exactly the same for every platform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't need that in any schema compatibility sense. It's just a judgment call on what kind of surprises we want to avoid.

// For better or worse, this is a rounding up of a nudged rounding up,
// e.g. 7.4999999999999 will round up to 8, but that provides more
// predictability against small arithmetic errors in floating point.
whole_bits_per_key_ = (millibits_per_key_ + 500) / 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the harm of not having whole_bits_per_key_ and just use millibits_per_key_ for legacy filters too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that. I just thought it somewhat safer not to unlock new behaviors in the legacy filters. Forward-compatibility should work but that's more manual effort to test. This way, format_version alone can back out of potential issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not breaking back compatibility by doing it, right? Integer value should still generate the same value no matter what.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 27, 2019
@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 57f3032.

dvdplm added a commit to dvdplm/rocksdb that referenced this pull request Nov 27, 2019
* upstream/master: (572 commits)
  Work around weird unused errors with Mingw (facebook#6075)
  Support options.max_open_files = -1 with periodic_compaction_seconds (facebook#6090)
  Fix HISTORY.md for 6.6.0 (facebook#6096)
  Expose and elaborate FilterBuildingContext (facebook#6088)
  Fix compilation under MSVC VS2015 (facebook#6081)
  Add shared library for musl-libc (facebook#3143)
  Refactor and clean up the code that reads a blob from a file (facebook#6093)
  Allow fractional bits/key in BloomFilterPolicy (facebook#6092)
  Refactor blob file creation logic (facebook#6066)
  Use lowercase for shlwapi.lib rpcrt4.lib (facebook#6076)
  Fix naming of library on PPC64LE (facebook#6080)
  Small improvements to Docker build for RocksJava (facebook#6079)
  Remove unused/undefined ImmutableCFOptions() (facebook#6086)
  Update 3rd-party libraries used by RocksJava (facebook#6084)
  Make default value of options.ttl to be 30 days when it is supported. (facebook#6073)
  Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (facebook#6072)
  Update HISTORY.md for forward compatibility (facebook#6085)
  Support ttl in Universal Compaction (facebook#6071)
  Fix the constness issues around autovector::iterator_impl's dereference operators (facebook#6057)
  Support options.ttl with options.max_open_files = -1 (facebook#6060)
  ...
wolfkdy pushed a commit to wolfkdy/rocksdb that referenced this pull request Dec 22, 2019
Summary:
There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.

This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.

Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.
Pull Request resolved: facebook#6092

Test Plan: unit tests included

Differential Revision: D18711313

Pulled By: pdillinger

fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants