Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Guice instead of custom bloom filter implementation #300

Merged
merged 3 commits into from
Jan 30, 2020

Conversation

hennekey
Copy link
Contributor

@hennekey hennekey commented Jan 15, 2020

This uses the avaialable code in Guice rather than a custom
implementation. It also provides a performance increase (as demonstrated
by the unit tests)

In service of #299

This uses the avaialable code in Guice rather than a custom
implementation. It also provides a performance increase (as demonstrated
by the unit tests)
Copy link
Collaborator

@ato ato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing Heritrix's custom BloomFilter with Guava's seems good to me if it has better performance.

The use of setAccessible() to match the existing interface makes me a little uncomfortable but at least its only used for the tests and logging and could be removed in future if it breaks on a Guava upgrade.

@anjackson, you guys use BloomUriUniqFilter right? Does this look ok to you?

@ato ato requested a review from anjackson January 20, 2020 05:15
@hennekey
Copy link
Contributor Author

I know the use of setAccessible is regrettable, but I wanted the tests to run against this. They all passed for me locally, I'll see why they failed on Travis.

By using assertEquals and seting the expected and actual values, the
failure messages become a bit more useful.
Otherwise this is a count of how many times this add method is called,
not how many times an element was noted as being actually added.
@ato ato merged commit b45f774 into internetarchive:master Jan 30, 2020
@hennekey hennekey deleted the fix-299 branch February 13, 2020 14:30
@@ -141,45 +80,18 @@ public BloomFilter64bit( final long n, final int d, boolean roundUp) {
* @param roundUp if true, round bit size up to next-nearest-power-of-2
*/
public BloomFilter64bit(final long n, final int d, Random weightsGenerator, boolean roundUp ) {
delegate = com.google.common.hash.BloomFilter.create(Funnels.unencodedCharsFunnel(), Ints.saturatedCast(n), 0.0000003);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, it looks like d is unused and we have a hard-coded false-positive rate of roughly one in three million, rather than one in 2^d. We are currently using Heritrix for broad crawls with billions of URLs, so we'd want to be able to tune this appropriately. Can we just use 2^d rather than hard-coding the value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. When I searched through the code I did not see any varying parameters being passed. The only use I found was this: https://github.com/internetarchive/heritrix3/blob/master/engine/src/main/java/org/archive/crawler/util/BloomUriUniqFilter.java#L89

Copy link
Collaborator

@anjackson anjackson Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I understand! This would usually be called from the Spring crawler beans rather than directly in the code, and those beans files are not in the core H3 repository. (Using Spring XML to configure H3 is very flexible but it makes all this stuff very difficult).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anjackson
Copy link
Collaborator

anjackson commented Mar 4, 2020

Well, looks like my review/note didn't get submitted? Anyway, I believe this merged change hard-codes the false-positive rate, which is a significant change in behaviour for this class and should at least be noted somewhere.

The I think the rate of ~ one in 3 million is fixed rather than in proportion to the expected expectedInsertions, so will happen quite frequently with multi-billion URL crawl.

@ato ato changed the title Use Guice instead of custom implementation Use Guice instead of custom bloom filter implementation Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants