Use Guice instead of custom bloom filter implementation #300

hennekey · 2020-01-15T23:15:19Z

This uses the avaialable code in Guice rather than a custom
implementation. It also provides a performance increase (as demonstrated
by the unit tests)

In service of #299

This uses the avaialable code in Guice rather than a custom implementation. It also provides a performance increase (as demonstrated by the unit tests)

ato

Replacing Heritrix's custom BloomFilter with Guava's seems good to me if it has better performance.

The use of setAccessible() to match the existing interface makes me a little uncomfortable but at least its only used for the tests and logging and could be removed in future if it breaks on a Guava upgrade.

@anjackson, you guys use BloomUriUniqFilter right? Does this look ok to you?

hennekey · 2020-01-22T21:58:43Z

I know the use of setAccessible is regrettable, but I wanted the tests to run against this. They all passed for me locally, I'll see why they failed on Travis.

By using assertEquals and seting the expected and actual values, the failure messages become a bit more useful.

Otherwise this is a count of how many times this add method is called, not how many times an element was noted as being actually added.

anjackson · 2020-01-23T11:19:09Z

commons/src/main/java/org/archive/util/BloomFilter64bit.java

@@ -141,45 +80,18 @@ public BloomFilter64bit( final long n, final int d, boolean roundUp) {
     * @param roundUp if true, round bit size up to next-nearest-power-of-2
     */
    public BloomFilter64bit(final long n, final int d, Random weightsGenerator, boolean roundUp ) {
+        delegate = com.google.common.hash.BloomFilter.create(Funnels.unencodedCharsFunnel(), Ints.saturatedCast(n), 0.0000003);


If I'm understanding this correctly, it looks like d is unused and we have a hard-coded false-positive rate of roughly one in three million, rather than one in 2^d. We are currently using Heritrix for broad crawls with billions of URLs, so we'd want to be able to tune this appropriately. Can we just use 2^d rather than hard-coding the value?

For sure. When I searched through the code I did not see any varying parameters being passed. The only use I found was this: https://github.com/internetarchive/heritrix3/blob/master/engine/src/main/java/org/archive/crawler/util/BloomUriUniqFilter.java#L89

Ah! I understand! This would usually be called from the Spring crawler beans rather than directly in the code, and those beans files are not in the core H3 repository. (Using Spring XML to configure H3 is very flexible but it makes all this stuff very difficult).

anjackson · 2020-03-04T13:29:11Z

Well, looks like my review/note didn't get submitted? Anyway, I believe this merged change hard-codes the false-positive rate, which is a significant change in behaviour for this class and should at least be noted somewhere.

The I think the rate of ~ one in 3 million is fixed rather than in proportion to the expected expectedInsertions, so will happen quite frequently with multi-billion URL crawl.

Use Guice instead of custom implementation

080380b

This uses the avaialable code in Guice rather than a custom implementation. It also provides a performance increase (as demonstrated by the unit tests)

ato approved these changes Jan 20, 2020

View reviewed changes

ato requested a review from anjackson January 20, 2020 05:15

hennekey added 2 commits January 22, 2020 17:13

Fix assertions

33458f1

By using assertEquals and seting the expected and actual values, the failure messages become a bit more useful.

Increment the count only when the filter notes it

9cb9563

Otherwise this is a count of how many times this add method is called, not how many times an element was noted as being actually added.

ato merged commit b45f774 into internetarchive:master Jan 30, 2020

hennekey deleted the fix-299 branch February 13, 2020 14:30

hennekey mentioned this pull request Feb 28, 2020

BloomFilter64bitTest is slow #299

Closed

anjackson reviewed Mar 4, 2020

View reviewed changes

ato changed the title ~~Use Guice instead of custom implementation~~ Use Guice instead of custom bloom filter implementation Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Guice instead of custom bloom filter implementation #300

Use Guice instead of custom bloom filter implementation #300

hennekey commented Jan 15, 2020 •

edited

Loading

ato left a comment

hennekey commented Jan 22, 2020

anjackson Jan 23, 2020

hennekey Mar 4, 2020

anjackson Mar 4, 2020 •

edited

Loading

hennekey Mar 4, 2020

anjackson commented Mar 4, 2020 •

edited

Loading

Use Guice instead of custom bloom filter implementation #300

Use Guice instead of custom bloom filter implementation #300

Conversation

hennekey commented Jan 15, 2020 • edited Loading

ato left a comment

Choose a reason for hiding this comment

hennekey commented Jan 22, 2020

anjackson Jan 23, 2020

Choose a reason for hiding this comment

hennekey Mar 4, 2020

Choose a reason for hiding this comment

anjackson Mar 4, 2020 • edited Loading

Choose a reason for hiding this comment

hennekey Mar 4, 2020

Choose a reason for hiding this comment

anjackson commented Mar 4, 2020 • edited Loading

hennekey commented Jan 15, 2020 •

edited

Loading

anjackson Mar 4, 2020 •

edited

Loading

anjackson commented Mar 4, 2020 •

edited

Loading