-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Guice instead of custom bloom filter implementation #300
Conversation
This uses the avaialable code in Guice rather than a custom implementation. It also provides a performance increase (as demonstrated by the unit tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing Heritrix's custom BloomFilter with Guava's seems good to me if it has better performance.
The use of setAccessible() to match the existing interface makes me a little uncomfortable but at least its only used for the tests and logging and could be removed in future if it breaks on a Guava upgrade.
@anjackson, you guys use BloomUriUniqFilter right? Does this look ok to you?
I know the use of |
By using assertEquals and seting the expected and actual values, the failure messages become a bit more useful.
Otherwise this is a count of how many times this add method is called, not how many times an element was noted as being actually added.
@@ -141,45 +80,18 @@ public BloomFilter64bit( final long n, final int d, boolean roundUp) { | |||
* @param roundUp if true, round bit size up to next-nearest-power-of-2 | |||
*/ | |||
public BloomFilter64bit(final long n, final int d, Random weightsGenerator, boolean roundUp ) { | |||
delegate = com.google.common.hash.BloomFilter.create(Funnels.unencodedCharsFunnel(), Ints.saturatedCast(n), 0.0000003); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding this correctly, it looks like d
is unused and we have a hard-coded false-positive rate of roughly one in three million, rather than one in 2^d. We are currently using Heritrix for broad crawls with billions of URLs, so we'd want to be able to tune this appropriately. Can we just use 2^d rather than hard-coding the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. When I searched through the code I did not see any varying parameters being passed. The only use I found was this: https://github.com/internetarchive/heritrix3/blob/master/engine/src/main/java/org/archive/crawler/util/BloomUriUniqFilter.java#L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I understand! This would usually be called from the Spring crawler beans rather than directly in the code, and those beans files are not in the core H3 repository. (Using Spring XML to configure H3 is very flexible but it makes all this stuff very difficult).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, looks like my review/note didn't get submitted? Anyway, I believe this merged change hard-codes the false-positive rate, which is a significant change in behaviour for this class and should at least be noted somewhere. The I think the rate of ~ one in 3 million is fixed rather than in proportion to the expected |
This uses the avaialable code in Guice rather than a custom
implementation. It also provides a performance increase (as demonstrated
by the unit tests)
In service of #299