You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Hey there, thank you for creating this library! I like the work you've put into the documentation. I'm a newbie when it comes to Julia and LSH so apologies if I get things wrong.
I tried running jaccard() on hash vectors after using MinHash and the default base.hash hashing method, and received puzzling results, sometimes with similarity value above 1, which should be impossible. I think this is due to summing large hash values causing an overflow.
This is probably not the intended use of the function - collision_probability seems like the right function. I got drawn to do this because explanations of Minhash I've seen say that the Jaccard index for hashes should approximate the value for the original shingles. I suspect other learners may be led to do this too.
Expected behavior
Because there is no overlap between the sets, the Jaccard index for their minhashes should always be 0. A naïve implementation that does not use arithmetic operations seems to get it right, e.g.:
function jaccard(x::AbstractVector, y::AbstractVector)::Float64
length(A ∩ B) / length(A ∪ B)
end
Or perhaps something that might clarify which function to use!
Screenshots
The text was updated successfully, but these errors were encountered:
inkybutton
changed the title
jaccard() on vectors is inaccurate, yielding similarity >1 sometimes
jaccard() on hash vectors is inaccurate
Dec 7, 2022
Describe the bug
Hey there, thank you for creating this library! I like the work you've put into the documentation. I'm a newbie when it comes to Julia and LSH so apologies if I get things wrong.
I tried running jaccard() on hash vectors after using MinHash and the default base.hash hashing method, and received puzzling results, sometimes with similarity value above 1, which should be impossible. I think this is due to summing large hash values causing an overflow.
This is probably not the intended use of the function -
collision_probability
seems like the right function. I got drawn to do this because explanations of Minhash I've seen say that the Jaccard index for hashes should approximate the value for the original shingles. I suspect other learners may be led to do this too.To Reproduce
This results in similarity >0, sometimes >1.
Expected behavior
Because there is no overlap between the sets, the Jaccard index for their minhashes should always be 0. A naïve implementation that does not use arithmetic operations seems to get it right, e.g.:
Or perhaps something that might clarify which function to use!
Screenshots
The text was updated successfully, but these errors were encountered: