-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53
Comments
This is technically "works as intended" as it's mentioned in the README. However, it is highly misleading as it is not what Jaro-Winkler proposed and has strongly negative side-effect of giving all strings with common prefix >=10 a perfect similarity. It's also not documented: https://docs.rs/strsim/0.10.0/strsim/fn.jaro_winkler.html |
Implementation of Jaro-Winkler similarity in the dguo/strsim-rs crate is wrong, causing strings with common prefix >=10 to all be considered perfect matches Using Jaro instead from the same crate fixes this issue Benefit of favoring long prefixes exists for matching common names But not for typo detection Hence use of Jaro instead of Jaro-Winkler is acceptable Confidence threshold adjusted so that `bar` is still suggested for `baz` since Jaro is strictly < Jaro-Winkler such an adjustment is expected. This is acceptable. While exact suggestions may change, the net change will be positive Suggestions are purely decorative and should thus not breaking change Fixes clap-rs#4660 Also see rapidfuzz/strsim-rs#53
Implementation of Jaro-Winkler similarity in the dguo/strsim-rs crate is wrong, causing strings with common prefix >=10 to all be considered perfect matches Using Jaro instead from the same crate fixes this issue Benefit of favoring long prefixes exists for matching common names But not for typo detection Hence use of Jaro instead of Jaro-Winkler is acceptable Confidence threshold adjusted so that `bar` is still suggested for `baz` since Jaro is strictly < Jaro-Winkler such an adjustment is expected. This is acceptable. While exact suggestions may change, the net change will be positive Suggestions are purely decorative and should thus not breaking change Fixes clap-rs#4660 Also see rapidfuzz/strsim-rs#53
Where is that? I looked through the README and didn't see anything that a layman about these algorithms, like me, would understand about that. |
@epage here is where it's mentioned: It sounds like the author modified the algorithm in a well-intentioned manner, thinking that their modification was an improvement. It would have been better to add a third similarity and call it In the code you can find the modification, it's commented otherwise I may not have been able to figure it out so quickly: It wouldn't be hard to undo the |
Based on how the function is named I also agree this should be considered a bug. I created a Pyo3 extensions for polars using this library and discovered that it was behaving incorrectly due to the fact that a lot of the strings I'm comparing have a common prefix length >=4. For now I'm going to use eddie, but can switch back to strsim once this issue is fixed. |
I apologize for the bug. I don't quite remember why I didn't limit the prefix, but @corneliusroemer is probably right that it was a (naive) attempt at an improvement. I do remember considering defaulting to a limit of 4 but making it configurable. In any case, Thanks to @maxbachmann for submitting a fix! We'll get a release out with it soon. |
It seems that the algorithm implemented here and called "Jaro-Winkler" is in fact not the usual "Jaro-Winkler" metric due to a design decision or misunderstanding/bug of the original intention of the algorithm.
The original Jaro-Winkler limits the prefix to <=4 - for a good reason: otherwise strings with identical common prefixes longer than 10 go above 1 (if one uses 0.1 for p).
The implementation here uses an arbitrary prefix length and instead hard-clips similarities to 1 when they are above.
Here is wikipedia on this topic: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_similarity
This causes issues like here, where
AlignmentScorr
is supposedly equal in distance toAlignmentScore
andAlignmentStart
- which it obviously isn't, see: clap-rs/clap#4660.I feel that an algorithm should be implemented the way it was intended by the author. It could be called
Jaro-Winkler-Guo
after the author of this crate @dguo but there should definitely be a normalJaro-Winkler
without modification.Unfortunately this crate, despite very heavily used by many, seems to be unmaintained. It would be nice if it could be maintained again, given that it's relied on by many important crates in the rust ecosystem. Maybe @dguo can start a call for a co-maintainer?
The text was updated successfully, but these errors were encountered: