Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53

Closed
corneliusroemer opened this issue Jan 23, 2023 · 5 comments
Labels

Comments

@corneliusroemer
Copy link

It seems that the algorithm implemented here and called "Jaro-Winkler" is in fact not the usual "Jaro-Winkler" metric due to a design decision or misunderstanding/bug of the original intention of the algorithm.

The original Jaro-Winkler limits the prefix to <=4 - for a good reason: otherwise strings with identical common prefixes longer than 10 go above 1 (if one uses 0.1 for p).

The implementation here uses an arbitrary prefix length and instead hard-clips similarities to 1 when they are above.

Here is wikipedia on this topic: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_similarity

This causes issues like here, where AlignmentScorr is supposedly equal in distance to AlignmentScore and AlignmentStart - which it obviously isn't, see: clap-rs/clap#4660.

I feel that an algorithm should be implemented the way it was intended by the author. It could be called Jaro-Winkler-Guo after the author of this crate @dguo but there should definitely be a normal Jaro-Winkler without modification.

Unfortunately this crate, despite very heavily used by many, seems to be unmaintained. It would be nice if it could be maintained again, given that it's relied on by many important crates in the rust ecosystem. Maybe @dguo can start a call for a co-maintainer?

@corneliusroemer
Copy link
Author

This is technically "works as intended" as it's mentioned in the README. However, it is highly misleading as it is not what Jaro-Winkler proposed and has strongly negative side-effect of giving all strings with common prefix >=10 a perfect similarity.

It's also not documented: https://docs.rs/strsim/0.10.0/strsim/fn.jaro_winkler.html

corneliusroemer added a commit to corneliusroemer/clap that referenced this issue Jan 23, 2023
Implementation of Jaro-Winkler similarity in the dguo/strsim-rs crate
is wrong, causing strings with common prefix >=10
to all be considered perfect matches

Using Jaro instead from the same crate fixes this issue
Benefit of favoring long prefixes exists for matching common names
But not for typo detection
Hence use of Jaro instead of Jaro-Winkler is acceptable

Confidence threshold adjusted so that `bar` is still suggested for `baz`
since Jaro is strictly < Jaro-Winkler
such an adjustment is expected. This is acceptable.
While exact suggestions may change, the net change will be positive
Suggestions are purely decorative and should thus not breaking change

Fixes clap-rs#4660
Also see rapidfuzz/strsim-rs#53
corneliusroemer added a commit to corneliusroemer/clap that referenced this issue Jan 23, 2023
Implementation of Jaro-Winkler similarity in the dguo/strsim-rs crate
is wrong, causing strings with common prefix >=10
to all be considered perfect matches

Using Jaro instead from the same crate fixes this issue
Benefit of favoring long prefixes exists for matching common names
But not for typo detection
Hence use of Jaro instead of Jaro-Winkler is acceptable

Confidence threshold adjusted so that `bar` is still suggested for `baz`
since Jaro is strictly < Jaro-Winkler
such an adjustment is expected. This is acceptable.
While exact suggestions may change, the net change will be positive
Suggestions are purely decorative and should thus not breaking change

Fixes clap-rs#4660
Also see rapidfuzz/strsim-rs#53
@epage
Copy link

epage commented Jan 23, 2023

This is technically "works as intended" as it's mentioned in the README.

Where is that? I looked through the README and didn't see anything that a layman about these algorithms, like me, would understand about that.

@corneliusroemer
Copy link
Author

@epage here is where it's mentioned:

https://github.com/dguo/strsim-rs/blame/65eac453cbd10ba4e13273002c843e95c81ae93f/README.md?plain=1#L13

It sounds like the author modified the algorithm in a well-intentioned manner, thinking that their modification was an improvement. It would have been better to add a third similarity and call it Jaro-Winkler-Guo ;)

In the code you can find the modification, it's commented otherwise I may not have been able to figure it out so quickly:
https://github.com/dguo/strsim-rs/blob/65eac453cbd10ba4e13273002c843e95c81ae93f/src/lib.rs#L164-L166

It wouldn't be hard to undo the -Guo modification, but given there hasn't been any activity on the repo, it may require a fork. At which point I wonder if it's worth starting fresh.

@wtcross
Copy link

wtcross commented Nov 16, 2023

Based on how the function is named I also agree this should be considered a bug. I created a Pyo3 extensions for polars using this library and discovered that it was behaving incorrectly due to the fact that a lot of the strings I'm comparing have a common prefix length >=4.

For now I'm going to use eddie, but can switch back to strsim once this issue is fixed.

@dguo
Copy link
Member

dguo commented Jan 4, 2024

I apologize for the bug. I don't quite remember why I didn't limit the prefix, but @corneliusroemer is probably right that it was a (naive) attempt at an improvement. I do remember considering defaulting to a limit of 4 but making it configurable. In any case, Jaro-Winkler-Guo has no reason to exist. 😝

Thanks to @maxbachmann for submitting a fix! We'll get a release out with it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants