BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53

corneliusroemer · 2023-01-23T17:31:18Z

It seems that the algorithm implemented here and called "Jaro-Winkler" is in fact not the usual "Jaro-Winkler" metric due to a design decision or misunderstanding/bug of the original intention of the algorithm.

The original Jaro-Winkler limits the prefix to <=4 - for a good reason: otherwise strings with identical common prefixes longer than 10 go above 1 (if one uses 0.1 for p).

The implementation here uses an arbitrary prefix length and instead hard-clips similarities to 1 when they are above.

Here is wikipedia on this topic: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_similarity

This causes issues like here, where AlignmentScorr is supposedly equal in distance to AlignmentScore and AlignmentStart - which it obviously isn't, see: clap-rs/clap#4660.

I feel that an algorithm should be implemented the way it was intended by the author. It could be called Jaro-Winkler-Guo after the author of this crate @dguo but there should definitely be a normal Jaro-Winkler without modification.

Unfortunately this crate, despite very heavily used by many, seems to be unmaintained. It would be nice if it could be maintained again, given that it's relied on by many important crates in the rust ecosystem. Maybe @dguo can start a call for a co-maintainer?

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2023-01-23T21:35:17Z

This is technically "works as intended" as it's mentioned in the README. However, it is highly misleading as it is not what Jaro-Winkler proposed and has strongly negative side-effect of giving all strings with common prefix >=10 a perfect similarity.

It's also not documented: https://docs.rs/strsim/0.10.0/strsim/fn.jaro_winkler.html

Implementation of Jaro-Winkler similarity in the dguo/strsim-rs crate is wrong, causing strings with common prefix >=10 to all be considered perfect matches Using Jaro instead from the same crate fixes this issue Benefit of favoring long prefixes exists for matching common names But not for typo detection Hence use of Jaro instead of Jaro-Winkler is acceptable Confidence threshold adjusted so that `bar` is still suggested for `baz` since Jaro is strictly < Jaro-Winkler such an adjustment is expected. This is acceptable. While exact suggestions may change, the net change will be positive Suggestions are purely decorative and should thus not breaking change Fixes clap-rs#4660 Also see rapidfuzz/strsim-rs#53

epage · 2023-01-23T21:52:35Z

This is technically "works as intended" as it's mentioned in the README.

Where is that? I looked through the README and didn't see anything that a layman about these algorithms, like me, would understand about that.

corneliusroemer · 2023-01-23T22:48:26Z

@epage here is where it's mentioned:

https://github.com/dguo/strsim-rs/blame/65eac453cbd10ba4e13273002c843e95c81ae93f/README.md?plain=1#L13

It sounds like the author modified the algorithm in a well-intentioned manner, thinking that their modification was an improvement. It would have been better to add a third similarity and call it Jaro-Winkler-Guo ;)

In the code you can find the modification, it's commented otherwise I may not have been able to figure it out so quickly:
https://github.com/dguo/strsim-rs/blob/65eac453cbd10ba4e13273002c843e95c81ae93f/src/lib.rs#L164-L166

It wouldn't be hard to undo the -Guo modification, but given there hasn't been any activity on the repo, it may require a fork. At which point I wonder if it's worth starting fresh.

wtcross · 2023-11-16T13:40:29Z

Based on how the function is named I also agree this should be considered a bug. I created a Pyo3 extensions for polars using this library and discovered that it was behaving incorrectly due to the fact that a lot of the strings I'm comparing have a common prefix length >=4.

For now I'm going to use eddie, but can switch back to strsim once this issue is fixed.

dguo · 2024-01-04T04:19:25Z

I apologize for the bug. I don't quite remember why I didn't limit the prefix, but @corneliusroemer is probably right that it was a (naive) attempt at an improvement. I do remember considering defaulting to a limit of 4 but making it configurable. In any case, Jaro-Winkler-Guo has no reason to exist. 😝

Thanks to @maxbachmann for submitting a fix! We'll get a release out with it soon.

corneliusroemer mentioned this issue Jan 23, 2023

Wrong Did you mean suggestion: for alignmentScorr clap suggests alignmentStart rather than alignmentScore clap-rs/clap#4660

Closed

2 tasks

corneliusroemer mentioned this issue Jan 23, 2023

fix(suggestions): Replace wrong Jaro-Winkler clap-rs/clap#4668

Merged

maxbachmann mentioned this issue Dec 31, 2023

limit common prefix in jaro-winkler #58

Merged

maxbachmann added the bug label Jan 1, 2024

maxbachmann closed this as completed Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53

corneliusroemer commented Jan 23, 2023

corneliusroemer commented Jan 23, 2023

epage commented Jan 23, 2023

corneliusroemer commented Jan 23, 2023

wtcross commented Nov 16, 2023 •

edited

Loading

dguo commented Jan 4, 2024

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4 #53

Comments

corneliusroemer commented Jan 23, 2023

corneliusroemer commented Jan 23, 2023

epage commented Jan 23, 2023

corneliusroemer commented Jan 23, 2023

wtcross commented Nov 16, 2023 • edited Loading

dguo commented Jan 4, 2024

wtcross commented Nov 16, 2023 •

edited

Loading