Feat: Implement additional fuzzy string similarities #51

jstammers · 2024-08-01T19:40:22Z

Very much still a work-in-progress, but I think it would be useful to implement some additional string similarity measures, that could be more useful than the levenshtein ratio in specific use-cases.

For example, the token_set_ratio implemented in fuzzywuzzy can be more useful when comparing addresses that may contain duplicate tokens or when the order is not important.

This PR contains some implementations of the main comparison functions implemented in fuzzywuzzy. These are currently failing as they do not produce the expected values.

In addition, I've created a benchmark test suite similar to the one @lmores created for address parsing.
This aims to benchmark the performance of the levenshtein ratio implementation with the one in rapidfuzz. This library contains both a scalar and vectorized methods.

Initial tests are fairly promising in terms of runtime performance

----------------------------------------------------------------------------------------- benchmark 'nrows=1000': 3 tests -----------------------------------------------------------------------------------------
Name (time in ms)                                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_similarity[1k-rapidfuzz]             13.1145 (1.01)     78.9044 (3.71)     16.5665 (1.0)      8.5530 (2.86)     13.6497 (1.0)      6.1826 (1.03)          1;1  60.3629 (1.0)          62           1
test_benchmark_similarity[1k-rapidfuzz-process]     13.0298 (1.0)      21.2540 (1.0)      16.7656 (1.01)     2.9944 (1.0)      16.0143 (1.17)     6.0396 (1.01)         35;0  59.6460 (0.99)         70           1
test_benchmark_similarity[1k-mismo]                 13.8483 (1.06)     21.7240 (1.02)     17.3298 (1.05)     3.0438 (1.02)     15.3976 (1.13)     5.9908 (1.0)          32;0  57.7040 (0.96)         67           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------- benchmark 'nrows=10000': 3 tests -----------------------------------------------------------------------------------------
Name (time in ms)                                        Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_similarity[10k-rapidfuzz-process]     16.7779 (1.0)      29.8702 (1.08)     20.2685 (1.0)      3.4576 (1.22)     18.3046 (1.0)      5.7714 (1.21)         12;0  49.3377 (1.0)          55           1
test_benchmark_similarity[10k-mismo]                 18.3121 (1.09)     27.6624 (1.0)      21.0492 (1.04)     2.8291 (1.0)      19.4303 (1.06)     4.7572 (1.0)          12;0  47.5076 (0.96)         51           1
test_benchmark_similarity[10k-rapidfuzz]             21.0192 (1.25)     37.2723 (1.35)     24.7292 (1.22)     4.4327 (1.57)     22.1991 (1.21)     6.3063 (1.33)          5;0  40.4380 (0.82)         46           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------- benchmark 'nrows=100000': 3 tests --------------------------------------------------------------------------------------------
Name (time in ms)                                          Min                 Max                Mean             StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_similarity[100k-mismo]                  62.7014 (1.04)      71.3320 (1.0)       66.4983 (1.0)       2.8259 (1.0)       67.8299 (1.04)     5.2726 (1.0)           5;0  15.0380 (1.0)          14           1
test_benchmark_similarity[100k-rapidfuzz-process]      60.3857 (1.0)      122.6261 (1.72)      70.6287 (1.06)     17.7671 (6.29)      65.1000 (1.0)      7.7055 (1.46)          2;2  14.1585 (0.94)         13           1
test_benchmark_similarity[100k-rapidfuzz]             104.0466 (1.72)     114.8337 (1.61)     108.0510 (1.62)      3.7327 (1.32)     106.4073 (1.63)     5.4322 (1.03)          3;0   9.2549 (0.62)          9           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------- benchmark 'nrows=1000000': 3 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                          Min                   Max                  Mean             StdDev                Median                 IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_similarity[1m-mismo]                    86.7239 (1.0)         99.9494 (1.0)         94.1669 (1.0)       4.4774 (1.0)         93.9198 (1.0)        8.8402 (1.0)           5;0  10.6194 (1.0)          10           1
test_benchmark_similarity[1m-rapidfuzz-process]       613.1168 (7.07)       695.6785 (6.96)       632.4729 (6.72)     35.4901 (7.93)       616.6402 (6.57)      25.9877 (2.94)          1;1   1.5811 (0.15)          5           1
test_benchmark_similarity[1m-rapidfuzz]             1,276.3718 (14.72)    1,427.0528 (14.28)    1,366.5156 (14.51)    68.9414 (15.40)    1,404.1056 (14.95)    118.2813 (13.38)         1;0   0.7318 (0.07)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------- benchmark 'nrows=10000000': 3 tests -------------------------------------------------------------------------------------------------
Name (time in ms)                                            Min                    Max                   Mean              StdDev                 Median                 IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_similarity[10m-mismo]                    558.7834 (1.0)         613.8404 (1.0)         581.3074 (1.0)       21.1528 (1.0)         572.4648 (1.0)       26.5438 (1.0)           2;0  1.7203 (1.0)           5           1
test_benchmark_similarity[10m-rapidfuzz-process]      6,600.2055 (11.81)     6,816.4707 (11.10)     6,725.8916 (11.57)     87.7196 (4.15)      6,729.9165 (11.76)    138.1069 (5.20)          2;0  0.1487 (0.09)          5           1
test_benchmark_similarity[10m-rapidfuzz]             13,579.2165 (24.30)    14,032.7867 (22.86)    13,819.0719 (23.77)    165.7264 (7.83)     13,839.8365 (24.18)    196.9811 (7.42)          2;0  0.0724 (0.04)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
===================================================================== 15 passed in 245.88s (0:04:05) =====================================================================

but this makes me wonder if the cache() is unfairly favouring the native-ibis implementation.

Since I'm not convinced these implementations are giving the correct values (i.e. those returned by rapidfuzz once scaled to the correct interval), I've added some property based tests to check that the values returned by the implementations here are the same as the rapidfuzz ones. This should strictly be a comparison with a pre-defined tolerance, but I'd first like to convince myself it's worth re-implementing these over wrapping the rapidfuzz functions in a UDF

NickCrews · 2024-08-02T06:41:08Z

mismo/text/_similarity.py

@@ -91,3 +91,131 @@ def levenshtein_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValu
    lenmax = ibis.greatest(s1.length(), s2.length())
    ldist = s1.levenshtein(s2)
    return (lenmax - ldist) / lenmax
+
+def token_set_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue:
+    """The ratio of the intersection of the token sets of two strings to the union of the token sets.


This is the same, or very close, to the already implemented mismo.sets.jaccard(), isn't it? If there are some slight differences, why is this difference needed? I would like to see if we can unify the APIs somewhat.

At the very least, I want to make it so all these functions operate on set-like (ir.Arrays) or bag-like (ir.Maps) data (see mismo/vector/), and leave the tokenization as a prep step. Inevitably someone is gonna want to tokenize their strings slightly differently that we have here.

But I think advertising those set-based functions could be very worth it, I imagine a lot of people are going to miss them.

But I think advertising those set-based functions could be very worth it, I imagine a lot of people are going to miss them.

Some time ago I thought about suggesting to port (some of) the functions here and make them readily available inside mismo, but I did not have the time to carefully think about it since then...

port (some of) the functions here

brilliant, yeah I thought of this too once but it slipped my mind. I worry a little bit that there are so many of them, and it's not obvious which to use that again we don't want to endorse methods that we don't actually stand behind. Having that huge menu works for dedupe because they actually do some clever machine learning to automatically choose a subset from this menu of options.

I think most of them can even be implemented fairly simply with the atomic functions that mismo already has, eg double_metaphone(), etc. Are there some in particular that would be a big lift to do yourself?

I think I am more tempted to tackle this problem as:

try to keep the core, domain-agnostic utils small, but useful enough for building blocks

then, build domain-specific blockers and compares that are actually powerful. For example, with street addresses, an acronym based blocking seems useless, but extracting any integers and letter sorting them (eg 1234 and 1324 both get mapped to "1234"), or rounding to a hundreds place (eg on the same street block) seems very practical for dealing with typos in people's street numbers. On the other hand, an acronym blocker seems quite useful if blocking on company names.

I think this will

be less overwhelming for users.

more batteries included

be more performant, users will end up with curated set of solutions to the common problems
What do you think?

Are there some in particular that would be a big lift to do yourself?

I am not a native English speaker and I am having some troubles to catch the precise meaning of "a big lift to do yourself".

If you are asking if doing it by myself would result in some kind of advantage, I do not believe so. I actually worked on the implementation of the predicate functions in dedupe (with some limited improvements) but that was essentially a war against strings performance in CPython. As mismo shifts the heavy load on the backend, I believe I have nothing special to add.

Instead, if you are asking if I could work on it in another pr, yes, I could, but these days are quite full for me and then I am leaving for holidays. I could start working on it at the end of August/beginning of September.

I'm in favour of porting some of these functions too. One thing I've found when trying to evaluate the similarity between records is that without a good set of comparison functions available, I tend to go for a simple set of thresholds on a single metric (e.g. levenshtien distance), which is simple to implement but not the most intuitive when it comes to defining a sensible set of levels. I can see how having a set of predicate functions could certainly be useful there.

In terms of my availability, I may have some time over the next few weeks to implement some of these, depending on how other projects and holidays go

But I think advertising those set-based functions could be very worth it, I imagine a lot of people are going to miss them.

I'd actually forgotten about the jaccard function when writing this, and I agree that after some pre-processing to extract words and convert to lower-case, these would be equivalent. Another reason I'm going off the idea of this PR!

Are there some in particular that would be a big lift to do yourself?

Sorry, I should be using simpler English! I meant "would it be hard for you (as in a general user of mismo, not you in particular) to just write these functions themselves, externally to mismo, or is there a large extra benefit to the function being written in actual mismo?"

For example, as we discovered in the PRs and benchmarks, using pypostal to do address parsing is difficult. I think it makes sense for this to be contained in mismo, since it would be easy for a user to implement it incorrectly/poorly if they did it themselves.

On the other hand, someone could easily implement dedupe's sortedAcronym() blocker predicate themselves, without needing this to be built into mismo:

def sorted_acronym(s: ir.StringValue) -> ir.ArrayValue: tokens = s.re_split("\s+") # or perhaps we include some mismo.tokenize(s) util that has various common options # like trimming beginning and trailing whitespace, etc return tokens.map(lambda token: token[0].upper()).sorted()

I'd actually forgotten about the jaccard function when writing this

This is a signal the docs need some updates. Where do you think would have been the best for you?

the text section of the reference?

in a how-to guide on working with text data? (doesn't exist yet but it probably should)

in an example?

pre-processing to extract words and convert to lower-case

I think these are common enough and easy enough to mess up (people probably want s.re_split("\s+"), not s.split(" ")) that it makes sense to include them. I'll try to make a PR for this soon

@NickCrews , I think it would be worth including it in some how-to guide. Perhaps a subsection under Comparing, or probably a distinct document, to provide an overview of the various text comparison methods. Given that there are some designed to compare single strings (e.g. levenhstein) and others on arrays of tokenized text (jaccard, tf-idf, etc.), it might be worth explaining how and when these different methods could be used to define similarity measures between text-based fields.

Back from holidays.

Would it be hard for you (as in a general user of mismo, not you in particular) to just write these functions themselves, externally to mismo, or is there a large extra benefit to the function being written in actual mismo?

Most of those functions are just one liners and I expect the average Python could easily implement them. The less trivial part is to known that the ideas behind those simple functions can be a good choice to define blockers. As you noted, this is probably more of a documentation issue rather than something we should implement in mismo (in future we may think about a page in the documentation the lists some of the ideas behind those predicate functions to give mismo users some examples of what they might need for their specific problems).
Keeping mismo domain agnostic and as lite as possible yet providing powerful domain-specific blockers seems the way to go.

NickCrews

I really appreciate the work you've done here, it looks a quite good implementation! But I want to slow down and actually check that these are actually things we want to support

NickCrews · 2024-08-02T06:50:54Z

mismo/text/_similarity.py

+
+
+def token_sort_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue:
+    """The levenshtein ratio of two strings after tokenizing and sorting the tokens.


Sorry, this might sound harsh. I am genuinely very appreciative of all the effort that you have putting into mismo, and I don't want to discourage you. With that caveat...

I am tempted to not support this, it is so similar to the other methods, and not hard for someone to implement themselves. Can you point to examples in record linkage literature where this method is used and gives good results? Have you personally found that you get noticeably better results with this? I worry that if we include this, then it becomes implicitly endorsed as a good way to do things, and I'm not sure that's true. I don't really trust the rationale that "it's in fuzzywuzzy, so it must be useful".

Thanks for the feedback @NickCrews , I think I'm leaning towards your opinion that we shouldn't implement these just because it's implemented elsewhere. If people want to use them, then it's simple enough to wrap the existing libraries in a UDF.

From a brief literature review , I've seen a few papers that reference the use of Jaro-Winkler as a similarity measure, but again is it worth re-inventing the wheel since other libraries implement this?

I'm happy to park this for the time being and make use of rapidfuzz if I decide to use these other measures.

jstammers and others added 5 commits August 1, 2024 13:03

feat: add hypothesis for property-based testing

e4be5e1

feat: add more fuzzy string matches

467ae0a

fixed lock

7d2dac5

fix tests

dccc4f0

test: add benchmark tests for lev ratio

0ba7960

NickCrews reviewed Aug 2, 2024

View reviewed changes

NickCrews requested changes Aug 2, 2024

View reviewed changes

jstammers mentioned this pull request Oct 1, 2024

Unable to call UDF-based LevelComparer within pyspark UDF #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Implement additional fuzzy string similarities #51

Feat: Implement additional fuzzy string similarities #51

jstammers commented Aug 1, 2024

NickCrews Aug 2, 2024

lmores Aug 2, 2024

NickCrews Aug 2, 2024

lmores Aug 2, 2024

jstammers Aug 2, 2024

jstammers Aug 2, 2024

NickCrews Aug 6, 2024

NickCrews Aug 6, 2024

jstammers Aug 15, 2024

lmores Aug 30, 2024

NickCrews left a comment

NickCrews Aug 2, 2024

jstammers Aug 2, 2024



		def token_sort_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue:
		"""The levenshtein ratio of two strings after tokenizing and sorting the tokens.

Feat: Implement additional fuzzy string similarities #51

Are you sure you want to change the base?

Feat: Implement additional fuzzy string similarities #51

Conversation

jstammers commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment