-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Implement additional fuzzy string similarities #51
base: main
Are you sure you want to change the base?
Conversation
@@ -91,3 +91,131 @@ def levenshtein_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValu | |||
lenmax = ibis.greatest(s1.length(), s2.length()) | |||
ldist = s1.levenshtein(s2) | |||
return (lenmax - ldist) / lenmax | |||
|
|||
def token_set_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue: | |||
"""The ratio of the intersection of the token sets of two strings to the union of the token sets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same, or very close, to the already implemented mismo.sets.jaccard(), isn't it? If there are some slight differences, why is this difference needed? I would like to see if we can unify the APIs somewhat.
At the very least, I want to make it so all these functions operate on set-like (ir.Arrays) or bag-like (ir.Maps) data (see mismo/vector/), and leave the tokenization as a prep step. Inevitably someone is gonna want to tokenize their strings slightly differently that we have here.
But I think advertising those set-based functions could be very worth it, I imagine a lot of people are going to miss them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think advertising those set-based functions could be very worth it, I imagine a lot of people are going to miss them.
Some time ago I thought about suggesting to port (some of) the functions here and make them readily available inside mismo, but I did not have the time to carefully think about it since then...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port (some of) the functions here
brilliant, yeah I thought of this too once but it slipped my mind. I worry a little bit that there are so many of them, and it's not obvious which to use that again we don't want to endorse methods that we don't actually stand behind. Having that huge menu works for dedupe because they actually do some clever machine learning to automatically choose a subset from this menu of options.
I think most of them can even be implemented fairly simply with the atomic functions that mismo already has, eg double_metaphone(), etc. Are there some in particular that would be a big lift to do yourself?
I think I am more tempted to tackle this problem as:
- try to keep the core, domain-agnostic utils small, but useful enough for building blocks
- then, build domain-specific blockers and compares that are actually powerful. For example, with street addresses, an acronym based blocking seems useless, but extracting any integers and letter sorting them (eg 1234 and 1324 both get mapped to "1234"), or rounding to a hundreds place (eg on the same street block) seems very practical for dealing with typos in people's street numbers. On the other hand, an acronym blocker seems quite useful if blocking on company names.
I think this will
- be less overwhelming for users.
- more batteries included
- be more performant, users will end up with curated set of solutions to the common problems
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there some in particular that would be a big lift to do yourself?
I am not a native English speaker and I am having some troubles to catch the precise meaning of "a big lift to do yourself".
If you are asking if doing it by myself would result in some kind of advantage, I do not believe so. I actually worked on the implementation of the predicate functions in dedupe (with some limited improvements) but that was essentially a war against strings performance in CPython. As mismo shifts the heavy load on the backend, I believe I have nothing special to add.
Instead, if you are asking if I could work on it in another pr, yes, I could, but these days are quite full for me and then I am leaving for holidays. I could start working on it at the end of August/beginning of September.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of porting some of these functions too. One thing I've found when trying to evaluate the similarity between records is that without a good set of comparison functions available, I tend to go for a simple set of thresholds on a single metric (e.g. levenshtien distance), which is simple to implement but not the most intuitive when it comes to defining a sensible set of levels. I can see how having a set of predicate functions could certainly be useful there.
In terms of my availability, I may have some time over the next few weeks to implement some of these, depending on how other projects and holidays go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think advertising those set-based functions could be very worth it, I imagine a lot of people are going to miss them.
I'd actually forgotten about the jaccard function when writing this, and I agree that after some pre-processing to extract words and convert to lower-case, these would be equivalent. Another reason I'm going off the idea of this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there some in particular that would be a big lift to do yourself?
Sorry, I should be using simpler English! I meant "would it be hard for you (as in a general user of mismo, not you in particular) to just write these functions themselves, externally to mismo, or is there a large extra benefit to the function being written in actual mismo?"
For example, as we discovered in the PRs and benchmarks, using pypostal to do address parsing is difficult. I think it makes sense for this to be contained in mismo, since it would be easy for a user to implement it incorrectly/poorly if they did it themselves.
On the other hand, someone could easily implement dedupe's sortedAcronym()
blocker predicate themselves, without needing this to be built into mismo:
def sorted_acronym(s: ir.StringValue) -> ir.ArrayValue:
tokens = s.re_split("\s+")
# or perhaps we include some mismo.tokenize(s) util that has various common options
# like trimming beginning and trailing whitespace, etc
return tokens.map(lambda token: token[0].upper()).sorted()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually forgotten about the jaccard function when writing this
This is a signal the docs need some updates. Where do you think would have been the best for you?
- the text section of the reference?
- in a how-to guide on working with text data? (doesn't exist yet but it probably should)
- in an example?
pre-processing to extract words and convert to lower-case
I think these are common enough and easy enough to mess up (people probably want s.re_split("\s+")
, not s.split(" ")
) that it makes sense to include them. I'll try to make a PR for this soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickCrews , I think it would be worth including it in some how-to guide. Perhaps a subsection under Comparing, or probably a distinct document, to provide an overview of the various text comparison methods. Given that there are some designed to compare single strings (e.g. levenhstein) and others on arrays of tokenized text (jaccard, tf-idf, etc.), it might be worth explaining how and when these different methods could be used to define similarity measures between text-based fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back from holidays.
Would it be hard for you (as in a general user of mismo, not you in particular) to just write these functions themselves, externally to mismo, or is there a large extra benefit to the function being written in actual mismo?
Most of those functions are just one liners and I expect the average Python could easily implement them. The less trivial part is to known that the ideas behind those simple functions can be a good choice to define blockers. As you noted, this is probably more of a documentation issue rather than something we should implement in mismo (in future we may think about a page in the documentation the lists some of the ideas behind those predicate functions to give mismo users some examples of what they might need for their specific problems).
Keeping mismo domain agnostic and as lite as possible yet providing powerful domain-specific blockers seems the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate the work you've done here, it looks a quite good implementation! But I want to slow down and actually check that these are actually things we want to support
|
||
|
||
def token_sort_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue: | ||
"""The levenshtein ratio of two strings after tokenizing and sorting the tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this might sound harsh. I am genuinely very appreciative of all the effort that you have putting into mismo, and I don't want to discourage you. With that caveat...
I am tempted to not support this, it is so similar to the other methods, and not hard for someone to implement themselves. Can you point to examples in record linkage literature where this method is used and gives good results? Have you personally found that you get noticeably better results with this? I worry that if we include this, then it becomes implicitly endorsed as a good way to do things, and I'm not sure that's true. I don't really trust the rationale that "it's in fuzzywuzzy, so it must be useful".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @NickCrews , I think I'm leaning towards your opinion that we shouldn't implement these just because it's implemented elsewhere. If people want to use them, then it's simple enough to wrap the existing libraries in a UDF.
From a brief literature review , I've seen a few papers that reference the use of Jaro-Winkler as a similarity measure, but again is it worth re-inventing the wheel since other libraries implement this?
I'm happy to park this for the time being and make use of rapidfuzz if I decide to use these other measures.
Very much still a work-in-progress, but I think it would be useful to implement some additional string similarity measures, that could be more useful than the levenshtein ratio in specific use-cases.
For example, the
token_set_ratio
implemented infuzzywuzzy
can be more useful when comparing addresses that may contain duplicate tokens or when the order is not important.This PR contains some implementations of the main comparison functions implemented in fuzzywuzzy. These are currently failing as they do not produce the expected values.
In addition, I've created a benchmark test suite similar to the one @lmores created for address parsing.
This aims to benchmark the performance of the levenshtein ratio implementation with the one in rapidfuzz. This library contains both a scalar and vectorized methods.
Initial tests are fairly promising in terms of runtime performance
but this makes me wonder if the
cache()
is unfairly favouring the native-ibis implementation.Since I'm not convinced these implementations are giving the correct values (i.e. those returned by
rapidfuzz
once scaled to the correct interval), I've added some property based tests to check that the values returned by the implementations here are the same as the rapidfuzz ones. This should strictly be a comparison with a pre-defined tolerance, but I'd first like to convince myself it's worth re-implementing these over wrapping therapidfuzz
functions in a UDF