Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This reimplements the damerau levenshtein distance based on the paper
Linear space string correction algorithm using the Damerau-Levenshtein distance
from Chunchun Zhao and Sartaj Sahni.Currently this improves the following things:
There are a couple of open decisions:
hashmap
Right now this uses a normal Hashmap internally. In Rapidfuzz I use a custom hashmap implementation. This is based on the hashmap implementation used inside cpython which uses open addressing. However it removes a couple of features:
This has a couple of advantages / disadvantages:
+ significantly faster especially for ascii
not really clear how both of them perform if you try to create more hash collisions.
+ little code -> in a small example it has around 40% smaller binary size when switching the hashmap in damerau levenshtein
- more code to maintain
- has a custom hashfunction which assumes that there can't be hash collisions. This works fine for basic types like integers or chars, but this assumption would break if a user wants to compare e.g. lists of strings.
- hashmap could be smaller if nothing is inlined + it is used in multiple places
interface
we should probably think about how we want the interface for generic functions to be as consistent as possible. Right now a lot of them have different interfaces.