-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
min_similarity & max_distance does not work in levsim #2541
Comments
@BitMindLab please fill in the issue template fully: what you're trying, what you're expecting, what you're seeing instead. |
The doc-comment clearly does not match the code behavior, which gains no efficiency benefit from the specification of Either the code could be changed to match the comment intent, or the comment could be updated to not make false claims about what the code does. |
Whether the code does something more efficient or not is up to the code. I proposed (much) faster early-out variants in the past, but IIRC @Witiko didn't think it was worth it. The only problem I see is that the documentation is unclear whether supplying So if the user knows their max distance threshold, they should always supply it. With the current implementation, supplying the parameter has no real effect, but that may change in the future. @Witiko @gojomo @BitMindLab @mpenkov can you clarify the docstring wording? Or even better, implement the early-out optimization more intelligently (no need to compute the full Levenshtein distance once it's known to be above the |
@piskvorky As discussed in #2016 (comment), replacing the |
That's the one, thanks for digging it up :) I assume "2×" also depends on the particular inputs, no? I.e. the longer the strings => the more costly to compute exact Levenshtein (especially compared to a low Replacing naive pair-wise computations by a better algo sounds good too. Which algo do you have in mind? Some trie / automaton? How would that work? |
The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²), where M is dictionary size and N is average word length. In fact, a bit tighter bound exists with respect to N, since a subquadratic algorithm was recently discovered, see here. However, since N is constant for a language and M >> N, the actual average time complexity is O(M²), i.e. there is virtually no difference between computing pointwise edit distance and calling a stub, as seen in the measurements. I am currently unaware of any algorithm for computing the edit distance in bulk, but I am looking. We can gain a little speed-up by optimizing the python-Levenshtein library, but the improvements will be insignificant. |
The improvements of early-out over using full Levenshtein distance are substantial. I've seen 10x+, gains (again, depending on string length: the longer the two inputs, the bigger the gain; two super large inputs are enough to halt your computation, you cannot take "average word length N" over squares like that). Gensim is not an academic library, performance on real inputs and constant factors matter a lot. If the "normal" words you tried on are all equally small, that's great, but is it true for all supported use-cases / languages?
But that's not the API I see in the Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast. I've done this multiple times before myself (for spelling correction, incl. with custom edit distance to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch. |
In the benchmark, we used the dictionary of the Google News dataset (English).
This is true. In the
Here are some references:
Regrettably, I will have little time to work on this during July and August. |
@piskvorky Two years later and I just came across a nice-and-simple solution, which should reduce the complexity of our kNN algorithm over the Levenshtein distance from O(M²) to O(M * logM), where M is the number of indexed words. The solution would use the ball tree from scikit-learn, which is a structure for indexing metric spaces that reduces the number of comparisons from M to logM when the tree is balanced compared to our current naïve brute-force approach. There exist better algorithms specifically for kNN over the Levenshtein distance, some of which I list in #2541 (comment). However, their implementations in Python are elusive. The ball tree is implemented in a Gensim dependency and the use is straightforward. We may wish to pursue further speed-up using approximate kNN in the future, but this seems as a major improvement over the current situation, where the Levenshtein distance is virtually useless with any larger-than-toy dictionary. |
Sure, if that works. I vaguely remember trying ball-tree and it didn't work well, but that was in a different context. I'd have to "load up" this thread into my memory to remember what it's about, but I'm sure you already did that :-) Let me know when you get some numbers! |
After playing with it for a bit, I was disappointed to discover that sklearn's BallTree assumes Euclidean space under the hood and uses centroids to represent intermediate nodes, which rules out non-vector metric spaces such as strings over the Levenshtein distance. I found a number of metric tree implementations in Python. However, I only managed to double the speed, which is arguably insufficient to make kNN search over the Levenshtein distance practically useful. I shared my measurements to #3146. |
Pity. I don't remember the particulars of the problem you're solving here, but searching for "similar strings to a given string", where similar = "Damerau-Levenshtein within a given distance among a known, static set of strings", can be done extremely fast, using trie and DAWG automata. I see that as a more viable approach than fancy clustering over metric spaces. Why not do that, can you remind me? |
I was curious how much speed we could gain with metric indexing, which is a nice general technique. I also tried the DAWG automata as implemented by lexpy with the following results: we get 2x speed-up when finding strings within edit distance 2 (i.e. roughly the same as VT-Tree) and up to 10x speed-up when finding strings within edit distance 1. We would need ca 200x speed-up to be in the same ballpark as vector space kNN search of word embeddings. |
10x speed-up compared to brute-forcing each comparison I'd expect the speed up to depend on the number of answers = words within the lexicon that are actually within distance N. Any fixed number is suspicious. We're probably talking about different algos here. @Witiko do you have a standardized benchmark that you use for this? With some realistic size, filled with realistic words. If the problem is easy to eval, I'll give it a try. |
OK I'll check #3146, to see how the overall problem is formulated, what is the goal, what's being measured. |
I am getting more favorable results for DAWG compared to VP-Tree with the bigger text8 corpus (100MB):
This is arguably quite nice. I added the implementation to #3146, we can merge when ready. |
See my comment there – I got several times faster results by using an automaton, in pure Python (on my mid-2014 MBP). |
For future reference – if we only look for This should be pretty fast; I just tried TinyFastSS and on text8 (253,854 terms in the dictionary) I'm getting thousands of queries per second. Pure Python, a single module of a hundred lines of code. Large potential for further optimizations too, given the simplicity of the algo. EDIT: FastSS indeed faster; implemented in #3146. |
Expecting
As doc says,
max_distance
is a more efficient way, more quickly.https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L32-L37
Problem description
However,
max_distance
does not speed up the processhttps://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L48-L51
The text was updated successfully, but these errors were encountered: