min_similarity & max_distance does not work in levsim #2541

BitMindLab · 2019-06-28T12:30:21Z

Expecting

As doc says, max_distance is a more efficient way, more quickly.
https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L32-L37

Problem description

However, max_distance does not speed up the process

https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L48-L51

The text was updated successfully, but these errors were encountered:

piskvorky · 2019-06-28T14:00:08Z

@BitMindLab please fill in the issue template fully: what you're trying, what you're expecting, what you're seeing instead.

gojomo · 2019-07-12T19:37:57Z

The doc-comment clearly does not match the code behavior, which gains no efficiency benefit from the specification of max_distance - it just sometimes returns a different value.

Either the code could be changed to match the comment intent, or the comment could be updated to not make false claims about what the code does.

piskvorky · 2019-07-12T20:34:30Z

Whether the code does something more efficient or not is up to the code. I proposed (much) faster early-out variants in the past, but IIRC @Witiko didn't think it was worth it.

The only problem I see is that the documentation is unclear whether supplying max_distance must speed up the computation, or simply may speed it up. The intent is may: no guarantees, except that it's not slower.

So if the user knows their max distance threshold, they should always supply it. With the current implementation, supplying the parameter has no real effect, but that may change in the future.

@Witiko @gojomo @BitMindLab @mpenkov can you clarify the docstring wording? Or even better, implement the early-out optimization more intelligently (no need to compute the full Levenshtein distance once it's known to be above the max_distance threshold for certain).

Witiko · 2019-07-12T22:15:12Z

@piskvorky As discussed in #2016 (comment), replacing the levsim function with a stub (i.e. Levenshtein distance is not even computed) leads to less than 2× speed improvement. My conclusion was that rather than optimize the pointwise levsim function, which will lead to virtually no speed-up, we should be computing the Levenshtein distance between all word pairs in bulk.

piskvorky · 2019-07-13T00:01:41Z

That's the one, thanks for digging it up :)

I assume "2×" also depends on the particular inputs, no? I.e. the longer the strings => the more costly to compute exact Levenshtein (especially compared to a low max_distance threshold… for example, with max_distance=N, we can exit early by simply checking abs(len(string1) - len(string2)) > N; compare to O(len(string1) * len(string2)) of full Levenshtein).

Replacing naive pair-wise computations by a better algo sounds good too. Which algo do you have in mind? Some trie / automaton? How would that work?

Witiko · 2019-07-13T13:27:23Z

The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²), where M is dictionary size and N is average word length. In fact, a bit tighter bound exists with respect to N, since a subquadratic algorithm was recently discovered, see here.

However, since N is constant for a language and M >> N, the actual average time complexity is O(M²), i.e. there is virtually no difference between computing pointwise edit distance and calling a stub, as seen in the measurements.

I am currently unaware of any algorithm for computing the edit distance in bulk, but I am looking. We can gain a little speed-up by optimizing the python-Levenshtein library, but the improvements will be insignificant.

piskvorky · 2019-07-13T14:39:25Z

The improvements of early-out over using full Levenshtein distance are substantial. I've seen 10x+, gains (again, depending on string length: the longer the two inputs, the bigger the gain; two super large inputs are enough to halt your computation, you cannot take "average word length N" over squares like that).

Gensim is not an academic library, performance on real inputs and constant factors matter a lot.

If the "normal" words you tried on are all equally small, that's great, but is it true for all supported use-cases / languages?

The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²)

But that's not the API I see in the TermSimilarityIndex module. The API claims it's looking for Get most similar terms for a given term.:

https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/termsim.py#L34

Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast.

I've done this multiple times before myself (for spelling correction, incl. with custom edit distance to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch.

Witiko · 2019-07-13T18:06:08Z

If the "normal" words you tried on all equally small, that's great, but is it true for all supported use-cases / languages?

In the benchmark, we used the dictionary of the Google News dataset (English).
We would see a larger speed-up for languages with longer words such as German.

The API claims it's looking for Get most similar terms for a given term.:

This is true. In the SparseTermSimilarityMatrix class, we eventually call most_similar for every term, which is equivalent to computing edit distance between all word pairs in our current implementation. However, since we are only interested in the k nearest neighbors, our problem is easier.

Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast.

I've done this multiple times before myself (for spelling correction, incl. with custom edit distance fnc to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch.

Here are some references:

Johnson, N., 2007. Nick's Blog. Damn Cool Algorithms, Part 1: BK-Trees. [URL]
Johnson, N., 2010. Nick's Blog. Damn Cool Algorithms: Levenshtein Automata. [URL]
Boytsov, L., 2011. Journal of Experimental Algorithmics. Indexing methods for approximate dictionary searching: Comparative analysis. [URL]

Regrettably, I will have little time to work on this during July and August.
Perhaps some other contributor will take up this fun project in the meantime.

Witiko · 2021-05-15T15:09:16Z

@piskvorky Two years later and I just came across a nice-and-simple solution, which should reduce the complexity of our kNN algorithm over the Levenshtein distance from O(M²) to O(M * logM), where M is the number of indexed words. The solution would use the ball tree from scikit-learn, which is a structure for indexing metric spaces that reduces the number of comparisons from M to logM when the tree is balanced compared to our current naïve brute-force approach.

There exist better algorithms specifically for kNN over the Levenshtein distance, some of which I list in #2541 (comment). However, their implementations in Python are elusive. The ball tree is implemented in a Gensim dependency and the use is straightforward. We may wish to pursue further speed-up using approximate kNN in the future, but this seems as a major improvement over the current situation, where the Levenshtein distance is virtually useless with any larger-than-toy dictionary.

piskvorky · 2021-05-15T20:22:16Z

Sure, if that works. I vaguely remember trying ball-tree and it didn't work well, but that was in a different context. I'd have to "load up" this thread into my memory to remember what it's about, but I'm sure you already did that :-)

Let me know when you get some numbers!

Witiko · 2021-05-15T20:52:40Z

After playing with it for a bit, I was disappointed to discover that sklearn's BallTree assumes Euclidean space under the hood and uses centroids to represent intermediate nodes, which rules out non-vector metric spaces such as strings over the Levenshtein distance. I found a number of metric tree implementations in Python. However, I only managed to double the speed, which is arguably insufficient to make kNN search over the Levenshtein distance practically useful. I shared my measurements to #3146.

piskvorky · 2021-05-15T21:02:29Z

Pity. I don't remember the particulars of the problem you're solving here, but searching for "similar strings to a given string", where similar = "Damerau-Levenshtein within a given distance among a known, static set of strings", can be done extremely fast, using trie and DAWG automata. I see that as a more viable approach than fancy clustering over metric spaces.

Why not do that, can you remind me?

Witiko · 2021-05-15T21:44:19Z

I was curious how much speed we could gain with metric indexing, which is a nice general technique. I also tried the DAWG automata as implemented by lexpy with the following results: we get 2x speed-up when finding strings within edit distance 2 (i.e. roughly the same as VT-Tree) and up to 10x speed-up when finding strings within edit distance 1. We would need ca 200x speed-up to be in the same ballpark as vector space kNN search of word embeddings.

piskvorky · 2021-05-15T22:18:39Z

10x speed-up compared to brute-forcing each comparison query_string x lexicon_string? That doesn't sound right.

I'd expect the speed up to depend on the number of answers = words within the lexicon that are actually within distance N. Any fixed number is suspicious. We're probably talking about different algos here.

@Witiko do you have a standardized benchmark that you use for this? With some realistic size, filled with realistic words. If the problem is easy to eval, I'll give it a try.

Witiko · 2021-05-15T22:20:54Z

This speed-up is with regards to the small benchmark I set up at #3146. For the DAWG, I used the DAWG.search_within_distance() method exposed by lexpy.

piskvorky · 2021-05-15T22:26:28Z

OK I'll check #3146, to see how the overall problem is formulated, what is the goal, what's being measured.

Witiko · 2021-05-15T22:38:29Z

I am getting more favorable results for DAWG compared to VP-Tree with the bigger text8 corpus (100MB):

ca 0.8 it/s with the current brute-force kNN
ca 1.6 it/s with VP-Tree kNN
ca 8 it/s with DAWG searching for words within edit distance 2
ca 80 it/s with DAWG searching for words within edit distance 1
ca 500 it/s with WordEmbeddingSimilarityIndex

This is arguably quite nice. I added the implementation to #3146, we can merge when ready.

piskvorky · 2021-05-16T01:50:03Z

See my comment there – I got several times faster results by using an automaton, in pure Python (on my mid-2014 MBP).

piskvorky · 2021-05-16T18:53:01Z

For future reference – if we only look for k=1 (not larger k), we can probably bypass automata altogether, and generate the candidate words directly instead, to look up in the dictionary exactly.

This should be pretty fast; I just tried TinyFastSS and on text8 (253,854 terms in the dictionary) I'm getting thousands of queries per second. Pure Python, a single module of a hundred lines of code. Large potential for further optimizations too, given the simplicity of the algo.

EDIT: FastSS indeed faster; implemented in #3146.

piskvorky added the need info Not enough information for reproduce an issue, need more info from author label Jun 28, 2019

BitMindLab changed the title ~~min_similarity does not work in levsim~~ min_similarity & max_distance does not work in levsim Jul 2, 2019

gojomo removed the need info Not enough information for reproduce an issue, need more info from author label Jul 12, 2019

mpenkov added good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest help wanted labels Sep 28, 2019

Witiko mentioned this issue May 15, 2021

Use FastSS for fast kNN over Levenshtein distance #3146

Merged

piskvorky closed this as completed in #3146 May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min_similarity & max_distance does not work in levsim #2541

min_similarity & max_distance does not work in levsim #2541

BitMindLab commented Jun 28, 2019 •

edited

Loading

piskvorky commented Jun 28, 2019

gojomo commented Jul 12, 2019 •

edited

Loading

piskvorky commented Jul 12, 2019 •

edited

Loading

Witiko commented Jul 12, 2019 •

edited

Loading

piskvorky commented Jul 13, 2019 •

edited

Loading

Witiko commented Jul 13, 2019 •

edited

Loading

piskvorky commented Jul 13, 2019 •

edited

Loading

Witiko commented Jul 13, 2019 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 16, 2021 •

edited

Loading

piskvorky commented May 16, 2021 •

edited

Loading

min_similarity & max_distance does not work in levsim #2541

min_similarity & max_distance does not work in levsim #2541

Comments

BitMindLab commented Jun 28, 2019 • edited Loading

Expecting

Problem description

piskvorky commented Jun 28, 2019

gojomo commented Jul 12, 2019 • edited Loading

piskvorky commented Jul 12, 2019 • edited Loading

Witiko commented Jul 12, 2019 • edited Loading

piskvorky commented Jul 13, 2019 • edited Loading

Witiko commented Jul 13, 2019 • edited Loading

piskvorky commented Jul 13, 2019 • edited Loading

Witiko commented Jul 13, 2019 • edited Loading

Witiko commented May 15, 2021 • edited Loading

piskvorky commented May 15, 2021

Witiko commented May 15, 2021 • edited Loading

piskvorky commented May 15, 2021 • edited Loading

Witiko commented May 15, 2021 • edited Loading

piskvorky commented May 15, 2021 • edited Loading

Witiko commented May 15, 2021 • edited Loading

piskvorky commented May 15, 2021 • edited Loading

Witiko commented May 15, 2021 • edited Loading

piskvorky commented May 16, 2021 • edited Loading

piskvorky commented May 16, 2021 • edited Loading

BitMindLab commented Jun 28, 2019 •

edited

Loading

gojomo commented Jul 12, 2019 •

edited

Loading

piskvorky commented Jul 12, 2019 •

edited

Loading

Witiko commented Jul 12, 2019 •

edited

Loading

piskvorky commented Jul 13, 2019 •

edited

Loading

Witiko commented Jul 13, 2019 •

edited

Loading

piskvorky commented Jul 13, 2019 •

edited

Loading

Witiko commented Jul 13, 2019 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

Witiko commented May 15, 2021 •

edited

Loading

piskvorky commented May 16, 2021 •

edited

Loading

piskvorky commented May 16, 2021 •

edited

Loading