Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_similarity & max_distance does not work in levsim #2541

Closed
BitMindLab opened this issue Jun 28, 2019 · 19 comments · Fixed by #3146
Closed

min_similarity & max_distance does not work in levsim #2541

BitMindLab opened this issue Jun 28, 2019 · 19 comments · Fixed by #3146
Labels
good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest help wanted

Comments

@BitMindLab
Copy link

BitMindLab commented Jun 28, 2019

Expecting

As doc says, max_distance is a more efficient way, more quickly.
https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L32-L37

Problem description

However, max_distance does not speed up the process

https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L48-L51

@piskvorky
Copy link
Owner

@BitMindLab please fill in the issue template fully: what you're trying, what you're expecting, what you're seeing instead.

@piskvorky piskvorky added the need info Not enough information for reproduce an issue, need more info from author label Jun 28, 2019
@BitMindLab BitMindLab changed the title min_similarity does not work in levsim min_similarity & max_distance does not work in levsim Jul 2, 2019
@gojomo
Copy link
Collaborator

gojomo commented Jul 12, 2019

The doc-comment clearly does not match the code behavior, which gains no efficiency benefit from the specification of max_distance - it just sometimes returns a different value.

Either the code could be changed to match the comment intent, or the comment could be updated to not make false claims about what the code does.

@gojomo gojomo removed the need info Not enough information for reproduce an issue, need more info from author label Jul 12, 2019
@piskvorky
Copy link
Owner

piskvorky commented Jul 12, 2019

Whether the code does something more efficient or not is up to the code. I proposed (much) faster early-out variants in the past, but IIRC @Witiko didn't think it was worth it.

The only problem I see is that the documentation is unclear whether supplying max_distance must speed up the computation, or simply may speed it up. The intent is may: no guarantees, except that it's not slower.

So if the user knows their max distance threshold, they should always supply it. With the current implementation, supplying the parameter has no real effect, but that may change in the future.

@Witiko @gojomo @BitMindLab @mpenkov can you clarify the docstring wording? Or even better, implement the early-out optimization more intelligently (no need to compute the full Levenshtein distance once it's known to be above the max_distance threshold for certain).

@Witiko
Copy link
Contributor

Witiko commented Jul 12, 2019

@piskvorky As discussed in #2016 (comment), replacing the levsim function with a stub (i.e. Levenshtein distance is not even computed) leads to less than 2× speed improvement. My conclusion was that rather than optimize the pointwise levsim function, which will lead to virtually no speed-up, we should be computing the Levenshtein distance between all word pairs in bulk.

@piskvorky
Copy link
Owner

piskvorky commented Jul 13, 2019

That's the one, thanks for digging it up :)

I assume "2×" also depends on the particular inputs, no? I.e. the longer the strings => the more costly to compute exact Levenshtein (especially compared to a low max_distance threshold… for example, with max_distance=N, we can exit early by simply checking abs(len(string1) - len(string2)) > N; compare to O(len(string1) * len(string2)) of full Levenshtein).

Replacing naive pair-wise computations by a better algo sounds good too. Which algo do you have in mind? Some trie / automaton? How would that work?

@Witiko
Copy link
Contributor

Witiko commented Jul 13, 2019

The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²), where M is dictionary size and N is average word length. In fact, a bit tighter bound exists with respect to N, since a subquadratic algorithm was recently discovered, see here.

However, since N is constant for a language and M >> N, the actual average time complexity is O(M²), i.e. there is virtually no difference between computing pointwise edit distance and calling a stub, as seen in the measurements.

I am currently unaware of any algorithm for computing the edit distance in bulk, but I am looking. We can gain a little speed-up by optimizing the python-Levenshtein library, but the improvements will be insignificant.

@piskvorky
Copy link
Owner

piskvorky commented Jul 13, 2019

The improvements of early-out over using full Levenshtein distance are substantial. I've seen 10x+, gains (again, depending on string length: the longer the two inputs, the bigger the gain; two super large inputs are enough to halt your computation, you cannot take "average word length N" over squares like that).

Gensim is not an academic library, performance on real inputs and constant factors matter a lot.

If the "normal" words you tried on are all equally small, that's great, but is it true for all supported use-cases / languages?

The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²)

But that's not the API I see in the TermSimilarityIndex module. The API claims it's looking for Get most similar terms for a given term.:

https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/termsim.py#L34

Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast.

I've done this multiple times before myself (for spelling correction, incl. with custom edit distance to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch.

@Witiko
Copy link
Contributor

Witiko commented Jul 13, 2019

If the "normal" words you tried on all equally small, that's great, but is it true for all supported use-cases / languages?

In the benchmark, we used the dictionary of the Google News dataset (English).
We would see a larger speed-up for languages with longer words such as German.

The API claims it's looking for Get most similar terms for a given term.:

This is true. In the SparseTermSimilarityMatrix class, we eventually call most_similar for every term, which is equivalent to computing edit distance between all word pairs in our current implementation. However, since we are only interested in the k nearest neighbors, our problem is easier.

Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast.

I've done this multiple times before myself (for spelling correction, incl. with custom edit distance fnc to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch.

Here are some references:

  1. Johnson, N., 2007. Nick's Blog. Damn Cool Algorithms, Part 1: BK-Trees. [URL]
  2. Johnson, N., 2010. Nick's Blog. Damn Cool Algorithms: Levenshtein Automata. [URL]
  3. Boytsov, L., 2011. Journal of Experimental Algorithmics. Indexing methods for approximate dictionary searching: Comparative analysis. [URL]

Regrettably, I will have little time to work on this during July and August.
Perhaps some other contributor will take up this fun project in the meantime.

@mpenkov mpenkov added good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest help wanted labels Sep 28, 2019
@Witiko
Copy link
Contributor

Witiko commented May 15, 2021

@piskvorky Two years later and I just came across a nice-and-simple solution, which should reduce the complexity of our kNN algorithm over the Levenshtein distance from O(M²) to O(M * logM), where M is the number of indexed words. The solution would use the ball tree from scikit-learn, which is a structure for indexing metric spaces that reduces the number of comparisons from M to logM when the tree is balanced compared to our current naïve brute-force approach.

There exist better algorithms specifically for kNN over the Levenshtein distance, some of which I list in #2541 (comment). However, their implementations in Python are elusive. The ball tree is implemented in a Gensim dependency and the use is straightforward. We may wish to pursue further speed-up using approximate kNN in the future, but this seems as a major improvement over the current situation, where the Levenshtein distance is virtually useless with any larger-than-toy dictionary.

@piskvorky
Copy link
Owner

Sure, if that works. I vaguely remember trying ball-tree and it didn't work well, but that was in a different context. I'd have to "load up" this thread into my memory to remember what it's about, but I'm sure you already did that :-)

Let me know when you get some numbers!

@Witiko
Copy link
Contributor

Witiko commented May 15, 2021

After playing with it for a bit, I was disappointed to discover that sklearn's BallTree assumes Euclidean space under the hood and uses centroids to represent intermediate nodes, which rules out non-vector metric spaces such as strings over the Levenshtein distance. I found a number of metric tree implementations in Python. However, I only managed to double the speed, which is arguably insufficient to make kNN search over the Levenshtein distance practically useful. I shared my measurements to #3146.

@piskvorky
Copy link
Owner

piskvorky commented May 15, 2021

Pity. I don't remember the particulars of the problem you're solving here, but searching for "similar strings to a given string", where similar = "Damerau-Levenshtein within a given distance among a known, static set of strings", can be done extremely fast, using trie and DAWG automata. I see that as a more viable approach than fancy clustering over metric spaces.

Why not do that, can you remind me?

@Witiko
Copy link
Contributor

Witiko commented May 15, 2021

I was curious how much speed we could gain with metric indexing, which is a nice general technique. I also tried the DAWG automata as implemented by lexpy with the following results: we get 2x speed-up when finding strings within edit distance 2 (i.e. roughly the same as VT-Tree) and up to 10x speed-up when finding strings within edit distance 1. We would need ca 200x speed-up to be in the same ballpark as vector space kNN search of word embeddings.

@piskvorky
Copy link
Owner

piskvorky commented May 15, 2021

10x speed-up compared to brute-forcing each comparison query_string x lexicon_string? That doesn't sound right.

I'd expect the speed up to depend on the number of answers = words within the lexicon that are actually within distance N. Any fixed number is suspicious. We're probably talking about different algos here.

@Witiko do you have a standardized benchmark that you use for this? With some realistic size, filled with realistic words. If the problem is easy to eval, I'll give it a try.

@Witiko
Copy link
Contributor

Witiko commented May 15, 2021

This speed-up is with regards to the small benchmark I set up at #3146. For the DAWG, I used the DAWG.search_within_distance() method exposed by lexpy.

@piskvorky
Copy link
Owner

piskvorky commented May 15, 2021

OK I'll check #3146, to see how the overall problem is formulated, what is the goal, what's being measured.

@Witiko
Copy link
Contributor

Witiko commented May 15, 2021

I am getting more favorable results for DAWG compared to VP-Tree with the bigger text8 corpus (100MB):

  • ca 0.8 it/s with the current brute-force kNN
  • ca 1.6 it/s with VP-Tree kNN
  • ca 8 it/s with DAWG searching for words within edit distance 2
  • ca 80 it/s with DAWG searching for words within edit distance 1
  • ca 500 it/s with WordEmbeddingSimilarityIndex

This is arguably quite nice. I added the implementation to #3146, we can merge when ready.

@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

See my comment there – I got several times faster results by using an automaton, in pure Python (on my mid-2014 MBP).

@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

For future reference – if we only look for k=1 (not larger k), we can probably bypass automata altogether, and generate the candidate words directly instead, to look up in the dictionary exactly.

This should be pretty fast; I just tried TinyFastSS and on text8 (253,854 terms in the dictionary) I'm getting thousands of queries per second. Pure Python, a single module of a hundred lines of code. Large potential for further optimizations too, given the simplicity of the algo.

EDIT: FastSS indeed faster; implemented in #3146.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants