Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional argument in_memory to FastSS __init__ and open to store index in dictionary #1

Closed
wants to merge 1 commit into from

Conversation

rominf
Copy link

@rominf rominf commented May 6, 2019

On dictionary of 375 words query on index, opened with in_memory,
runs a third faster.

…index in dictionary

On dictionary of 375 words query on index, opened with `in_memory`,
runs a third faster.
@piskvorky
Copy link

piskvorky commented May 16, 2021

Since this package is in the public domain, I cleaned it up a bit, got rid of the DB stuff (in-memory + pickle is both simpler and faster), optimized it and added it to Gensim: piskvorky/gensim#3146.

Thanks @fujimotos @rominf for your great work! It helped us with our indexing.

@fujimotos
Copy link
Owner

@piskvorky For my part, it's perfectly fine to embed TinyFastSS that way,

I placed this program into public domain, mostly to make it easier to
embed it into any larger programs.

@piskvorky
Copy link

piskvorky commented May 20, 2021

FYI, an optimized version of in-memory TinyFastSS is now included in Gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/fastss.pyx

Query benchmarks against a 180,000 index of English words, max_dist=2:

  • common short word ("two"): 31x faster
  • mid-tier word ("koala"): 51x faster
  • rare long word ("registrable"): 8x faster

@fujimotos
Copy link
Owner

@piskvorky For that matter, I recommend you to check out polyleven.

Use polyleven.levenshtin() as a drop-in replacement of ceditdist(), and you'll
probably be able to archive even more speedup...

import polyleven

def editdist(s1, s2, max_dist):
    return polyleven.levenshtein(s1, s2, max_dist)

@piskvorky
Copy link

piskvorky commented May 20, 2021

That seems slower actually:

timeit fastss.editdist('uživatel', 'živočich')
310 ns ± 9.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

timeit polyleven.levenshtein('uživatel', 'živočich')
444 ns ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

But for us, not having an external dependency was more important than a few percent here and there. That was the main goal. Which is why your TinyFastSS helped so much ❤️

EDIT: I checked the polyleven source code, that MBLEVEN_MATRIX approach is cool! It's probably indeed faster in some cases.

@maxbachmann
Copy link

maxbachmann commented May 21, 2021

@piskvorky interesting I just posted here: piskvorky/gensim#3146, since I could not reproduce your differences.

Using your example strings I get the following results on my machine:

polyleven.levenshtein: 249 ns
Levenshtein.distance: 174 ns
rapidfuzz.string_metric.levenshtein: 147 ns
fastss.editdist: 222 ns
empty function: 67 ns

As a comparision I used the following empty function:

def func(a, b):
    pass

At least for these small strings a big part of the runtime is the function call overhead + parsing of the arguments.

@piskvorky
Copy link

piskvorky commented May 21, 2021

Microbenchmarking like that doesn't make much sense IMO, except to get a rough order-of-magnitude idea. Check out the full benchmark in the PR you link to, that should be more realistic – at least for the kind of application we're aiming at in Gensim: NLP, relatively short strings, definitely not random.

@rominf rominf closed this Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants