Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speeding up similarity queries #5

Closed
piskvorky opened this issue Feb 27, 2011 · 3 comments
Closed

speeding up similarity queries #5

piskvorky opened this issue Feb 27, 2011 · 3 comments

Comments

@piskvorky
Copy link
Owner

Currently if one is only interested in the top-n most similar documents, the MatrixSimilarity and SparseMatrixSimilarity classes compute all similarities, then sort them, then clip to top-n.

Actually the sorting is slower than the matrix multiplication:
http://groups.google.com/group/gensim/browse_thread/thread/f6b839ceaa16c834
so for a start, speed up the post-processing (sorting) part.

@piskvorky
Copy link
Owner Author

I compared three functions. All take an array of input values s and return a sparse (no explicit zeroes) list of its ten greatest elements (index, s[index]):

  1. current approach: form list of non-zero (index, s[index]) 2-tuples, sort it by s[index], return top 10
  2. only keep track of the ten largest values: form the list as above, but use heapq.nlargest to avoid sorting it all; asymptotically faster
  3. use numpy.argsort(s) to avoid forming the tuples explicitly, filter out non-zeros only at the end

timeit results:

len(s)=1,000 | 1) 3.46 ms | 2) 3.41 ms | 3) 99.1 µs
len(s)=10,000 | 1) 43.7 ms | 2) 34 ms | 3) 916 µs
len(s)=100,000 | 1) 502 ms | 2) 337 ms | 3) 11.2 ms

@piskvorky
Copy link
Owner Author

import numpy, heapq
s = numpy.random.rand(10000)

def s1():
    l = [(index, sim) for index, sim in enumerate(s) if abs(sim) > 1e-10]
    return sorted(l, key=lambda item: -item[1])[:10]

def s2():
    l = [(index, sim) for index, sim in enumerate(s) if abs(sim) > 1e-10]
    return heapq.nlargest(10, l, key=lambda item: item[1])

def s3():
    result = []
    for index in numpy.argsort(s)[::-1]:
        if abs(s[index]) > 1e-10:
            result.append((index, s[index]))
            if len(result) == 10:
                break
    return result

@Dieterbe
Copy link
Contributor

Actually the sorting is slower than the matrix multiplication

About 5-6 times slower. but note that i have very short documents (avg 30 tokens or less, haven't counted), so the crazy difference I see is probably not for everyone. I guess if your documents are hundreds of tokens, the similarity calculations will outweigh the sorting.

Either way this sorting optimisation works very well for me. (http://groups.google.com/group/gensim/msg/ae8e4d58d0ead1fb)
Now let's tackle the actual similarity calculations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants