speeding up similarity queries #5

piskvorky · 2011-02-27T06:57:17Z

Currently if one is only interested in the top-n most similar documents, the MatrixSimilarity and SparseMatrixSimilarity classes compute all similarities, then sort them, then clip to top-n.

Actually the sorting is slower than the matrix multiplication:
http://groups.google.com/group/gensim/browse_thread/thread/f6b839ceaa16c834
so for a start, speed up the post-processing (sorting) part.

The text was updated successfully, but these errors were encountered:

piskvorky · 2011-02-27T07:23:39Z

I compared three functions. All take an array of input values s and return a sparse (no explicit zeroes) list of its ten greatest elements (index, s[index]):

current approach: form list of non-zero (index, s[index]) 2-tuples, sort it by s[index], return top 10
only keep track of the ten largest values: form the list as above, but use heapq.nlargest to avoid sorting it all; asymptotically faster
use numpy.argsort(s) to avoid forming the tuples explicitly, filter out non-zeros only at the end

timeit results:

len(s)=1,000 | 1) 3.46 ms | 2) 3.41 ms | 3) 99.1 µs
len(s)=10,000 | 1) 43.7 ms | 2) 34 ms | 3) 916 µs
len(s)=100,000 | 1) 502 ms | 2) 337 ms | 3) 11.2 ms

piskvorky · 2011-02-27T07:35:26Z

import numpy, heapq
s = numpy.random.rand(10000)

def s1():
    l = [(index, sim) for index, sim in enumerate(s) if abs(sim) > 1e-10]
    return sorted(l, key=lambda item: -item[1])[:10]

def s2():
    l = [(index, sim) for index, sim in enumerate(s) if abs(sim) > 1e-10]
    return heapq.nlargest(10, l, key=lambda item: item[1])

def s3():
    result = []
    for index in numpy.argsort(s)[::-1]:
        if abs(s[index]) > 1e-10:
            result.append((index, s[index]))
            if len(result) == 10:
                break
    return result

Dieterbe · 2011-02-28T09:30:34Z

Actually the sorting is slower than the matrix multiplication

About 5-6 times slower. but note that i have very short documents (avg 30 tokens or less, haven't counted), so the crazy difference I see is probably not for everyone. I guess if your documents are hundreds of tokens, the similarity calculations will outweigh the sorting.

Either way this sorting optimisation works very well for me. (http://groups.google.com/group/gensim/msg/ae8e4d58d0ead1fb)
Now let's tackle the actual similarity calculations!

py3k fix

ghost assigned Dieterbe Apr 24, 2011

piskvorky closed this as completed Jun 8, 2011

piskvorky pushed a commit that referenced this issue Sep 16, 2014

Merge pull request #5 from piskvorky/ziky90-develop

1316390

py3k fix

danwiesenthal mentioned this issue Oct 13, 2014

LdaMulticore livelock when documents converge? #244

Closed

lerela mentioned this issue Oct 16, 2014

multicore LDA #232

Merged

thomaskern mentioned this issue May 13, 2015

word2vec (& doc2vec) training doesn't benefit from all CPU cores with high workers values #336

Closed

horpto mentioned this issue Jan 11, 2019

Add method for patch corpora.Dictionary based on special tokens #2200

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speeding up similarity queries #5

speeding up similarity queries #5

piskvorky commented Feb 27, 2011

piskvorky commented Feb 27, 2011

piskvorky commented Feb 27, 2011

Dieterbe commented Feb 28, 2011

speeding up similarity queries #5

speeding up similarity queries #5

Comments

piskvorky commented Feb 27, 2011

piskvorky commented Feb 27, 2011

piskvorky commented Feb 27, 2011

Dieterbe commented Feb 28, 2011