Support multiple `most_similar()` queries in one call #2987

gojomo · 2020-10-20T18:54:23Z

SpaCy's most_similar (https://spacy.io/api/vectors#most_similar) accepts multiple queries at a time, and further may then break them into batches. In so doing, the expensive dot call at the heart of the calculation can work on larger chunks of data at a time, and visit each row of a large source array just once for multiple results - potentially a noticeable speedup.

Gensim could consider upgrading most_similar() to offer the same batch efficiency.

(Thought inspired by #2986's hopes-for-certain optimizations.)

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-10-20T19:10:01Z

Our similarity classes in docsim already do this. Unifying the code path of most_similar() with standard similarity is preferable to further divergence.

But more generally, where speed matters, our integration with approximate NN search (Annoy, NMSLIB, possibly others) is the best ROI.

gojomo · 2020-10-20T22:31:25Z

ANN introduces enough complications, in preparation/deployment/caveats, that I believe its support should be qualified as an "advanced, if needed" option, rather than a thing anyone can/should drop in "for speed". And so, anything that puts off the need for those extra steps and imprecise results for a larger group of users is potentially valuable. (#2883 has a recent example of someone who was overcomplicating things with Annoy prematurely.)

I also suspect for many users with in-RAM datasets, batching/amortizing full, precise calculations may offer a speedup that's competitive with approximate indexing, without the extra indexing costs or result imprecision, though tests could prove that wrong.

most_similar() could definitely use some API-improvement for clarity/flexibility - but to the extent I understand the interfaces in gensim.iterfaces and gensim.similarities.docsim, I find the most_similar() approach already clearer & more-flexible. For example:

the main similarity-lookup in SimilarityABC/ seems to be overloaded in __getitem__[] lookup - which doesn't strike me as idiomatic/intuitive Python, or adequately explicit, and creates issues when (as has been common in the 2Vec world) it is important for the same model to also offer a true simple 1-item-lookup
overloading []-lookup also isn't amenable to the kinds of optional parameters that can be useful for a most-similars search – like topn or normalize, which can only be set indirectly, globally, on the Similarity instance
looking at MatrixSimilarity.get_similarities() or SimilarityABC.__getitem__(), their major concern seems to be efficiency on sparse arrays - whereas other domains are almost exclusively dense. Mixing the codepaths could thus complicate both for no net benefit.

So: unified conventions in method-names/parameters make sense, but it may make sense for implementations to still diverge, and it may make sense for docsim classes to move more in the direction of the explicitness elsewhere.

piskvorky · 2020-10-21T09:12:35Z

Similarity handles both sparse and dense arrays. MatrixSimilarity is for dense arrays only. Both Similarity and MatrixSimilarity allow both single item & multiple item (batch) queries.

Re. []: "Unifying" could imply adding a named function, yes. I forget whether there's one already or not, but adding an alias is trivial.

There's no functional difference, most_similar and docsim.Similarity do exactly the same thing conceptually. docsim is more general and efficient; most_similar more tightly tied to its specific "word embedding" use-case.

piskvorky added feature Issue described a new feature performance Issue related to performance (in HW meaning) wishlist Feature request labels Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple `most_similar()` queries in one call #2987

Support multiple `most_similar()` queries in one call #2987

gojomo commented Oct 20, 2020

piskvorky commented Oct 20, 2020 •

edited

Loading

gojomo commented Oct 20, 2020

piskvorky commented Oct 21, 2020 •

edited

Loading

Support multiple most_similar() queries in one call #2987

Support multiple most_similar() queries in one call #2987

Comments

gojomo commented Oct 20, 2020

piskvorky commented Oct 20, 2020 • edited Loading

gojomo commented Oct 20, 2020

piskvorky commented Oct 21, 2020 • edited Loading

Support multiple `most_similar()` queries in one call #2987

Support multiple `most_similar()` queries in one call #2987

piskvorky commented Oct 20, 2020 •

edited

Loading

piskvorky commented Oct 21, 2020 •

edited

Loading