Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple most_similar() queries in one call #2987

Open
gojomo opened this issue Oct 20, 2020 · 3 comments
Open

Support multiple most_similar() queries in one call #2987

gojomo opened this issue Oct 20, 2020 · 3 comments
Labels
feature Issue described a new feature performance Issue related to performance (in HW meaning) wishlist Feature request

Comments

@gojomo
Copy link
Collaborator

gojomo commented Oct 20, 2020

SpaCy's most_similar (https://spacy.io/api/vectors#most_similar) accepts multiple queries at a time, and further may then break them into batches. In so doing, the expensive dot call at the heart of the calculation can work on larger chunks of data at a time, and visit each row of a large source array just once for multiple results - potentially a noticeable speedup.

Gensim could consider upgrading most_similar() to offer the same batch efficiency.

(Thought inspired by #2986's hopes-for-certain optimizations.)

@piskvorky
Copy link
Owner

piskvorky commented Oct 20, 2020

Our similarity classes in docsim already do this. Unifying the code path of most_similar() with standard similarity is preferable to further divergence.

But more generally, where speed matters, our integration with approximate NN search (Annoy, NMSLIB, possibly others) is the best ROI.

@piskvorky piskvorky added feature Issue described a new feature performance Issue related to performance (in HW meaning) wishlist Feature request labels Oct 20, 2020
@gojomo
Copy link
Collaborator Author

gojomo commented Oct 20, 2020

ANN introduces enough complications, in preparation/deployment/caveats, that I believe its support should be qualified as an "advanced, if needed" option, rather than a thing anyone can/should drop in "for speed". And so, anything that puts off the need for those extra steps and imprecise results for a larger group of users is potentially valuable. (#2883 has a recent example of someone who was overcomplicating things with Annoy prematurely.)

I also suspect for many users with in-RAM datasets, batching/amortizing full, precise calculations may offer a speedup that's competitive with approximate indexing, without the extra indexing costs or result imprecision, though tests could prove that wrong.

most_similar() could definitely use some API-improvement for clarity/flexibility - but to the extent I understand the interfaces in gensim.iterfaces and gensim.similarities.docsim, I find the most_similar() approach already clearer & more-flexible. For example:

  • the main similarity-lookup in SimilarityABC/ seems to be overloaded in __getitem__[] lookup - which doesn't strike me as idiomatic/intuitive Python, or adequately explicit, and creates issues when (as has been common in the 2Vec world) it is important for the same model to also offer a true simple 1-item-lookup
  • overloading []-lookup also isn't amenable to the kinds of optional parameters that can be useful for a most-similars search – like topn or normalize, which can only be set indirectly, globally, on the Similarity instance
  • looking at MatrixSimilarity.get_similarities() or SimilarityABC.__getitem__(), their major concern seems to be efficiency on sparse arrays - whereas other domains are almost exclusively dense. Mixing the codepaths could thus complicate both for no net benefit.

So: unified conventions in method-names/parameters make sense, but it may make sense for implementations to still diverge, and it may make sense for docsim classes to move more in the direction of the explicitness elsewhere.

@piskvorky
Copy link
Owner

piskvorky commented Oct 21, 2020

Similarity handles both sparse and dense arrays. MatrixSimilarity is for dense arrays only. Both Similarity and MatrixSimilarity allow both single item & multiple item (batch) queries.

Re. []: "Unifying" could imply adding a named function, yes. I forget whether there's one already or not, but adding an alias is trivial.

There's no functional difference, most_similar and docsim.Similarity do exactly the same thing conceptually. docsim is more general and efficient; most_similar more tightly tied to its specific "word embedding" use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature performance Issue related to performance (in HW meaning) wishlist Feature request
Projects
None yet
Development

No branches or pull requests

2 participants