Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc2vec most_similar using clip_start: result indexes are not from clip_start #601

Closed
voloviky opened this issue Feb 3, 2016 · 1 comment
Assignees
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@voloviky
Copy link

voloviky commented Feb 3, 2016

gensim 0.12.1 doc2vec.
When using clip_start, clip_end in most_similar function, the result gives keys from the start of dataset and not from clip_start.
In order to reproduce use:
clip_start=5, clip_end=10, topn=5.
The results should be items with keys: 5, 6, 7, 8,9.
But it gives items with keys: 0, 1, 2, 3, 4.

Look at this part of code in doc2vec.py most_similar function:

dists = dot(self.doctag_syn0norm[clip_start:clip_end], mean)
if not topn:
 return dists
best = matutils.argsort(dists, topn=topn + len(all_docs), reverse=True)
# ignore (don't return) docs from the input
result = [(self._key_index(sim), float(dists[sim])) for sim in best if sim not in all_docs]
return result[:topn]

The issue is that self._key_index(sim) takes sim, whereas sim is index of best. best doesn't take into consideration clip_start.
Changing this line solves the issue.
result = [(self._key_index(sim + clip_start), float(dists[sim])) for sim in best if sim not in all_docs]
Please review

@ajaanbaahu
Copy link

I have found the same issue and and agree with @voloviky , any idea when this can be fixed?

@tmylk tmylk added bug Issue described a bug difficulty easy Easy issue: required small fix labels Oct 6, 2016
@tmylk tmylk closed this as completed in a0443e4 Nov 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

4 participants