You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working with a corpus of very short documents and noticed that the inferred vectors for the same document were very different.
fromscipy.spatial.distanceimportpdist, squareformtestdoc="This is a small sample document."vectors= [d2vmod.infer_vector(testdoc) for_inrange(5)]
squareform(pdist(vectors, "cosine"))
I am working with a corpus of very short documents and noticed that the inferred vectors for the same document were very different.
More training steps makes things worse in this case:
Note: This is more extreme than what I'm seeing with more domain-specific sample documents, where start to get more consistent after about 5000 steps.
I believe this is happening because the learning rate decays extremely rapidly:
https://github.com/RaRe-Technologies/gensim/blob/8b810918d59781116794a6679999afdc76b857ef/gensim/models/doc2vec.py#L565
Notice that
alpha
is very close tomin_alpha
after the first step and this is exaggerated even more when the number of steps is larger.When I change Doc2Vec to have a linear decay in learning rate
I get much better results. With 20 steps, we get pairwise cosine distances of
, with 100 we get
, and with 1000 steps:
The text was updated successfully, but these errors were encountered: