-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pretrained word2vec model when train doc2vec #2703
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. Left you some comments.
Please also add unit tests for your new functionality.
Let me know when you're ready for another review.
@@ -1,177 +1,61 @@ | |||
gensim – Topic Modelling in Python | |||
doc2vec in gensim – support pretrained word2vec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of replacing the top-level README.md file, you should put this documentation somewhere else. Ideally, it should be in a tutorial or a howto.
See https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.html#sphx-glr-auto-examples-howtos-run-doc-py for more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first time to do PR on github, I already followed your advice on word2vec.py and doc2vec.py, I also know that I should not change the top README.md file, but I really don't how and where to write the document, thank you for more advice!
[OpenBLAS]: http://xianyi.github.io/OpenBLAS/ | ||
[source tar.gz]: http://pypi.python.org/pypi/gensim | ||
[documentation]: http://radimrehurek.com/gensim/install.html | ||
This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write your documentation so that it's useful from the point of view of the reader.
"This is a forked gensim version" is not relevant to the user. Furthermore, it becomes misleading the moment we actually merge this PR.
[documentation]: http://radimrehurek.com/gensim/install.html | ||
This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8. | ||
|
||
The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016)](https://arxiv.org/abs/1607.05368), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a [forked gensim verstion](https://github.com/jhlau/gensim) to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also irrelevant. This kind of information is good inside the PR, as motivation and background (it may already be there).
> pretrained_emb = "word2vec_pretrained.txt" # This is a pretrained word2vec model of C text format | ||
> | ||
> model = gensim.models.doc2vec.Doc2Vec( | ||
corpus_train, # This is the documents corpus to be trained which should meet gensim's format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please.
I don't think an added I also think people who've read the Lau & Baldwin paper will have unrealistic expectations about the benefits of this approach, because the paper is a bit confused & contradictory in its analysis. (My comments on a prior issue outline some of the reasons.) Unfortunately, the deeply misguided #1777 refactoring made experiments in this direction more difficult, by breaking two previous ways users could hook-into and modify a model's vocabulary-initialization. (The I also hope that a side-effect of re-factoring (& #1777-rollback) work I'm exploring in #2698 will eventually be to serve these kinds of needs in a more flexible & logical way. So, I'd not want this integrated until that work is further along. (Some of the classes this PR modifies, like the '_Trainables', should be going away.) |
OK, so it sounds like the best action at the moment is to close this PR and wait for a more appropriate time to work on it. @gojomo Do you agree? |
Ok, you can close this PR.
At 2019-12-29 21:39:17, "Michael Penkov" <notifications@github.com> wrote:
OK, so it sounds like the best action at the moment is to close this PR and wait for a more appropriate time to work on it. @gojomo Do you agree?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a forked gensim verstion to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8.
Now I edited two files to support pretrained word embeddings for doc2vec.