-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc2vec to wikipedia #654
Doc2vec to wikipedia #654
Conversation
Thanks, I've hoped for a notebook like this for the project for a while! I doubt the inclusion or exclusion of the "List of…" etc articles will make a big difference either way, and as far as I can tell, the 'Document Embeddings with Paragraph Vectors' paper didn't mention such article filtering. So I'd keep things simple, and maybe test different article subsets later. Parameter thoughts:
Regarding results:
Hope this helps! |
Regarding computation resources, what exactly do you need @isohyt ? We could provide access to our dev servers, if that helps (and if @tmylk greenlights the need). |
@gojomo, Thank you for your helpful and insightful comments, I will try it all as you proposed. @piskvorky, I feel extremely happy about your proposal. I want to run ipython notebook on your dev servers. However, I am a little worried about that because it's the first time for me to use a remote server to run a program... |
We can do that, using SSH port tunnelling (@tmylk would help you setting this up). |
@isohyt – the original paper didn't report DM results for comparison, so I wouldn't say that's strictly necessary for showing how to reproduce the paper's experiment. But it would be interesting, along with how other parameter variations ( |
Sorry for late to update this tutorial |
FYI a beginner doc2vec tutorial is in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb |
I've finished training two of d2v model, DBOW and DM, using wikipedia. |
This is great stuff! I notice in your notebook, your DM model keeps the default number of iterations (5), while the DBOW uses a full 10 like the paper. Also, the If possible, I'd suggest instead iteratively discovering the |
Thanks, @gojomo In the same time, I will check the optimal min_count. If you want to remain the preprocessing code in ipynb, I want to write clearer code. |
In the past, I've watched the log output and adjusted interactively, though I see why that's not ideal for robust code or demo notebooks. It looks like |
Thank you guys especially @isohyt for this thread.
I think you already knew that but the implementation find here : https://github.com/isohyt/gensim/blob/cb22f47f371457061b98f9390042f12b108587cf/docs/notebooks/doc2vec-wikipedia.ipynb I would really appreciate any help ;) |
@rtanzifi The string "model.build_vocab(documents)" does not appear in the notebook. If you're getting such an error in your modified code, you've somehow made |
sorry, i forgot to update this tutorial. |
Please make it one file, add a note to the |
@tmylk It's ready to be merged :) |
@@ -1,9 +1,23 @@ | |||
Changes | |||
======= | |||
|
|||
* Add doc2vec tutorial using wikipedia dump. (@isohyt, #654) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix merge errors. It should be just 1 line added in Changelog.
Please merge in develop to resolve the merge conflicts |
f961f35
to
9ec9175
Compare
Related to Issue #629.
I conducted the similar experiment to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998) and wrote documentation.
However, I got some problems. Could you help me with the following problems?
Problems
Todo
[ ] Evaluate Doc2Vec using triplet datasets.Questions
Please feel free to comments if you have any idea other than the above.
Thanks.