-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add sent2vec in Gensim #1458
Conversation
gensim/models/wrappers/sent2vec.py
Outdated
|
||
def word_vec(self, word, use_norm=False): | ||
""" | ||
Accept a single word as input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use google docstring fromat (anywhere)
gensim/models/wrappers/sent2vec.py
Outdated
import logging | ||
import tempfile | ||
import os | ||
import struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
gensim/models/wrappers/sent2vec.py
Outdated
import numpy as np | ||
from numpy import float32 as REAL, sqrt, newaxis | ||
from gensim import utils | ||
from gensim.models.keyedvectors import KeyedVectors, Vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import Vocab
gensim/models/wrappers/sent2vec.py
Outdated
from six import string_types | ||
|
||
logger = logging.getLogger(__name__) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two blank lines before class definition (anywhere)
gensim/models/wrappers/sent2vec.py
Outdated
Note that you **cannot continue training** after doing a replace. The model becomes | ||
effectively read-only = you can only call `most_similar`, `similarity` etc. | ||
""" | ||
super(FastTextKeyedVectors, self).init_sims(replace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undefined FastTextKeyedVectors
cmd.append("-%s" % option) | ||
cmd.append(str(value)) | ||
|
||
output = utils.check_output(args=cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output
- unused variable
I have a question - what's a difference between current PR and existing fasttext wrapper? Also, please do several things
|
As mentioned in the link here- https://github.com/epfml/sent2vec The algorithm builds on FastText to create features and representations for short texts and sentences. You can say it is an extension of Word2Vec. |
@souravsingh But we already have Doc2Vec (aka ParagraphVectors) as "extension of w2v for texts", what's advantages of this approach? |
@menshikh-iv I will be conducting a benchmark between sent2vec and Doc2Vec on Wikipedia data. Maybe @martinjaggi can have a better answer to your question? |
@menshikh-iv @souravsingh |
This algo looks great, and the results seem far superior to doc2vec. If these results can be replicated, I'm in favour of supplanting doc2vec with sent2vec in gensim. As with fastText, we ideally want a fast native implementation (not just a wrapper for C++). |
I have to take the indications of results "far superior to doc2vec" with some grains-of-salt. For example, in the original sent2vec paper, they only train their PV-DBOW/PV-DM models on a 900 megaword toronto-books corpus, with unclear metaparameters/metaoptimization – then reuse those models across all tasks. (And yet across all the tables, few of the models besides skip-thought, are good-performers when using toronto-books as training data.) Meanwhile they evaluate sent2vec with a 2 gigaword Twitter corpus and then a 20 gigaword Wikipedia corpus, with (presumably) careful choice of their own model's parameters. How well would PV (or other algorithms!) perform given the same training data & level of meta-optimization perform? There isn't any evidence. (Similarly, the SemEval results for PV are limited to "PV-DBOW that uses the model from Lau & Baldwin [2016]" – a single Wikipedia-based model, which only loads in Lau's gensim fork, with unclear metaparameters/metaoptimization. That downloadable model is also suspiciously small - 1.4GB suggests less than all of Wikipedia may have been used for training.) All that said, the innovations of sent2vec all seem useful and intuitively likely to help improve doc-vectors. As I understand the paper, the key changes seem to be (1) n-grams are also trained; (2) the These could be layered into the existing or a future unified gensim Doc2Vec model as preprocessing steps or new optional parameters. N-grams could be simulated today via preprocessing (that inserts synthesized n-grams into texts); setting a super-large window would approximate a full-document window. Drop-out might be a helpful new option even for word2vec and vanilla PV options. (There was another single-author paper, can't locate at the moment but was mentioned in a gensim github request – that like FastText & sent2vec was training doc-vecs as sums-of-word-vecs-with-dropout to lessen memory overhead in large corpuses.) I have a hunch there's a meta-model out of which these are all parameterized instances. It may be useful to more precisely refer to the existing gensim implementation as "PV" so that "Doc2Vec" can be a generic umbrella name for the techniques. |
Agreed. A practical unbiased evaluation is a big part of the challenge here -- we definitely don't want to replace proven, optimized algorithms with a bird in the bush. Chiseling a common structure out of these related methods sounds non-trivial but potentially very beneficial (as long as these abstractions don't compromise the performance). It will also allow some sanity in maintaining all that stuff. |
@gojomo indeed our first version of the paper didn't train on toronto books yet, but we have fixed this. the performance is very robust over all 3 corpora (wiki, twitter, toronto), and we have published all 3 pretrained models for comparison. we are not affiliated with the SemEval 2017 organizers, so their evaluation of sent2vec is independent confirmation with zero parameter tuning, even without using an optimized tokenizer. for PV not performing well, this seems to be a consistent picture by now, as PV is one of the standard baselines in many applications (despite the downside that inference is non-trivial, in contrast to sent2vec). |
@martinjaggi But what PV metaparameters did you choose, and did you use as much effort in picking those as was used in picking the values in Table 5? I can believe your techniques have helped – ngrams & larger windows & dropout all seem like good ideas. But without more details, I can't trust the magnitudes-of-improvement. (And further, without other measures of overhead - for example the effect of ngram expansion and giant windows on memory and training time – also hard to know if vanilla Doc2Vec might not still be preferable for some projects.) That you've done a Toronto-Toronto apples-to-apples (corpus) comparison helps a little, but the metaoptimization issue remains. And, it seems sent2vec did really well across all evaluations when trained on the Twitter data... so why not give every other algorithm a chance to train on that data, too, for that apples-to-apples comparison? My issue with the SemEval paper isn't one of affiliation. They downloaded one oddish pre-trained model - it's mentioned as one of the only 2 in the whole setup NOT from algorithm's originators. The size of the file looks incomplete to me, given the description of its origin. And in other discussions, I've highlighted several areas where the Lau & Baldwin evaluation of PV seems inconsistent. (The latest SemEval paper you linked to, http://nlp.arizona.edu/SemEval-2017/pdf/SemEval001.pdf, on multilingual comparisons doesn't even seem to have sent2vec or PV-DBOW scores, so not sure what its relevance is.) While I appreciate the attempt to benchmark off-the-shelf models, trained on generic data, on a variety of other specific datasets, that's not a typical way of using PV (or many similar methods) – where training on exactly the text-domain in which you intend to compare, so that their specific vocabularies/meanings are learned, is more typical. I found that other paper - seems similar to 'siamese CBOW', but is titled "Efficient Vector Representation for Documents Through Corruption", https://openreview.net/pdf?id=B1Igu2ogg, by Minmin Chen. (Some code apparently based on a "-sentence-vectors" patch once released my Mikolov is at https://github.com/mchen24/iclr2017.) |
So do we wait until we have native FastText in Gensim before proceeding with the PR? |
@gojomo thanks a lot for the pointer! SemEval results are in their Table 14. our reported PV results are from [1]. very large window for CBOW is a good idea, and it's included in our code and experiments, but not enough for getting the improvements of sent2vec (hyperparams for CBOW were carefully tuned as well, dim = 600, ws = 10, ep = 5, lr = 0.07, and t = 10−5, section 4 of arxiv v1) [1] Hill, Felix, Cho, Kyunghyun, and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. In Proceedings of NAACL-HLT, February 2016. @souravsingh is the wrapper here now mostly compatible with the fasttext wrapper? (both codes are extremely similar) |
@martinjaggi Aha, I'd overlooked SemEval's Table 14 as an "en-en" comparison. Still, it's evaluating a single underdocumented/unoptimized/suspiciously small downloaded PV-DBOW model not from the originators (or even heavy users) of the method. It also seems like participants were generally encouraged to use the STS-specific training data to prepare their models – but there's no evidence the PV-DBOW model used anything but generic Wikipedia article texts. So it looks to me like an unfair and unreliable evaluation. Looking at Hill/Cho/Kyunghyun/Korhonen's "Learning Distributed Representations of Sentences from Unlabelled Data", it also has serious errors in evaluation. They only used 100-dimensions for their PV tests – very constrained, especially for modeling 70 million sentences, and for comparing against other models allowed 500-4800 dimensions. There's no mention of searching for optimal parameters other than dimension-size. But most seriously - fatally, in my opinion – they only used one epoch over the 70M sentences. I'm frankly surprised the model did anything at all with that little training. PV papers use 10-20 epochs or more. I have further reservations with any "Toronto Books"-based evaluations: that corpus is ordered sentences from about 7000 mostly-fiction books from unpublished authors, with the largest single category mentioned being "Romance", with 2,865 amateur romance novels included. To use this data for semantic-similarity testing among other news/non-fiction sentences seems really fishy to me. So, sent2vec with 600+ dimensions, 3-13 epochs, other creator-tuned metaparameters, and (in some cases) much more appropriate training data is compared against PV with just 100 dimensions, 1 epoch, no apparent other tuning, and only trained on a dataset of amateur fiction. That's hardly a fair comparison. Given these problems, I'm unable to draw any conclusions about PV's relative performance based on your referenced works. |
always just best to just do some benchmarking. luckily you guys have all algorithms in gensim already, therefore a simple benchmark should be easy to set up, as @souravsingh has suggested. here is a convenient one for example: https://github.com/facebookresearch/SentEval (doesn't have STS 2017 yet, but most from previous years) |
We can wait for #1482 to finish before proceeding with this PR. |
@souravsingh #1482 never been finished, current fasttext PR - #1525. |
I am waiting on FastText model at #1525 to be merged(which should be soon). Once that is done, We can inherit the class from FastText and make some fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for this wrapper
import logging | ||
|
||
import numpy as np | ||
from numpy import zeros, ones, vstack, sum as np_sum, empty, float32 as REAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of unused imports, looks like copy-paste
import numpy as np | ||
from numpy import zeros, ones, vstack, sum as np_sum, empty, float32 as REAL | ||
|
||
from gensim.models.word2vec import Word2Vec, train_sg_pair, train_cbow_pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused imports too
""" | ||
|
||
def initialize_word_vectors(self): | ||
self.wv = Sent2VecKeyedVectors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Sent2VecKeyedVectors
? I don't see this class anywhere.
What's status here @souravsingh ? |
I will revisit the issue later once I have a concrete idea on the model. Closing the issue for now. |
Adds sent2vec algorithm as a wrapper.
Fiex #1376