[WIP] Add sent2vec in Gensim #1458

souravsingh · 2017-07-02T14:58:07Z

Adds sent2vec algorithm as a wrapper.

menshikh-iv · 2017-07-05T06:06:30Z

gensim/models/wrappers/sent2vec.py

+
+    def word_vec(self, word, use_norm=False):
+        """
+        Accept a single word as input.


Please use google docstring fromat (anywhere)

menshikh-iv · 2017-07-05T06:08:24Z

gensim/models/wrappers/sent2vec.py

+import logging
+import tempfile
+import os
+import struct


unused import

menshikh-iv · 2017-07-05T06:08:51Z

gensim/models/wrappers/sent2vec.py

+import numpy as np
+from numpy import float32 as REAL, sqrt, newaxis
+from gensim import utils
+from gensim.models.keyedvectors import KeyedVectors, Vocab


unused import Vocab

menshikh-iv · 2017-07-05T06:09:17Z

gensim/models/wrappers/sent2vec.py

+from six import string_types
+
+logger = logging.getLogger(__name__)
+


Two blank lines before class definition (anywhere)

menshikh-iv · 2017-07-05T06:10:26Z

gensim/models/wrappers/sent2vec.py

+        Note that you **cannot continue training** after doing a replace. The model becomes
+        effectively read-only = you can only call `most_similar`, `similarity` etc.
+        """
+        super(FastTextKeyedVectors, self).init_sims(replace)


undefined FastTextKeyedVectors

menshikh-iv · 2017-07-05T06:11:09Z

gensim/models/wrappers/sent2vec.py

+            cmd.append("-%s" % option)
+            cmd.append(str(value))
+
+        output = utils.check_output(args=cmd)


output - unused variable

menshikh-iv · 2017-07-05T06:14:24Z

I have a question - what's a difference between current PR and existing fasttext wrapper?

Also, please do several things

Write tests for you wrapper
Create Notebook with usage/comparison
Fix review comment

souravsingh · 2017-07-05T07:50:22Z

As mentioned in the link here- https://github.com/epfml/sent2vec

The algorithm builds on FastText to create features and representations for short texts and sentences. You can say it is an extension of Word2Vec.

menshikh-iv · 2017-07-05T08:09:17Z

@souravsingh But we already have Doc2Vec (aka ParagraphVectors) as "extension of w2v for texts", what's advantages of this approach?

souravsingh · 2017-07-05T09:22:51Z

@menshikh-iv I will be conducting a benchmark between sent2vec and Doc2Vec on Wikipedia data.

Maybe @martinjaggi can have a better answer to your question?

martinjaggi · 2017-07-05T13:59:07Z

@menshikh-iv
doc2vec doesn't perform well, see for example the papers here comparing it to sent2vec: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval001.pdf and
https://arxiv.org/pdf/1703.02507

@souravsingh
i agree with @menshikh-iv that it would likely be best to keep this very similar to the fasttext wrapper, and reuse as much as possible. the only difference is the training algorithm used. note that we have just updated our code to be compatible with the newest version of fasttext models

piskvorky · 2017-07-06T01:59:47Z

This algo looks great, and the results seem far superior to doc2vec.

If these results can be replicated, I'm in favour of supplanting doc2vec with sent2vec in gensim. As with fastText, we ideally want a fast native implementation (not just a wrapper for C++).

gojomo · 2017-07-09T20:16:43Z

I have to take the indications of results "far superior to doc2vec" with some grains-of-salt. For example, in the original sent2vec paper, they only train their PV-DBOW/PV-DM models on a 900 megaword toronto-books corpus, with unclear metaparameters/metaoptimization – then reuse those models across all tasks. (And yet across all the tables, few of the models besides skip-thought, are good-performers when using toronto-books as training data.) Meanwhile they evaluate sent2vec with a 2 gigaword Twitter corpus and then a 20 gigaword Wikipedia corpus, with (presumably) careful choice of their own model's parameters. How well would PV (or other algorithms!) perform given the same training data & level of meta-optimization perform? There isn't any evidence.

(Similarly, the SemEval results for PV are limited to "PV-DBOW that uses the model from Lau & Baldwin [2016]" – a single Wikipedia-based model, which only loads in Lau's gensim fork, with unclear metaparameters/metaoptimization. That downloadable model is also suspiciously small - 1.4GB suggests less than all of Wikipedia may have been used for training.)

All that said, the innovations of sent2vec all seem useful and intuitively likely to help improve doc-vectors. As I understand the paper, the key changes seem to be (1) n-grams are also trained; (2) the window is always the full doc; (3) a small amount of dropout applied to just the n-grams.

These could be layered into the existing or a future unified gensim Doc2Vec model as preprocessing steps or new optional parameters. N-grams could be simulated today via preprocessing (that inserts synthesized n-grams into texts); setting a super-large window would approximate a full-document window. Drop-out might be a helpful new option even for word2vec and vanilla PV options. (There was another single-author paper, can't locate at the moment but was mentioned in a gensim github request – that like FastText & sent2vec was training doc-vecs as sums-of-word-vecs-with-dropout to lessen memory overhead in large corpuses.)

I have a hunch there's a meta-model out of which these are all parameterized instances. It may be useful to more precisely refer to the existing gensim implementation as "PV" so that "Doc2Vec" can be a generic umbrella name for the techniques.

piskvorky · 2017-07-10T05:42:57Z

Agreed. A practical unbiased evaluation is a big part of the challenge here -- we definitely don't want to replace proven, optimized algorithms with a bird in the bush.

Chiseling a common structure out of these related methods sounds non-trivial but potentially very beneficial (as long as these abstractions don't compromise the performance). It will also allow some sanity in maintaining all that stuff.

martinjaggi · 2017-07-10T18:30:47Z

@gojomo indeed our first version of the paper didn't train on toronto books yet, but we have fixed this. the performance is very robust over all 3 corpora (wiki, twitter, toronto), and we have published all 3 pretrained models for comparison. we are not affiliated with the SemEval 2017 organizers, so their evaluation of sent2vec is independent confirmation with zero parameter tuning, even without using an optimized tokenizer. for PV not performing well, this seems to be a consistent picture by now, as PV is one of the standard baselines in many applications (despite the downside that inference is non-trivial, in contrast to sent2vec).
would be nice if you could share the other paper mentioned, could it be the siamese CBOW maybe?

gojomo · 2017-07-10T19:16:20Z

@martinjaggi But what PV metaparameters did you choose, and did you use as much effort in picking those as was used in picking the values in Table 5?

I can believe your techniques have helped – ngrams & larger windows & dropout all seem like good ideas. But without more details, I can't trust the magnitudes-of-improvement. (And further, without other measures of overhead - for example the effect of ngram expansion and giant windows on memory and training time – also hard to know if vanilla Doc2Vec might not still be preferable for some projects.)

That you've done a Toronto-Toronto apples-to-apples (corpus) comparison helps a little, but the metaoptimization issue remains. And, it seems sent2vec did really well across all evaluations when trained on the Twitter data... so why not give every other algorithm a chance to train on that data, too, for that apples-to-apples comparison?

My issue with the SemEval paper isn't one of affiliation. They downloaded one oddish pre-trained model - it's mentioned as one of the only 2 in the whole setup NOT from algorithm's originators. The size of the file looks incomplete to me, given the description of its origin. And in other discussions, I've highlighted several areas where the Lau & Baldwin evaluation of PV seems inconsistent.

(The latest SemEval paper you linked to, http://nlp.arizona.edu/SemEval-2017/pdf/SemEval001.pdf, on multilingual comparisons doesn't even seem to have sent2vec or PV-DBOW scores, so not sure what its relevance is.)

While I appreciate the attempt to benchmark off-the-shelf models, trained on generic data, on a variety of other specific datasets, that's not a typical way of using PV (or many similar methods) – where training on exactly the text-domain in which you intend to compare, so that their specific vocabularies/meanings are learned, is more typical.

I found that other paper - seems similar to 'siamese CBOW', but is titled "Efficient Vector Representation for Documents Through Corruption", https://openreview.net/pdf?id=B1Igu2ogg, by Minmin Chen. (Some code apparently based on a "-sentence-vectors" patch once released my Mikolov is at https://github.com/mchen24/iclr2017.)

souravsingh · 2017-07-11T08:26:47Z

So do we wait until we have native FastText in Gensim before proceeding with the PR?

martinjaggi · 2017-07-11T13:14:51Z

@gojomo thanks a lot for the pointer! SemEval results are in their Table 14. our reported PV results are from [1]. very large window for CBOW is a good idea, and it's included in our code and experiments, but not enough for getting the improvements of sent2vec (hyperparams for CBOW were carefully tuned as well, dim = 600, ws = 10, ep = 5, lr = 0.07, and t = 10−5, section 4 of arxiv v1)

[1] Hill, Felix, Cho, Kyunghyun, and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. In Proceedings of NAACL-HLT, February 2016.

@souravsingh is the wrapper here now mostly compatible with the fasttext wrapper? (both codes are extremely similar)

gojomo · 2017-07-11T17:56:33Z

@martinjaggi Aha, I'd overlooked SemEval's Table 14 as an "en-en" comparison. Still, it's evaluating a single underdocumented/unoptimized/suspiciously small downloaded PV-DBOW model not from the originators (or even heavy users) of the method. It also seems like participants were generally encouraged to use the STS-specific training data to prepare their models – but there's no evidence the PV-DBOW model used anything but generic Wikipedia article texts. So it looks to me like an unfair and unreliable evaluation.

Looking at Hill/Cho/Kyunghyun/Korhonen's "Learning Distributed Representations of Sentences from Unlabelled Data", it also has serious errors in evaluation. They only used 100-dimensions for their PV tests – very constrained, especially for modeling 70 million sentences, and for comparing against other models allowed 500-4800 dimensions. There's no mention of searching for optimal parameters other than dimension-size. But most seriously - fatally, in my opinion – they only used one epoch over the 70M sentences. I'm frankly surprised the model did anything at all with that little training. PV papers use 10-20 epochs or more.

I have further reservations with any "Toronto Books"-based evaluations: that corpus is ordered sentences from about 7000 mostly-fiction books from unpublished authors, with the largest single category mentioned being "Romance", with 2,865 amateur romance novels included. To use this data for semantic-similarity testing among other news/non-fiction sentences seems really fishy to me.

So, sent2vec with 600+ dimensions, 3-13 epochs, other creator-tuned metaparameters, and (in some cases) much more appropriate training data is compared against PV with just 100 dimensions, 1 epoch, no apparent other tuning, and only trained on a dataset of amateur fiction. That's hardly a fair comparison.

Given these problems, I'm unable to draw any conclusions about PV's relative performance based on your referenced works.

martinjaggi · 2017-07-11T18:42:22Z

always just best to just do some benchmarking. luckily you guys have all algorithms in gensim already, therefore a simple benchmark should be easy to set up, as @souravsingh has suggested. here is a convenient one for example: https://github.com/facebookresearch/SentEval (doesn't have STS 2017 yet, but most from previous years)

souravsingh · 2017-07-14T09:44:16Z

We can wait for #1482 to finish before proceeding with this PR.

menshikh-iv · 2017-09-14T07:24:38Z

@souravsingh #1482 never been finished, current fasttext PR - #1525.
What's a status of current PR?

souravsingh · 2017-09-14T07:37:38Z

I am waiting on FastText model at #1525 to be merged(which should be soon). Once that is done, We can inherit the class from FastText and make some fixes.

menshikh-iv

Please add tests for this wrapper

menshikh-iv · 2017-09-29T05:40:51Z

gensim/models/wrappers/sent2vec.py

+import logging
+
+import numpy as np
+from numpy import zeros, ones, vstack, sum as np_sum, empty, float32 as REAL


A lot of unused imports, looks like copy-paste

menshikh-iv · 2017-09-29T05:41:04Z

gensim/models/wrappers/sent2vec.py

+import numpy as np
+from numpy import zeros, ones, vstack, sum as np_sum, empty, float32 as REAL
+
+from gensim.models.word2vec import Word2Vec, train_sg_pair, train_cbow_pair


unused imports too

menshikh-iv · 2017-09-29T05:42:10Z

gensim/models/wrappers/sent2vec.py

+    """
+
+    def initialize_word_vectors(self):
+        self.wv = Sent2VecKeyedVectors()


What is Sent2VecKeyedVectors? I don't see this class anywhere.

menshikh-iv · 2017-09-29T05:43:51Z

What's status here @souravsingh ?

souravsingh · 2017-10-05T12:43:32Z

I will revisit the issue later once I have a concrete idea on the model. Closing the issue for now.

Create sent2vec.py

316ec70

menshikh-iv suggested changes Jul 5, 2017

View reviewed changes

Update sent2vec.py

dbf3a6a

menshikh-iv suggested changes Sep 29, 2017

View reviewed changes

souravsingh closed this Oct 5, 2017

souravsingh deleted the add-sent2vec branch October 5, 2017 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add sent2vec in Gensim #1458

[WIP] Add sent2vec in Gensim #1458

souravsingh commented Jul 2, 2017

menshikh-iv Jul 5, 2017

menshikh-iv Jul 5, 2017

menshikh-iv Jul 5, 2017

menshikh-iv Jul 5, 2017

menshikh-iv Jul 5, 2017

menshikh-iv Jul 5, 2017 •

edited

Loading

menshikh-iv commented Jul 5, 2017

souravsingh commented Jul 5, 2017

menshikh-iv commented Jul 5, 2017

souravsingh commented Jul 5, 2017

martinjaggi commented Jul 5, 2017

piskvorky commented Jul 6, 2017

gojomo commented Jul 9, 2017 •

edited

Loading

piskvorky commented Jul 10, 2017 •

edited

Loading

martinjaggi commented Jul 10, 2017

gojomo commented Jul 10, 2017 •

edited

Loading

souravsingh commented Jul 11, 2017

martinjaggi commented Jul 11, 2017

gojomo commented Jul 11, 2017

martinjaggi commented Jul 11, 2017

souravsingh commented Jul 14, 2017

menshikh-iv commented Sep 14, 2017 •

edited

Loading

souravsingh commented Sep 14, 2017

menshikh-iv left a comment

menshikh-iv Sep 29, 2017

menshikh-iv Sep 29, 2017

menshikh-iv Sep 29, 2017

menshikh-iv commented Sep 29, 2017

souravsingh commented Oct 5, 2017

		from six import string_types

		logger = logging.getLogger(__name__)

[WIP] Add sent2vec in Gensim #1458

[WIP] Add sent2vec in Gensim #1458

Conversation

souravsingh commented Jul 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Jul 5, 2017

souravsingh commented Jul 5, 2017

menshikh-iv commented Jul 5, 2017

souravsingh commented Jul 5, 2017

martinjaggi commented Jul 5, 2017

piskvorky commented Jul 6, 2017

gojomo commented Jul 9, 2017 • edited Loading

piskvorky commented Jul 10, 2017 • edited Loading

martinjaggi commented Jul 10, 2017

gojomo commented Jul 10, 2017 • edited Loading

souravsingh commented Jul 11, 2017

martinjaggi commented Jul 11, 2017

gojomo commented Jul 11, 2017

martinjaggi commented Jul 11, 2017

souravsingh commented Jul 14, 2017

menshikh-iv commented Sep 14, 2017 • edited Loading

souravsingh commented Sep 14, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Sep 29, 2017

souravsingh commented Oct 5, 2017

menshikh-iv Jul 5, 2017 •

edited

Loading

gojomo commented Jul 9, 2017 •

edited

Loading

piskvorky commented Jul 10, 2017 •

edited

Loading

gojomo commented Jul 10, 2017 •

edited

Loading

menshikh-iv commented Sep 14, 2017 •

edited

Loading