Case insensitive option for word2vec accuracy #804

jayantj · 2016-07-25T00:37:52Z

Related to #714

Quick PR to get some feedback. Also, should we have tests for this? There aren't any currently, which seems fine to me since accuracy is an informal evaluation metric rather than basic functionality

A couple of concerns mentioned as comments

jayantj · 2016-07-25T00:44:58Z

gensim/models/word2vec.py

@@ -1610,13 +1612,16 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, u
                    logger.debug("skipping line #%i with OOV words: %s" % (line_no, line.strip()))
                    continue

+                original_vocab = self.vocab
+                self.vocab = ok_vocab
                ignore = set(self.vocab[v].index for v in [a, b, c])  # indexes of words to ignore
                predicted = None
                # find the most likely prediction, ignoring OOV words and input words
                sims = most_similar(self, positive=[b, c], negative=[a], topn=False, restrict_vocab=restrict_vocab)


The behaviour for restrict_vocab in most_similar is a little different from its behaviour in accuracy
In accuracy, simply the first restrict_vocab vectors are taken for calculating similarity, and in accuracy, ok_vocab is created by sorting vocab by frequency and taking the first restrict_vocab words.

This could cause some unintended behaviour, especially with low values for restrict_vocab since it is passed to most_similar from accuracy. The two sets of vocab used in the two methods could be significantly different, and possibly even completely disjoint.

Setting self.vocab = ok_vocab before the most_similar call should handle this, I think?

Yes, I agree it's currently a subtle bug that this call to most_similar() passes the restrict_vocab argument, since it may not mean the exact same thing. (It's not always the case that the word-vectors have already been frequency-sorted.) It's only truly correct to pass the count in if the interpretation is the same.

But, it's likely a big speedup, in cases where the vocabulary is large compared to the restrict_vocab, as it prevents lots of unnecessary distance-calculations in the dot(...) line of most_similar(). (Executing most_similar() with a filtered vocab doesn't provide the same optimization.)

Perhaps we just want to document that for restrict_vocab to work as expected here, the vocabulary must be sorted? (Or even redefine the argument here to mean limiting searches to the 1st indexes, whether they're the most-frequent or not. So no extra sort-by-frequency here.)

Hmm, yes that's a good catch, I hadn't noticed a filtered vocab doesn't prevent all those vector multiplications.

I think the latter makes most sense, especially seeing that sort_vocab is true by default in the word2vec initialization. Doesn't seem to make sense to perform an operation like sorting on the entire vocab for temporary use.

gojomo · 2016-07-25T15:46:29Z

gensim/models/word2vec.py

@@ -1576,15 +1576,17 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, u
        Use `restrict_vocab` to ignore all questions containing a word whose frequency
        is not in the top-N most frequent words (default top 30,000).

-        Use `use_lowercase` to convert all words in questions to thier lowercase form before evaluating
-        the accuracy. It's useful when assuming the text preprocessing also uses lowercase.  (default True).
+        Use `case_insensitive` to convert all words in questions and vocab to their lowercase form before evaluating


It shouldn't make a difference in results, but in the spirit of mimicking compute-accuracy.c unless there's a good reason not to, perhaps we should case-normalize to uppercase?

jayantj · 2016-07-26T18:54:26Z

@gojomo Made some changes, please have a look?

tmylk · 2016-08-02T11:22:43Z

gensim/models/word2vec.py

+        `restrict_vocab` is an optional integer which limits the vocab to be used
+        for answering questions. For example, restrict_vocab=10000 would only check
+        the first 10000 word vectors in the vocabulary order. (This may be meaningful
+        if you've sorted the vocabulary by descending frequency.)


Please add a comment to what happens for case_insensitive and restrict_vocab together.

Current comment no longer mentions default of 30,000 in effect.

tmylk · 2016-08-02T11:26:58Z

If you add changelog and tests than it would look good to me.
Waiting for @gojomo to review too.

gojomo · 2016-08-02T20:54:15Z

I think latest changes match compute-accuracy.c behavior, but would be good to test against it, using same sets of word2vec.c-format-vectors with mixed-cases. If we've matched behavior exactly, then should get exact-same correct-counts for many values of threshold/restrict_vocab, or both case_insensitive options.

jayantj · 2016-08-04T08:12:57Z

Yeah, that makes sense. What do you think the best approach would be, adding pretrained word2vec models to the tests and checking their accuracy against hardcoded values (obtained from running compute-accuracy.c locally), or adding the compute-accuracy binary to the tests and automating everything?

piskvorky · 2016-08-04T09:05:33Z

@jayantj Definitely not the latter.

gojomo · 2016-08-04T15:38:38Z

I was thinking the need could be met by running a few tests manually – for example using the GoogleNews vectors, which are too big to bundle (or even download/load) during automated unit testing.

I suppose a toy-sized set of word-vectors (say just enough to get a few analogies right/wrong) could also be bundled for testing purposes, but it would have to be hand-crafted to trigger problems for any deviation from the compute-accuracy.c behaviors with regard to threshold/case. (That may be more trouble than it's worth, once general conformance has been manually verified...)

tmylk · 2016-08-05T08:50:42Z

Waiting for manual verification results on GoogleNews vectors and a small toy mixed case dataset.

jayantj · 2016-08-05T10:52:57Z

Manually verified on a midsize mixed case corpus (brown) for multiple values of restrict_vocab
I'm running into memory issues while verifying on GoogleNews vectors
Verified for text8 though

Are we skipping on automated tests then?

tmylk · 2016-08-05T11:44:45Z

Could you link to a gist or notebook with your manual results?

jayantj · 2016-08-11T22:55:54Z

Sorry for the really late response
Here's the ipython notebook
I've compared gensim trained models for text8 and gutenberg (I've modified the compute accuracy printf statements a little, the output is still not very readable though)

Found a new error that'd been missed previously while training on gutenberg -

Suppose the results of most_similar for boy girl he she return He followed by She, with the previous logic and case_insensitive=True, He gets ignored as one of the analogy words itself (as the matching is performed on the value and not the index), and the analogy gets marked as correct.

compute-accuracy.c works with indices though. This was how it had been implemented previously in gensim too, the change to value-based matching was in one my commits itself :/

You could actually argue He should be ignored, and the analogy be marked as correct, but since we're following the original implementation exactly, this is definitely not desirable.

Let me know if you find any other issues with the PR

tmylk · 2016-08-12T14:45:41Z

@jayantj Could you please pull master and resolve the merge conflicts?

tmylk · 2016-08-14T19:10:30Z

Thanks! Merged in 3c47319

piskvorky · 2016-08-15T14:08:28Z

CHANGELOG.md

@@ -11,6 +11,7 @@ Changes
 * Implemented LsiModel.docs_processed attribute
 * Added LdaMallet support. Added LdaVowpalWabbit, LdaMallet example to notebook. Added test suite for coherencemodel and aggregation.
  Added `topics` parameter to coherencemodel. Can now provide tokenized topics to calculate coherence value (@dsquareindia, #750)
+* Changed `use_lowercase` option in word2vec accuracy to `case_insensitive` to account for case variations in training vocabulary (@jayantj, #714)


Please link directly to the github pullrequest/issue page, so people can check the changes and discussions easily, by visiting the direct link in their browser.

The markdown syntax for links is [link text](link url).

(@tmylk please start doing this consistently in CHANGELOG starting this release)

jayantj added 2 commits July 25, 2016 05:39

case insensitive option for word2vec accuracy

bcd2b2d

updates docstring for word2vec accuracy

b2189e2

jayantj reviewed Jul 25, 2016
View reviewed changes

fixes dict comprehension for python 2.6

fdc60e8

gojomo reviewed Jul 25, 2016
View reviewed changes

jayantj added 4 commits July 26, 2016 23:37

corrects usage/documentation of restrict_vocab in accuracy

97fa4e5

missing continue after skipping invalid line

83edcd9

all case variants of input words ignored from candidate answers

db81504

uses uppercase instead of lowercase for consistency

bd95a7d

tmylk reviewed Aug 2, 2016
View reviewed changes

jayantj force-pushed the accuracy_case_insensitive branch from 2aa5065 to d2f7e0b Compare August 3, 2016 07:37

updates changelog and documentation

4443883

jayantj force-pushed the accuracy_case_insensitive branch from d2f7e0b to 4443883 Compare August 3, 2016 07:42

tmylk pushed a commit that referenced this pull request Aug 14, 2016

case insensitive option for word2vec accuracy PR #804 fixes #714

3c47319

tmylk closed this Aug 14, 2016

piskvorky reviewed Aug 15, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case insensitive option for word2vec accuracy #804

Case insensitive option for word2vec accuracy #804

jayantj commented Jul 25, 2016

jayantj Jul 25, 2016 •

edited

Loading

gojomo Jul 25, 2016

jayantj Jul 25, 2016

gojomo Jul 25, 2016

jayantj Jul 25, 2016

jayantj commented Jul 26, 2016

tmylk Aug 2, 2016

gojomo Aug 2, 2016

jayantj Aug 3, 2016

tmylk commented Aug 2, 2016

gojomo commented Aug 2, 2016

jayantj commented Aug 4, 2016

piskvorky commented Aug 4, 2016

gojomo commented Aug 4, 2016 •

edited

Loading

tmylk commented Aug 5, 2016

jayantj commented Aug 5, 2016

tmylk commented Aug 5, 2016

jayantj commented Aug 11, 2016 •

edited

Loading

tmylk commented Aug 12, 2016

tmylk commented Aug 14, 2016

piskvorky Aug 15, 2016 •

edited

Loading

Case insensitive option for word2vec accuracy #804

Case insensitive option for word2vec accuracy #804

Conversation

jayantj commented Jul 25, 2016

jayantj Jul 25, 2016 • edited Loading

Choose a reason for hiding this comment

gojomo Jul 25, 2016

Choose a reason for hiding this comment

jayantj Jul 25, 2016

Choose a reason for hiding this comment

gojomo Jul 25, 2016

Choose a reason for hiding this comment

jayantj Jul 25, 2016

Choose a reason for hiding this comment

jayantj commented Jul 26, 2016

tmylk Aug 2, 2016

Choose a reason for hiding this comment

gojomo Aug 2, 2016

Choose a reason for hiding this comment

jayantj Aug 3, 2016

Choose a reason for hiding this comment

tmylk commented Aug 2, 2016

gojomo commented Aug 2, 2016

jayantj commented Aug 4, 2016

piskvorky commented Aug 4, 2016

gojomo commented Aug 4, 2016 • edited Loading

tmylk commented Aug 5, 2016

jayantj commented Aug 5, 2016

tmylk commented Aug 5, 2016

jayantj commented Aug 11, 2016 • edited Loading

tmylk commented Aug 12, 2016

tmylk commented Aug 14, 2016

piskvorky Aug 15, 2016 • edited Loading

Choose a reason for hiding this comment

jayantj Jul 25, 2016 •

edited

Loading

gojomo commented Aug 4, 2016 •

edited

Loading

jayantj commented Aug 11, 2016 •

edited

Loading

piskvorky Aug 15, 2016 •

edited

Loading