Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive option for word2vec accuracy #804

Closed
wants to merge 8 commits into from

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Jul 25, 2016

Related to #714

Quick PR to get some feedback. Also, should we have tests for this? There aren't any currently, which seems fine to me since accuracy is an informal evaluation metric rather than basic functionality

A couple of concerns mentioned as comments

@@ -1610,13 +1612,16 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, u
logger.debug("skipping line #%i with OOV words: %s" % (line_no, line.strip()))
continue

original_vocab = self.vocab
self.vocab = ok_vocab
ignore = set(self.vocab[v].index for v in [a, b, c]) # indexes of words to ignore
predicted = None
# find the most likely prediction, ignoring OOV words and input words
sims = most_similar(self, positive=[b, c], negative=[a], topn=False, restrict_vocab=restrict_vocab)
Copy link
Contributor Author

@jayantj jayantj Jul 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour for restrict_vocab in most_similar is a little different from its behaviour in accuracy
In accuracy, simply the first restrict_vocab vectors are taken for calculating similarity, and in accuracy, ok_vocab is created by sorting vocab by frequency and taking the first restrict_vocab words.

This could cause some unintended behaviour, especially with low values for restrict_vocab since it is passed to most_similar from accuracy. The two sets of vocab used in the two methods could be significantly different, and possibly even completely disjoint.

Setting self.vocab = ok_vocab before the most_similar call should handle this, I think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree it's currently a subtle bug that this call to most_similar() passes the restrict_vocab argument, since it may not mean the exact same thing. (It's not always the case that the word-vectors have already been frequency-sorted.) It's only truly correct to pass the count in if the interpretation is the same.

But, it's likely a big speedup, in cases where the vocabulary is large compared to the restrict_vocab, as it prevents lots of unnecessary distance-calculations in the dot(...) line of most_similar(). (Executing most_similar() with a filtered vocab doesn't provide the same optimization.)

Perhaps we just want to document that for restrict_vocab to work as expected here, the vocabulary must be sorted? (Or even redefine the argument here to mean limiting searches to the 1st indexes, whether they're the most-frequent or not. So no extra sort-by-frequency here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes that's a good catch, I hadn't noticed a filtered vocab doesn't prevent all those vector multiplications.

I think the latter makes most sense, especially seeing that sort_vocab is true by default in the word2vec initialization. Doesn't seem to make sense to perform an operation like sorting on the entire vocab for temporary use.

@@ -1576,15 +1576,17 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, u
Use `restrict_vocab` to ignore all questions containing a word whose frequency
is not in the top-N most frequent words (default top 30,000).

Use `use_lowercase` to convert all words in questions to thier lowercase form before evaluating
the accuracy. It's useful when assuming the text preprocessing also uses lowercase. (default True).
Use `case_insensitive` to convert all words in questions and vocab to their lowercase form before evaluating
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't make a difference in results, but in the spirit of mimicking compute-accuracy.c unless there's a good reason not to, perhaps we should case-normalize to uppercase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right

@jayantj
Copy link
Contributor Author

jayantj commented Jul 26, 2016

@gojomo Made some changes, please have a look?

`restrict_vocab` is an optional integer which limits the vocab to be used
for answering questions. For example, restrict_vocab=10000 would only check
the first 10000 word vectors in the vocabulary order. (This may be meaningful
if you've sorted the vocabulary by descending frequency.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to what happens for case_insensitive and restrict_vocab together.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current comment no longer mentions default of 30,000 in effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tmylk
Copy link
Contributor

tmylk commented Aug 2, 2016

If you add changelog and tests than it would look good to me.
Waiting for @gojomo to review too.

@gojomo
Copy link
Collaborator

gojomo commented Aug 2, 2016

I think latest changes match compute-accuracy.c behavior, but would be good to test against it, using same sets of word2vec.c-format-vectors with mixed-cases. If we've matched behavior exactly, then should get exact-same correct-counts for many values of threshold/restrict_vocab, or both case_insensitive options.

@jayantj jayantj force-pushed the accuracy_case_insensitive branch from 2aa5065 to d2f7e0b Compare August 3, 2016 07:37
@jayantj jayantj force-pushed the accuracy_case_insensitive branch from d2f7e0b to 4443883 Compare August 3, 2016 07:42
@jayantj
Copy link
Contributor Author

jayantj commented Aug 4, 2016

Yeah, that makes sense. What do you think the best approach would be, adding pretrained word2vec models to the tests and checking their accuracy against hardcoded values (obtained from running compute-accuracy.c locally), or adding the compute-accuracy binary to the tests and automating everything?

@piskvorky
Copy link
Owner

@jayantj Definitely not the latter.

@gojomo
Copy link
Collaborator

gojomo commented Aug 4, 2016

I was thinking the need could be met by running a few tests manually – for example using the GoogleNews vectors, which are too big to bundle (or even download/load) during automated unit testing.

I suppose a toy-sized set of word-vectors (say just enough to get a few analogies right/wrong) could also be bundled for testing purposes, but it would have to be hand-crafted to trigger problems for any deviation from the compute-accuracy.c behaviors with regard to threshold/case. (That may be more trouble than it's worth, once general conformance has been manually verified...)

@tmylk
Copy link
Contributor

tmylk commented Aug 5, 2016

Waiting for manual verification results on GoogleNews vectors and a small toy mixed case dataset.

@jayantj
Copy link
Contributor Author

jayantj commented Aug 5, 2016

Manually verified on a midsize mixed case corpus (brown) for multiple values of restrict_vocab
I'm running into memory issues while verifying on GoogleNews vectors
Verified for text8 though

Are we skipping on automated tests then?

@tmylk
Copy link
Contributor

tmylk commented Aug 5, 2016

Could you link to a gist or notebook with your manual results?

@jayantj
Copy link
Contributor Author

jayantj commented Aug 11, 2016

Sorry for the really late response
Here's the ipython notebook
I've compared gensim trained models for text8 and gutenberg (I've modified the compute accuracy printf statements a little, the output is still not very readable though)

Found a new error that'd been missed previously while training on gutenberg -

Suppose the results of most_similar for boy girl he she return He followed by She, with the previous logic and case_insensitive=True, He gets ignored as one of the analogy words itself (as the matching is performed on the value and not the index), and the analogy gets marked as correct.

compute-accuracy.c works with indices though. This was how it had been implemented previously in gensim too, the change to value-based matching was in one my commits itself :/

You could actually argue He should be ignored, and the analogy be marked as correct, but since we're following the original implementation exactly, this is definitely not desirable.

Let me know if you find any other issues with the PR

@tmylk
Copy link
Contributor

tmylk commented Aug 12, 2016

@jayantj Could you please pull master and resolve the merge conflicts?

@tmylk
Copy link
Contributor

tmylk commented Aug 14, 2016

Thanks! Merged in 3c47319

@tmylk tmylk closed this Aug 14, 2016
@@ -11,6 +11,7 @@ Changes
* Implemented LsiModel.docs_processed attribute
* Added LdaMallet support. Added LdaVowpalWabbit, LdaMallet example to notebook. Added test suite for coherencemodel and aggregation.
Added `topics` parameter to coherencemodel. Can now provide tokenized topics to calculate coherence value (@dsquareindia, #750)
* Changed `use_lowercase` option in word2vec accuracy to `case_insensitive` to account for case variations in training vocabulary (@jayantj, #714)
Copy link
Owner

@piskvorky piskvorky Aug 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link directly to the github pullrequest/issue page, so people can check the changes and discussions easily, by visiting the direct link in their browser.

The markdown syntax for links is [link text](link url).

(@tmylk please start doing this consistently in CHANGELOG starting this release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants