-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitive option for word2vec accuracy #804
Conversation
@@ -1610,13 +1612,16 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, u | |||
logger.debug("skipping line #%i with OOV words: %s" % (line_no, line.strip())) | |||
continue | |||
|
|||
original_vocab = self.vocab | |||
self.vocab = ok_vocab | |||
ignore = set(self.vocab[v].index for v in [a, b, c]) # indexes of words to ignore | |||
predicted = None | |||
# find the most likely prediction, ignoring OOV words and input words | |||
sims = most_similar(self, positive=[b, c], negative=[a], topn=False, restrict_vocab=restrict_vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour for restrict_vocab
in most_similar
is a little different from its behaviour in accuracy
In accuracy, simply the first restrict_vocab
vectors are taken for calculating similarity, and in accuracy
, ok_vocab
is created by sorting vocab
by frequency and taking the first restrict_vocab
words.
This could cause some unintended behaviour, especially with low values for restrict_vocab
since it is passed to most_similar
from accuracy. The two sets of vocab used in the two methods could be significantly different, and possibly even completely disjoint.
Setting self.vocab = ok_vocab
before the most_similar
call should handle this, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree it's currently a subtle bug that this call to most_similar()
passes the restrict_vocab
argument, since it may not mean the exact same thing. (It's not always the case that the word-vectors have already been frequency-sorted.) It's only truly correct to pass the count in if the interpretation is the same.
But, it's likely a big speedup, in cases where the vocabulary is large compared to the restrict_vocab
, as it prevents lots of unnecessary distance-calculations in the dot(...)
line of most_similar()
. (Executing most_similar()
with a filtered vocab
doesn't provide the same optimization.)
Perhaps we just want to document that for restrict_vocab
to work as expected here, the vocabulary must be sorted? (Or even redefine the argument here to mean limiting searches to the 1st indexes, whether they're the most-frequent or not. So no extra sort-by-frequency here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes that's a good catch, I hadn't noticed a filtered vocab
doesn't prevent all those vector multiplications.
I think the latter makes most sense, especially seeing that sort_vocab
is true by default in the word2vec initialization. Doesn't seem to make sense to perform an operation like sorting on the entire vocab for temporary use.
@@ -1576,15 +1576,17 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, u | |||
Use `restrict_vocab` to ignore all questions containing a word whose frequency | |||
is not in the top-N most frequent words (default top 30,000). | |||
|
|||
Use `use_lowercase` to convert all words in questions to thier lowercase form before evaluating | |||
the accuracy. It's useful when assuming the text preprocessing also uses lowercase. (default True). | |||
Use `case_insensitive` to convert all words in questions and vocab to their lowercase form before evaluating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't make a difference in results, but in the spirit of mimicking compute-accuracy.c
unless there's a good reason not to, perhaps we should case-normalize to uppercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right
@gojomo Made some changes, please have a look? |
`restrict_vocab` is an optional integer which limits the vocab to be used | ||
for answering questions. For example, restrict_vocab=10000 would only check | ||
the first 10000 word vectors in the vocabulary order. (This may be meaningful | ||
if you've sorted the vocabulary by descending frequency.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment to what happens for case_insensitive
and restrict_vocab
together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current comment no longer mentions default of 30,000 in effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
If you add changelog and tests than it would look good to me. |
I think latest changes match compute-accuracy.c behavior, but would be good to test against it, using same sets of word2vec.c-format-vectors with mixed-cases. If we've matched behavior exactly, then should get exact-same correct-counts for many values of threshold/restrict_vocab, or both case_insensitive options. |
2aa5065
to
d2f7e0b
Compare
d2f7e0b
to
4443883
Compare
Yeah, that makes sense. What do you think the best approach would be, adding pretrained word2vec models to the tests and checking their accuracy against hardcoded values (obtained from running |
@jayantj Definitely not the latter. |
I was thinking the need could be met by running a few tests manually – for example using the GoogleNews vectors, which are too big to bundle (or even download/load) during automated unit testing. I suppose a toy-sized set of word-vectors (say just enough to get a few analogies right/wrong) could also be bundled for testing purposes, but it would have to be hand-crafted to trigger problems for any deviation from the |
Waiting for manual verification results on GoogleNews vectors and a small toy mixed case dataset. |
Manually verified on a midsize mixed case corpus (brown) for multiple values of Are we skipping on automated tests then? |
Could you link to a gist or notebook with your manual results? |
Sorry for the really late response Found a new error that'd been missed previously while training on gutenberg - Suppose the results of
You could actually argue Let me know if you find any other issues with the PR |
@jayantj Could you please pull master and resolve the merge conflicts? |
Thanks! Merged in 3c47319 |
@@ -11,6 +11,7 @@ Changes | |||
* Implemented LsiModel.docs_processed attribute | |||
* Added LdaMallet support. Added LdaVowpalWabbit, LdaMallet example to notebook. Added test suite for coherencemodel and aggregation. | |||
Added `topics` parameter to coherencemodel. Can now provide tokenized topics to calculate coherence value (@dsquareindia, #750) | |||
* Changed `use_lowercase` option in word2vec accuracy to `case_insensitive` to account for case variations in training vocabulary (@jayantj, #714) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link directly to the github pullrequest/issue page, so people can check the changes and discussions easily, by visiting the direct link in their browser.
The markdown syntax for links is [link text](link url)
.
(@tmylk please start doing this consistently in CHANGELOG starting this release)
Related to #714
Quick PR to get some feedback. Also, should we have tests for this? There aren't any currently, which seems fine to me since
accuracy
is an informal evaluation metric rather than basic functionalityA couple of concerns mentioned as comments