diff --git a/CHANGELOG.md b/CHANGELOG.md index 4582d7ca57..cc3f7581c3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ Changes * Implemented LsiModel.docs_processed attribute * Added LdaMallet support. Added LdaVowpalWabbit, LdaMallet example to notebook. Added test suite for coherencemodel and aggregation. Added `topics` parameter to coherencemodel. Can now provide tokenized topics to calculate coherence value (@dsquareindia, #750) +* Changed `use_lowercase` option in word2vec accuracy to `case_insensitive` to account for case variations in training vocabulary (@jayantj, #714) 0.13.1, 2016-06-22 diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py index 7d56a79a15..223487413b 100644 --- a/gensim/models/word2vec.py +++ b/gensim/models/word2vec.py @@ -1573,13 +1573,15 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, c The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there's one aggregate summary at the end. - `restrict_vocab` is an optional integer which limits the vocab to be used - for answering questions. For example, restrict_vocab=10000 would only check - the first 10000 word vectors in the vocabulary order. (This may be meaningful - if you've sorted the vocabulary by descending frequency.) - - Use `case_insensitive` to convert all words in questions and vocab to their uppercase form before evaluating - the accuracy. Useful in case of case-mismatch between training tokens and question words. (default True). + Use `restrict_vocab` to ignore all questions containing a word not in the first `restrict_vocab` + words (default top 30,000). This may be meaningful if you've sorted the vocabulary by descending + frequency. In case `case_insensitive` is True, the first `restrict_vocab` words are taken first, and then + case normalization is performed. + + Use `case_insensitive` to convert all words in questions and vocab to their uppercase form before + evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens + and question words. In case of multiple case variants of a single word, the vector for the first + occurrence (also the most frequent if vocabulary is sorted) is taken. This method corresponds to the `compute-accuracy` script of the original C word2vec.