Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705

Merged
merged 1 commit into from
Jun 9, 2016

Conversation

devashishd12
Copy link
Contributor

@devashishd12 devashishd12 commented May 24, 2016

Addresses #692. TODO:

  • Raise warning if alpha is raised again after descending to minimum value. (Where should this be ideally done @gojomo?)
  • Raise warning if calling load() on an instance rather than the class.
  • Raise warning if mismatch in expected count for train() and actual.
  • Add test suit to test_doc2vec.py, test_word2vec.py

@tmylk @gojomo am I going on the right track with this? Also how should the warnings be dealt with in doc2vec_inner.c?

@gojomo
Copy link
Collaborator

gojomo commented May 24, 2016

For each of these warnings, the goal should be to make note of the problem exactly once, at the earliest possible time. We don't want the overhead of redundant checks; the user doesn't need X million warnings for their X million text examples. So avoid doing checks for every document, during training, and instead do as soon as the problem parameter/input is seen for the 1st time.

The alpha is managed by the train() method, so the check would have to go there.

Most of these issues apply to Word2Vec as well, and so in most cases the checks will be there (and inherited by Doc2Vec).

You won't likely have to be concerned about either the .c files (because they're auto-generated from the cython .pyx files), or the .pyx files (as those are optimized implementations for which any problems should have been discovered earlier, in pure-python code).

@devashishd12
Copy link
Contributor Author

@gojomo thanks for the help! I've updated the PR. Hope this is better. What should be a good threshold value for a "small" vocab? Also, alpha monotonically decreases right? I'll change the code if that's not the case.

@@ -509,9 +510,16 @@ def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_
Each sentence must be a list of unicode strings.

"""
for sentence in sentences:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds the overhead of a full pass over the corpus in the case where the user has already done things right; that's excessive. It'd be better to add the check to a pass that's already happening (eg scan_vocab()). (And, it may even be a desirable optimization to just check the first example... this mistake is likely to be all-or-nothing, and we don't want to burden a user doing the right thing, with a tens-of-millions-of-examples corpus, with a check of every example.)

Also, this code appears to only work for the Doc2Vec case, when 'sentence' is something TaggedDocument-shaped with a 'words' field. In Word2Vec, there's no such field, the example is just the list-of-tokens. Are you testing in a Word2Vec triggering scenario?

Copy link
Contributor Author

@devashishd12 devashishd12 May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh alright I didn't know about the all-or-nothing scenario. I've added checks in the scan_vocab methods of Word2Vec and Doc2Vec wherein I'm only checking if the sentence or document.words is an instance of string_types for sentence (or document) 0. So I don't think we need the check for all single characters in vocab since, as you said, this check already covers it. Is this approach fine?

@gojomo
Copy link
Collaborator

gojomo commented May 25, 2016

I don't have any confident idea of a good "your vocabulary is suspiciously small" threshold, so we should probably just skip that idea unless/until there's more evidence it's needed. (It was just one of a bunch of 'and/or' ideas for catching common errors.)

@devashishd12
Copy link
Contributor Author

@gojomo I've made the changes to the code. Could you please check? Also regarding raising the warning for a mismatch in the expected count for train() and actual, what is exactly count over here and what should it be compared against to raise the warning?

@devashishd12 devashishd12 force-pushed the TaggedDocument_warning branch from 4ec40d6 to 2eedf77 Compare May 27, 2016 13:50
@@ -636,6 +636,12 @@ def scan_vocab(self, documents, progress_per=10000, trim_rule=None):
interval_start = default_timer() - 0.00001 # guard against next sample being identical
interval_count = 0
vocab = defaultdict(int)
try:
if isinstance(list(documents)[0].words, string_types):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many users won't be passing an indexable/all-in-memory corpus – so this will break for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for bothering you so much :(
What would be a better way to check then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the existing iteration, check the current sentence. But, once the first is checked, set a flag to prevent further checks – based on our assumption that the user either gets this "all right" or "all wrong" and that (when wrong) they don't want redundant warnings for every sentence.

@devashishd12
Copy link
Contributor Author

devashishd12 commented May 31, 2016

@tmylk @gojomo I've added unit tests for the build_vocab() warning and the train() warning. Sorry for still using the warnings module. The warnings weren't displaying on the console window with logger.warn and also the testing with warnings.warn seemed more convenient. Is it fine?

vocab = defaultdict(int)
for document_no, document in enumerate(documents):
if not checked_string_types:
if isinstance(document.words, string_types):
warnings.warn("'words' should be a list of unicode strings. True string type provided.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "True string type", why not output the actual type of document.words?

Plus maybe even output the entire document example, for explicitness and clarity.

@gojomo gojomo changed the title [WIP] ENH: Raise warnings if vocab in single character elements in doc2vec [WIP] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec May 31, 2016
@gojomo
Copy link
Collaborator

gojomo commented May 31, 2016

Changed title to reflect that this can be an issue in Word2Vec as well (and should be checked there too).

I'm in favor of logger.warn() for consistency. That should typically result in visible console output, unless something non-typical has been done with levels/loggers.

@tmylk
Copy link
Contributor

tmylk commented Jun 2, 2016

testfixtures package allows to test for logging output so we can use logger.warn instead of just warnings.
It is just a test time dependency so not an issue.

@devashishd12
Copy link
Contributor Author

Thanks for the suggestion @tmylk however I'm getting an ImportError on import testfixtures while running nosetests. I don't get this error outside of nosetests. I've checked an answer here but I don't think this answer will work in our case since I can't remove the __init__.py nor can I specify a particular target directory since it won't work for other users. Also would this dependency be a problem for other users while running nosetests?

@devashishd12
Copy link
Contributor Author

@tmylk the build still seems to be failing.....

@devashishd12 devashishd12 force-pushed the TaggedDocument_warning branch 3 times, most recently from 840bcc3 to b4195f4 Compare June 6, 2016 16:49
@devashishd12
Copy link
Contributor Author

@piskvorky @gojomo @tmylk I think travis is happy. I think the test failing is unrelated to my PR. Could you please review once?

for sentence_no, sentence in enumerate(sentences):
if not checked_string_types:
if isinstance(sentence, string_types):
logger.warn("'sentences' should be a list of list of unicode strings. %s provided.", type(sentence))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantically accurate, sentences need not be a list – just an iterable. But the important thing being checked here is whether the first item is the expected list-of-strings. So I would word this warning as: "Each sentences item should be a list of words (usually unicode strings). First item here is instead plain %s." The warning in doc2vec could be similarly: "Each words should be a list of words (usually unicode strings). First 'words' here is instead plain %s".) (I find that clearer than just "%s provided" which may at first glance seem like no contradiction with preceding advice.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Made the changes.

@gojomo
Copy link
Collaborator

gojomo commented Jun 6, 2016

Other than my slight suggestion regarding warning wording, this looks good-to-go.

The test failure is in gensim.test.test_glove2word2vec.TestGlove2Word2Vec and thus seems unrelated to these changes. @tmylk I suggest that test should be disabled here and in 'develop' until it can be independently made reliable.

@devashishd12 devashishd12 force-pushed the TaggedDocument_warning branch from b4195f4 to 5bf3329 Compare June 6, 2016 22:18
@devashishd12
Copy link
Contributor Author

I've made the changes in the log message, rebased and squashed.

@devashishd12
Copy link
Contributor Author

@tmylk can this be merged?

@devashishd12 devashishd12 changed the title [WIP] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec [MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec Jun 7, 2016
@tmylk tmylk merged commit 394ceab into piskvorky:develop Jun 9, 2016
@tmylk
Copy link
Contributor

tmylk commented Jun 9, 2016

@dsquareindia Thanks a lot for this very needed PR!

@devashishd12
Copy link
Contributor Author

No problem! Thanks a ton for the help and reviews!

@devashishd12 devashishd12 deleted the TaggedDocument_warning branch June 9, 2016 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants