Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding type check for corpus_file argument #2469

Merged
merged 16 commits into from
May 5, 2019
Merged
19 changes: 18 additions & 1 deletion gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
except ImportError:
from Queue import Queue # noqa:F401

from collections import namedtuple, defaultdict
from collections import namedtuple, defaultdict, Iterable
from timeit import default_timer

from numpy import zeros, float32 as REAL, empty, ones, \
Expand Down Expand Up @@ -794,6 +794,23 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor

"""
kwargs = {}

# Check if both documents and corpus_file are None
if corpus_file is None and documents is None:
raise TypeError("Either one of corpus_file or documents value must be provided.")
Copy link
Collaborator

@mpenkov mpenkov May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to include a period. Please apply this to other messages as well.

Suggested change
raise TypeError("Either one of corpus_file or documents value must be provided.")
raise TypeError("Either one of corpus_file or documents value must be provided")

I tried committing these nitpicks to your branch myself, but I don't have permissions.


# Check if both documents and corpus_file are not None
if not ((corpus_file is None) ^ (documents is None)):
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the above.

Suggested change
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
raise TypeError("Both corpus_file and documents may not be provided at the same time")


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to combine these two cases with an XOR: if (corpus_file is None) ^ (documents is None):.

Copy link
Contributor Author

@saraswatmks saraswatmks May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way for readability. Also, I noticed True ^ True is False. If I do it the xor way, this case will get skipped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd actually need to be if not ((corpus_file is None) ^ (documents is None)): - but then essentially one error message, roughly "supply one or the other but not both", would be fine for either mistake.

Copy link
Owner

@piskvorky piskvorky May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong -1 on this: hard to read and reason about. Please stick to simpler (and easier to maintain) constructs .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clearest approach would be not to have a method with such delicate XOR one-but-not-the-other positional parameters - use separate methods with distinct names! But failing that, a one-line XOR, whose next line is an error message with a plain-english description of the constraint, seems pretty simple to me - just accurately reflecting the design choice already made.

# Check if corpus_file is string type
if documents is None and not isinstance(corpus_file, string_types):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If corpus_file must be a valid path to a file, then why not explicitly check for that, instead of checking the type?

Suggested change
if documents is None and not isinstance(corpus_file, string_types):
if documents is None and not os.path.isfile(corpus_file):

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that throw an exception for incorrect (non-string) corpus_file? It's an interesting idea for a try-catch block though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could wrap it in a _is_valid_corpus(corpus_file) function for malformed input.

Given that this input validation will be happening in mulitple places, it's probably not a bad idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky @mpenkov I've kept it simple. I used os.path.isfile as suggested. It returns False at all type of invalid inputs including non-string / float or whatever trash value a user would pass.

Also, to do this for fasttext, I think it will be better:

  1. To create a separate PR
  2. Wait until the this PR is finalised so that I can follow the same code approach there are well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on creating separate PRs for this, for several reasons:

  1. The changes are tiny. They span several lines of code.
  2. The changes will be identical for each model (e.g. doc2vec, fasttext, etc)
  3. PR overhead (review, merge, release, document) is not worth the effort.
  4. It's salami-slicing.

For your second point, sure, I think it's a good idea to wait until people come to a consensus before expanding your approach to the other models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov Could you please review the latest changes and let me know is we are good to go or something else needs to be done ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Left you some nitpicky comments.

In the future, please consider ticking "allow commits from maintainers" when creating a PR on github. That allows the reviewers of your PR to push changes to your branch, potentially shortcutting some of the back-and-forth for minor issues like typos, formatting, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov thanks for the suggestion. I believe we are solid now on this PR. I will now make these changes in fasttext module as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please go ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov I've done the changes in fasttext module as well. Please have a look.

raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead." % corpus_file)

Better than %s, because corpus_file may contain spaces.


# Check if documents is iterable
if documents is not None and not isinstance(documents, Iterable):
raise TypeError("documents must be an iterable of list, got %s instead." % documents)

if corpus_file is not None:
# Calculate offsets for each worker along with initial doctags (doctag ~ document/line number in a file)
offsets, start_doctags = self._get_offsets_and_start_doctags_for_corpusfile(corpus_file, self.workers)
Expand Down
18 changes: 17 additions & 1 deletion gensim/test/test_doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,22 @@ def testLoadOldModel(self):
sims_to_infer = loaded_model.docvecs.most_similar([doc0_inferred], topn=len(loaded_model.docvecs))
self.assertTrue(sims_to_infer)

def testDoc2vecTrainParameters(self):

model = doc2vec.Doc2Vec(vector_size=50)
model.build_vocab(documents=list_corpus)

# check if corpus_file is not a string
self.assertRaises(TypeError, model.train, corpus_file=11111)
# check if documents is an iterable
self.assertRaises(TypeError, model.train, documents=11111)
# check is both the parameters are provided
self.assertRaises(TypeError, model.train, documents=sentences, corpus_file='test')
# check if both the parameters are left empty
self.assertRaises(TypeError, model.train, documents=None, corpus_file=None)
# check if corpus_file is an iterable
self.assertRaises(TypeError, model.train, corpus_file=sentences)

@unittest.skipIf(os.name == 'nt', "See another test for Windows below")
def test_get_offsets_and_start_doctags(self):
# Each line takes 6 bytes (including '\n' character)
Expand Down Expand Up @@ -387,7 +403,7 @@ def model_sanity(self, model, keep_training=True):
tmpf = get_tmpfile('gensim_doc2vec.tst')
model.save(tmpf)
loaded = doc2vec.Doc2Vec.load(tmpf)
loaded.train(sentences, total_examples=loaded.corpus_count, epochs=loaded.epochs)
loaded.train(documents=sentences, total_examples=loaded.corpus_count, epochs=loaded.epochs)

def test_training(self):
"""Test doc2vec training."""
Expand Down