Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding type check for corpus_file argument #2469

Merged
merged 16 commits into from
May 5, 2019
Merged
19 changes: 18 additions & 1 deletion gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
except ImportError:
from Queue import Queue # noqa:F401

from collections import namedtuple, defaultdict
from collections import namedtuple, defaultdict, Iterable
from timeit import default_timer

from numpy import zeros, float32 as REAL, empty, ones, \
Expand Down Expand Up @@ -794,6 +794,23 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor

"""
kwargs = {}

# Check if both documents and corpus_file are None
if corpus_file is None and documents is None:
raise TypeError("Either one of corpus_file or documents value must be provided.")
Copy link
Collaborator

@mpenkov mpenkov May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to include a period. Please apply this to other messages as well.

Suggested change
raise TypeError("Either one of corpus_file or documents value must be provided.")
raise TypeError("Either one of corpus_file or documents value must be provided")

I tried committing these nitpicks to your branch myself, but I don't have permissions.


# Check if both documents and corpus_file are not None
if corpus_file is not None and documents is not None:
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the above.

Suggested change
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.")
raise TypeError("Both corpus_file and documents may not be provided at the same time")


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to combine these two cases with an XOR: if (corpus_file is None) ^ (documents is None):.

Copy link
Contributor Author

@saraswatmks saraswatmks May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way for readability. Also, I noticed True ^ True is False. If I do it the xor way, this case will get skipped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd actually need to be if not ((corpus_file is None) ^ (documents is None)): - but then essentially one error message, roughly "supply one or the other but not both", would be fine for either mistake.

Copy link
Owner

@piskvorky piskvorky May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong -1 on this: hard to read and reason about. Please stick to simpler (and easier to maintain) constructs .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clearest approach would be not to have a method with such delicate XOR one-but-not-the-other positional parameters - use separate methods with distinct names! But failing that, a one-line XOR, whose next line is an error message with a plain-english description of the constraint, seems pretty simple to me - just accurately reflecting the design choice already made.

# Check if corpus_file is string type
if documents is None and not os.path.isfile(corpus_file):
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file)
raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead." % corpus_file)

Better than %s, because corpus_file may contain spaces.


# Check if documents is iterable
if documents is not None and not isinstance(documents, Iterable):
raise TypeError("documents must be an iterable of list, got %s instead." % documents)

if corpus_file is not None:
# Calculate offsets for each worker along with initial doctags (doctag ~ document/line number in a file)
offsets, start_doctags = self._get_offsets_and_start_doctags_for_corpusfile(corpus_file, self.workers)
Expand Down
18 changes: 17 additions & 1 deletion gensim/test/test_doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,22 @@ def testLoadOldModel(self):
sims_to_infer = loaded_model.docvecs.most_similar([doc0_inferred], topn=len(loaded_model.docvecs))
self.assertTrue(sims_to_infer)

def testDoc2vecTrainParameters(self):

model = doc2vec.Doc2Vec(vector_size=50)
model.build_vocab(documents=list_corpus)

# check if corpus_file is not a string
self.assertRaises(TypeError, model.train, corpus_file=11111)
# check if documents is an iterable
self.assertRaises(TypeError, model.train, documents=11111)
# check is both the parameters are provided
self.assertRaises(TypeError, model.train, documents=sentences, corpus_file='test')
# check if both the parameters are left empty
self.assertRaises(TypeError, model.train, documents=None, corpus_file=None)
# check if corpus_file is an iterable
self.assertRaises(TypeError, model.train, corpus_file=sentences)

@unittest.skipIf(os.name == 'nt', "See another test for Windows below")
def test_get_offsets_and_start_doctags(self):
# Each line takes 6 bytes (including '\n' character)
Expand Down Expand Up @@ -387,7 +403,7 @@ def model_sanity(self, model, keep_training=True):
tmpf = get_tmpfile('gensim_doc2vec.tst')
model.save(tmpf)
loaded = doc2vec.Doc2Vec.load(tmpf)
loaded.train(sentences, total_examples=loaded.corpus_count, epochs=loaded.epochs)
loaded.train(documents=sentences, total_examples=loaded.corpus_count, epochs=loaded.epochs)

def test_training(self):
"""Test doc2vec training."""
Expand Down