-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding type check for corpus_file argument #2469
Changes from 11 commits
40989ba
ad76b83
e1f32a2
a9eeddf
6639089
2ca51ca
6458aea
bc9dce6
97d5619
cdaff9f
b046292
298e8f0
4fb01f4
07a7d5d
8821b92
bfc4360
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -70,7 +70,7 @@ | |||||
except ImportError: | ||||||
from Queue import Queue # noqa:F401 | ||||||
|
||||||
from collections import namedtuple, defaultdict | ||||||
from collections import namedtuple, defaultdict, Iterable | ||||||
from timeit import default_timer | ||||||
|
||||||
from numpy import zeros, float32 as REAL, empty, ones, \ | ||||||
|
@@ -794,6 +794,23 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor | |||||
|
||||||
""" | ||||||
kwargs = {} | ||||||
|
||||||
# Check if both documents and corpus_file are None | ||||||
if corpus_file is None and documents is None: | ||||||
raise TypeError("Either one of corpus_file or documents value must be provided.") | ||||||
|
||||||
# Check if both documents and corpus_file are not None | ||||||
if not ((corpus_file is None) ^ (documents is None)): | ||||||
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For consistency with the above.
Suggested change
|
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be tempted to combine these two cases with an XOR: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did it this way for readability. Also, I noticed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it'd actually need to be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Strong -1 on this: hard to read and reason about. Please stick to simpler (and easier to maintain) constructs . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The clearest approach would be not to have a method with such delicate XOR one-but-not-the-other positional parameters - use separate methods with distinct names! But failing that, a one-line XOR, whose next line is an error message with a plain-english description of the constraint, seems pretty simple to me - just accurately reflecting the design choice already made. |
||||||
# Check if corpus_file is string type | ||||||
if documents is None and not isinstance(corpus_file, string_types): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't that throw an exception for incorrect (non-string) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could wrap it in a _is_valid_corpus(corpus_file) function for malformed input. Given that this input validation will be happening in mulitple places, it's probably not a bad idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @piskvorky @mpenkov I've kept it simple. I used Also, to do this for fasttext, I think it will be better:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm -1 on creating separate PRs for this, for several reasons:
For your second point, sure, I think it's a good idea to wait until people come to a consensus before expanding your approach to the other models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mpenkov Could you please review the latest changes and let me know is we are good to go or something else needs to be done ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. Left you some nitpicky comments. In the future, please consider ticking "allow commits from maintainers" when creating a PR on github. That allows the reviewers of your PR to push changes to your branch, potentially shortcutting some of the back-and-forth for minor issues like typos, formatting, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mpenkov thanks for the suggestion. I believe we are solid now on this PR. I will now make these changes in fasttext module as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, please go ahead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mpenkov I've done the changes in fasttext module as well. Please have a look. |
||||||
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Better than %s, because corpus_file may contain spaces. |
||||||
|
||||||
# Check if documents is iterable | ||||||
if documents is not None and not isinstance(documents, Iterable): | ||||||
raise TypeError("documents must be an iterable of list, got %s instead." % documents) | ||||||
|
||||||
if corpus_file is not None: | ||||||
# Calculate offsets for each worker along with initial doctags (doctag ~ document/line number in a file) | ||||||
offsets, start_doctags = self._get_offsets_and_start_doctags_for_corpusfile(corpus_file, self.workers) | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to include a period. Please apply this to other messages as well.
I tried committing these nitpicks to your branch myself, but I don't have permissions.