Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

Merged
merged 16 commits into from
Jul 18, 2017

Conversation

michaelwsherman
Copy link
Contributor

word2vec.LineSentencePath(path) will read all the files in a directory in the same fashion as word2vec.LineSentence reads a file. This provides an easy way to use a corpus of multiple files when training a word2vec model (or any model compatible with word2vec.LineSentence).

Minimal exeception handling, but some logging.

Michael Sherman added 4 commits June 16, 2017 11:49
added method models.word2vec.LineSentencePath

method to read an entire directory's files in the same style as
models.word2vec.LineSentence
initial attempt at test, including files. test just splits the
lee_background.cor file into two parts and puts them in a directory,
then makes sure they match the unsplit file as loaded by
word2vec.LineSentence
no longer sensitive to an input without a trailing os-specific slash
@michaelwsherman michaelwsherman changed the title Issue #1364 -- word2vec.LineSentencePath class to read a directory as a corpus word2vec.LineSentencePath class to read a directory as a corpus (#1364) Jun 16, 2017
@gojomo
Copy link
Collaborator

gojomo commented Jun 17, 2017

Thanks for the contrib!

The failing automatic check looks like some style issues - you can click the 'Details' link or red 'X' to view the failure logs for more hints.

More substantive comments inline.

@@ -1521,6 +1521,54 @@ def __iter__(self):
i += self.max_sentence_length


class LineSentencePath(object):
Copy link
Collaborator

@gojomo gojomo Jun 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd consider the name PathLineSentences more typical and descriptive, but other may have an even better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to comment. I'll get to these next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made, will be reflected in next pull request

self.limit = limit

try:
self.source = os.path.join(source, '') # ensures os-specific slash is at end of path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on coverage of all related needs:

  • perhaps this should accept a path to a single file, too, and still work in that case?
  • by deferring the actual resolution of initialization parameters to the beginning of __iter__(), the object might be more robust for cases where files are arriving in the target directory between instantiation & 1st iteration. OTOH, that would also mean repeated iterations – as in the common Word2Vec/Doc2Vec multi-pass training, could find different files each time. No strong opinion yet on which approach is better – just pointing out the choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made to accept a single file and still work, including an additional test case.

I think it is better to resolve the initialization parameters in the __init__(). While there could be some use in not requiring the files to all be present when the object is initialized, I think that possibly changing the files processed every time a new iteration starts is likely to cause confusion. It seems more natural that the default behavior would be to get a list of files and not change them as long as the object is used. This would match the behavior of LineSentence--if you change the contents of the file between iterations, you'll get different results, but you can't change the reference to the file after the object has been created.

I would personally be caught off guard if the files changed between iterations. While this could be useful in some cases, I think it is a risky default behavior. Adding some capabilities to do this, however, may make sense. But I'd rather not do that unless a compelling use case is presented.

What I've done instead is log the list of files read when the object is created at the info level, so there's some sort of explicit record available of what the object is reading.

"""Does LineSentencePath work with a path argument?"""
logging.debug(word2vec)
with utils.smart_open(datapath('lee_background.cor')) as orig:
sentences = word2vec.LineSentencePath(datapath('LineSentencePath'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rather than creating a custom new split (and duplication) of the lee_background.cor, a new subdir of test-data lee could be added, with just the lee.cor and lee_background.cor files. The test would make sure this new class on the directory yields the same set of docs as the two other files read individually. And eventually, the duplication of these files in two places could be eliminated by making all the other tests/demos/tutorials just grab the lee* files from this new canonical place. (That is: aim for less duplication in the long run by re-using the same files everywhere.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but I think making a lee directory may be worse than having a special PathLineSentences directory.

It seems there are quite a few lee* files in test_data already. I could move those two files you mentioned, but then we'd have a bunch of lee files still in the main test_data directory. Those files could be moved as well, except some of them aren't compatible with PathLineSentences (.bin, .vec) which meant that they couldn't be moved without breaking the test. Then we would have lee* files in two places, which could lead to confusion later on. There is also the issue that something that makes sense to add to this lee directory in the future could break the test.

Given that the PathLinesSentences class operates on a full directory (which makes it somewhat unique) I think it makes more sense to keep it's test dependent on a directory that isn't overloaded for use with other tests.

What I will do is use smaller files that aren't duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made, put in a new tiny test corpus, and changed the test to read the files and combine them manually and compare to PathLineSentences, rather than having a single file duplicating the contents of the PathLineSentences folder in test_data

Michael Sherman added 6 commits June 19, 2017 11:09
in word2vec.py . Test updated as well
in models.word2vec . Tests also updated
had only 1 space before an inline comment, flagged by travis CI build
Removed LineSentencePath directory, created PathLineSentences
lee corpus duplicates were in LineSentencePath, was wasting space
made new small corpus to test PathLineSentences, put in directory
changed test to read both files manually, combine, and compare to
PathLineSentences (rather than having a separate single file to match
the entire contents of the PathLineSentences test_data directory
changed PathLineSentences to support a single file in addition to a
directory, raises a warning to use LineSentence when a single file is
given as a parameter. added corresponding test.
@michaelwsherman
Copy link
Contributor Author

I think I've addressed all the comments. Please re-review.

I'm having some problems with figuring out what travis-ci is angry about. I did find an inline comment missing a whitespace and fixed that. But I think the other comments are due to extra spaces in [i : i +...] line and that's elsewhere in the files. I also see errors with the new corpus file--not sure how to tell travis-ci to not treat those files as code. I'll review after the current style check is done and see if I can get to the bottom of things.

@michaelwsherman
Copy link
Contributor Author

Nevermind, looks like travis just checks my changes -- not code already in the repo. But I did fix another style issue elsewhere in the code.

I think the style check should pass now.

Michael Sherman and others added 3 commits June 19, 2017 13:51
@michaelwsherman
Copy link
Contributor Author

Cleaning up my noob git branching fail, no more dangling branches not actually getting used.

@michaelwsherman
Copy link
Contributor Author

@gojomo or other maintainers--Any other changes? Any questions? Please let me know. Thank you.

@menshikh-iv menshikh-iv changed the title word2vec.LineSentencePath class to read a directory as a corpus (#1364) Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) Jul 18, 2017
@menshikh-iv
Copy link
Contributor

Looks good for me, thank you @michaelwsherman, congratz with your first PR:1st_place_medal:

@menshikh-iv menshikh-iv merged commit b818c91 into piskvorky:develop Jul 18, 2017
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I know I'm late with the review.

Adding some minor code style comments. @menshikh-iv

self.limit = limit

if os.path.isfile(self.source):
logging.warning('single file read, better to use models.word2vec.LineSentence')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be logger (each module has its own logger in gensim).

Applies here and elsewhere in this PR.

self.source = os.path.join(self.source, '') # ensures os-specific slash at end of path
logging.debug('reading directory ' + self.source)
self.input_files = os.listdir(self.source)
self.input_files = [self.source + file for file in self.input_files] # make full paths
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file is a reserved keyword in Python. Better use filename or something like that.

else: # not a file or a directory, then we can't do anything with it
raise ValueError('input is neither a file nor a path')

logging.info('files read into PathLineSentences:' + '\n'.join(self.input_files))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to pass the formatting arguments as arguments: logger.info("%s", y) instead of logger.info("%s" % y) .

Here and elsewhere.

"""
`source` should be a path to a directory (as a string) where all files can be opened by the
LineSentence class. Each file will be read up to
`limit` lines (or no clipped if limit is None, the default).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no => not.

@michaelwsherman
Copy link
Contributor Author

Fixes from @piskvorky in PR #1573 .

menshikh-iv pushed a commit that referenced this pull request Oct 24, 2017
* initial commit of fixes in comments of #1423

* removed unnecessary space in logger

* added support for custom Phrases scorers

* fixed Phrases.__getitem__ to support pluggable scoring #1533

* travisCI style fixes

* fixed __next__() to next() for python 3 compatibilyt

* misc fixes

* spacing fixes for style

* custom scorer support in sklearn api

* Phrases scikit interface tests for pluggable scoring

* missing line breaks

* style, clarity, and robustness fixes requested by @piskvorky

* check in Phrases init to make sure scorer is pickleable

* backwards scoring compatibility when loading a Phrases class

* removal of pickle testing objects in Phrases init

* switched to six for python 2/3 compatibility

* fix docstring
menshikh-iv added a commit that referenced this pull request Oct 25, 2017
@menshikh-iv menshikh-iv mentioned this pull request Oct 25, 2017
menshikh-iv added a commit that referenced this pull request Oct 25, 2017
* replace open->smart_open in annoy tutorial

* style fixes for lda model diff

* fix for #1390

* fix for #1423

* fix doc in Phrases
horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017
…iskvorky#1573)

* initial commit of fixes in comments of piskvorky#1423

* removed unnecessary space in logger

* added support for custom Phrases scorers

* fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533

* travisCI style fixes

* fixed __next__() to next() for python 3 compatibilyt

* misc fixes

* spacing fixes for style

* custom scorer support in sklearn api

* Phrases scikit interface tests for pluggable scoring

* missing line breaks

* style, clarity, and robustness fixes requested by @piskvorky

* check in Phrases init to make sure scorer is pickleable

* backwards scoring compatibility when loading a Phrases class

* removal of pickle testing objects in Phrases init

* switched to six for python 2/3 compatibility

* fix docstring
horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017
* replace open->smart_open in annoy tutorial

* style fixes for lda model diff

* fix for piskvorky#1390

* fix for piskvorky#1423

* fix doc in Phrases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants