Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

michaelwsherman · 2017-06-16T22:23:08Z

word2vec.LineSentencePath(path) will read all the files in a directory in the same fashion as word2vec.LineSentence reads a file. This provides an easy way to use a corpus of multiple files when training a word2vec model (or any model compatible with word2vec.LineSentence).

Minimal exeception handling, but some logging.

added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence

initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence

no longer sensitive to an input without a trailing os-specific slash

gojomo · 2017-06-17T00:06:48Z

Thanks for the contrib!

The failing automatic check looks like some style issues - you can click the 'Details' link or red 'X' to view the failure logs for more hints.

More substantive comments inline.

gojomo · 2017-06-17T00:07:35Z

gensim/models/word2vec.py

@@ -1521,6 +1521,54 @@ def __iter__(self):
                        i += self.max_sentence_length


+class LineSentencePath(object):


Personally I'd consider the name PathLineSentences more typical and descriptive, but other may have an even better name.

Thank you for taking the time to comment. I'll get to these next week.

Change made, will be reflected in next pull request

gojomo · 2017-06-17T00:16:18Z

gensim/models/word2vec.py

+        self.limit = limit
+
+        try:
+            self.source = os.path.join(source, '') # ensures os-specific slash is at end of path


Thoughts on coverage of all related needs:

perhaps this should accept a path to a single file, too, and still work in that case?

by deferring the actual resolution of initialization parameters to the beginning of __iter__(), the object might be more robust for cases where files are arriving in the target directory between instantiation & 1st iteration. OTOH, that would also mean repeated iterations – as in the common Word2Vec/Doc2Vec multi-pass training, could find different files each time. No strong opinion yet on which approach is better – just pointing out the choice.

Change made to accept a single file and still work, including an additional test case.

I think it is better to resolve the initialization parameters in the __init__(). While there could be some use in not requiring the files to all be present when the object is initialized, I think that possibly changing the files processed every time a new iteration starts is likely to cause confusion. It seems more natural that the default behavior would be to get a list of files and not change them as long as the object is used. This would match the behavior of LineSentence--if you change the contents of the file between iterations, you'll get different results, but you can't change the reference to the file after the object has been created.

I would personally be caught off guard if the files changed between iterations. While this could be useful in some cases, I think it is a risky default behavior. Adding some capabilities to do this, however, may make sense. But I'd rather not do that unless a compelling use case is presented.

What I've done instead is log the list of files read when the object is created at the info level, so there's some sort of explicit record available of what the object is reading.

gojomo · 2017-06-17T00:23:43Z

gensim/test/test_word2vec.py

+        """Does LineSentencePath work with a path argument?"""
+        logging.debug(word2vec)
+        with utils.smart_open(datapath('lee_background.cor')) as orig:
+            sentences = word2vec.LineSentencePath(datapath('LineSentencePath'))


Perhaps rather than creating a custom new split (and duplication) of the lee_background.cor, a new subdir of test-data lee could be added, with just the lee.cor and lee_background.cor files. The test would make sure this new class on the directory yields the same set of docs as the two other files read individually. And eventually, the duplication of these files in two places could be eliminated by making all the other tests/demos/tutorials just grab the lee* files from this new canonical place. (That is: aim for less duplication in the long run by re-using the same files everywhere.)

I thought about this, but I think making a lee directory may be worse than having a special PathLineSentences directory.

It seems there are quite a few lee* files in test_data already. I could move those two files you mentioned, but then we'd have a bunch of lee files still in the main test_data directory. Those files could be moved as well, except some of them aren't compatible with PathLineSentences (.bin, .vec) which meant that they couldn't be moved without breaking the test. Then we would have lee* files in two places, which could lead to confusion later on. There is also the issue that something that makes sense to add to this lee directory in the future could break the test.

Given that the PathLinesSentences class operates on a full directory (which makes it somewhat unique) I think it makes more sense to keep it's test dependent on a directory that isn't overloaded for use with other tests.

What I will do is use smaller files that aren't duplicated.

Change made, put in a new tiny test corpus, and changed the test to read the files and combine them manually and compare to PathLineSentences, rather than having a single file duplicating the contents of the PathLineSentences folder in test_data

…into develop

in word2vec.py . Test updated as well

in models.word2vec . Tests also updated

had only 1 space before an inline comment, flagged by travis CI build

Removed LineSentencePath directory, created PathLineSentences lee corpus duplicates were in LineSentencePath, was wasting space made new small corpus to test PathLineSentences, put in directory changed test to read both files manually, combine, and compare to PathLineSentences (rather than having a separate single file to match the entire contents of the PathLineSentences test_data directory

changed PathLineSentences to support a single file in addition to a directory, raises a warning to use LineSentence when a single file is given as a parameter. added corresponding test.

michaelwsherman · 2017-06-19T17:42:18Z

I think I've addressed all the comments. Please re-review.

I'm having some problems with figuring out what travis-ci is angry about. I did find an inline comment missing a whitespace and fixed that. But I think the other comments are due to extra spaces in [i : i +...] line and that's elsewhere in the files. I also see errors with the new corpus file--not sure how to tell travis-ci to not treat those files as code. I'll review after the current style check is done and see if I can get to the bottom of things.

michaelwsherman · 2017-06-19T17:48:48Z

Nevermind, looks like travis just checks my changes -- not code already in the repo. But I did fix another style issue elsewhere in the code.

I think the style check should pass now.

resolved test_word2vec.py manually

michaelwsherman · 2017-06-23T21:27:59Z

Cleaning up my noob git branching fail, no more dangling branches not actually getting used.

…into develop

michaelwsherman · 2017-06-24T22:42:34Z

@gojomo or other maintainers--Any other changes? Any questions? Please let me know. Thank you.

menshikh-iv · 2017-07-18T09:12:45Z

Looks good for me, thank you @michaelwsherman, congratz with your first PR:1st_place_medal:

piskvorky

Sorry I know I'm late with the review.

Adding some minor code style comments. @menshikh-iv

piskvorky · 2017-07-25T05:19:09Z

gensim/models/word2vec.py

+        self.limit = limit
+
+        if os.path.isfile(self.source):
+            logging.warning('single file read, better to use models.word2vec.LineSentence')


Should be logger (each module has its own logger in gensim).

Applies here and elsewhere in this PR.

piskvorky · 2017-07-25T05:20:08Z

gensim/models/word2vec.py

+            self.source = os.path.join(self.source, '')  # ensures os-specific slash at end of path
+            logging.debug('reading directory ' + self.source)
+            self.input_files = os.listdir(self.source)
+            self.input_files = [self.source + file for file in self.input_files]  # make full paths


file is a reserved keyword in Python. Better use filename or something like that.

piskvorky · 2017-07-25T05:21:01Z

gensim/models/word2vec.py

+        else:  # not a file or a directory, then we can't do anything with it
+            raise ValueError('input is neither a file nor a path')
+
+        logging.info('files read into PathLineSentences:' + '\n'.join(self.input_files))


Better to pass the formatting arguments as arguments: logger.info("%s", y) instead of logger.info("%s" % y) .

Here and elsewhere.

piskvorky · 2017-07-25T05:21:55Z

gensim/models/word2vec.py

+        """
+        `source` should be a path to a directory (as a string) where all files can be opened by the
+        LineSentence class. Each file will be read up to
+        `limit` lines (or no clipped if limit is None, the default).


no => not.

michaelwsherman · 2017-09-06T18:27:08Z

Fixes from @piskvorky in PR #1573 .

@piskvorky

* initial commit of fixes in comments of #1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring #1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring

* replace open->smart_open in annoy tutorial * style fixes for lda model diff * fix for #1390 * fix for #1423 * fix doc in Phrases

@piskvorky

…iskvorky#1573) * initial commit of fixes in comments of piskvorky#1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring

* replace open->smart_open in annoy tutorial * style fixes for lda model diff * fix for piskvorky#1390 * fix for piskvorky#1423 * fix doc in Phrases

Michael Sherman added 4 commits June 16, 2017 11:49

issue piskvorky#1364 first commit, corpus from a directory

44fb606

added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence

test for word2vec.LineSentencePath issue piskvorky#1364

0a62352

initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence

better handling of input for LineSentencePath

b55a844

no longer sensitive to an input without a trailing os-specific slash

Merge branch 'LineSentencePath' into develop

bde9cfd

michaelwsherman mentioned this pull request Jun 16, 2017

Better support for training word2vec models on a corpus consisting of multiple files #1364

Closed

michaelwsherman changed the title ~~Issue #1364 -- word2vec.LineSentencePath class to read a directory as a corpus~~ word2vec.LineSentencePath class to read a directory as a corpus (#1364) Jun 16, 2017

gojomo reviewed Jun 17, 2017

View reviewed changes

Michael Sherman added 6 commits June 19, 2017 11:09

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

86517a8

…into develop

LineSentencePath renamed PathLineSentences

aef2879

in word2vec.py . Test updated as well

LineSentencePath rename to PathLineSentences

6a21b80

in models.word2vec . Tests also updated

fix whitespace style error

f362e33

had only 1 space before an inline comment, flagged by travis CI build

word2vec.PathLineSentences single file support

ac49054

changed PathLineSentences to support a single file in addition to a directory, raises a warning to use LineSentence when a single file is given as a parameter. added corresponding test.

fixing style issues

bda1fe7

Michael Sherman and others added 3 commits June 19, 2017 13:51

fix style issue

83eb848

Merge branch 'release-2.2.0'

dfd1f8e

Merge branch 'develop' into LineSentencePath

4125143

resolved test_word2vec.py manually

Michael Sherman added 2 commits June 23, 2017 17:28

Merge branch 'master' of https://github.com/RaRe-Technologies/gensim …

14c2265

…into develop

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

45b92f2

…into develop

menshikh-iv changed the title ~~word2vec.LineSentencePath class to read a directory as a corpus (#1364)~~ Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) Jul 18, 2017

menshikh-iv merged commit b818c91 into piskvorky:develop Jul 18, 2017

piskvorky reviewed Jul 25, 2017

View reviewed changes

menshikh-iv added the style checking label Jul 25, 2017

michaelwsherman mentioned this pull request Aug 3, 2017

models.Phrases multiple scoring methods (#1363) #1464

Merged

michaelwsherman pushed a commit to bloomberg/gensim that referenced this pull request Sep 5, 2017

initial commit of fixes in comments of piskvorky#1423

21c4401

michaelwsherman mentioned this pull request Sep 6, 2017

1533 fix and 1464 1423 comments #1573

Merged

menshikh-iv added a commit that referenced this pull request Oct 25, 2017

fix for #1423

5e312a5

menshikh-iv mentioned this pull request Oct 25, 2017

Small style fixes #1650

Merged

menshikh-iv added a commit that referenced this pull request Oct 25, 2017

Fix code/docstring style (#1650)

67d9634

* replace open->smart_open in annoy tutorial * style fixes for lda model diff * fix for #1390 * fix for #1423 * fix doc in Phrases

menshikh-iv removed the style checking label Oct 26, 2017

horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017

Fix code/docstring style (piskvorky#1650)

754ea54

* replace open->smart_open in annoy tutorial * style fixes for lda model diff * fix for piskvorky#1390 * fix for piskvorky#1423 * fix doc in Phrases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

michaelwsherman commented Jun 16, 2017

gojomo commented Jun 17, 2017

gojomo Jun 17, 2017 •

edited

Loading

michaelwsherman Jun 17, 2017

michaelwsherman Jun 19, 2017

gojomo Jun 17, 2017

michaelwsherman Jun 19, 2017

gojomo Jun 17, 2017

michaelwsherman Jun 19, 2017

michaelwsherman Jun 19, 2017

michaelwsherman commented Jun 19, 2017

michaelwsherman commented Jun 19, 2017

michaelwsherman commented Jun 23, 2017

michaelwsherman commented Jun 24, 2017

menshikh-iv commented Jul 18, 2017

piskvorky left a comment •

edited

Loading

piskvorky Jul 25, 2017

piskvorky Jul 25, 2017

piskvorky Jul 25, 2017

piskvorky Jul 25, 2017

michaelwsherman commented Sep 6, 2017

		@@ -1521,6 +1521,54 @@ def __iter__(self):
		i += self.max_sentence_length


		class LineSentencePath(object):

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

Conversation

michaelwsherman commented Jun 16, 2017

gojomo commented Jun 17, 2017

gojomo Jun 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwsherman commented Jun 19, 2017

michaelwsherman commented Jun 19, 2017

michaelwsherman commented Jun 23, 2017

michaelwsherman commented Jun 24, 2017

menshikh-iv commented Jul 18, 2017

piskvorky left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwsherman commented Sep 6, 2017

gojomo Jun 17, 2017 •

edited

Loading

piskvorky left a comment •

edited

Loading