Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

Merged
merged 16 commits into from
Jul 18, 2017
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -1521,6 +1521,54 @@ def __iter__(self):
i += self.max_sentence_length


class LineSentencePath(object):
Copy link
Collaborator

@gojomo gojomo Jun 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd consider the name PathLineSentences more typical and descriptive, but other may have an even better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to comment. I'll get to these next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made, will be reflected in next pull request

"""
Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
Like LineSentence, but will process all files in a directory in alphabetical order by filename
"""

def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
"""
`source` should be a path to a directory (as a string) where all files can be opened by the
LineSentence class. Each file will be read up to
`limit` lines (or no clipped if limit is None, the default).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no => not.


Example::

sentences = LineSentencePath(os.getcwd() + '\\corpus\\')

The files in the directory should be either text files, .bz2 files, or .gz files.

"""
self.source = source
self.max_sentence_length = max_sentence_length
self.limit = limit

try:
self.source = os.path.join(source, '') # ensures os-specific slash is at end of path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on coverage of all related needs:

  • perhaps this should accept a path to a single file, too, and still work in that case?
  • by deferring the actual resolution of initialization parameters to the beginning of __iter__(), the object might be more robust for cases where files are arriving in the target directory between instantiation & 1st iteration. OTOH, that would also mean repeated iterations – as in the common Word2Vec/Doc2Vec multi-pass training, could find different files each time. No strong opinion yet on which approach is better – just pointing out the choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made to accept a single file and still work, including an additional test case.

I think it is better to resolve the initialization parameters in the __init__(). While there could be some use in not requiring the files to all be present when the object is initialized, I think that possibly changing the files processed every time a new iteration starts is likely to cause confusion. It seems more natural that the default behavior would be to get a list of files and not change them as long as the object is used. This would match the behavior of LineSentence--if you change the contents of the file between iterations, you'll get different results, but you can't change the reference to the file after the object has been created.

I would personally be caught off guard if the files changed between iterations. While this could be useful in some cases, I think it is a risky default behavior. Adding some capabilities to do this, however, may make sense. But I'd rather not do that unless a compelling use case is presented.

What I've done instead is log the list of files read when the object is created at the info level, so there's some sort of explicit record available of what the object is reading.

logging.debug('reading directory ' + source)
self.input_files = os.listdir(source)
except OSError:
raise ValueError('input is a file, not a path, use word2vec.LineSentence')
except NameError:
raise ValueError('input source is not a path')

self.input_files = os.listdir(source)
self.input_files.sort() # makes sure it happens in filename order

def __iter__(self):
'''iterate through the files'''
for file_name in self.input_files:
logging.info('reading file ' + file_name + '\n')
with utils.smart_open(self.source + file_name) as fin:
for line in itertools.islice(fin, self.limit):
line = utils.to_unicode(line).split()
i = 0
while i < len(line):
yield line[i : i + self.max_sentence_length]
i += self.max_sentence_length


# Example: ./word2vec.py -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3
if __name__ == "__main__":
import argparse
Expand Down
Loading