-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423
Changes from 4 commits
44fb606
0a62352
b55a844
bde9cfd
86517a8
aef2879
6a21b80
f362e33
1dbe7b6
ac49054
bda1fe7
83eb848
dfd1f8e
4125143
14c2265
45b92f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1521,6 +1521,54 @@ def __iter__(self): | |
i += self.max_sentence_length | ||
|
||
|
||
class LineSentencePath(object): | ||
""" | ||
Simple format: one sentence = one line; words already preprocessed and separated by whitespace. | ||
Like LineSentence, but will process all files in a directory in alphabetical order by filename | ||
""" | ||
|
||
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None): | ||
""" | ||
`source` should be a path to a directory (as a string) where all files can be opened by the | ||
LineSentence class. Each file will be read up to | ||
`limit` lines (or no clipped if limit is None, the default). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
Example:: | ||
|
||
sentences = LineSentencePath(os.getcwd() + '\\corpus\\') | ||
|
||
The files in the directory should be either text files, .bz2 files, or .gz files. | ||
|
||
""" | ||
self.source = source | ||
self.max_sentence_length = max_sentence_length | ||
self.limit = limit | ||
|
||
try: | ||
self.source = os.path.join(source, '') # ensures os-specific slash is at end of path | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts on coverage of all related needs:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Change made to accept a single file and still work, including an additional test case. I think it is better to resolve the initialization parameters in the I would personally be caught off guard if the files changed between iterations. While this could be useful in some cases, I think it is a risky default behavior. Adding some capabilities to do this, however, may make sense. But I'd rather not do that unless a compelling use case is presented. What I've done instead is log the list of files read when the object is created at the info level, so there's some sort of explicit record available of what the object is reading. |
||
logging.debug('reading directory ' + source) | ||
self.input_files = os.listdir(source) | ||
except OSError: | ||
raise ValueError('input is a file, not a path, use word2vec.LineSentence') | ||
except NameError: | ||
raise ValueError('input source is not a path') | ||
|
||
self.input_files = os.listdir(source) | ||
self.input_files.sort() # makes sure it happens in filename order | ||
|
||
def __iter__(self): | ||
'''iterate through the files''' | ||
for file_name in self.input_files: | ||
logging.info('reading file ' + file_name + '\n') | ||
with utils.smart_open(self.source + file_name) as fin: | ||
for line in itertools.islice(fin, self.limit): | ||
line = utils.to_unicode(line).split() | ||
i = 0 | ||
while i < len(line): | ||
yield line[i : i + self.max_sentence_length] | ||
i += self.max_sentence_length | ||
|
||
|
||
# Example: ./word2vec.py -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3 | ||
if __name__ == "__main__": | ||
import argparse | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd consider the name
PathLineSentences
more typical and descriptive, but other may have an even better name.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to comment. I'll get to these next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made, will be reflected in next pull request