-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word2vec vocab building is not multi-threaded #400
Comments
Can you say a little more about your use case? What is the performance now, what performance do you need, and why? We can certainly optimize this part if there's a compelling reason to :) Note: threading doesn't generally help much with Python, because of the global interpreter lock (GIL). And it looks like you're putting all sentences into plain lists, in RAM, which wouldn't scale (no streaming). |
It's going through a somewhat large history of github events (the past 7 months, ~1.8 billion words). Memory is not an issue in my case, but using your implementation building the vocab list is almost as slow as training the network, since it's only running on one core out of 32 (to be clear, threads will run properly on separate cores if core affinity is set correctly, but what I added could have been done just as easily with multiprocessing). I realize the way I'm doing it is specific to my use case (not streaming, no max vocab size), but there's no reason for vocab building not to be muti-thread/multi-process, since it's such a huge performance loss. Streaming could be solved by having one iterator shared between threads and have the iterator handle locking. Not sure how max vocab size would work. |
Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers? Anyway, yes, parallelizing the vocab building sounds useful. If it clashes with the max vocab size pruning, I'd rather drop that and keep the parallel version. The parallelization optimization scenario sounds more commonly useful than pruning... although it brings more maintenance headaches (multiprocessing works differently on Windows). Do you have a way of testing on Windows? If you find a way to build the vocab in parallel, while streaming the input (not everything in RAM), and being robust across platforms, can you please open a PR @btym? For inspiration, see the actual training a few lines lower, which is also streamed and parallelized. |
I may have been mistaken and read the wrong thing, I only tried the changes Can test on windows/osx/linux conveniently. Will try to submit a PR if I On Wed, Jul 15, 2015, 4:38 AM Radim Řehůřek notifications@github.com
|
Hey guys, Been following this issue as this is also a big concern for me. Training on 2B+ word sets takes a whole lot of time, but most of it is actually spent scanning the vocabulary. Based on @piskvorky suggestion, I went ahead and added multiprocess support to word2vec scan_vocab, removing max vocab pruning as it was much more complicated to deal with. Haven't run any quantitative benches but planning to do so tomorrow to compare scaling on many cores and with non-trivial LineSentence implementations. For now the only hiccup is that my implementation is using Counters() to easily merge each worker vocab, and that was only introduced in 2.7. Have a look at the PR #406 and tell me what you think! |
I need this feature too. I have about 500G training data in a directory, which consists about 500 files. It will be helpful if multi files can be scanned at the same in the vocabulary building step. |
I have some Cython code from spaCy that is helpful here:
The strategy is to have two tables: an int-to-int table that stores the counts and only knows the hashes of the strings, and a Python table mapping the hashes to the strings. The Python table is only updated when we cross the minimum frequency threshold. I wrote this up very quickly, but so far the performance seems much much faster than from preshed.counter import PreshCounter
from spacy.strings import hash_string
class Corpus(object):
def __init__(self, directory, min_freq=10):
self.directory = directory
self.counts = PreshCounter()
self.strings = {}
self.min_freq = min_freq
def count_doc(self, words):
# Get counts for this document
doc_counts = PreshCounter()
doc_strings = {}
for word in words:
key = hash_string(word)
doc_counts.inc(key, 1)
doc_strings[key] = word
n = 0
for key, count in doc_counts:
self.counts.inc(key, count)
# TODO: Why doesn't inc return this? =/
corpus_count = self.counts[key]
# Remember the string when we exceed min count
if corpus_count >= self.min_freq and (corpus_count - count) < self.min_freq:
self.strings[key] = doc_strings[key]
n += count
return n
def __iter__(self):
for text_loc in iter_dir(self.directory):
with io.open(text_loc, 'r', encoding='utf8') as file_:
for sent_str in file_:
yield sent_str.split() |
This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts). I would suggest mentioning this in the documentation explicitly. |
@windj007 Do you happen to have example code on replacing with placeholders? Assume it is some simple regexes. Would be great to add it to the documentation. |
For Wikipedia, a regex like this one worked for me: BAD_TOKENS_RE = re.compile(r'''^(https?://.*|[()',";?=\.#=:0-9/\-%«»*—|]+|.*\.(jpg|png|gif|svg)|.)$''') It matches such types of tokens as URLs, special symbols and filenames. At the moment I wrote my previous comment, I had a set of regexes and really replaced tokens with placeholders. Now, I merged all the regexs into this one and just drop the matched tokens. |
have u used above code? |
Is there any progress made with I'm setting following variables related to this function: MAX_VOCAB_SIZE = None
TRIM_RULE = None
SORTED_VOCAB = 1
BATCH_WORDS = 10000 Any suggestion to speed this up? I'm not sure on how should I use Honnibal's code. |
vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
for word in sentence:
vocab[word] += 1 I see two avenues to speed up
If anyone's interested in implementing either, let me know. |
Hello! I need this and I'm willing to implement it. I actually have a class that extends iterator and streams from an internal VPN resource a list of papers My Iterator is parallelized, meaning that there is a queue where the iterator reads sentences, I would like to parallelize build_vocab too. Thanks |
Implement which part, what exactly? IMO the word counting logic is so trivial that any sort of fine-grained thread synchronization (locks) would do more harm than good. The Python overhead is too large. So, a coarse grained approach looks more promising. I mean, the proof will be in the benchmark numbers, but for the extra complexity to be worthwhile, we would need a whole multiples speedup (2x, 3x… not 20% or 30%). Do you think you can do it? |
I thought you already had a plan, but anyway this is my proposal:
at the end of the process, the global vocab will be the actual vocab, the local vocab will be lost after Processes shoutdown. |
I'd expect that (queuing objects with multiprocessing) to be significantly slower than the current single-threaded approach. But let me know what the numbers say. |
ok. |
Yes please – implement your changes and then open a pull request. |
I made some tests, and I realized that it should be even faster if I can manage to build the vocabulary myself, since I'm already parallelizing the work to pre-process the text. I dig into the code and found the life-saving function This solves the issue in my case since before, build_vocab was the bottleneck with 1 single process slowing down the many pre-processing processes.. Now I can spawn 50 parallel processes and all reach a 100% usage of CPU 100% of the time. |
Yes if you're able to "squeeze in" the word counting into the same process that generates the data, that will avoid the extra corpus loop altogether. Basically amortize the data shuffling cost into an existing preprocessing step. This would be true even if you didn't parallelize the preprocessing step. But even more so if you have it parallelized. |
I was wondering if I can do something similar for Phrases? because it suffers from the same single-core limitation. |
Sure – Phrases is also a part of preprocessing. So conceptually, phrase detection belongs to the same computational step. |
OK, but practically how can I do it? I don't see a build_vocab_from_freq there or something that allows to overcome the 1-core limitation. |
I've been running a couple billion words through word2vec, using a 32-core EC2 instance. Actually training the model is blazing fast (relatively), but populating the vocab dict is painfully slow.
I forked and added multi-threading, but max_vocab_size pruning and progress logging were difficult to implement (you can't prune the vocab list of a single worker, since there could be more instances of a word in a different worker's sentences, and pruning after all the vocab counts have been merged defeats the purpose entirely).
Because I had to disable those two things, I'm not going to submit a pull request. Here's what I changed, perhaps @piskvorky or someone with more time could get vocab size pruning working with multithreading?
https://github.com/btym/gensim/blob/develop/gensim/models/word2vec.py#L480
The text was updated successfully, but these errors were encountered: