word2vec vocab building is not multi-threaded #400

khogeland · 2015-07-14T22:39:45Z

I've been running a couple billion words through word2vec, using a 32-core EC2 instance. Actually training the model is blazing fast (relatively), but populating the vocab dict is painfully slow.

I forked and added multi-threading, but max_vocab_size pruning and progress logging were difficult to implement (you can't prune the vocab list of a single worker, since there could be more instances of a word in a different worker's sentences, and pruning after all the vocab counts have been merged defeats the purpose entirely).

Because I had to disable those two things, I'm not going to submit a pull request. Here's what I changed, perhaps @piskvorky or someone with more time could get vocab size pruning working with multithreading?

https://github.com/btym/gensim/blob/develop/gensim/models/word2vec.py#L480

piskvorky · 2015-07-15T08:15:20Z

Can you say a little more about your use case?

What is the performance now, what performance do you need, and why?

We can certainly optimize this part if there's a compelling reason to :)

Note: threading doesn't generally help much with Python, because of the global interpreter lock (GIL). And it looks like you're putting all sentences into plain lists, in RAM, which wouldn't scale (no streaming).

khogeland · 2015-07-15T08:52:58Z

It's going through a somewhat large history of github events (the past 7 months, ~1.8 billion words). Memory is not an issue in my case, but using your implementation building the vocab list is almost as slow as training the network, since it's only running on one core out of 32 (to be clear, threads will run properly on separate cores if core affinity is set correctly, but what I added could have been done just as easily with multiprocessing).

I realize the way I'm doing it is specific to my use case (not streaming, no max vocab size), but there's no reason for vocab building not to be muti-thread/multi-process, since it's such a huge performance loss. Streaming could be solved by having one iterator shared between threads and have the iterator handle locking. Not sure how max vocab size would work.

piskvorky · 2015-07-15T11:38:24Z

Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes with the max vocab size pruning, I'd rather drop that and keep the parallel version. The parallelization optimization scenario sounds more commonly useful than pruning... although it brings more maintenance headaches (multiprocessing works differently on Windows). Do you have a way of testing on Windows?

If you find a way to build the vocab in parallel, while streaming the input (not everything in RAM), and being robust across platforms, can you please open a PR @btym?

For inspiration, see the actual training a few lines lower, which is also streamed and parallelized.

khogeland · 2015-07-15T16:50:52Z

I may have been mistaken and read the wrong thing, I only tried the changes
on a small set since I was doing this while the other model was still
training :) Should be able to switch to multiprocessing.

Can test on windows/osx/linux conveniently. Will try to submit a PR if I
have time this week/weekend.

On Wed, Jul 15, 2015, 4:38 AM Radim Řehůřek notifications@github.com
wrote:

Thread affinity is not related to GIL. I'm surprised you saw an
improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes
with the max vocab size pruning, I'd rather drop that and keep the parallel
version. The parallelization optimization scenario sounds more commonly
useful than pruning... although it brings more maintenance headaches
(multiprocessing works differently on Windows). Do you have a way of
testing on Windows?

If you find a way to build the vocab in parallel, while streaming the
input (not everything in RAM), can you please open a PR @btym
https://github.com/btym?

For inspiration, see the actual training a few lines lower, which is also
streamed.

—
Reply to this email directly or view it on GitHub
#400 (comment).

fortiema · 2015-07-21T13:45:14Z

Hey guys,

Been following this issue as this is also a big concern for me. Training on 2B+ word sets takes a whole lot of time, but most of it is actually spent scanning the vocabulary.

Based on @piskvorky suggestion, I went ahead and added multiprocess support to word2vec scan_vocab, removing max vocab pruning as it was much more complicated to deal with.

Haven't run any quantitative benches but planning to do so tomorrow to compare scaling on many cores and with non-trivial LineSentence implementations.

For now the only hiccup is that my implementation is using Counters() to easily merge each worker vocab, and that was only introduced in 2.7. Have a look at the PR #406 and tell me what you think!

Huarong · 2015-07-31T01:43:59Z

I need this feature too. I have about 500G training data in a directory, which consists about 500 files. It will be helpful if multi files can be scanned at the same in the vocabulary building step.

honnibal · 2015-12-05T10:04:12Z

I have some Cython code from spaCy that is helpful here:

The spacy.strings.hash_string function uses Murmurhash to produce a 64-bit integer key
The preshed library has a very fast and very memory efficient 64-bit to 64-bit hash table, implemented using open addressing and linear probing. The table is slightly faster than Google's dense_hash_map for my use cases.

The strategy is to have two tables: an int-to-int table that stores the counts and only knows the hashes of the strings, and a Python table mapping the hashes to the strings. The Python table is only updated when we cross the minimum frequency threshold.

I wrote this up very quickly, but so far the performance seems much much faster than scan_vocab, with much better memory use. I'm up to 1.5b words, after running for 10 minutes.

from preshed.counter import PreshCounter
from spacy.strings import hash_string

class Corpus(object):
    def __init__(self, directory, min_freq=10):
        self.directory = directory
        self.counts = PreshCounter()
        self.strings = {}
        self.min_freq = min_freq

    def count_doc(self, words):
        # Get counts for this document
        doc_counts = PreshCounter()
        doc_strings = {}
        for word in words:
            key = hash_string(word)
            doc_counts.inc(key, 1)
            doc_strings[key] = word

        n = 0
        for key, count in doc_counts:
            self.counts.inc(key, count)
            # TODO: Why doesn't inc return this? =/
            corpus_count = self.counts[key]
            # Remember the string when we exceed min count
            if corpus_count >= self.min_freq and (corpus_count - count) < self.min_freq:
                 self.strings[key] = doc_strings[key]
            n += count
        return n

    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with io.open(text_loc, 'r', encoding='utf8') as file_:
                for sent_str in file_:
                    yield sent_str.split()

windj007 · 2016-06-13T07:57:11Z

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

tmylk · 2016-10-05T13:17:51Z

@windj007 Do you happen to have example code on replacing with placeholders? Assume it is some simple regexes. Would be great to add it to the documentation.

windj007 · 2016-10-05T13:34:38Z

For Wikipedia, a regex like this one worked for me:

BAD_TOKENS_RE = re.compile(r'''^(https?://.*|[()',";?=\.#=:0-9/\-%«»*—|]+|.*\.(jpg|png|gif|svg)|.)$''')

It matches such types of tokens as URLs, special symbols and filenames.

At the moment I wrote my previous comment, I had a set of regexes and really replaced tokens with placeholders. Now, I merged all the regexs into this one and just drop the matched tokens.

gauravkoradiya · 2019-08-28T13:51:48Z

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

have u used above code?

spate141 · 2021-04-28T02:27:51Z

Is there any progress made with build_vocab() in 4.0 release? I just started with 105B tokens file and planning to train few versions of embedding models. Seems like build_vocab is rather more expensive than actual training.

I'm setting following variables related to this function:

MAX_VOCAB_SIZE  = None
TRIM_RULE       = None
SORTED_VOCAB    = 1
BATCH_WORDS     = 10000

Any suggestion to speed this up? I'm not sure on how should I use Honnibal's code.

piskvorky · 2021-04-28T06:03:05Z

build_vocab is still pure Python = slow. Essentially it does this:

vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    for word in sentence:
        vocab[word] += 1

I see two avenues to speed up build_vocab:

Optimize LineSentence or the loop above with Cython / C. Especially for the corpus_file code path, where we know the input is a file on a local disk (rather than any iterable or remote data stream), we should be able to do much better.

Plus, we can now use Bounter for fast in-memory counting, so we don't have to care about the vocab pruning at all.
In case of corpus_file, we could also parallelize build_vocab. Different threads/processes could jump to scan different places in the underlying local text file – something we cannot do with a generic iterable.

If anyone's interested in implementing either, let me know.

ciropom · 2023-03-14T06:42:44Z

build_vocab is still pure Python = slow. Essentially it does this:

vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    for word in sentence:
        vocab[word] += 1

I see two avenues to speed up build_vocab:

1. Optimize LineSentence or the loop above with Cython / C. Especially for the `corpus_file` code path, where we know the input is a file on a local disk (rather than any iterable or remote data stream), we should be able to do much better.
   Plus, we can now use [Bounter](https://github.com/RaRe-Technologies/bounter) for fast in-memory counting, so we don't have to care about the vocab pruning at all.

2. In case of `corpus_file`, we could also parallelize `build_vocab`. Different threads/processes could jump to scan different places in the underlying local text file – something we cannot do with a generic iterable.

If anyone's interested in implementing either, let me know.

Hello! I need this and I'm willing to implement it.

I actually have a class that extends iterator and streams from an internal VPN resource a list of papers
that I tokenize and pass to gensim.. but build_vocab is taking ages.

My Iterator is parallelized, meaning that there is a queue where the iterator reads sentences,
and there are multiple threads writing on it.

I would like to parallelize build_vocab too.
Please let me know the details and how to proceed.

Thanks
Danilo

piskvorky · 2023-03-14T08:07:04Z

Implement which part, what exactly?

IMO the word counting logic is so trivial that any sort of fine-grained thread synchronization (locks) would do more harm than good. The Python overhead is too large.

So, a coarse grained approach looks more promising.

I mean, the proof will be in the benchmark numbers, but for the extra complexity to be worthwhile, we would need a whole multiples speedup (2x, 3x… not 20% or 30%). Do you think you can do it?

ciropom · 2023-03-14T09:10:29Z

I thought you already had a plan, but anyway this is my proposal:

define a read_queue
the build_vocab will

sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    read_queue.put(sentence)

spawn multiple Processes that reads from read_queue, each one with their own local vocab
every SYNC_VOCAB words processed, each process will global_vocab_queue.put(json.dumps(vocab)); vocab={};
another Process awaits local vocabularies from global_vocab_queue and merges them in a global vocabulary, summing up when needed.

at the end of the process, the global vocab will be the actual vocab, the local vocab will be lost after Processes shoutdown.

piskvorky · 2023-03-14T09:15:19Z

I'd expect that (queuing objects with multiprocessing) to be significantly slower than the current single-threaded approach.

But let me know what the numbers say.

ciropom · 2023-03-14T09:25:27Z

ok.
practically, should I fork the project and ask for merge request afterwards?

piskvorky · 2023-03-14T09:34:28Z

Yes please – implement your changes and then open a pull request.

ciropom · 2023-03-15T16:34:11Z

I made some tests, and I realized that it should be even faster if I can manage to build the vocabulary myself, since I'm already parallelizing the work to pre-process the text.

I dig into the code and found the life-saving function build_vocab_from_freq
that I can give to use a vocabulary prepared by me.

This solves the issue in my case since before, build_vocab was the bottleneck with 1 single process slowing down the many pre-processing processes.. Now I can spawn 50 parallel processes and all reach a 100% usage of CPU 100% of the time.

piskvorky · 2023-03-15T19:08:45Z

Yes if you're able to "squeeze in" the word counting into the same process that generates the data, that will avoid the extra corpus loop altogether. Basically amortize the data shuffling cost into an existing preprocessing step.

This would be true even if you didn't parallelize the preprocessing step. But even more so if you have it parallelized.

ciropom · 2023-03-16T08:20:23Z

I was wondering if I can do something similar for Phrases? because it suffers from the same single-core limitation.

piskvorky · 2023-03-16T11:03:27Z

Sure – Phrases is also a part of preprocessing. So conceptually, phrase detection belongs to the same computational step.

ciropom · 2023-03-16T17:10:08Z

OK, but practically how can I do it? I don't see a build_vocab_from_freq there or something that allows to overcome the 1-core limitation.

fortiema mentioned this issue Jul 21, 2015

Parallel scan_vocab #406

Closed

honnibal mentioned this issue Dec 5, 2015

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Closed

janrygl mentioned this issue Dec 6, 2015

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

Closed

menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning) labels Oct 3, 2017

gojomo mentioned this issue Feb 15, 2018

word2vec gensim multi core training #668

Closed

mglowacki100 mentioned this issue Aug 14, 2022

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

Open

ciropom mentioned this issue Mar 21, 2023

Word2Vec does not run faster with more workers caused by sentences length #1509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec vocab building is not multi-threaded #400

word2vec vocab building is not multi-threaded #400

khogeland commented Jul 14, 2015

piskvorky commented Jul 15, 2015

khogeland commented Jul 15, 2015

piskvorky commented Jul 15, 2015

khogeland commented Jul 15, 2015

fortiema commented Jul 21, 2015

Huarong commented Jul 31, 2015

honnibal commented Dec 5, 2015

windj007 commented Jun 13, 2016

tmylk commented Oct 5, 2016

windj007 commented Oct 5, 2016 •

edited

Loading

gauravkoradiya commented Aug 28, 2019

spate141 commented Apr 28, 2021 •

edited

Loading

piskvorky commented Apr 28, 2021 •

edited

Loading

ciropom commented Mar 14, 2023

piskvorky commented Mar 14, 2023 •

edited

Loading

ciropom commented Mar 14, 2023

piskvorky commented Mar 14, 2023 •

edited

Loading

ciropom commented Mar 14, 2023

piskvorky commented Mar 14, 2023

ciropom commented Mar 15, 2023

piskvorky commented Mar 15, 2023

ciropom commented Mar 16, 2023

piskvorky commented Mar 16, 2023

ciropom commented Mar 16, 2023

word2vec vocab building is not multi-threaded #400

word2vec vocab building is not multi-threaded #400

Comments

khogeland commented Jul 14, 2015

piskvorky commented Jul 15, 2015

khogeland commented Jul 15, 2015

piskvorky commented Jul 15, 2015

khogeland commented Jul 15, 2015

fortiema commented Jul 21, 2015

Huarong commented Jul 31, 2015

honnibal commented Dec 5, 2015

windj007 commented Jun 13, 2016

tmylk commented Oct 5, 2016

windj007 commented Oct 5, 2016 • edited Loading

gauravkoradiya commented Aug 28, 2019

spate141 commented Apr 28, 2021 • edited Loading

piskvorky commented Apr 28, 2021 • edited Loading

ciropom commented Mar 14, 2023

piskvorky commented Mar 14, 2023 • edited Loading

ciropom commented Mar 14, 2023

piskvorky commented Mar 14, 2023 • edited Loading

ciropom commented Mar 14, 2023

piskvorky commented Mar 14, 2023

ciropom commented Mar 15, 2023

piskvorky commented Mar 15, 2023

ciropom commented Mar 16, 2023

piskvorky commented Mar 16, 2023

ciropom commented Mar 16, 2023

windj007 commented Oct 5, 2016 •

edited

Loading

spate141 commented Apr 28, 2021 •

edited

Loading

piskvorky commented Apr 28, 2021 •

edited

Loading

piskvorky commented Mar 14, 2023 •

edited

Loading

piskvorky commented Mar 14, 2023 •

edited

Loading