Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word2vec vocab building is not multi-threaded #400

Open
khogeland opened this issue Jul 14, 2015 · 24 comments
Open

word2vec vocab building is not multi-threaded #400

khogeland opened this issue Jul 14, 2015 · 24 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning)

Comments

@khogeland
Copy link

I've been running a couple billion words through word2vec, using a 32-core EC2 instance. Actually training the model is blazing fast (relatively), but populating the vocab dict is painfully slow.

I forked and added multi-threading, but max_vocab_size pruning and progress logging were difficult to implement (you can't prune the vocab list of a single worker, since there could be more instances of a word in a different worker's sentences, and pruning after all the vocab counts have been merged defeats the purpose entirely).

Because I had to disable those two things, I'm not going to submit a pull request. Here's what I changed, perhaps @piskvorky or someone with more time could get vocab size pruning working with multithreading?

https://github.com/btym/gensim/blob/develop/gensim/models/word2vec.py#L480

@piskvorky
Copy link
Owner

Can you say a little more about your use case?

What is the performance now, what performance do you need, and why?

We can certainly optimize this part if there's a compelling reason to :)

Note: threading doesn't generally help much with Python, because of the global interpreter lock (GIL). And it looks like you're putting all sentences into plain lists, in RAM, which wouldn't scale (no streaming).

@khogeland
Copy link
Author

It's going through a somewhat large history of github events (the past 7 months, ~1.8 billion words). Memory is not an issue in my case, but using your implementation building the vocab list is almost as slow as training the network, since it's only running on one core out of 32 (to be clear, threads will run properly on separate cores if core affinity is set correctly, but what I added could have been done just as easily with multiprocessing).

I realize the way I'm doing it is specific to my use case (not streaming, no max vocab size), but there's no reason for vocab building not to be muti-thread/multi-process, since it's such a huge performance loss. Streaming could be solved by having one iterator shared between threads and have the iterator handle locking. Not sure how max vocab size would work.

@piskvorky
Copy link
Owner

Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes with the max vocab size pruning, I'd rather drop that and keep the parallel version. The parallelization optimization scenario sounds more commonly useful than pruning... although it brings more maintenance headaches (multiprocessing works differently on Windows). Do you have a way of testing on Windows?

If you find a way to build the vocab in parallel, while streaming the input (not everything in RAM), and being robust across platforms, can you please open a PR @btym?

For inspiration, see the actual training a few lines lower, which is also streamed and parallelized.

@khogeland
Copy link
Author

I may have been mistaken and read the wrong thing, I only tried the changes
on a small set since I was doing this while the other model was still
training :) Should be able to switch to multiprocessing.

Can test on windows/osx/linux conveniently. Will try to submit a PR if I
have time this week/weekend.

On Wed, Jul 15, 2015, 4:38 AM Radim Řehůřek notifications@github.com
wrote:

Thread affinity is not related to GIL. I'm surprised you saw an
improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes
with the max vocab size pruning, I'd rather drop that and keep the parallel
version. The parallelization optimization scenario sounds more commonly
useful than pruning... although it brings more maintenance headaches
(multiprocessing works differently on Windows). Do you have a way of
testing on Windows?

If you find a way to build the vocab in parallel, while streaming the
input (not everything in RAM), can you please open a PR @btym
https://github.com/btym?

For inspiration, see the actual training a few lines lower, which is also
streamed.


Reply to this email directly or view it on GitHub
#400 (comment).

@fortiema
Copy link

Hey guys,

Been following this issue as this is also a big concern for me. Training on 2B+ word sets takes a whole lot of time, but most of it is actually spent scanning the vocabulary.

Based on @piskvorky suggestion, I went ahead and added multiprocess support to word2vec scan_vocab, removing max vocab pruning as it was much more complicated to deal with.

Haven't run any quantitative benches but planning to do so tomorrow to compare scaling on many cores and with non-trivial LineSentence implementations.

For now the only hiccup is that my implementation is using Counters() to easily merge each worker vocab, and that was only introduced in 2.7. Have a look at the PR #406 and tell me what you think!

@Huarong
Copy link

Huarong commented Jul 31, 2015

I need this feature too. I have about 500G training data in a directory, which consists about 500 files. It will be helpful if multi files can be scanned at the same in the vocabulary building step.

@honnibal
Copy link

honnibal commented Dec 5, 2015

I have some Cython code from spaCy that is helpful here:

  • The spacy.strings.hash_string function uses Murmurhash to produce a 64-bit integer key
  • The preshed library has a very fast and very memory efficient 64-bit to 64-bit hash table, implemented using open addressing and linear probing. The table is slightly faster than Google's dense_hash_map for my use cases.

The strategy is to have two tables: an int-to-int table that stores the counts and only knows the hashes of the strings, and a Python table mapping the hashes to the strings. The Python table is only updated when we cross the minimum frequency threshold.

I wrote this up very quickly, but so far the performance seems much much faster than scan_vocab, with much better memory use. I'm up to 1.5b words, after running for 10 minutes.

from preshed.counter import PreshCounter
from spacy.strings import hash_string

class Corpus(object):
    def __init__(self, directory, min_freq=10):
        self.directory = directory
        self.counts = PreshCounter()
        self.strings = {}
        self.min_freq = min_freq

    def count_doc(self, words):
        # Get counts for this document
        doc_counts = PreshCounter()
        doc_strings = {}
        for word in words:
            key = hash_string(word)
            doc_counts.inc(key, 1)
            doc_strings[key] = word

        n = 0
        for key, count in doc_counts:
            self.counts.inc(key, count)
            # TODO: Why doesn't inc return this? =/
            corpus_count = self.counts[key]
            # Remember the string when we exceed min count
            if corpus_count >= self.min_freq and (corpus_count - count) < self.min_freq:
                 self.strings[key] = doc_strings[key]
            n += count
        return n

    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with io.open(text_loc, 'r', encoding='utf8') as file_:
                for sent_str in file_:
                    yield sent_str.split()

@windj007
Copy link

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

@windj007 Do you happen to have example code on replacing with placeholders? Assume it is some simple regexes. Would be great to add it to the documentation.

@windj007
Copy link

windj007 commented Oct 5, 2016

For Wikipedia, a regex like this one worked for me:

BAD_TOKENS_RE = re.compile(r'''^(https?://.*|[()',";?=\.#=:0-9/\-%«»*—|]+|.*\.(jpg|png|gif|svg)|.)$''')

It matches such types of tokens as URLs, special symbols and filenames.

At the moment I wrote my previous comment, I had a set of regexes and really replaced tokens with placeholders. Now, I merged all the regexs into this one and just drop the matched tokens.

@menshikh-iv menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning) labels Oct 3, 2017
@gauravkoradiya
Copy link

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

have u used above code?

@spate141
Copy link

spate141 commented Apr 28, 2021

Is there any progress made with build_vocab() in 4.0 release? I just started with 105B tokens file and planning to train few versions of embedding models. Seems like build_vocab is rather more expensive than actual training.

I'm setting following variables related to this function:

MAX_VOCAB_SIZE  = None
TRIM_RULE       = None
SORTED_VOCAB    = 1
BATCH_WORDS     = 10000

Any suggestion to speed this up? I'm not sure on how should I use Honnibal's code.

@piskvorky
Copy link
Owner

piskvorky commented Apr 28, 2021

build_vocab is still pure Python = slow. Essentially it does this:

vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    for word in sentence:
        vocab[word] += 1

I see two avenues to speed up build_vocab:

  1. Optimize LineSentence or the loop above with Cython / C. Especially for the corpus_file code path, where we know the input is a file on a local disk (rather than any iterable or remote data stream), we should be able to do much better.

    Plus, we can now use Bounter for fast in-memory counting, so we don't have to care about the vocab pruning at all.

  2. In case of corpus_file, we could also parallelize build_vocab. Different threads/processes could jump to scan different places in the underlying local text file – something we cannot do with a generic iterable.

If anyone's interested in implementing either, let me know.

@ciropom
Copy link

ciropom commented Mar 14, 2023

build_vocab is still pure Python = slow. Essentially it does this:

vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    for word in sentence:
        vocab[word] += 1

I see two avenues to speed up build_vocab:

1. Optimize LineSentence or the loop above with Cython / C. Especially for the `corpus_file` code path, where we know the input is a file on a local disk (rather than any iterable or remote data stream), we should be able to do much better.
   Plus, we can now use [Bounter](https://github.com/RaRe-Technologies/bounter) for fast in-memory counting, so we don't have to care about the vocab pruning at all.

2. In case of `corpus_file`, we could also parallelize `build_vocab`. Different threads/processes could jump to scan different places in the underlying local text file – something we cannot do with a generic iterable.

If anyone's interested in implementing either, let me know.

Hello! I need this and I'm willing to implement it.

I actually have a class that extends iterator and streams from an internal VPN resource a list of papers
that I tokenize and pass to gensim.. but build_vocab is taking ages.

My Iterator is parallelized, meaning that there is a queue where the iterator reads sentences,
and there are multiple threads writing on it.

I would like to parallelize build_vocab too.
Please let me know the details and how to proceed.

Thanks
Danilo

@piskvorky
Copy link
Owner

piskvorky commented Mar 14, 2023

Implement which part, what exactly?

IMO the word counting logic is so trivial that any sort of fine-grained thread synchronization (locks) would do more harm than good. The Python overhead is too large.

So, a coarse grained approach looks more promising.

I mean, the proof will be in the benchmark numbers, but for the extra complexity to be worthwhile, we would need a whole multiples speedup (2x, 3x… not 20% or 30%). Do you think you can do it?

@ciropom
Copy link

ciropom commented Mar 14, 2023

I thought you already had a plan, but anyway this is my proposal:

  • define a read_queue
  • the build_vocab will
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    read_queue.put(sentence)
  • spawn multiple Processes that reads from read_queue, each one with their own local vocab
  • every SYNC_VOCAB words processed, each process will global_vocab_queue.put(json.dumps(vocab)); vocab={};
  • another Process awaits local vocabularies from global_vocab_queue and merges them in a global vocabulary, summing up when needed.

at the end of the process, the global vocab will be the actual vocab, the local vocab will be lost after Processes shoutdown.

@piskvorky
Copy link
Owner

piskvorky commented Mar 14, 2023

I'd expect that (queuing objects with multiprocessing) to be significantly slower than the current single-threaded approach.

But let me know what the numbers say.

@ciropom
Copy link

ciropom commented Mar 14, 2023

ok.
practically, should I fork the project and ask for merge request afterwards?

@piskvorky
Copy link
Owner

Yes please – implement your changes and then open a pull request.

@ciropom
Copy link

ciropom commented Mar 15, 2023

I made some tests, and I realized that it should be even faster if I can manage to build the vocabulary myself, since I'm already parallelizing the work to pre-process the text.

I dig into the code and found the life-saving function build_vocab_from_freq
that I can give to use a vocabulary prepared by me.

This solves the issue in my case since before, build_vocab was the bottleneck with 1 single process slowing down the many pre-processing processes.. Now I can spawn 50 parallel processes and all reach a 100% usage of CPU 100% of the time.

@piskvorky
Copy link
Owner

Yes if you're able to "squeeze in" the word counting into the same process that generates the data, that will avoid the extra corpus loop altogether. Basically amortize the data shuffling cost into an existing preprocessing step.

This would be true even if you didn't parallelize the preprocessing step. But even more so if you have it parallelized.

@ciropom
Copy link

ciropom commented Mar 16, 2023

I was wondering if I can do something similar for Phrases? because it suffers from the same single-core limitation.

@piskvorky
Copy link
Owner

Sure – Phrases is also a part of preprocessing. So conceptually, phrase detection belongs to the same computational step.

@ciropom
Copy link

ciropom commented Mar 16, 2023

OK, but practically how can I do it? I don't see a build_vocab_from_freq there or something that allows to overcome the 1-core limitation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests