-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word2vec (& doc2vec) training doesn't benefit from all CPU cores with high workers
values
#336
Comments
Maybe your input sentence iterator gets saturated? What is the throughput of your sentence stream? Can it sustain that many cores? |
I am using the LineSentence iterator that comes with the gensim code. I should have said that the code is slower for any number of cores beyond one using the MacPro. However, I see similar improvements as your blog post using a Macbook Pro (going all the way to 8 virtual cores). These improvements are seen with the same code and datasets. My concern was the output of the FAST_VERSION check being different for each machine. |
Hmm. Can you post the output of: import scipy
scipy.show_config() on either machine? |
This is for the MacPro: atlas_threads_info: |
I have the same problem on two independent machines, centos and ubuntu for that matter. on my macbook pro it works as expected and shown in your blogpost. 1 workers: 4 workers: 8 workers The Centos machine has the exact same behavior. FAST_VERSION is 1 on all 3 machines. Last line of each paragraph is the elapsed CPU time. ec2: local centos: macbook: This is the code i am running:
|
Since your input iterator is fast (for both of you), it must be some sort of resource contention. Maybe CPU cache? Or threading failing for some reason? I'm out of simple ideas :( we'll have to profile properly and see what's going on in detail. |
@piskvorky |
While I have never experienced that problem, I get some speed-up on my desktop by forcing BLAS to be single-threaded (in my case Otherwise, I can't really tell what is going on. It could be worth trying out without hyper-threading, but I'd try to exhaust other possibilities before modifying BIOS settings. |
My testing has so far been limited to an MBP where going from 1 to 4 workers has given the expected speedup, the FAST_VERSION reports as '0' (doubles), and the scipy.show_config() info is similar to the above MacPro and MBP reports. (There's one other extra_compile_args value, '-DAPPLE_ACCELERATE_SGEMV_PATCH', but that appears to be related to a different issue.) It certainly seems like a BLAS issue, but for completeness, it would be good to record the exact OSX and Python versions in play. A wild guess: perhaps BLAS tries to use "all" cores it sees, each operation. So each of your 4 python threads may, on each BLAS operation, try to use all 24 virtual cores – meaning peak attempted use of 96 cores, and thus the throughput loss from contention (both scheduling & L1 cache-stomping). If so, when trying the |
Thanks guys. @gojomo |
The setting OPENBLAS_NUM_THREAD has no impact of the behavior, no matter the number (tried 1-8) |
I just ran
on my mac, executing
Is this all expected behavior? |
using gdb on centos 6 (20 cores, 40 virtual ones) i randomly picked one of the 20 cores (utilization 5%) and saw this:
Is it possible you are calling randint a bit too often and numpy's RandomState is locking itself out? |
Re: OPENBLAS_NUM_THREADS – note that should be plural (just in case you tried it as singular as you typed it here), and also the OpenBLAS README mentions in some cases (OPEN_MP in use?) that a different variant is required. I suppose it's possible the np.random.rand_int call is a bottleneck; it happens on each (neg-sampling) train_* while still holding the python GIL. If you confirm more threads all waiting there, it may be worth refactoring the code to call it less frequently. But maybe you're seeing a more general issue, where since a significant amount of the training is still in pythonland, its GIL is preventing full multithreading benefits. Improving that might require deeper refactoring. (Perhaps, changing the train_sentence_* methods to take batches as train_sentences_*, and then doing larger batches of work without the GIL.) |
It looks like the randint() calls are the main problem. I recently started testing performance on an 8-core Ubuntu machine, and began seeing the same problem: Where more workers had hurt (Ubuntu), they now help up through the number of cores I can test (8). And even where more workers were already helping (OSX), or where the only contention was with the job-dispatcher thread (workers=1), I've seen reductions in training-time for the same amount of data of 20%-40%. |
That's amazing. Thanks guys for reporting, investigating & @gojomo for fixing! (see my comment in the PR) @jticknor @thomaskern can you test & check this fix? Does it help for you? |
@piskvorky sorry, was on vacation. I see you merged it into develop 2 weeks ago. Do you want me to test it against the fix_randint_slows_workers branch or the current develop branch? |
For verifying this fix, 'develop' is good. (If you're doing any doc2vec stuff, broader testing of the PR #356 "bigdocvec_pr branch" would be helpful but not needed for checking this issue.) |
htop looks good! 20 cores working at ~75% each is a massive increase. however, the training times still look poor: times for a 50k set: times for a 500k set |
Progress! There may still be other contention or bottlenecks that prevent 20 cores from being the fastest. (I was once worried about the locking around the progress-count... but it wasn't a factor in my smaller tests. Still, at many workers similar little hotspots might grow in impact.) It'd be interesting to know at what worker-count (between 2 & 20) throughput was maximized. Bumping up the size of the job-queue (source-code edit: maxsize more than 2x thread-count) or the job-chunksize (optional parameter to train()) would be other small changes that might move the needle in a contended situation. |
And one more thought: at the margin, it seems that more-dimensions, more negative-examples, larger windows, or longer sentences/documents each tend to require relatively more time inside the no-GIL/optimized code – and more time there can be a good thing for parallel progress. So all other things held equal, you may get more utilization out of all the cores by upping those parameters. (Whether those parameter tweaks improve results or are worth the net extra time/space cost, could go either way. But I'd expect the 'observed optimal core count' to get higher with each of those values rising.) |
I just tried it with 3 workers on the 50k one: That is already slower than the 2-core run (which i just re-run and got the same results). |
Using window=10, negative=100, dimensions=400 (unchanged) on the 50k with 2 and 20 cores i get:
20 cores: 6837 words/s |
Hi guys! I also got interested in word2vec parallel training performance recently, and observed strange results. I ran some benchmarks of my own on 2 different machines (using LineSentence, 300 dim., 5 words, 10e-5 subsampling and compiled OpenBLAS from source). Unit is Kwords/sec: Core i5-4570, 16GB RAM, Ubuntu 14.04, Python 2.7.6:
Dual Xeon E5-2420 v2, 32GB RAM, CentOS 6, Python 2.7.10:
I ran a basic numpy/BLAS benchmark on both machines (dot() on 1000x1000 matrix). Note: Playing with OPENBLAS_NUM_THREADS value had marginal effect in every case. Although recent commits provided some welcome gains, it appears current code has trouble scaling upwards of 4 threads, and tops around 8 threads. Also, I have no clue how to explain the substantial difference of performance favoring my simple i5 desktop, even if numpy/OpenBLAS performance is higher on the server... Maybe you can help me shed some light on all that? Thanks! |
@fortiema - Thanks for the detailed info-gathering! I've got no sure answers, but a dump of thoughts:
|
@gojomo Thanks for the pointers! About the last point, I compiled Python from source on the server, versus using built-in Python on the desktop so if there is a difference the server should have an advantage here. I'm using LineSentence for my benches, so this is as raw IO as it gets I guess? The only thing being done is splitting on whitespace. I'll have a look at those scheduler parameters, this is mostly new to me! Looking forward for your PR to be merged and run some additional tests! P.S.: I would also have a look down under, but I'm totally new to Cython and parallel Python, and currently lacking any time to dig into this. Hopefully I get a chance to pick it up soon, would be more than happy to help! |
I'm also having a similar problem. I'm using doc2vec with doc2vec negative sampling with the dbow model. With 64 threads (32 cores), the usage of each thread is similar and hovers around 8%, and I'm getting about 89k words/s. It takes word2vec <10 minutes whereas gensim is taking 1 hour. I've also messed with Some system stats:
The corpus is fed in as a list of tagged documents, with words themselves being lists too so hopefully that's not a bottleneck. Here's a breakdown with different numbers of workers:
Setting This might be related to #450, my sentences are tweets. |
@jskda, any chance you could try the same tests as in your 'breakdown' table, with the code in this branch: https://github.com/gojomo/gensim/tree/batching ? It's my experimental approach to passing larger runs-of-small-examples to the cython nogil code regions. (It requires recompilation of the cython extensions and thus the process is roughly: (1) clear existing gensim installations from environment; (2) clone the git branch; (3) pip install from repo directory: When constructing your model, add a I don't think this will help with workers > 32 (actual cores), but it may help with worker counts in the 4-32 range, making the optimal number for your task higher. (My ideal would be workers = cores will always perform best, but I don't think this change is enough yet to achieve that, for large numbers of cores.) |
I've run the same tests on your branch (ran cython on the pyx files and pip install) and it seems to improve on a few percentage points for more cores, but strangely it reduces performance for 8 threads. I've add in some word2vec counts for comparison too.
|
Thanks for the detailed numbers! I am a bit perplexed by the results, both before and after. I didn't expect, but could kind of understand, a small batching penalty at low worker counts... but it's surprisingly large, and I haven't noticed similar issues in my ad-hoc testing. I'm surprised that the throughput for 64 threads has increased: I would've expected performance to saturate and even suffer from extra contention earlier, at least in python. And the word2vec.c numbers are also mysterious: I don't have a mental model that could explain peak performance at just 4 threads, and worst at 8, but 64-almost-as-good-as-4. Just stabbing in the dark: What are the chances of some hidden, under-controlled variable? Are there other processes competing for IO/CPU, or is this a virtualized environment (where such competition might be hidden)? Is the iterator providing the examples doing something atypical, either in IO or decompression/preprocessing? Are your doc-vecs memory-mapped and randomly-accessed? (Memory-mapping for larger-than-core arrays of doctags-in-training should work best if training passes make an orderly sweep over the mmapped array.) |
@gojomo Sorry word2vec had an extra 0 for 4 threads so it's increasing. This is running on a dedicated server, there are some other processes going on but they're fairly constant in cpu usage. I'll try again now to see if the numbers change much. The documents are provided as a list of |
Re: memory-mapping There's an optional Doc2Vec argument ( Are you (re)shuffling the examples each training-pass? While that can best for results it might thwart CPU cache efficiency, especially with very small documents.) |
The training set does fit in ram and while training there's 11gb used (machine has 500gb). Training is done like this, there's only a single pass:
I can give you a version of my script to reproduce those numbers. |
@jskda – sure, the script could be helpful. I want to dig deeper into whatever's blocking greater utilization with large numbers of cores… but may not have much time for it anytime soon. |
Noticed that as well, run word2vec on EC2 cpu large instance on 64 threads and htop indicates 10% cpu usage per thread (word rate is 800k). Any progress on that? |
64 threads is going to be a challenge given the amount of GIL-locked code that still needs to run: you may get higher utilization with fewer contending threads. If the texts are small, make sure you're using the latest gensim – older versions didn't bundle as many together to minimize GIL-noGIL transitions. But also: make sure your corpus iteration isn't a bottleneck (it's still single-threaded) – if you can pretokenize & keep-in-memory the whole corpus, that may help. Finally, some options that would normally cost in performance – larger windows, more negative examples – may become nearly costless in a situation where contention is a factor. (They help keep each thread in the noGIL blocks longer, using 'free' CPU that'd be idle anyway.) So it's something else to try, if increasing those parameters are helpful to your application. (They aren't always.) Other than that, you may be hitting inherent limits of the current implementation. I'd like to explore working-around such limits further, but don't often have an idle more-than-8-core machine and the time to dig deeper with a dataset/config that shows the problem. |
Thanks! Ad latest gensim - yes, I used dev version In general, it is fast enough for me, just seemed like it could be like few times faster given CPU utilizations. I can share scripts to reproduce if that helps, I am actually fitting wikipedia corpus from gensim. |
Sharing the scripts could help to reproduce (or trigger other suggestions) – but even if available, might not have a chance to look at or run in a similar setup for a long while. But barring code optimizations, my main recommendations are to read pre-tokenized text from memory, and try fewer-workers-than-cores (because it seems the overhead of contention itself becomes an issue) if searching for maximal throughput. (Tweaking other parameters would affect the properties of your end-vectors, and thus would be a project-specific consideration.) |
See especially #532 for other discussion of important factors - but this issue should capture any discussion/decisions about improvements in Word2Vec/Doc2Vec parallel thread utilization/throughput. |
workers
values
With a corpus_file, we found an optimal speed at workers = cpu (vcpu) available in the instance. And this is using all the cpu of the instance. |
@piskvorky maybe this issue could be close ? |
Yes, issue is stale. If the issue persist, please open a new ticket. |
I have been using word2vec with great success on a Macbook Pro, seeing nice increases with additional workers. However, when moving the code over to a MacPro (24 virtual cores), the code actually takes longer with the addition of more workers. When I run "gensim.models.word2vec.FAST_VERSION" I get a value of 0. This same code will give me a value of 1 on the Macbook Pro. Is there something I am missing moving over to the Xenon processors?
The text was updated successfully, but these errors were encountered: