-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc2vec not parallelizing #532
Comments
As soon as FAST_VERSION is not -1, there are compute-intensive codepaths that avoid holding the python global interpreter lock, and thus you should start to see multiple cores engaged. However, there are still bottlenecks (eg discussions in #336, with some maybe-workarounds) that limit how well all the cores are engaged, especially as the number of cores/workers grow greater than 4 or 8. So you should expect to see some but not full core-engagement. In general, training that spends more time inside the optimized code will achieve higher utilization. That means more dimensions, larger text examples (more words), larger window values, or a larger count of negative samples (if using negative sampling). In fact I've noticed that when training isn't yet saturating all cores, upping some of those parameters (that would normally require more work to be done per example, and thus slower completion) can come 'for free'. |
Ok. Thanks for the explanations. Besides this, I noticed that as It runs, it get slower and slower, as measured by the number of words/second, and when you look it up on htop, it is consuming just 0.7% of one core... it started with 3.5k words /sec and after a few hours running it is down to 36/sec do you get this kind of behavior too? |
Are you on OSX? If so it might be this 'app nap' issue: #493. If not, I've not seen that behavior and wouldn't expect it: after the initial model setup, training should be at about the same rate early or late. If you're seeing it, I would first take a look at possible IO/network bottlenecks (or maybe throttling) in reading the data, or if the data has been sorted in some way that makes the later examples very different, or if some other code (including perhaps the code feeding in the examples) might have some performance issue (eg linear-scan-from-start-for-each-example) or be triggering swapping. |
No I am not. There is no I/O bottleneck that I can detect. I am streaming the examples directly from a local Mongodb collection. I have trained word2vec models from the same dataset under the same circumstances without any issues. |
There's little difference between what the Doc2Vec and Word2Vec models are doing during training –certainly no extra or harder steps that would account for a slowdown at the end. The Doc2Vec model might be using much more addressable memory, if there are far more documents/doctags than vocabulary words. That might show up as swapping. But I would most strongly suspect something with whatever code fetches and feeds the examples to the model. (If the data can fit in main memory, perhaps compose it all there rather than having another DB/disk in the loop during processing. Or even if not, perhaps eliminate any DB/API as a factor by streaming the corpus as text from a fast local volume like an SSD.) |
@gojomo Is this reproducible? Otherwise suggest closing the issue. |
This is absolutely a problem for me. I'm creating a doc2vec model using the command:
(I've paraphrased a lot down here). Using htop, I can see that only one core is actually in use, the rest are idle. I start from a 14.04.4 ubuntu image on an ec2 c4.8xlarge. My provisioning script:
My requirements:
I've tried setting scipy to 0.15.1 and gensim to 0.12.1, and then all versions of gensim to 0.12.4, and scipy to 0.16.1. All say FAST_VERSION = 1. None of them use more than one core at a time. I've also tried the numpy from apt (1.8.6), as well as the latest (1.11.0), with no change. |
How many examples and words are in your 'sentences'? |
Here's a sample from when the code starts running:
|
Looking at this, I'm wondering if the 18 million words being constrained to 300k at a time might be a bottleneck. I have more than enough RAM to handle more than that; any way to increase? Or could there be something else at play? If I set |
Where are you seeing a "300k at a time" constraint? With all the version-variants you've tried, you may want to make absolutely sure of the effective FAST_VERSION value, by printing it just before the training. (Though, you should also see a logged warning if using the pure-Python fallback.) Given how much of the process is still in Python and subject to the GIL, you're unlikely to see full 36-core parallelism, but should be seeing something better than 1. It's possible the attempt at 36-threads is making things harder, so it'd be worth trying smaller worker counts – especially 2 to 16. (If none of these show any hint of using more cores, which I've seen working on Ubuntu 14.04 with 8 workers on an 8-core machine, then I'd start to expect some other odd system-specific limit – something in the support libraries or processor-affinity settings or somesuch.) The part that can thread-parallelize is the cython 'nogil' regions. Generally, things which make the algorithm spend more time there increase the parallelism (when it's working at all). So: a larger Recent gensim also batches multiple examples, up to a total word-count of 10000 ( gensim/gensim/models/word2vec_inner.pyx Line 678 in b99e852
|
So I've tried dropping the number of workers to 1, 2, 4, etc, and they all go to the same speed as the 36 core instance, and htop shows a single core being used. It just feels like, somehow, the threads are still being constrained to a single core. I remember getting this to work before in the past, so I'm wondering if I was misremembering that, or if there is something that's changed in those apt-get updates that's breaking things. How do I enable dbow_words concurrent word-training? Is that changing the setting from I don't know how to read the numpy config settings, but is there something in here that looks incorrect?
It looks similar enough to the output in #336 that nothing stood out as a warning flag to me, but as I mentioned, I'm not sure how to read that. |
You have most probably checked this but maybe there was some cpu binding So I've tried dropping the number of workers to 1, 2, 4, etc, and they all I remember getting this to work before in the past, so I'm wondering if I How do I enable dbow_words concurrent word-training? Is that changing the I don't know how to read the numpy config settings, but is there something
It looks similar enough to the output in #336 — |
@tmylk don't use this style of (email?) replies -- it pollutes github with quoted texts and makes the conversation hard to follow. |
Occasionally when people have more than one python/virtualenv, they're checking the FAST_VERSION in a different one than is running their application code – so I would make absolutely sure that the code that's running slow/single-core is reporting a non-negative FAST_VERSION. Also, while I have no reason to think it's a problem here, I generally try to install anything that can be installed with pip, with pip (rather than apt-get) – so I'd not install python-numpy or python-virtualenv via apt-get. (You might be making some redundant or not-optimally-available installations.) You could also try using conda, which I've had good luck with on Ubuntu. I usually start with the 'miniconda' installer (http://conda.pydata.org/miniconda.html), and create an environment based on the packages that conda does well (numpy/scipy/ipython/notebook). But then, use 'pip' to install gensim (because conda's version lags). Again, no specific reason to think this will help with your issues, but it's worth a try jsut to mix things up, given the mystery. |
What would I be looking for if some kind of For this provisioned instance (or in a local vagrant, whichever), there's only one venv, and gensim is only installed in the venv. I can only import gensim to check I tried to use 1.11 of numpy, but that didn't seem to alleviate the problem. I'm going to wipe my venv and try again, but I've tried that numerous times in the last little while. I think maybe the best thing would be that I not use my own code, but maybe use some kind of test case. Is there one, and if so, where is it? |
Hi mmroden There is a gensim doc2vec tutorial that is parallelized at https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb It is also easy to test if Python can parallelise some simple task, say x^2, on many cores on your instance. |
It seems like it might be useful to have this kind of test available in general, if I'm not the one who opened the ticket. Let me see what I can write up, but would it make sense to roll whatever I make as a potential contribution back? |
It would be useful to have an example to test if any Python parallelisation is possible. It has a place in this issue and in FAQ for troubleshooting. But it is a general python thing, not particuarly related to gensim, so we would not merge it in as a PR |
Here is my timing test code. When it runs, I can clearly see all processors being used by htop. When I run the model creation code above, I only see one core working (ie, the original problem). So this isn't a general python/python multithreaded issue. I'm running this locally, rather than in the ec2 instance, on a virtual machine with 4 cores. The machine is provisioned using that exact same Is there something wrong in the way that I'm creating the model, some parameter I didn't set? I really am genuinely baffled here.
|
Unfortunately Locally, when I run the following (in an ipython notebook on OSX), it takes about 18 seconds, and reported CPU utilization of the Python process exceeds 500% – indicating multiple cores are engaged. I would also expect it to show multiple cores engaged under
(While this ticket's description seems appropriate for your issue, I highly suspect that the original reporter's issue was something in their corpus streaming. It looks like your code brings all the examples into python-objects in main-memory before training begins, so I don't suspect that yours is really the same issue. Though, if in fact your |
So the posted code pegs all CPUs, and htop shows that one cpu is at max during gensim training. Would there be a way to check that there could be some resource contention or locks? I noticed some people using gdb to attach to the process at random intervals and found randint calls in the other ticket, could something like that be at work here as well? |
The randint issue was fixed a few releases ago, so it's not (exactly) that. Does your |
More debugging: I do see that all cores are used when this step is shown:
That leads me to believe that there's some hidden generator in my corpus. I'm creating it like so:
where
Which looks like it would hold everything in memory to me, and when I change that Side note: When I call corpus creation like so:
(note I get this crash:
Not sure if that's a separate issue. I can get the stack trace if that helps. This issue does not occur if |
Regarding the side note: stack trace in a separate issue will be greatly appreciated. |
I've removed all try/excepts-- I'm having a hell of a time reproducing. If I see it again, I'll make the trace into another issue. |
Does the doc2vec-IMDB.ipynb demo notebook manage to utilize multiple cores on your system? This is an important question to resolve, since if it does, we know that the Python/numpy/scipy/gensim-cython code is working on your system, and don't need to do further investigation of those factors. The "something broke in trial 0, continuing" message does not look like a gensim printout. What does it mean? Is your code printing it after some sort of timeout or other test of the results? So that You could try to make certain the
|
So it looks like the model training from the ipython notebook does run in parallel-- well, this line works:
So that means that the problem is more in the data preparation side, right? The 'something broke' message is from my code-- I was getting random failures when I was setting everything up, but those failures stopped, so I didn't remove it until now. I actually have a group of training data sets, and a holdout set-- I was comparing training set formation and its effect on whether the system would reproduce the label on the set. Hence the different names, there's a function call in between. When I switch to |
Now that you've seen multiple cores engaged (in both the doc2vec-IMDB.ipynb example, and your own code), we know that's working – the remaining issue is increasing core-utilization. You should again try varying things like |
I literally just swapped out |
I think the issue should remain open -- we're really interested in figuring this out. We'll assist you with the debugging any way we can. The |
This is the format of the data that goes into the corpus line:
As such, the The corpus is randomly selected for each holdout run, but a typical run will have ~26k - ~30k items. The document lengths are in the neighborhood of an average of ~706 words, with a max of ~18821 and a min of 3. The number of tags can range from 1 to 11, with an average of 1.05 (ie, most documents are labeled once). |
Please, on a system where you can confirm seeing the "only one core used" problem with the original Otherwise, the size/shape of the data shouldn't have big effects on the throughput, at least not in the latest release (where a batching-of-small-examples optimization exists). (Side note: due to implementation limits in the cython path, a document of more than 10000 words will only have its first 10000 words considered. A workaround suitable for most cases would be to split the document at 10000-word intervals, but re-use the exact same |
I have done the alteration, no change. I'm thinking that what we may need to do is set up some kind of remote debugging session, especially since you guys want to see this in action. Would that work? Who would I contact about doing that? Best case scenario, after five seconds, it's obvious that I screwed something up, and that becomes apparent when the whole system is in view. |
So was it "only one core" both before and after the change? Earlier you mentioned something changed the observed behavior from "only one core" to "many cores but still not more than 100% total". Can you still toggle between those two behaviors? A more precise way to log core-utilization may be to use a command like Is this on the local VM or the AWS VM? If the local VM, has the local VM ever reproduced the "exactly one core" condition? What is the local virtualization system used? For open-source support, I prefer the back-and-forth of a discussion log: it forces precise communication and incremental isolation, and creates an archived series of reference steps others can learn from in future similar situations. If a full set of code & data that reproduces the problem elsewhere can be shared, that's great too. But if fixing it requires looking at proprietary code/systems (even just for a little while), that can't be shared, that's a consulting gig. |
I've looked further into this, and I think I have a potential solution (sorry for the delay, but, you know, life). I'm seeing a single core being used when the number of sentences is near to the number of labels; that is, if I have 25k sentences and 10k labels, that uses one core. But! If I have 160k sentences and 1k labels, that uses all cores. So this:
Is 1 core, regardless of the size of the box, while this:
Uses 36 cores pretty well. Does that observation vibe with how the algorithm is parallelizing internally? |
I'll just add to the discussion, noting that I'm having a similar issue. On both Windows and an Ubuntu box, I see no obvious evidence of Doc2vec parallelizing. I'm feeing the model with a
This was originally happening with 0.12.4, but upgrading to 0.13.2 didn't change things. Was there any movement on the point by @mmroden about the ration of labels to examples? For my case I'm not labeling any documents, so if I understand things correctly the number of labels and documents should be equal. |
@mmroden - I can't think of a good reason that'd make a difference... and to the extent I can imagine possible relationships, I might expect the alternate relation to hold: more tags might parallelize better, because different cores (which might not share the same CPU cache) would be writing to overlapping memory ranges slightly less often. The bigger difference I see between your two scenarios is the number of examples. (The batching that happens means small datasets may never get a chance to spread over many threads.) Does the perceived relationship persist with a larger dataset? |
@jlorince The 1st thing to check: are you sure the cython-optimized versions are running? (There's effectively no parallelism possible without them.) You can check this by ensuring Next is whether your IO/decompression (which can only happen in the master thread) is the real bottleneck. If you have the memory to read all examples into a list, then use that as the corpus, does that help? Because large parts of the code are still subject to the Python GIL, saturating all cores is unattainable, and indeedtrying to use more workers can decrease total throughput through more contention (so also try values between 1 and NUM_CORES to see if it helps). But, I've always seen at least some multi-thread activity. How are your monitoring the thread activity (specifically on Ubuntu) and determining parallelization is not happening? |
I checked on both systems, and got a |
@jlorince Confirm that loading the docs into a list is the right approach to load into memory and exclude I/O |
@jlorince Yes, |
Ok, just ran it on a subsample of documents, loading everything into memory, and I do see evidence of parallelization (in htop on linux, will test on Windows next). Regarding the discussion above of the pros/cons of multiple cores...if we have the memory to load all the raw data into RAM, is there still a downside to using as many workers as cores? |
The optimal number of worker threads will probably still be somewhere between 1 and the number of cores. Pre-loading into memory eliminates the bottleneck of single-threaded IO/decompression as a cause of idle threads. There's still the issue of the Python GIL, which means the pure-Python portions of the process contend with each other, and beyond some count of workers, such contention can mean more workers will stop helping and start hurting overall throughput. (As noted in some of my comments above, some parameter choices, by allowing more to be done inside the optimized cython GIL-oblivious blocks, can also see less-contention/higher-utilization.) |
@jlorince Thanks for testing it. Waiting for update on Windows. If it parallelizes there then will close the issue. |
Leaving open to investigate relationship between parallelization and tags/docs ratio in #532 (comment) |
My code also does not parallelize. """--------------------------------------------My inputs----------------------------------------------------"""
inputs = LabeledLineSentence(self.filename, self.field) """------------------------------------1st WAY----------------------------------------------------------""" """------------------------------------2nd WAY----------------------------------------------------------""" I've tried both ways on Windows with gensim.models.doc2vec.FAST_VERSION > -1 ensured works parallel but neither way utilizes more than one core. I also don't understand why the first way runs in 5 minutes and the second way runs in 25 minutes. I hope I don't do anything stupid. Thanks in advance. p.s. I have 10845 unique tags from a corpus of 10845 examples. Is this finally a problem of not utilizing the rest of the cores? |
@christinazavou Try |
@tmylk Thanks! This reduces time (1st way runs now in 1 minute and 2nd way in 7 minutes). However it still utilizes only one core. (i shouldn't have problems running multi-cores because gensim LDAmulticore runs as expected.) |
@christinazavou That sounds like an issue. Could you please post more about your config?
|
@christinazavou This would be better discussed on the project discussion list, as it does not appear you're hitting the known issue tracked here that gensim doesn't parallelize as much as we'd like beyond 2-8 cores, but rather something specific to your usage. That said, there's no surprise that "2nd way" is slower - every call to Also, what are you using to monitor that only one core is being used, and at what stage of the process? The initial creation of the list (reading from files), and the initial scan of that list to discover the corpus vocabulary, will each only use one core no matter what; it's only the model training passes that can start to use multiple cores. So only after your logging indicates that training has begun should you check if multiple cores are engaged, and with |
While there's some useful discussion here, the oldest report fo such Word2Vec/Doc2Vec bottlenecks (also with useful discussion) is #336. Closing this issue in favor of continuing discussion of potential improvements there. |
Doc2vec does not use all my cores despite my setting workers=8 when I instantiate it.
My install passes the assert below:
Do I have to do something else?
The text was updated successfully, but these errors were encountered: