Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most use cases are much slower on 4-CPU workers #551

Open
crusaderky opened this issue Nov 24, 2022 · 21 comments
Open

Most use cases are much slower on 4-CPU workers #551

crusaderky opened this issue Nov 24, 2022 · 21 comments

Comments

@crusaderky
Copy link
Contributor

crusaderky commented Nov 24, 2022

I ran the following A/B test:
https://github.com/coiled/coiled-runtime/actions/runs/3534162161

dask is 2022.11.1 everywhere.

runtime Worker instances (small_client) Worker instances (test_parquet.py) Vendor
AB_baseline 10x m6i-large: 2 threads 8 GiB per worker 16x m6i-xlarge: 4 threads 16 GiB per worker Intel
AB_4x 5x m6i-xlarge: 4 threads 16 GiB per worker 8x m6i-2xlarge: 8 threads 32 GiB per worker Intel
AB_m6a 10x m6a-large: 2 threads 8 GiB per worker 16x m6i-xlarge: 4 threads 16 GiB per worker AMD

As you can see the cluster-wide number of CPUs and memory capacity is the same everywhere.
The cost per hour of the AB_4x is exactly the same as the baseline, whereas the AMD instances are 10% cheaper.

Insights

  • Doubling the number of threads and halving the number of workers causes memory usage to drop by 20~45% in most cases. This is expected; the use cases where it drops more are those with higher rates of data duplication among workers.

4x vs baseline, avg memory

  • The same however causes overall runtime to baloon almost everywhere across the board, sometimes in excess of 80%. This is not expected and probably denotes severe GIL contention in numpy and pandas; however this is a blind conclusion and it should be verified carefully.

4x vs baseline, runtime

  • AMD CPUs are 10% cheaper per hour than Intel ones, but are also up to 50% slower, so are not recommended for any CPU-intensive workload.

m4a vs baseline, runtime)

Required actions

  • Investigate the increase in runtime when doubling threads per worker, confirm/dismiss it is due to GIL contention, pin it down to individual numpy/pandas functions, open upstream tickets, and start tracking them
  • Figure out how to start multiple single-thread workers on a VM (note that the minimum number of CPUs is 2) and rerun the A/B tests to figure out if there's a performance increase compared to baseline.
  • The above will most likely return that 4GB (half of a m6i-large) is not enough in most cases; investigate cost/performance of r6i instances which mount twice as much RAM.
@ntabris
Copy link
Member

ntabris commented Nov 24, 2022

What's the goal? Improve dask performance under various circumstances, provide better guidance good infrastructure choices, something else?

@crusaderky
Copy link
Contributor Author

The goal is that dask has 40% better memory performance and less network transfers on less, larger hosts, but in order to benefit from it we need to deal with GIL contention first.

@fjetter
Copy link
Member

fjetter commented Nov 24, 2022

What's the goal? Improve dask performance under various circumstances, provide better guidance good infrastructure choices, something else?

The context is that we've had an argument about performance of workers with many threads. The benchmarks above uses the same total number of CPUs and same total RAM on the cluster but distribution is different.

Under the assumption that GIL contention is not an issue, the "few larger VMs/workers" should be performing better, at least w.r.t memory usage. We can see that this is true, memory usage is reduced.
However, runtime performance takes a severe hit which is likely caused by the GIL. However, it may also be due to blocked event loops on our end.

The goal is a combination of things. This would likely translate into better performance for everyone, a better default setting for coiled and possible future scenarios where we deploy things a bit differently (e.g. multiple workers on the same, larger VM)

@crusaderky Why compare AMD vs Intel? I don't see how this related

Re: Required actions

Let's talk about these points first before going down the rabbit hole.

@ntabris
Copy link
Member

ntabris commented Nov 24, 2022

Why compare AMD vs Intel?

FYI for a while we were unintentionally giving customers AMD rather than Intel (we weren't excluding AMD and we were filtering by cost). Multiple customers said "why did my workloads get slower?"

@crusaderky
Copy link
Contributor Author

@crusaderky Why compare AMD vs Intel? I don't see how this related

We did not have a measure of how much slower AMD instances are.
Before this test, AMD could have been better value for money e.g. if it had been less than 10% slower.

@crusaderky crusaderky changed the title Use cases take over 80% longer on 4-CPU workers Most use cases are much slower on 4-CPU workers Nov 24, 2022
@crusaderky crusaderky assigned crusaderky and unassigned crusaderky Nov 24, 2022
@dchudz
Copy link
Collaborator

dchudz commented Nov 25, 2022

FWIW I think platform team would be happy to hear evidence about common workloads where it'd be helpful to have multiple worker processes per instance. That's the sort of data that would help us prioritize allowing that in the platform. (It's possible there'd also be dask help we want for implementing it.)

My interpretation of Florian's comment is that we're not really sure yet, but maybe I'm reading it wrong.

Anyway let's stay in touch!

@fjetter
Copy link
Member

fjetter commented Nov 28, 2022

Yes, not sure yet. The one benefit this would have is that transfers between these workers is typically faster because we don't actually need to utilize the network and the kernel can sometimes even apply some optimizations.

The common workload here is again heavy network. I can see this being relevant for both "very large data transfers (higher throughput)" and for "many, many small data transfers (lower latency)". I think we don't have any reliable data to know how impactful this is in which scenario.

I'm wondering how difficult it would be for coiled to support this, even just for an experiment. There is also a way to achieve the same/similar from within dask. I'm interest in running this experiment but I don't know what's the cheapest approach

@gjoseph92
Copy link
Contributor

FYI snakebench already supports profiling each test with py-spy in native mode: https://github.com/gjoseph92/snakebench#profiling. This makes it very easy to see how much GIL contention (or if not that, something else) is affecting us.

See the dask demo day video for an example of profiling the GIL in worker threads with py-spy: https://youtu.be/_x7oaSEJDjA?t=2762

@crusaderky
Copy link
Contributor Author

I think we don't have any reliable data to know how impactful this is in which scenario.

I'm wondering how difficult it would be for coiled to support this, even just for an experiment. There is also a way to achieve the same/similar from within dask. I'm interest in running this experiment but I don't know what's the cheapest approach

I mean, we could test this without coiled. Manually start a bunch of EC2 instances, manually deploy conda on all of them, manually start the cluster. It's a substantial amount of legwork though.

@fjetter
Copy link
Member

fjetter commented Nov 29, 2022

I mean, we could test this without coiled. Manually start a bunch of EC2 instances, manually deploy conda on all of them, manually start the cluster. It's a substantial amount of legwork though.

I see three options

  1. Change something in Coiled, if easily possible. https://github.com/coiled/product/issues/7#issuecomment-1330610879 Looks like it might be
  2. Hack the nanny to start not one but multiple WorkerProcesses. Should also be easy if we do not care about gracefully shutting them down again
  3. Do the manual thing you mentioned... Might be OK do do once but if we want to repeat a measurement this would escalate into serious work. I'd rather go for 1. or 2.

@mrocklin
Copy link
Member

FWIW I have doubts about the benefit of faster intramachine network communication being significant. I would be curious to hear perspectives on this comment: https://github.com/coiled/product/issues/7#issuecomment-1328076454

I'm curious about GIL-boundedness. I might suggest looking at this problem also in the small with a tool like https://github.com/jcrist/ptime and playing with data types. My hypothesis is that we can make this problem go away somewhat by increasingly relying on Arrow data types in Pandas. I don't know though.

@gjoseph92
Copy link
Contributor

Why don't we just profile it with py-spy and see how much GIL contention there is #551 (comment)? You just do pytest --pyspy. Then we can quickly assess whether GIL contention is the issue or not, and narrow our search.

@gjoseph92
Copy link
Contributor

I ran the tests under py-spy for 2-vcpu and 4-vcpu workers.

Benchmark results: https://observablehq.com/@gjoseph92/snakebench?commits=3b8fe8e&commits=6c9b5e1&measure=duration&groupby=branch&show=passed

It's only 1 repeat and some workers were running py-spy, so I wouldn't put too much weight on the results, but we are seeing most tests 50% slower and lots 100-200% slower by trading workers for threads.

The py-spy profiles are GitHub artifacts which can be downloaded here:

@dchudz
Copy link
Collaborator

dchudz commented Nov 29, 2022 via email

@gjoseph92
Copy link
Contributor

If you go to the benchmark results on observable and click through to the commits being compared, you can see the state of everything. But this is the only difference between the two cases gjoseph92/snakebench@6c9b5e1: switching from 10 m6i.large to 5 m6i.xlarge.

The primary point of this was to record py-spy profiles, which snakebench makes it easy to do.

@dchudz
Copy link
Collaborator

dchudz commented Nov 29, 2022

OK. Thanks. So the lesson is... GIL contention is a common problem and you should frequently use more small workers instead of fewer big ones?

And that big worker instances with more worker processes could be nice but as far as we know now Matt's right (in https://github.com/coiled/product/issues/7#issuecomment-1328076454) that it's not a big deal, and you're basically as well off with many small workers as you would be if Coiled implemented separate worker processes on an instance? (This experiment isn't really meant to address Matt's point there, right?)

@gjoseph92
Copy link
Contributor

This experiment isn't really meant to address Matt's point there, right?

No, it's not. It's intended to try to get a little more of a sense of the "why". I'm thinking though that it might also motivate how much we want to work on https://github.com/coiled/product/issues/7. For instance, if profiling indicates "there are couple obvious things we could change in dask or upstream to alleviate this" then multiple workers per machine on Coiled becomes less important. If it indicates "this is a pervasive issue with the GIL we're unlikely to solve easily", then we just have to answer:

  1. are there reasons you'd want the same total # of vCPUs in fewer, bigger machines, and how important are they?
  2. how would we communicate to Coiled users that bigger machines generally have worse performance?

@gjoseph92
Copy link
Contributor

I'm not going to do extensive analysis on these (though I hope someone else does), but here are a few interesting things I see:

  • H2O benchmarks are heavy on parquet IO. They're slower, likely because of the convoy effect with fsspec. Plenty of discussion of this in Read_parquet is slower than expected with S3 dask/dask#9619

  • H2O q8 was hilariously 850% slower in my benchmark run (amplified by py-spy, certainly). Profiling shows 84% (!!!) of the event loop's time blocked by the GIL, up from 32%. fsspec's event loop is GIL-blocked 64%, up from 42%.

    Interestingly, GIL contention in worker threads is highly imbalanced. In both cases, the first worker thread is GIL-bound 30-40%, and idle about half the time. The other threads are GIL-bound 60-70% of the time, and not very idle.

  • test_sum_residuals reads zarr with fsspec and does a simple computation. It should be IO-bound. It's also slower with 4 CPUs. The fsspecIO event loop is GIL-bound by the convoy effect 35-40% of the time in both cases. The main event loop is GIL-bound 30% instead of 17%.

  • test_anom_mean is a bit slower. I don't see much evidence of GIL contention.

    The main thing I notice is that in both cases, the worker threads are idle 50-60% of the time. The event loop is also very busy—only 10% idle with 4 CPUs. It feels like the worker threads are starved for work. This workload requires a bit of transfer, and as usual, we see enormous amounts of time (~50%) where the event loop is blocked doing "non-blocking" IO. So if the asyncio networking stack is a bottleneck, maybe it makes sense that with more threads threads per worker, but still just one networking stack, it's less able to feed the 4 threads? There's probably a better explanation here.

Some thoughts:

  • Something the most-impacted tests have in common is fsspec. I'm concerned by the fact that fsspec has its own event loop in a separate thread, instead of using the worker's event loop. As I understand, having multiple asyncio event loops in different threads is usually not a good idea. I have no way to know, but I'm curious if resolving this would help performance.
  • GIL contention is definitely higher, but I'm not sure it's 2x higher with 2x the threads.
  • Significant idleness in worker threads makes me wonder if worker event loop and data transfer is also a bottleneck. If the network stack were a bottleneck, then maybe it follows that with fewer workers, you're trying to push more bytes of data through fewer bottlenecks. (With more workers, the total bytes that each worker needs to tranfer should be lower, aka you get more parallelism of networking.)
  • I'm running another job only recording Python frames which hold the GIL (via py-spy's --gil flag). Right now, we just see things that are slowed down trying to retake the GIL (like coming out of NumPy code, etc.). This should give a clearer picture of any hotspots that are holding the GIL a lot.

@gjoseph92
Copy link
Contributor

@mrocklin
Copy link
Member

Here is a small video measuring the parallel performance using groupby aggregations on a 64-(virtual)-core machine with increasing number of cores (using ptime) with different data types. I use groupby-aggregations because I feel that they're an ok stand-in for a "moderately-complex pandas workflow" where the GIL is concerned.

https://drive.google.com/file/d/14CoYhheEKo9KXvOesE9olO_VlsRJwINM/view?usp=share_link

(I'll ask @mrdanjlund77 to make it pretty and put it on youtube for future viewing)

Basic results are as follows:

  • grouping on object dtype: 2-3x speedup
  • grouping on integer dtype: 10x speedup
  • grouping on arrow string dtype: 16x speedup

Now of course, as has been mentioned, these queries that we're looking at aren't really looking at groupby-aggregation performance. Many of them are almost entirely parquet performance. The experiment that I ran here wasn't really that representative of these benchmarks (but neither are the benchmarks necessarily representative of a wide range of situations). Hopefully however this experiment gives some (but not enough) evidence that we can use significantly larger machines (16 cores seems pretty safe) if we're using good data types.

@mrocklin
Copy link
Member

Cleaner video: https://www.youtube.com/watch?v=EGgKeZ7_LxQ&ab_channel=Coiled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants