Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control de/ser offload #3793

Merged
merged 6 commits into from
May 25, 2020
Merged

control de/ser offload #3793

merged 6 commits into from
May 25, 2020

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented May 12, 2020

Fixes #3776

(posting some scratch code here, to pick up later)

@martindurant
Copy link
Member Author

@mrocklin , how would I benchmark this to show this makes any difference?

@mrocklin
Copy link
Member

If you follow the link from the original issue it points to a comment by @Kobzol . I recommend asking him.

@Kobzol
Copy link
Contributor

Kobzol commented May 14, 2020

@martindurant I can run a series of benchmarks with and without this change if you want. However, in my previous benchmarks it was the offloading in to_frames, not from_frames, that was showing up as 10-15 % of runtime in profiles.

@martindurant
Copy link
Member Author

@Kobzol , oh!
I can do the same things at both ends, and then I'll ping you to see if it was helpful, thanks.

@martindurant martindurant changed the title WIP: control deser offload WIP: control de/ser offload May 14, 2020
@martindurant
Copy link
Member Author

@Kobzol

@Kobzol
Copy link
Contributor

Kobzol commented May 18, 2020

I ran a few benchmarks. sizeof has mostly disappeared from the trace, however not completely, because it's used in more places in the scheduler, while you have disabled it only for the main RPC connection. The scheduler uses multiple RPC connections (for example for clients downloading data), so I think that it should be disabled for all RPC connections originating on the scheduler. I'll try to run the benchmark again with allow_offload=False everywhere.

From my current results I have not seen any perf. improvements at all, so it's possible that this change will not have a large effect on the overall pipeline runtime. This is something that we have observed quite often with out Rust scheduler implementation, sometimes even a big (local) win does not translate to faster pipelines if it's not in the overall hot path. But even if we found no perf. gains I still think that this change should be merged, as it's just wasteful to calculate the size on the scheduler.

Hopefully in the future there will be a standardized benchmark suite that will allow us to test these changes easily.

@Kobzol
Copy link
Contributor

Kobzol commented May 18, 2020

FWIW, I got some wins when I turned off offloading everywhere (even on clients and workers):
image

I would be interested if these perf. gains can be replicated if offloading is only disabled on the scheduler (but on all its comms).

@martindurant
Copy link
Member Author

Which comms are not avoiding offload on the scheduler?

@Kobzol
Copy link
Contributor

Kobzol commented May 20, 2020

The scheduler has a rpc attribute, which is a connection pool that handles out-of-band messages like gather. But it also creates separate connections for each connected client and worker. These connections handle streaming messages (compute task, task finished, etc.). It's the difference between self.handlers and worker_handlers/client_handlers in the scheduler.

When I inspect the comm variable in Scheduler::add_worker, it has allow_offload=True. And when I add print(allow_offload) to the to_frames method, it prints mostly False in the scheduler process. That's because most of the messages do not actually go through the Scheduler's rpc comm. But I might be wrong :)

@mrocklin
Copy link
Member

Almost all messages into or out of a Dask Server arrive in two places

  1. Server.rpc, which is all of the connections that this server initiates
  2. Server.listener, which is created in core.py::Server.listen which handles all of the inbound connections

There are a few other places from time to time, but have tried hard to centralize connection handling on these two locations.

distributed/core.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Member

The changes here seem sensible to me. It feels awkward to modify an object these days, rather than pass in a parameter, but I suspect that this is nicer than threading the allow_offload parameter through superclasses, which makes sense.

I'm +1. It would be good to get feedback from @Kobzol first though if he has time to run this through benchmarks again.

@Kobzol
Copy link
Contributor

Kobzol commented May 22, 2020

I ran the benchmarks again and I got the same perf. gains when the offloading was only disabled at the scheduler. Around 10-20% overall runtime improvement for my numpy and pandas benchmarks.

Seems to me like a net win, as it was showing up in profiles before on a very hot path and now it's completely gone.

image

@martindurant
Copy link
Member Author

@quasiben , do you want to give this a quick once-over, to see if this might affect UCX at all? (I don't think so)

@martindurant martindurant changed the title WIP: control de/ser offload control de/ser offload May 22, 2020
@martindurant
Copy link
Member Author

OK merging.

I would like to mention that distributed.utils.nbytes is needlessly called multiple times an a couple of places. That could be eliminated at the cost of slightly longer code, but I won't do it now until it is shown to affect performance; the function itself is very short and simple.

@martindurant martindurant merged commit 0ec78f8 into dask:master May 25, 2020
@martindurant martindurant deleted the not_offload branch May 25, 2020 17:32
@quasiben
Copy link
Member

Sorry for missing this @martindurant. FWIW, the changes should not the affect UCX comms

@mrocklin
Copy link
Member

Thanks for leading this @martindurant !

I would like to mention that distributed.utils.nbytes is needlessly called multiple times an a couple of places. That could be eliminated at the cost of slightly longer code, but I won't do it now until it is shown to affect performance; the function itself is very short and simple.

This should be fine I think. The nbytes function is much faster than the sizeof function. nbytes is only called on bytes/memoryview/bytearray type objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't offload serialization in scheduler
4 participants