Simple way to retry, solves #4698 #5546

haf · 2021-11-25T22:25:27Z

Closes Client/Scheduler gather not robust to busy worker #4698
Tests added / passed
Passes pre-commit run --all-files

GPUtester · 2021-11-25T22:25:30Z

Can one of the admins verify this patch?

quasiben · 2021-11-25T23:09:09Z

Add to allowlist

haf · 2021-11-27T12:08:44Z

Failures

The most relevant bit seems to be test_gather_dep_one_worker_always_busy;

I looked into alternatives, like here:

distributed/distributed/worker.py

Line 2998 in 60cb52f

await asyncio.sleep(0.100 * 1.5 ** self.repetitively_busy)

But it would seem logic like this (on the class instance) never resets the retry account and therefore will make the system run slower after at least one round of retries.

The reason I made the change like this, is because this function is wrapped in a retry, and there's an implicit assumption by callers that there always exists a data in the returned value.

test_avoid_oversubscription is also relevant since it triggers this — perhaps the "right" way to solve this is something else?

fjetter · 2021-11-29T10:12:17Z

Thanks for your patch @haf . I believe the fix will require a bit more work, though. While on first glance the retry may be a viable option, it doesn't fit this use case quite right. By default, there is not even a retry allowed (see configuration distributed.comm.retry) and we should be robust to busy workers regardless of the configured retry values.

We typically retry on errors, mostly network errors, and need to limit the maximum amount of retries to ensure a stable operating cluster. However, busyness is not something we can limit since a worker may appear busy for a an extended period of time. Typically, we allow infinite "busy retries". The logic around the repetitively busy retry count is indeed not perfect and likely slightly buggy in a sense that it actually retries too often.

The reason why distributed.worker.get_data_from_worker does not implement the busy retry itself is because the primary caller of it, Worker.gather_dep has a rather sophisticated handling for these situations which is application specific (try fetching from a different worker or escalate to scheduler) and does not necessarily fit the use case of other callers. Implementing a retry on that level will be very difficult since you would need to accommodate all possible callers.

distributed.utils_comm.get_data_from_worker (we should change the name to avoid confusion) behaves a bit differently and is typically used in a different context. It might be an option to consolidate the two functions into one but I expect this to be a grander effort. Instead, I would implement a retry mechanism in utils_comm.get_data_from_worker directly, e.g. if available, try a different worker else try this specific worker again until successful. For this retry I would simply introduce a local backoff. Is this something you would like to follow up on?

haf · 2021-11-29T14:57:39Z

Yes, I kind of presumed this would not be a good solution, but I figured I'd get the party started but I'm not the right person to solve the bug. I didn't understand what the difference was between get_data_from_worker and get_data_from_worker in the two different namespaces and like you highlight I also don't know what "layer" or expectations each method has on its callers.

Perhaps though with the bug isolated there might be an opportunity to get it fixed? :)

Simple way to retry, solves dask#4698

89ad616

aiudirog mentioned this pull request Jan 24, 2022

Retry get_data_from_worker when worker is busy in gather_from_workers #5692

Closed

3 tasks

crusaderky mentioned this pull request Jul 12, 2023

Overhaul gather() #7997

Merged

haf requested a review from fjetter as a code owner January 23, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple way to retry, solves #4698 #5546

Simple way to retry, solves #4698 #5546

haf commented Nov 25, 2021 •

edited

Loading

GPUtester commented Nov 25, 2021

quasiben commented Nov 25, 2021

haf commented Nov 27, 2021 •

edited

Loading

fjetter commented Nov 29, 2021

haf commented Nov 29, 2021 •

edited

Loading

Simple way to retry, solves #4698 #5546

Are you sure you want to change the base?

Simple way to retry, solves #4698 #5546

Conversation

haf commented Nov 25, 2021 • edited Loading

GPUtester commented Nov 25, 2021

quasiben commented Nov 25, 2021

haf commented Nov 27, 2021 • edited Loading

fjetter commented Nov 29, 2021

haf commented Nov 29, 2021 • edited Loading

haf commented Nov 25, 2021 •

edited

Loading

haf commented Nov 27, 2021 •

edited

Loading

haf commented Nov 29, 2021 •

edited

Loading