Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple way to retry, solves #4698 #5546

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

haf
Copy link

@haf haf commented Nov 25, 2021

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@quasiben
Copy link
Member

Add to allowlist

@haf
Copy link
Author

haf commented Nov 27, 2021

Failures

image

The most relevant bit seems to be test_gather_dep_one_worker_always_busy;

image

I looked into alternatives, like here:

await asyncio.sleep(0.100 * 1.5 ** self.repetitively_busy)

But it would seem logic like this (on the class instance) never resets the retry account and therefore will make the system run slower after at least one round of retries.

The reason I made the change like this, is because this function is wrapped in a retry, and there's an implicit assumption by callers that there always exists a data in the returned value.

test_avoid_oversubscription is also relevant since it triggers this — perhaps the "right" way to solve this is something else?

@fjetter
Copy link
Member

fjetter commented Nov 29, 2021

Thanks for your patch @haf . I believe the fix will require a bit more work, though. While on first glance the retry may be a viable option, it doesn't fit this use case quite right. By default, there is not even a retry allowed (see configuration distributed.comm.retry) and we should be robust to busy workers regardless of the configured retry values.

We typically retry on errors, mostly network errors, and need to limit the maximum amount of retries to ensure a stable operating cluster. However, busyness is not something we can limit since a worker may appear busy for a an extended period of time. Typically, we allow infinite "busy retries". The logic around the repetitively busy retry count is indeed not perfect and likely slightly buggy in a sense that it actually retries too often.

The reason why distributed.worker.get_data_from_worker does not implement the busy retry itself is because the primary caller of it, Worker.gather_dep has a rather sophisticated handling for these situations which is application specific (try fetching from a different worker or escalate to scheduler) and does not necessarily fit the use case of other callers. Implementing a retry on that level will be very difficult since you would need to accommodate all possible callers.

distributed.utils_comm.get_data_from_worker (we should change the name to avoid confusion) behaves a bit differently and is typically used in a different context. It might be an option to consolidate the two functions into one but I expect this to be a grander effort. Instead, I would implement a retry mechanism in utils_comm.get_data_from_worker directly, e.g. if available, try a different worker else try this specific worker again until successful. For this retry I would simply introduce a local backoff. Is this something you would like to follow up on?

@haf
Copy link
Author

haf commented Nov 29, 2021

Yes, I kind of presumed this would not be a good solution, but I figured I'd get the party started but I'm not the right person to solve the bug. I didn't understand what the difference was between get_data_from_worker and get_data_from_worker in the two different namespaces and like you highlight I also don't know what "layer" or expectations each method has on its callers.

Perhaps though with the bug isolated there might be an opportunity to get it fixed? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client/Scheduler gather not robust to busy worker
4 participants