-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple way to retry, solves #4698 #5546
base: main
Are you sure you want to change the base?
Conversation
Can one of the admins verify this patch? |
Add to allowlist |
Failures The most relevant bit seems to be I looked into alternatives, like here: distributed/distributed/worker.py Line 2998 in 60cb52f
But it would seem logic like this (on the class instance) never resets the retry account and therefore will make the system run slower after at least one round of retries. The reason I made the change like this, is because this function is wrapped in a retry, and there's an implicit assumption by callers that there always exists a test_avoid_oversubscription is also relevant since it triggers this — perhaps the "right" way to solve this is something else? |
Thanks for your patch @haf . I believe the fix will require a bit more work, though. While on first glance the retry may be a viable option, it doesn't fit this use case quite right. By default, there is not even a retry allowed (see configuration We typically retry on errors, mostly network errors, and need to limit the maximum amount of retries to ensure a stable operating cluster. However, busyness is not something we can limit since a worker may appear busy for a an extended period of time. Typically, we allow infinite "busy retries". The logic around the repetitively busy retry count is indeed not perfect and likely slightly buggy in a sense that it actually retries too often. The reason why
|
Yes, I kind of presumed this would not be a good solution, but I figured I'd get the party started but I'm not the right person to solve the bug. I didn't understand what the difference was between Perhaps though with the bug isolated there might be an opportunity to get it fixed? :) |
pre-commit run --all-files