-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client/Scheduler gather not robust to busy worker #4698
Comments
This was actually really easy to read the source code for, due to the high quality of the source code in this project. Here's the bug: https://github.com/dask/distributed/blob/main/distributed/utils_comm.py#L63-L88 It creates a coroutine for comms via send/recv to the worker, opening a comms channel. It uses this operation: distributed/distributed/worker.py Lines 4186 to 4195 in 60cb52f
Which creates a RPC request called "get_data" and always returns the response no matter whether the response is successful or not (but it crashes if the protocol is broken). Findin the invoked worker method was then rather easy: distributed/distributed/worker.py Line 1670 in 60cb52f
Now it's just a question if seeing if it ever returns any message without "data" as a property, which is does in this if statement: distributed/distributed/worker.py Lines 1692 to 1705 in 60cb52f
Which actually does log on a debug level that the worker has too many open connections to answer the request right now. So what's need to be done is to update the line that crashes to check whether |
When I run a lot of tasks on the CNES HPC with a big Dask cluster (512 threads/128 workers), I sometimes have communication errors between the scheduler and the workers. The error is thrown by the module "distributed/utils_comm.py," because the code tries to read the key 'data' which does not exist in the dictionary.
I modified the module code, to display the content of this dictionary when this error is thrown. I saw that in this context, the dictionary just contains the following data:
dict(status='busy')
.I cannot put reproducible code because the software used is not public and I did not discover simple lines of code to reproduce this problem. I can eventually do other tests to give you further information.
Environment:
The text was updated successfully, but these errors were encountered: