-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock after few workers take all tasks #5635
Comments
I assume these are the five workers which have so much data spilled on the dashboard? (See bytes stored by worker)
Anything interesting happening in this function? E.g. is there native code running which may not release the GIL?
That strongly indicates another state machine problem. |
I think the I'm not sure how to better help diagnose the state machine issues. I find most problems with autoscaling clusters. I could set the cluster size to very high numbers, but the workloads vary a lot and this would be wasteful. I already waste a lot because my autoscale lower bound needs to be high enough to prevent |
These timeouts are not unexpected in such an auto scaling situation. The latest release (2022.01.0) includes a fix to the cluster dump function to not fail in case there are timeouts to individual workers. |
I just encountered a new mode, where the dashboard says 3 tasks are processing but the graph in the lower left shows 0 (and the Info page shows no workers with anything in "processing"). I typically have to delete a pod to unstick things, but this time I don't know which pod! I searched through the logs for "overlap-getitem", since that is the most upstream task prefix that is wedged. In the scheduler, I see 12 lines in succession (always the same key, but several different workers) in the scheduler:
I tried deleting the workers mentioned, some of which are still around, but that did not bring back processing of the missing keys. About 25 seconds before those log entries, I see:
Some 50 minutes later:
That key is seen in the logs as a worker being retired just a few seconds prior to the "Unexpected worker completed task" messages. The same key is re-registered in the middle of those "Unexpected worker completed task" messages. The cluster dump would not upload to github, so I put it here: https://drive.google.com/file/d/1okQ6M8M8smP0oBhvlrdbQLJJbWp7_32U/view?usp=sharing |
We have recently started seeing this behavior as well, currently on dask+distributed 2022.1.0 but I think it's been happening for a couple of releases. Same pattern as @chrisroat:
@crusaderky @fjetter any chance #5665 could be related given that this worker was left for dead but then kept going...? |
@bnaul #5665 mitigates, but does not fix, the issue. I can see from your dashboard that 4 workers are paused with extremely high amounts of spilled memory. I currently have two changes in my pipeline:
While you wait for those, I would suggest you try to figure out why you have 180 GiB worth of managed data per worker on 4 workers while the rest of the cluster is relatively free. Do those workers contain replicas of data that already exists elsewhere? In that case, the new Active Memory Manager could help you: https://coiled.io/blog/introducing-the-dask-active-memory-manager/ You can also try running rebalance() periodically. Word of warning - the method is not safe to run on a busy scheduler. Making it safe and automatic is also in my pipeline. |
I see this consolidation of tasks a few workers when I use auto-scaling and spot instances. I think the addition/deletion of workers is throwing a curveball to the scheduler. I tried to repro in #4471 (no deadlock -- just bad scheduling), where you see a small number of workers get all the tasks as the cluster scales up. The graph above is a map_overlap on a 3d-chunked array with 21720=680 chunks, reading from and writing to the zarr format. This leads to 82k tasks(!). I notice in an issue you link that one can adjust fuse parameters. Is that worth trying -- could it reduce the number of tasks and perhaps lower the rate of wonky scheduling & deadlocks? FWIW, I did attempt the AMM at one point. If I set the env var DASK_DISTRIBUTED__SCHEDULER__ACTIVE_MEMORY-MANAGER__START to "true", it turns on -- do I need additional settings? |
Do you mean that the 4 workers that I see saturated in the dashboard are the initial ones and the rest of the cluster only goes online when they are already very full?
fuse parameters can reduce the number of tasks, which in turn translates in a reduction in CPU load on the scheduler. If your scheduler CPU is frequently near 100%, this should in turn translate in a reduction in random timeouts and disconnects. I don't think this is your case - unless you're frequently seeing them in the logs.
Should be enough. Please run Could you run |
I don't think so. In the issue I references (#4471), I scale from 1 to 4 to 20 workers. It looks like 6 workers get the bulk of the load, and a few others get some load.
I do see a lot of comm errors in the logs.
I could try that on some future run. I do see tons of GIL errors, which I guess must be the read/write to disk that is happening. It does seem like distributing the work better would mitigate things. |
FYI I recently merged a PR which might help in this situation. At least the failure scenario I fixed there is likely to happen in scaling situations, particularly if workers are removed #5786 |
The "few workers taking all the tasks" thing is explainable and fixable by #6115 . This doesn't explain things fully halting though, unless possibly you've run out of disk space (is this possible?) |
What happened:
I have a dask-gateway cluster on k8s, and a 63k task graph. The processing has essentially stalled. Five workers have 375 tasks each, while the remaining 75 have zero. The scheduler logs have a bunch of timeout errors. The workers with all the tasks have the "unresponsive... long-running ... GIL" errors -- though it look like the running tasks are
stack
.I attempted to use
dump_cluster_state
, but hit a timeout error.Anything else we need to know?:
After deleting the pods with the 5 wedged workers, the graph completed.
Environment:
Cluster Dump State:
The text was updated successfully, but these errors were encountered: