-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler deadlocks after asynchronous client.who_has
call
#5144
Comments
Thanks for raising an issue @pfackeldey. Looking at your example, |
I should also add that active memory management in Dask is actively be worked on (xref #4982), so this type of replica tracking should be much more transparent in the future |
Thank you for your fast reply @jrbourbeau ! The documentation link https://distributed.dask.org/en/latest/client.html#async-await-operation states in the second part:
Thus I expected the above mentioned code to work. Also thank you very much for pointing to the ongoing work on the active memory management, I'll keep an eye on this! Best, Peter |
What happened:
Dear
dask-distributed
developers,First of all, thank you for this wonderful project!
We are using
dask-distributed
for our local HTCondor computing cluster.In our use-case we periodically kill and spawn
dask-workers
in HTCondorJobs, such that other HTCondorJobs from other users can slide in between our computing runs.We also need to work with heavy input, which we need to distribute to the
dask-workers
beforehand usingclient.scatter
. Of course we want to replicate this as soon as newdask-worker
are spawned. Thus we added an asynchronous periodic callback to theclient
's IOLoop, which takes care of this replication.Unfortunately we noticed that the
client.who_has(..., asynchronous=True)
call deadlocks our scheduler (unfortunately without a stack trace). Any connection to the scheduler results then in a timeout.What you expected to happen:
We expected that we can add a asynchronous callback, which uses
client.who_has(..., asynchronous=True)
, to theclient
's IOLoop without deadlocking the scheduler.Minimal Complete Verifiable Example:
This is a minimal reproducible example, which shows the above-mentioned problem. Since it also happens on a
LocalCluster
the problem seems to be batch-system-agnostic.The output (only once!):
Afterwards the scheduler is stuck.
Anything else we need to know?:
-
Environment:
client.get_versions(check=True)
does not throw an error and outputs:Thank you very much in advance for your input and help!
Best, Peter
The text was updated successfully, but these errors were encountered: