-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed.lock.Lock deadlocks on worker failure. #2362
Comments
Hrm, do any of the normal Python locks have TTLs? I'm curious what happens with file-based locks in Python when a process goes down. An alternative would be to clean up locks on workers/clients when those workers/clients are removed from the scheduler. When someone acquired a lock they would also send their client or worker ID. When we removed clients or workers we would check with the LockExtension and it would release any lock associated with that client or worker. It's worth noting that this behavior may not always be desired. I can imagine situations where the right thing to do is to keep the lock acquired even if the worker fails. I think I agree with you though that something like this is probably going to be more useful than harmful. |
There's a fair bit of prior art for TTL-based lock expiration in redis redlock algorithm (and various clients), nfs locking, and dynamodb-based locks, but I don't believe any of the python standard library lock implementations have this form of TTL logic. This may be because the standard implementations are intended for single-processes systems with python semantics (i.e. exception propagation with context managers, try/finally, etc...) where errors either allow lock cleanup or take down the entire process. In contrast we can't rely on any from of worker/client based graceful failure or cleanup in a distributed context. As far as I can tell many file-based locks just allow deadlock to occur if a client fails un-gracefully, acknowledging that this deadlock can be resolved by removing the abandoned lock file. The scheduler-level lock-cleanup-on-worker-failure you propose would likely resolve the problem outlined in this issue. This type of "client lifetime" lock expiration semantic likely isn't used in the systems linked above because, unlike dask, they don't have a model of long-lived client<->server connections that can be used to detect lock-holder failures. My concerns with lock-cleanup-on-worker-failure are that (1) it may be slightly more complex to implement and (2) it does not address issues where deadlock may not be associated with worker failure. (1) may just be a product of my relative lack of familiarity with scheduler extensions, and may be simple to implement via plugin-based worker/client state change callbacks. (2) is harder to pin down, but one could easily imagine a scenario where a worker experiences some long-running blocking failure accessing an external resource. My weakly-held preference would be toward using an expiration TTL, but I would happily defer to a strong preference or argument for the alternative solution. In either case, I do agree that this functionality must be optional and should likely be opt-in behavior. |
OK you've convinced me, I agree that TTL behavior seems sensible. It's also not incompatible with future cleanup-on-worker-failure behavior. Is this something that you would be willing to implement? |
Yes, I'm happy to put together an initial pull for this feature. I'll target having a prototype for review in the next week or so. |
@asford I am running into this issue every once in a while (unacquired locks due to workers failing). Did you have time to submit a PR? I can't find it in distributed. Thanks. |
Hi there and thanks for a great project Dask is! Has anonye had a chance to work on this? |
Not to my knowledge. If you want to help out it would be welcome. Also cc @fjetter , in case the Semaphore implementation (which I think does support TTLs) could be used to support this. |
The Semaphore implementation should serve as a drop in replacement if used as What the semaphore is doing, every distributed/distributed/distributed.yaml Lines 45 to 47 in 24007c2
brief docs https://docs.dask.org/en/latest/futures.html#semaphore |
Ah we just hit this as well - may be worth making a note in docs. When building an HA system, in the case of a benign crash (e.g., GPU OOM), it's reasonable to expect restarting workers to be safe, but if the responsible crasher is due to a |
Use of
distributed.lock.Lock
causes deadlocks in cases where a lock-holding worker is lost and/or fails to release the distributed lock due to an error. This blocks all subsequent attempts to acquire the named lock, as the lock can no longer be released.This can be trivially demonstrated via:
The simplest solution to this issue would be to add an (optional, client specified) TTL to acquired locks, allowing any subsequent acquire attempts to break the existing lock and acquire the target iff the lock TTL has passed.
I am happy to open a PR for this feature, but would like to have a second opinion on a few design considerations:
renew
operation in addition torelease
.The text was updated successfully, but these errors were encountered: