Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed.lock.Lock deadlocks on worker failure. #2362

Closed
asford opened this issue Nov 16, 2018 · 10 comments · Fixed by #8770
Closed

distributed.lock.Lock deadlocks on worker failure. #2362

asford opened this issue Nov 16, 2018 · 10 comments · Fixed by #8770

Comments

@asford
Copy link
Contributor

asford commented Nov 16, 2018

Use of distributed.lock.Lock causes deadlocks in cases where a lock-holding worker is lost and/or fails to release the distributed lock due to an error. This blocks all subsequent attempts to acquire the named lock, as the lock can no longer be released.

This can be trivially demonstrated via:

lock_a = Lock("x")
lock_a.acquire()

# simulated worker loss 
del lock_a

# ...never acquirable
lock_b = Lock("x")
lock_b.acquire()

The simplest solution to this issue would be to add an (optional, client specified) TTL to acquired locks, allowing any subsequent acquire attempts to break the existing lock and acquire the target iff the lock TTL has passed.

I am happy to open a PR for this feature, but would like to have a second opinion on a few design considerations:

  1. What is the expected client behavior if the client releases an already-expired lock? The simplest behavior would be to treat the lock as un-acquired and return an equivalent error.
  2. Should locks support a "renew" operation, extending the TTL of an acquired lock? This could conceivably provide support for long-running operations and/or allow improved responsiveness. This may require extending the current API with a renew operation in addition to release.
  3. Should lock expiration occur "on expire" or "on demand"? Expiration "on demand" would only break the target lock if another acquire attempt occurs, which may be desirable behavior but would potentially mask occurrences where a long-running operation exceeds the specified TTL during smaller-scale development and testing.
@mrocklin
Copy link
Member

Hrm, do any of the normal Python locks have TTLs? I'm curious what happens with file-based locks in Python when a process goes down.

An alternative would be to clean up locks on workers/clients when those workers/clients are removed from the scheduler. When someone acquired a lock they would also send their client or worker ID. When we removed clients or workers we would check with the LockExtension and it would release any lock associated with that client or worker.

It's worth noting that this behavior may not always be desired. I can imagine situations where the right thing to do is to keep the lock acquired even if the worker fails. I think I agree with you though that something like this is probably going to be more useful than harmful.

@asford
Copy link
Contributor Author

asford commented Nov 18, 2018

There's a fair bit of prior art for TTL-based lock expiration in redis redlock algorithm (and various clients), nfs locking, and dynamodb-based locks, but I don't believe any of the python standard library lock implementations have this form of TTL logic. This may be because the standard implementations are intended for single-processes systems with python semantics (i.e. exception propagation with context managers, try/finally, etc...) where errors either allow lock cleanup or take down the entire process. In contrast we can't rely on any from of worker/client based graceful failure or cleanup in a distributed context.

As far as I can tell many file-based locks just allow deadlock to occur if a client fails un-gracefully, acknowledging that this deadlock can be resolved by removing the abandoned lock file.

The scheduler-level lock-cleanup-on-worker-failure you propose would likely resolve the problem outlined in this issue. This type of "client lifetime" lock expiration semantic likely isn't used in the systems linked above because, unlike dask, they don't have a model of long-lived client<->server connections that can be used to detect lock-holder failures.

My concerns with lock-cleanup-on-worker-failure are that (1) it may be slightly more complex to implement and (2) it does not address issues where deadlock may not be associated with worker failure. (1) may just be a product of my relative lack of familiarity with scheduler extensions, and may be simple to implement via plugin-based worker/client state change callbacks. (2) is harder to pin down, but one could easily imagine a scenario where a worker experiences some long-running blocking failure accessing an external resource.

My weakly-held preference would be toward using an expiration TTL, but I would happily defer to a strong preference or argument for the alternative solution. In either case, I do agree that this functionality must be optional and should likely be opt-in behavior.

@mrocklin
Copy link
Member

mrocklin commented Nov 19, 2018

OK you've convinced me, I agree that TTL behavior seems sensible. It's also not incompatible with future cleanup-on-worker-failure behavior.

Is this something that you would be willing to implement?

@asford
Copy link
Contributor Author

asford commented Nov 20, 2018

Yes, I'm happy to put together an initial pull for this feature. I'll target having a prototype for review in the next week or so.

@aeantipov
Copy link

@asford I am running into this issue every once in a while (unacquired locks due to workers failing). Did you have time to submit a PR? I can't find it in distributed. Thanks.

@tomaszgy
Copy link

Hi there and thanks for a great project Dask is! Has anonye had a chance to work on this?

@mrocklin
Copy link
Member

Not to my knowledge. If you want to help out it would be welcome.

Also cc @fjetter , in case the Semaphore implementation (which I think does support TTLs) could be used to support this.

@fjetter
Copy link
Member

fjetter commented Oct 29, 2020

The Semaphore implementation should serve as a drop in replacement if used as Lock ~ Semaphore(max_leases=1) and offers a TTL implementation (we just call it slightly differently but I'm open to change :) )
When we introduced the semaphore, we also briefly discussed if it could not be used as a backend for Lock but ultimately, I ran out of time and never got around to do so. IIRC, there are a few subtle api differences in terms of when to await and create the object, which is why we would've needed to change a few tests but it should not impact usage significantly.

What the semaphore is doing, every acquire will create a unique lease_id which is registered with the scheduler. The semaphore on client side then repeatedly refreshes the lease on the scheduler, i.e. as long as the client who acquired the lease is still alive the lease is never freed. If it is never refreshed, the lease is eventually released by the scheduler and therefore it avoids a deadlock.
This behaviour can be tuned to your need with the options distributed.scheduler.locks.lease-validation-interval and distributed.scheduler.locks.lease-timeout, see

locks:
lease-validation-interval: 10s # The interval in which the scheduler validates staleness of all acquired leases. Must always be smaller than the lease-timeout itself.
lease-timeout: 30s # Maximum interval to wait for a Client refresh before a lease is invalidated and released.

brief docs https://docs.dask.org/en/latest/futures.html#semaphore

@tomaszgy
Copy link

@mrocklin and @fjetter, thanks a lot for the quick responses!

I'm going to go with a Lock ~ Semaphore(max_leases=1) approach so thanks for pointing it out :-)

@lmeyerov
Copy link

Ah we just hit this as well - may be worth making a note in docs. When building an HA system, in the case of a benign crash (e.g., GPU OOM), it's reasonable to expect restarting workers to be safe, but if the responsible crasher is due to a Lock, it will currently deadlock even across restarts. (We hit this as bsql is not reentrant, so we have to gate worker access to it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants