distributed.lock.Lock deadlocks on worker failure. #2362

asford · 2018-11-16T21:27:39Z

Use of distributed.lock.Lock causes deadlocks in cases where a lock-holding worker is lost and/or fails to release the distributed lock due to an error. This blocks all subsequent attempts to acquire the named lock, as the lock can no longer be released.

This can be trivially demonstrated via:

lock_a = Lock("x")
lock_a.acquire()

# simulated worker loss 
del lock_a

# ...never acquirable
lock_b = Lock("x")
lock_b.acquire()

The simplest solution to this issue would be to add an (optional, client specified) TTL to acquired locks, allowing any subsequent acquire attempts to break the existing lock and acquire the target iff the lock TTL has passed.

I am happy to open a PR for this feature, but would like to have a second opinion on a few design considerations:

What is the expected client behavior if the client releases an already-expired lock? The simplest behavior would be to treat the lock as un-acquired and return an equivalent error.
Should locks support a "renew" operation, extending the TTL of an acquired lock? This could conceivably provide support for long-running operations and/or allow improved responsiveness. This may require extending the current API with a renew operation in addition to release.
Should lock expiration occur "on expire" or "on demand"? Expiration "on demand" would only break the target lock if another acquire attempt occurs, which may be desirable behavior but would potentially mask occurrences where a long-running operation exceeds the specified TTL during smaller-scale development and testing.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-11-17T12:57:54Z

Hrm, do any of the normal Python locks have TTLs? I'm curious what happens with file-based locks in Python when a process goes down.

An alternative would be to clean up locks on workers/clients when those workers/clients are removed from the scheduler. When someone acquired a lock they would also send their client or worker ID. When we removed clients or workers we would check with the LockExtension and it would release any lock associated with that client or worker.

It's worth noting that this behavior may not always be desired. I can imagine situations where the right thing to do is to keep the lock acquired even if the worker fails. I think I agree with you though that something like this is probably going to be more useful than harmful.

asford · 2018-11-18T21:42:39Z

There's a fair bit of prior art for TTL-based lock expiration in redis redlock algorithm (and various clients), nfs locking, and dynamodb-based locks, but I don't believe any of the python standard library lock implementations have this form of TTL logic. This may be because the standard implementations are intended for single-processes systems with python semantics (i.e. exception propagation with context managers, try/finally, etc...) where errors either allow lock cleanup or take down the entire process. In contrast we can't rely on any from of worker/client based graceful failure or cleanup in a distributed context.

As far as I can tell many file-based locks just allow deadlock to occur if a client fails un-gracefully, acknowledging that this deadlock can be resolved by removing the abandoned lock file.

The scheduler-level lock-cleanup-on-worker-failure you propose would likely resolve the problem outlined in this issue. This type of "client lifetime" lock expiration semantic likely isn't used in the systems linked above because, unlike dask, they don't have a model of long-lived client<->server connections that can be used to detect lock-holder failures.

My concerns with lock-cleanup-on-worker-failure are that (1) it may be slightly more complex to implement and (2) it does not address issues where deadlock may not be associated with worker failure. (1) may just be a product of my relative lack of familiarity with scheduler extensions, and may be simple to implement via plugin-based worker/client state change callbacks. (2) is harder to pin down, but one could easily imagine a scenario where a worker experiences some long-running blocking failure accessing an external resource.

My weakly-held preference would be toward using an expiration TTL, but I would happily defer to a strong preference or argument for the alternative solution. In either case, I do agree that this functionality must be optional and should likely be opt-in behavior.

mrocklin · 2018-11-19T13:42:51Z

OK you've convinced me, I agree that TTL behavior seems sensible. It's also not incompatible with future cleanup-on-worker-failure behavior.

Is this something that you would be willing to implement?

asford · 2018-11-20T18:26:27Z

Yes, I'm happy to put together an initial pull for this feature. I'll target having a prototype for review in the next week or so.

aeantipov · 2019-02-02T18:35:56Z

@asford I am running into this issue every once in a while (unacquired locks due to workers failing). Did you have time to submit a PR? I can't find it in distributed. Thanks.

tomaszgy · 2020-10-29T09:48:26Z

Hi there and thanks for a great project Dask is! Has anonye had a chance to work on this?

mrocklin · 2020-10-29T14:36:53Z

Not to my knowledge. If you want to help out it would be welcome.

Also cc @fjetter , in case the Semaphore implementation (which I think does support TTLs) could be used to support this.

fjetter · 2020-10-29T16:18:17Z

The Semaphore implementation should serve as a drop in replacement if used as Lock ~ Semaphore(max_leases=1) and offers a TTL implementation (we just call it slightly differently but I'm open to change :) )
When we introduced the semaphore, we also briefly discussed if it could not be used as a backend for Lock but ultimately, I ran out of time and never got around to do so. IIRC, there are a few subtle api differences in terms of when to await and create the object, which is why we would've needed to change a few tests but it should not impact usage significantly.

What the semaphore is doing, every acquire will create a unique lease_id which is registered with the scheduler. The semaphore on client side then repeatedly refreshes the lease on the scheduler, i.e. as long as the client who acquired the lease is still alive the lease is never freed. If it is never refreshed, the lease is eventually released by the scheduler and therefore it avoids a deadlock.
This behaviour can be tuned to your need with the options distributed.scheduler.locks.lease-validation-interval and distributed.scheduler.locks.lease-timeout, see

distributed/distributed/distributed.yaml

Lines 45 to 47 in 24007c2

    
           locks: 
        
             lease-validation-interval: 10s  # The interval in which the scheduler validates staleness of all acquired leases. Must always be smaller than the lease-timeout itself. 
        
             lease-timeout: 30s  # Maximum interval to wait for a Client refresh before a lease is invalidated and released.

brief docs https://docs.dask.org/en/latest/futures.html#semaphore

tomaszgy · 2020-10-30T15:05:59Z

@mrocklin and @fjetter, thanks a lot for the quick responses!

I'm going to go with a Lock ~ Semaphore(max_leases=1) approach so thanks for pointing it out :-)

lmeyerov · 2021-08-18T01:03:51Z

Ah we just hit this as well - may be worth making a note in docs. When building an HA system, in the case of a benign crash (e.g., GPU OOM), it's reasonable to expect restarting workers to be safe, but if the responsible crasher is due to a Lock, it will currently deadlock even across restarts. (We hit this as bsql is not reentrant, so we have to gate worker access to it.)

gjoseph92 mentioned this issue May 19, 2022

ENH: impure tasks? #6378

Open

fjetter mentioned this issue Jul 16, 2024

Ensure Lock will not lock up in case of worker failures #8770

Merged

fjetter closed this as completed in #8770 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed.lock.Lock deadlocks on worker failure. #2362

distributed.lock.Lock deadlocks on worker failure. #2362

asford commented Nov 16, 2018

mrocklin commented Nov 17, 2018

asford commented Nov 18, 2018

mrocklin commented Nov 19, 2018 •

edited

Loading

asford commented Nov 20, 2018

aeantipov commented Feb 2, 2019

tomaszgy commented Oct 29, 2020

mrocklin commented Oct 29, 2020

fjetter commented Oct 29, 2020 •

edited

Loading

tomaszgy commented Oct 30, 2020

lmeyerov commented Aug 18, 2021

distributed.lock.Lock deadlocks on worker failure. #2362

distributed.lock.Lock deadlocks on worker failure. #2362

Comments

asford commented Nov 16, 2018

mrocklin commented Nov 17, 2018

asford commented Nov 18, 2018

mrocklin commented Nov 19, 2018 • edited Loading

asford commented Nov 20, 2018

aeantipov commented Feb 2, 2019

tomaszgy commented Oct 29, 2020

mrocklin commented Oct 29, 2020

fjetter commented Oct 29, 2020 • edited Loading

tomaszgy commented Oct 30, 2020

lmeyerov commented Aug 18, 2021

mrocklin commented Nov 19, 2018 •

edited

Loading

fjetter commented Oct 29, 2020 •

edited

Loading