Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Lock will not lock up in case of worker failures #8770

Merged
merged 2 commits into from
Jul 17, 2024

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jul 16, 2024

Closes #2362

This replaces the Lock implementation with the Semaphore. The caveat is that the Lock no longer strictly guarantees that there are never two threads acquiring the lock since the resilience of the Semaphore is timeout based.

In hindsight, the Semaphore should likely have used a plugin based approach that invalidates the leases a worker holds once it is removed but this could be added later.
To still be able to enforce strictness on the lock, I allowed to disable the lease validation logic by setting dask.config.set({"distributed.scheduler.locks.lease-timeout": "inf"}) which I think is a compromise

There is one minor breakage around what happens if a semaphore is "closed". The new implementation will just reset the state. This is necessary to allow patterns like...

    def f(x):
        with Lock("x"):
           ...

    futures = c.map(f, range(20))

where the lock is not initialized ahead of time. I think this is a fair compromise.

Copy link
Contributor

github-actions bot commented Jul 16, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ± 0      29 suites  ±0   11h 52m 7s ⏱️ + 2m 27s
 4 087 tests + 5   3 971 ✅ + 6    112 💤 ±0  4 ❌  - 1 
55 286 runs  +70  52 844 ✅ +67  2 438 💤 +6  4 ❌  - 3 

For more details on these failures, see this check.

Results for commit c570707. ± Comparison against base commit 110eac1.

This pull request removes 2 and adds 7 tests. Note that renamed tests count towards both.
distributed.tests.test_scheduler ‑ test_deadlock_dependency_of_queued_released
distributed.tests.test_semaphore ‑ test_close_sync
distributed.tests.test_client ‑ test_release_persisted_collection
distributed.tests.test_client ‑ test_release_persisted_collection_sync
distributed.tests.test_locks ‑ test_locks_inf_lease_timeout
distributed.tests.test_scheduler ‑ test_deadlock_dependency_of_queued_released_when_worker_removed[False]
distributed.tests.test_scheduler ‑ test_deadlock_dependency_of_queued_released_when_worker_removed[True]
distributed.tests.test_scheduler ‑ test_deadlock_dependency_of_queued_released_when_worker_replaced[False]
distributed.tests.test_scheduler ‑ test_deadlock_dependency_of_queued_released_when_worker_replaced[True]

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait self-requested a review July 16, 2024 13:07
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @fjetter! This change looks good to me. I'm comfortable with having breaking changes for the lock but I have one nit that might be nice to take care of if it's quick.

In hindsight, the Semaphore should likely have used a plugin based approach that invalidates the leases a worker holds once it is removed but this could be added later.

+1, should we add an issue for this?

distributed/lock.py Show resolved Hide resolved
distributed/semaphore.py Outdated Show resolved Hide resolved
@fjetter fjetter merged commit ffcb271 into dask:main Jul 17, 2024
29 of 34 checks passed
@fjetter fjetter deleted the use_semaphore_for_lock branch July 17, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distributed.lock.Lock deadlocks on worker failure.
2 participants