-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test_restarting_does_not_deadlock #8849
Conversation
The deadlock of the test happens with the following logs
so the shuffle is being deactivated prematurely because of a barrier task being finsihed / release ( |
I think the shuffle actually finishes just way too quickly...
|
I replaced the nanny with a simple Worker which is much faster since it doesn't require us to start a process. This should be sufficient |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 25 files ±0 25 suites ±0 10h 13m 41s ⏱️ + 4m 41s For more details on these failures, see this check. Results for commit 5fc3d74. ± Comparison against base commit 7c2134e. ♻️ This comment has been updated with latest results. |
My above analysis was wrong. The test logic is flawed. It prepares all shuffle input data on worker A but expected transfer tasks to be scheduled on worker B. That's not necessarily guaranteed and explains why we're not hitting this condition every now and then. |
I slightly rewrote the test. I also confirmed that this test triggers the deadlock condition of #8088 which was the original issue this test was introduced for |
I was looking into some of the shuffle related test failures.
I was looking into
test_restarting_does_not_deadlock
and noticed that the test timed out atwait_until_worker_has_tasks
which appeared to be a little suspicious, i.e. one of the workers just never received any tasks (might of course be a fundamental scheduling problem, idk yet)anyhow, I grew suspicious of our "while not condition sleep" patterns and started with migrating this to an event based utility.
cc @hendrikmakait you might be interested in this