Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test_restarting_does_not_deadlock #8849

Merged
merged 5 commits into from
Oct 29, 2024
Merged

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Aug 30, 2024

I was looking into some of the shuffle related test failures.

I was looking into test_restarting_does_not_deadlock and noticed that the test timed out at wait_until_worker_has_tasks which appeared to be a little suspicious, i.e. one of the workers just never received any tasks (might of course be a fundamental scheduling problem, idk yet)

anyhow, I grew suspicious of our "while not condition sleep" patterns and started with migrating this to an event based utility.

cc @hendrikmakait you might be interested in this

@fjetter
Copy link
Member Author

fjetter commented Aug 30, 2024

The deadlock of the test happens with the following logs

2024-08-30 18:20:53,413 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 initialized by task ('shuffle-transfer-4b151ea2e7118ca361cbf67e2e3cbf08', 2) executed on worker tcp://127.0.0.1:64362
2024-08-30 18:20:54,299 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 deactivated due to stimulus 'task-finished-1725034854.297965'

so the shuffle is being deactivated prematurely because of a barrier task being finsihed / release (shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd)
The transition log is huge at this point...
transition_log.txt.zip

@fjetter
Copy link
Member Author

fjetter commented Aug 30, 2024

I think the shuffle actually finishes just way too quickly...

Transition(key=('p2pshuffle-3c0e70f475016b3312ed2e08b06af4fd', 0), start='processing', finish='memory', recommendations={('getitem-f4ac494203a9f98493521efbbad2b8ea', 0): 'processing', 'shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd': 'released'}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.881619), Transition(key='shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd', start='memory', finish='released', recommendations={}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.88164)])

@fjetter
Copy link
Member Author

fjetter commented Aug 30, 2024

I replaced the nanny with a simple Worker which is much faster since it doesn't require us to start a process. This should be sufficient

@fjetter fjetter marked this pull request as ready for review August 30, 2024 16:30
@fjetter fjetter changed the title [WIP] Shuffle test failures Fix test_restarting_does_not_deadlock Aug 30, 2024
Copy link
Contributor

github-actions bot commented Aug 30, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ±0      25 suites  ±0   10h 13m 41s ⏱️ + 4m 41s
 4 125 tests ±0   4 011 ✅ +2    111 💤 ±0  3 ❌  - 2 
47 651 runs  ±0  45 543 ✅ +3  2 105 💤  - 1  3 ❌  - 2 

For more details on these failures, see this check.

Results for commit 5fc3d74. ± Comparison against base commit 7c2134e.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author

fjetter commented Sep 2, 2024

My above analysis was wrong. The test logic is flawed. It prepares all shuffle input data on worker A but expected transfer tasks to be scheduled on worker B. That's not necessarily guaranteed and explains why we're not hitting this condition every now and then.

@fjetter
Copy link
Member Author

fjetter commented Sep 2, 2024

I slightly rewrote the test. I also confirmed that this test triggers the deadlock condition of #8088 which was the original issue this test was introduced for

@fjetter fjetter merged commit 2953090 into dask:main Oct 29, 2024
28 of 32 checks passed
@fjetter fjetter deleted the shuffle_failures branch October 29, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant