WIP Worker state transition refactor #4772

fjetter · 2021-04-29T17:17:40Z

This is still very much in flow but so far I got, at least locally, most tests running

This refactors the worker state machine such that it follows a similar execution model as the scheduler where we calculate recommendations and messages during a transition and perform these recommended transitions until we converge and there are no further recommendations.

The overall theme of this change is to be less forgiving in edge cases, log more, raise more often. If something unexpected is happening we do not fail silently.
All connected transitions are also linked using a transaction_id which is generated at the top of the chain and propagated through. While I haven't put this into the logs consistently, this is already added to the transition log such that one can easily follow what the reason of a given transition is/was (this is already possible to calculate based on the recommendations but in logs the ID is helpful)

I have currently a strong suspicion that this state machine can be described by only exit and enter actions (hence the few sporadic _transition_enter_{} methods but I haven't converged on this, yet.

One major point about this PR is that I get rid of the release_key and distinguish between delete and forget actions. This helps with keeping state like a suspicious counter and helps with understanding what is actually going on

At the very least the following items are to be finished before this can be considered reviewable

Get all tests green
Reiterate signatures of transition functions to accept TaskState instead of key
Dedicated PR for scheduler changes
Enable test test_broken_deps (this fails on main as well but is an excellent stress test for the state machine)
Adress all TODOs / FIXMEs. If not possible, follow up ticket

mrocklin · 2021-04-29T17:24:17Z

cc @gforsyth

fjetter · 2021-05-05T17:50:18Z

Brief update:

Still tons of TODOs
Locally all tests I would expect to be affected by this test_worker, test_client, test_failing_workers, test_steal, ...) are passing 🎉
In the above mentioned test cases I hit the occasional deadlock connected to failing connections in gather_dep I will investigate next. That's also why CI is failing (at the very least one of the reasons)
Trying to pull out some changes of this in dedicated PRs but so far no big luck. Not sure if these are problems I can solve individually, though Forget erred tasks // Fix deadlocks on worker #4784

fjetter · 2021-08-05T17:35:39Z

Closed in favour of #5046

fjetter added 5 commits April 30, 2021 14:44

Transition recommendation system

df58d5c

revert changes in distributed/core

cff80eb

revert some changes in tests

48233c5

A bit cleanup

4ff46b9

Assert that all workers are alive after a test

37dad44

fjetter force-pushed the simulate_broken_comm branch from 81b86a8 to 37dad44 Compare April 30, 2021 12:44

fjetter added 2 commits April 30, 2021 16:22

Rewrite more towards enter and exit actions

26b3d4b

A step closer

7f03d52

Missing allow_dead_workers and fix group nbytes

05f8e1b

This was referenced May 6, 2021

Forget erred tasks // Fix deadlocks on worker #4784

Merged

KeyError: ('error', 'waiting') #4800

Closed

fjetter mentioned this pull request May 27, 2021

O(1) rebalance #4774

Merged

fjetter mentioned this pull request Jul 6, 2021

Active Memory Management Control System #4982

Open

fjetter closed this Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Worker state transition refactor #4772

WIP Worker state transition refactor #4772

fjetter commented Apr 29, 2021 •

edited

Loading

mrocklin commented Apr 29, 2021

fjetter commented May 5, 2021

fjetter commented Aug 5, 2021

WIP Worker state transition refactor #4772

WIP Worker state transition refactor #4772

Conversation

fjetter commented Apr 29, 2021 • edited Loading

mrocklin commented Apr 29, 2021

fjetter commented May 5, 2021

fjetter commented Aug 5, 2021

fjetter commented Apr 29, 2021 •

edited

Loading