Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Worker state transition refactor #4772

Closed
wants to merge 8 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Apr 29, 2021

This is still very much in flow but so far I got, at least locally, most tests running

This refactors the worker state machine such that it follows a similar execution model as the scheduler where we calculate recommendations and messages during a transition and perform these recommended transitions until we converge and there are no further recommendations.

The overall theme of this change is to be less forgiving in edge cases, log more, raise more often. If something unexpected is happening we do not fail silently.
All connected transitions are also linked using a transaction_id which is generated at the top of the chain and propagated through. While I haven't put this into the logs consistently, this is already added to the transition log such that one can easily follow what the reason of a given transition is/was (this is already possible to calculate based on the recommendations but in logs the ID is helpful)

I have currently a strong suspicion that this state machine can be described by only exit and enter actions (hence the few sporadic _transition_enter_{} methods but I haven't converged on this, yet.

One major point about this PR is that I get rid of the release_key and distinguish between delete and forget actions. This helps with keeping state like a suspicious counter and helps with understanding what is actually going on

At the very least the following items are to be finished before this can be considered reviewable

  • Get all tests green
  • Reiterate signatures of transition functions to accept TaskState instead of key
  • Dedicated PR for scheduler changes
  • Enable test test_broken_deps (this fails on main as well but is an excellent stress test for the state machine)
  • Adress all TODOs / FIXMEs. If not possible, follow up ticket

@mrocklin
Copy link
Member

cc @gforsyth

@fjetter
Copy link
Member Author

fjetter commented May 5, 2021

Brief update:

  • Still tons of TODOs
  • Locally all tests I would expect to be affected by this test_worker, test_client, test_failing_workers, test_steal, ...) are passing 🎉
  • In the above mentioned test cases I hit the occasional deadlock connected to failing connections in gather_dep I will investigate next. That's also why CI is failing (at the very least one of the reasons)
  • Trying to pull out some changes of this in dedicated PRs but so far no big luck. Not sure if these are problems I can solve individually, though Forget erred tasks // Fix deadlocks on worker #4784

@fjetter
Copy link
Member Author

fjetter commented Aug 5, 2021

Closed in favour of #5046

@fjetter fjetter closed this Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants