Add State Machine Handling to KernelManager #763

blink1073 · 2022-03-30T21:48:40Z

In #751 we explored better internal state management for the KernelManager class.

We decided there to adopt the transitions library. It is MIT licensed, already on conda-forge, and lets you produce diagrams from your state machine.
We can produce the state machine graph image automatically, backed by CI.

A rough sketch looks like:

Currently server adds "execution_state" to the manager object based on kernel status messages. I think eventually we should move that logic here, but it doesn't need to be in this PR.

The states used here could mirror the ones used in JupyterLab :

Unknown
Starting
Running (Maps to Idle and Busy execution states)
Terminating
Restarting
Dead (with optional traceback)

We are Unknown to start. State transitions are as follows:

Unknown -> Starting
Starting -> Running or Dead
Running -> Terminating, Restarting, or Dead
Terminating -> Dead
Restarting -> Running or Dead
Dead -> Starting

Then we have a state: Unicode() trait and a convenience wait_for_state() method that takes
an optional desired state or returns for any state change. It would also raise an exception if it
gets to Dead state and it wasn't the expected state, along with the traceback.

The state trait can also be observed directly.

We also have an exception: Unicode() trait that is set when an exception leads to a Dead status.

The text was updated successfully, but these errors were encountered:

kevin-bates · 2022-03-31T02:43:59Z

This is interesting, thanks for opening this issue @blink1073.

If we were to include the triggers that initiate these transitions, it seems like we'd have the following ...

Trigger	From State	To State	Notes
`start_kernel()`	Unknown	Starting	Successful startup initiation
`ready.set_result()`	Starting	Running	Successful startup completion
`ready.set_exception()`	Starting	Dead	Unsuccessful startup
`shutdown_kernel()`	Running	Terminating	Successful shutdown initiation
`shutdown_ready.set_result()`	Terminating	Dead	Successful shutdown completion
`shutdown_ready.set_exception()`	Terminating	Dead	Unsuccessful shutdown
`restart_kernel()`	Running	Restarting	Successful restart initiation
`shutdown_ready.set_result()` and `startup_ready.set_result()`	Restarting	Running	Successful restart completion. "shutdown_ready" is `ready` in use during shutdown, while "startup_ready" is `ready` in use during startup
`shutdown_ready.set_exception()` or `startup_ready.set_exception()`	Restarting	Dead	Unsuccessful restart
`interrupt_kernel()`	Running	Running	Successful interrupt, kernel state goes from {idle\|busy} to idle

Because Restarting is literally a composition of shutdown and startup I'm not sure I've expressed that correctly, nor am I sure about the ready future references.

I suppose we could solely express Restarting in terms of the Terminating and Starting entries, but, from a contextual standpoint, I think it makes sense that KernelManager has a "Restarting" state.

What do others think?

davidbrochart · 2022-03-31T06:48:39Z

About "shutdown and restart", I think we could clarify the specification. Restarting could be as simple as clearing the kernel's execution context, thus virtually removing the need for a "restarting" state.

Zsailer · 2022-04-19T16:31:15Z

@kevin-bates, thank you for making that table. That's super helpful as we move into implementing this.

One comment: we might not be able to conclude that shutdown_ready.set_exception() results in a dead state. If it fails to shutdown, the state could be Unknown, or still Running. But we can clarify this in the initial PR.

@davidbrochart, I think "restart" has multiple meanings that need a lot more clarification (I'll explain more on the other issue). We can iterate on the restart states in this state machine in future PRs.

I want to propose a few more states here.

"Started and Connecting"
"Started and Connected"
"Started but Failed to Connect"

I think we need more visibility into the ZMQ sockets here. A kernel process might be running, but communication with the kernel is failing for some reason. This could happen when ZMQ sockets go down (not impossible in a remote kernel scenario) or a third-party kernel implements the kernel messaging protocol incorrectly (speaking from personal experience when debugging other kernels 😅 ).

blink1073 · 2022-07-15T10:38:40Z

@Zsailer and I talked a bit about this yesterday. A few ideas that came out:

The state machine idea was too brittle in practice, as it makes it difficult to subclass
We should rename the private _restarter class to monitor and have it perform heartbeat checks and status relating to last_activity, execution_state, and the connectivity status of each of the channels. The nudge logic that is current in Jupyter Server could probably be handled here as well, and not performed on each new websocket connection as it is currently done. The monitor would use the provisioner to provide a lifecycle_status that is starting, dead, etc.

kevin-bates · 2022-07-15T16:26:42Z

@blink1073 and @Zsailer - this makes sense.

Would monitor then be something that could be configurable? (Thinking we should be careful here - at least initially.)
Would the culling of a kernel (based on idle status) still reside in the server? (This seems like a good example of application-level logic that belongs in the server.)

The monitor would use the provisioner to provide a lifecycle_status that is starting, dead, etc.

This seems to imply that the provisioner would need to provide a status that it doesn't provide today and asking implementers of provisioners to get this right may be asking too much. Just wondering if this shouldn't still be a function of the KernelManager since lifecycle management runs through it and is independent of provisioner implementations in that sense. Since the KernelManager has-a monitor (is that still true?) perhaps the KernelManager could set the lifecycle state into the monitor and the monitor could, given an instance of a provisioner, fine tune that status by asking the provisioner if it's still running (via its poll() method) when necessary (like when the current state is started) since that's how a dead state would get detected today.

blink1073 · 2022-07-15T16:58:50Z

Hey @kevin-bates, thanks for the feedback. I think it would make sense to not make monitor configurable to start, and to leave the culling out as you said. Yes, the KernelManager has-a monitor and we could handle the lifecycle_status could be managed without changes to provisioners.

Zsailer · 2022-08-31T21:01:35Z

Listing out the full set of states while chatting with @blink1073:

lifecycle_state
- starting
- connecting
- connected
- restarting
- terminating
- dead
- unknown
execution_state (published by IOPub channel)
- idle
- busy
- starting
- unknown
last_activity == last change in execution state

The idea is that the "monitor" object would track all of these states.

This monitor object would take the place of the .nudge logic in Jupyter Server and allow other consumers of jupyter_client to benefit from this monitoring of the kernel states.

blink1073 added the enhancement label Mar 30, 2022

blink1073 mentioned this issue Mar 30, 2022

Handle warnings in tests #751

Closed

vidartf mentioned this issue Mar 31, 2022

Failure to start jupyter lab --- IndexError: pop from an empty deque jupyterlab/jupyterlab#11934

Open

blink1073 mentioned this issue Apr 2, 2022

Add warnings for tests jupyter-server/jupyter_server#709

Closed

mwakaba2 mentioned this issue Apr 6, 2022

Meeting Notes 2022 jupyter-server/team-compass#15

Closed

Zsailer mentioned this issue May 2, 2022

Manage Kernel status/state with a state machine #782

Closed

3 tasks

blink1073 mentioned this issue Sep 12, 2022

Jupyter Client 8 Release Planning #833

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add State Machine Handling to KernelManager #763

Add State Machine Handling to KernelManager #763

blink1073 commented Mar 30, 2022

kevin-bates commented Mar 31, 2022 •

edited by blink1073

Loading

davidbrochart commented Mar 31, 2022

Zsailer commented Apr 19, 2022

blink1073 commented Jul 15, 2022

kevin-bates commented Jul 15, 2022

blink1073 commented Jul 15, 2022

Zsailer commented Aug 31, 2022

Add State Machine Handling to KernelManager #763

Add State Machine Handling to KernelManager #763

Comments

blink1073 commented Mar 30, 2022

kevin-bates commented Mar 31, 2022 • edited by blink1073 Loading

davidbrochart commented Mar 31, 2022

Zsailer commented Apr 19, 2022

blink1073 commented Jul 15, 2022

kevin-bates commented Jul 15, 2022

blink1073 commented Jul 15, 2022

Zsailer commented Aug 31, 2022

kevin-bates commented Mar 31, 2022 •

edited by blink1073

Loading