Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report exit status if kernel unexpectedly dies #957

Open
tmaxwell-anthropic opened this issue Jul 7, 2023 · 4 comments
Open

Report exit status if kernel unexpectedly dies #957

tmaxwell-anthropic opened this issue Jul 7, 2023 · 4 comments

Comments

@tmaxwell-anthropic
Copy link

If the Jupyter kernel process unexpectedly dies, Jupyter will show a message in the frontend. However, it doesn't report the exit status of the dead process. The exit status would be useful for diagnosing why the process crashed -- for example, I recently spent a while trying to figure out why a kernel process crashed, before eventually realizing it had been SIGKILL'ed by the Linux OOM killer.

I'm not sure what's the best way to implement this. KernelManager.is_alive doesn't provide a way to report the reason for death. jupyter_server.ServerKernelManager adds a reason field; it's not currently used for this purpose, but perhaps it could be? There would also need to be a way to report the message to the frontend, perhaps via a new field in the kernel status message? And then the frontend would need to be updated to render that field.

For now I'm working around this by subclassing KernelManager and ZMQChannelsWebsocketConnection to report the exit code to the frontend via the stderr: https://gist.github.com/tmaxwell-anthropic/de7a54753d312ccce831dbfd48ef2b12 But it would be nice if this could be a first-class feature.

@jakkdl
Copy link

jakkdl commented Jul 17, 2023

Shouldn't this be reported primarily in https://github.com/jupyter-server/jupyter_server? Only once the server has support for it could any eventual client-side support be handled.

@tmaxwell-anthropic
Copy link
Author

My understanding is that jupyter_client is the "client" of the kernel, not the "client" of the server. I.e. jupyter_client is the component of jupyter_server which is responsible for talking to the kernel process. For example, the logic that checks if a kernel is alive is defined here in jupyter_client: https://github.com/jupyter/jupyter_client/blob/main/jupyter_client/manager.py#L647-L654

I figured this feature would first need support in jupyter_client, then jupyter_server, then the actual frontends (e.g. https://github.com/jupyter/nbclassic). Does that sound plausible to you?

@jakkdl
Copy link

jakkdl commented Jul 18, 2023

Ah, my bad, I missed the distinction between ServerKernelManager and KernelManager. And thanks for the clarification of the client/server distinction, think I'm following now.

The interfaces of KernelManager (in client) and GatewayKernelManager (in server) looks quite messy at the moment, with several fields being getattrd and/or type: ignore[attr-defined]. But reason seems to only be set at initialization, so promoting it to a proper property of KernelManager and also using it when the server is killed makes sense to me - and will at least partly clean up the interface mess.

@kevin-bates
Copy link
Member

The entity that knows the actual state of the kernel "process" is the Kernel Provisioner. Provisioners are an abstraction layer that enables kernel lifecycle management other than just local process management (which is what the default LocalProvisioner provides).

Since what you're driving at is essentially the result of KernelProvisionerBase.poll(), it seems this new method on KernelManager could simply leverage its result. (Its too bad KernelManager.is_alive() converts the result to a boolean, but its been that way since before provisioners.) Should you find it necessary to extend LocalProvisioner, we should include a default implementation of that "extension" on KernelProvisionerBase to accommodate existing provisioner implementations.

Also note that the integer status returned from poll() (and relative to the LocalProvisioner) will not necessarily reflect the true exit status of a Windows kernel process if it was terminated on behalf of a signal. So users should expect some hand-wavy, wiggle-room, in what actually gets reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants