[DISCUSS] RESTful Asynchronous Kernel Starts #197

blink1073 · 2020-04-05T11:28:05Z

Jupyter Notebook has had an implicit assumption that starting a kernel process is instantaneous. This results in a race condition when two requests try and start a session at the same time. The second started session will check to see if one exists, see that it does not because the previous one has not saved, and create a new one. You will then end up with two sessions for the same path.

Additionally, if the kernel takes a very long time to start, there is no feedback to the client. In JupyterLab 1.2, this results in No Kernel! and a busy indicator until the kernel is started. In JupyterLab 2.0, you see "Kernel" and an idle indication until the response is received, at which point you see "Python 3". You can see this behavior by adding a yield gen.sleep(10) in MappingKernelManager.start_kernel. Admittedly, we can hack the display to show the pending name, but a better fix is to have a better server/client contract.

I propose that we upgrade the server to respond POST to /api/kernels, /api/sessions, or /api/terminals/ that has an "async" flag (or equivalent name) in the body with a response of the following form, following the restcookbook:

HTTP/1.1 202 Accepted
Location: /queue/12345

Text inline from the cookbook:

Instead of creating the actual resources, create a temporary one. Instead of returning a 201 (Created) HTTP response, you can issue a 202 (Accepted) response code. This informs the client that the request has been accepted and understood by the server, but the resource is not (yet) created. Send the temporary resource inside the Location header."

This location can store information about the status of the actual resource: an ETA on when it will be created, what is currently being done or processed.

When the actual resource has been created, the temporary resources can return a 303 (See other) response. The location header returns the URI to the definitive resource. A client can either DELETE the temporary resource, or the server can expire this resource and return a 410 (Gone) later on.

This way, the client immediately gets feedback that a kernel or a session is being created without having to wait for it to be created. By making it optional, clients can opt-in to this behavior and migrate to it. At some later date we could require that all clients handle 202 responses.

The server would need to be upgraded to have new database entries for pending kernel and sessions, and use them when determining if one exists. It would need to implement the protocol defined above. Even if a client does not request async, we can still use this new database table to fix the race condition.

The text was updated successfully, but these errors were encountered:

blink1073 · 2020-04-05T11:52:32Z

cc @ellisonbg @jasongrout

blink1073 · 2020-04-05T11:54:30Z

cc @captainsafia @rgbkrk

kevin-bates · 2020-04-05T19:26:24Z

Thank you for starting this Steve, I think this could be helpful to clients. Had a couple thoughts, questions...

Adding a status column to reflect that a start request is pending certainly lessens the race condition and sounds like something that should occur anyway.
Session records essentially use the (notebook) path as the "primary key". I'm not that familiar with /api/terminals, but /api/kernels POST requests are given just the kernel name (or none at all for the default) - as a result, there isn't a "PK" to use unless we, say, let the user specify the kernel-id (which I believe is a good idea anyway). Otherwise, I'm not sure how this applies to /api/kernels (and perhaps /api/terminals). Perhaps the async "flag" in the body could be a "PK" of sorts.
We should make sure that clients requesting async behavior/responses also tolerate today's responses (for that same request) in case they happen to hit a downlevel server. I'm sure that's the standard approach anyway.
Is it possible to trigger this race condition using Lab or Notebook frontends or are you using a REST client directly to see this?

Let's add this to the server's weekly meeting agenda if you can make it on Thursday.

captainsafia · 2020-04-05T20:01:26Z

Thanks for the write-up, Steve. I agree with the general approach. Some thoughts come to mind:

Jupyter servers that support the async flag should advertise this support somehow. Perhaps by setting a property in the response to the /api network call.
Adding a status column to reflect that a start request is pending certainly lessens the race condition and sounds like something that should occur anyway.

Ideally, whatever status is used to represent a kernel/session that is being launched will be separate from the "starting" status that kernels currently advertise. This will help disambiguate the different states for the end-user and avoid overloading an existing state.

Do we have recommendations on how clients should query for the launched kernel/session once they have sent the launch request (polling, server-sent event, etc)?

blink1073 · 2020-04-06T18:53:54Z

Thanks for the feedback @kevin-bates, @captainsafia!

Session records essentially use the (notebook) path as the "primary key".

Not necessarily, you can have multiple sessions with different types or names, for instance. Interestingly, the classic notebook tears down the session instead of changing kernels, but does so synchronously. In jupyterlab/jupyterlab#8174 I implemented an asynchronous version of this so you can switch kernels while a slow kernel is launching.

Is it possible to trigger this race condition using Lab or Notebook frontends

It is possible to trigger by adding a gen.sleep(10) in start_kernel and calling POST on /api/sessions before the previous request has completed.

Jupyter servers that support the async flag should advertise this support somehow. Perhaps by setting a property in the response to the /api network call.

Agreed!

Ideally, whatever status is used to represent a kernel/session that is being launched will be separate from the "starting" status that kernels currently advertise.

We call this "initializing" in JupyterLab currently.

rgbkrk · 2020-04-06T21:11:38Z

Thank you for making this issue! I've been working on a kernel that proxies requests while waiting for an underlying kernel to take a long time to start up, all while accepting parameters on boot. Delighted to see the conversation going here.

kevin-bates · 2020-04-07T15:53:40Z

Hi Steve, thank you for your response. It reflects my naivete with the front-end, so I apologize. I was more caught up on the race condition thing and how to reproduce that than the issue at hand.

Mostly for my benefit, the idea is that these long-running calls would be mitigated by...

detecting the client is requesting this async protocol
creating a persistent record to reflect the status of the entity
returning a "handle/key" (with 202) to that record which the client uses to check on the status. The "target" entity (or its facilitator) would also use this to update its status
updating the entity's startup logic to reflect startup completion (or failure) in the persistent record
informing the client of the entity's startup completion/failure by returning information (with 303) with which the client can now use the entity
removing the async record (or not immediately. I see this being useful for HA/DR and we might want this to live for the duration of the kernel.)

I'm hoping we could make this the normal sequence of events within the server and simply make the response conditional based on whether the client has requested async behavior or not.

blink1073 · 2020-04-08T08:54:26Z

I agree that behind the scenes the server should be handling all requests as async, since it may be servicing both types of clients. One thing we have to determine is whether we should have an "exists" check at all, and what it should mean.

If I POST to api/session while a session is being created, and it has a different property than the last request, should that be a new session? In my mind, POST always creates a new object, and the PK is the id of the object, which the client should be able to provide if desired. The client could see the list of all pending sessions on api/session/queue to decide whether to create a new session. We could keep sessions in that queue until they are shut down.

echarles · 2021-10-15T10:29:44Z

In // to the discussion on #592, I wonder what are the needed changes to the services packages to support the Pending/Slow staring kernel proposal.

My feeling after having worked a bit on some restarting/terminating/... edge case is that a State Machine (https://en.wikipedia.org/wiki/State_pattern) would really help to deal with the various kernel state. For now, the code is really ad-hoc and does not leverage the power of the well-know state pattern.

This would mean a deep refactoring of the services, which maybe out of scope of this requirement, but worth at least getting feedback on this idea.

blink1073 · 2023-01-08T20:48:32Z

Closing this, since we have implemented the changes in jupyter_client to enable pending_kernels.

blink1073 mentioned this issue Apr 5, 2020

Kernel Status UX jupyterlab/jupyterlab#7403

Open

blink1073 changed the title ~~[Discuss] RESTful Asynchronous Kernel Starts~~ [DISCUSS] RESTful Asynchronous Kernel Starts Apr 5, 2020

blink1073 mentioned this issue Apr 5, 2020

Better handling of slow kernel startup jupyterlab/jupyterlab#8174

Merged

Zsailer mentioned this issue Oct 14, 2021

Jupyter Server Notes 2021 jupyter-server/team-compass#4

Closed

blink1073 mentioned this issue Oct 14, 2021

Slow Starting Kernels Proposal #592

Closed

blink1073 closed this as completed Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] RESTful Asynchronous Kernel Starts #197

[DISCUSS] RESTful Asynchronous Kernel Starts #197

blink1073 commented Apr 5, 2020 •

edited

Loading

blink1073 commented Apr 5, 2020

blink1073 commented Apr 5, 2020

kevin-bates commented Apr 5, 2020

captainsafia commented Apr 5, 2020

blink1073 commented Apr 6, 2020 •

edited

Loading

rgbkrk commented Apr 6, 2020

kevin-bates commented Apr 7, 2020

blink1073 commented Apr 8, 2020

echarles commented Oct 15, 2021

blink1073 commented Jan 8, 2023

[DISCUSS] RESTful Asynchronous Kernel Starts #197

[DISCUSS] RESTful Asynchronous Kernel Starts #197

Comments

blink1073 commented Apr 5, 2020 • edited Loading

blink1073 commented Apr 5, 2020

blink1073 commented Apr 5, 2020

kevin-bates commented Apr 5, 2020

captainsafia commented Apr 5, 2020

blink1073 commented Apr 6, 2020 • edited Loading

rgbkrk commented Apr 6, 2020

kevin-bates commented Apr 7, 2020

blink1073 commented Apr 8, 2020

echarles commented Oct 15, 2021

blink1073 commented Jan 8, 2023

blink1073 commented Apr 5, 2020 •

edited

Loading

blink1073 commented Apr 6, 2020 •

edited

Loading