Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] RESTful Asynchronous Kernel Starts #197

Closed
blink1073 opened this issue Apr 5, 2020 · 10 comments
Closed

[DISCUSS] RESTful Asynchronous Kernel Starts #197

blink1073 opened this issue Apr 5, 2020 · 10 comments

Comments

@blink1073
Copy link
Contributor

blink1073 commented Apr 5, 2020

Jupyter Notebook has had an implicit assumption that starting a kernel process is instantaneous. This results in a race condition when two requests try and start a session at the same time. The second started session will check to see if one exists, see that it does not because the previous one has not saved, and create a new one. You will then end up with two sessions for the same path.

Additionally, if the kernel takes a very long time to start, there is no feedback to the client. In JupyterLab 1.2, this results in No Kernel! and a busy indicator until the kernel is started. In JupyterLab 2.0, you see "Kernel" and an idle indication until the response is received, at which point you see "Python 3". You can see this behavior by adding a yield gen.sleep(10) in MappingKernelManager.start_kernel. Admittedly, we can hack the display to show the pending name, but a better fix is to have a better server/client contract.

image

image

I propose that we upgrade the server to respond POST to /api/kernels, /api/sessions, or /api/terminals/ that has an "async" flag (or equivalent name) in the body with a response of the following form, following the restcookbook:

HTTP/1.1 202 Accepted
Location: /queue/12345

Text inline from the cookbook:


Instead of creating the actual resources, create a temporary one. Instead of returning a 201 (Created) HTTP response, you can issue a 202 (Accepted) response code. This informs the client that the request has been accepted and understood by the server, but the resource is not (yet) created. Send the temporary resource inside the Location header."

This location can store information about the status of the actual resource: an ETA on when it will be created, what is currently being done or processed.

When the actual resource has been created, the temporary resources can return a 303 (See other) response. The location header returns the URI to the definitive resource. A client can either DELETE the temporary resource, or the server can expire this resource and return a 410 (Gone) later on.


This way, the client immediately gets feedback that a kernel or a session is being created without having to wait for it to be created. By making it optional, clients can opt-in to this behavior and migrate to it. At some later date we could require that all clients handle 202 responses.

The server would need to be upgraded to have new database entries for pending kernel and sessions, and use them when determining if one exists. It would need to implement the protocol defined above. Even if a client does not request async, we can still use this new database table to fix the race condition.

@blink1073 blink1073 changed the title [Discuss] RESTful Asynchronous Kernel Starts [DISCUSS] RESTful Asynchronous Kernel Starts Apr 5, 2020
@blink1073
Copy link
Contributor Author

cc @ellisonbg @jasongrout

@blink1073
Copy link
Contributor Author

cc @captainsafia @rgbkrk

@kevin-bates
Copy link
Member

Thank you for starting this Steve, I think this could be helpful to clients. Had a couple thoughts, questions...

  1. Adding a status column to reflect that a start request is pending certainly lessens the race condition and sounds like something that should occur anyway.

  2. Session records essentially use the (notebook) path as the "primary key". I'm not that familiar with /api/terminals, but /api/kernels POST requests are given just the kernel name (or none at all for the default) - as a result, there isn't a "PK" to use unless we, say, let the user specify the kernel-id (which I believe is a good idea anyway). Otherwise, I'm not sure how this applies to /api/kernels (and perhaps /api/terminals). Perhaps the async "flag" in the body could be a "PK" of sorts.

  3. We should make sure that clients requesting async behavior/responses also tolerate today's responses (for that same request) in case they happen to hit a downlevel server. I'm sure that's the standard approach anyway.

  4. Is it possible to trigger this race condition using Lab or Notebook frontends or are you using a REST client directly to see this?

Let's add this to the server's weekly meeting agenda if you can make it on Thursday.

@captainsafia
Copy link
Contributor

Thanks for the write-up, Steve. I agree with the general approach. Some thoughts come to mind:

  • Jupyter servers that support the async flag should advertise this support somehow. Perhaps by setting a property in the response to the /api network call.

  • Adding a status column to reflect that a start request is pending certainly lessens the race condition and sounds like something that should occur anyway.

Ideally, whatever status is used to represent a kernel/session that is being launched will be separate from the "starting" status that kernels currently advertise. This will help disambiguate the different states for the end-user and avoid overloading an existing state.

  • Do we have recommendations on how clients should query for the launched kernel/session once they have sent the launch request (polling, server-sent event, etc)?

@blink1073
Copy link
Contributor Author

blink1073 commented Apr 6, 2020

Thanks for the feedback @kevin-bates, @captainsafia!

Session records essentially use the (notebook) path as the "primary key".

Not necessarily, you can have multiple sessions with different types or names, for instance. Interestingly, the classic notebook tears down the session instead of changing kernels, but does so synchronously. In jupyterlab/jupyterlab#8174 I implemented an asynchronous version of this so you can switch kernels while a slow kernel is launching.

Is it possible to trigger this race condition using Lab or Notebook frontends

It is possible to trigger by adding a gen.sleep(10) in start_kernel and calling POST on /api/sessions before the previous request has completed.

Jupyter servers that support the async flag should advertise this support somehow. Perhaps by setting a property in the response to the /api network call.

Agreed!

Ideally, whatever status is used to represent a kernel/session that is being launched will be separate from the "starting" status that kernels currently advertise.

We call this "initializing" in JupyterLab currently.

@rgbkrk
Copy link
Contributor

rgbkrk commented Apr 6, 2020

Thank you for making this issue! I've been working on a kernel that proxies requests while waiting for an underlying kernel to take a long time to start up, all while accepting parameters on boot. Delighted to see the conversation going here.

@kevin-bates
Copy link
Member

Hi Steve, thank you for your response. It reflects my naivete with the front-end, so I apologize. I was more caught up on the race condition thing and how to reproduce that than the issue at hand.

Mostly for my benefit, the idea is that these long-running calls would be mitigated by...

  • detecting the client is requesting this async protocol
  • creating a persistent record to reflect the status of the entity
  • returning a "handle/key" (with 202) to that record which the client uses to check on the status. The "target" entity (or its facilitator) would also use this to update its status
  • updating the entity's startup logic to reflect startup completion (or failure) in the persistent record
  • informing the client of the entity's startup completion/failure by returning information (with 303) with which the client can now use the entity
  • removing the async record (or not immediately. I see this being useful for HA/DR and we might want this to live for the duration of the kernel.)

I'm hoping we could make this the normal sequence of events within the server and simply make the response conditional based on whether the client has requested async behavior or not.

@blink1073
Copy link
Contributor Author

I agree that behind the scenes the server should be handling all requests as async, since it may be servicing both types of clients. One thing we have to determine is whether we should have an "exists" check at all, and what it should mean.

If I POST to api/session while a session is being created, and it has a different property than the last request, should that be a new session? In my mind, POST always creates a new object, and the PK is the id of the object, which the client should be able to provide if desired. The client could see the list of all pending sessions on api/session/queue to decide whether to create a new session. We could keep sessions in that queue until they are shut down.

@echarles
Copy link
Member

In // to the discussion on #592, I wonder what are the needed changes to the services packages to support the Pending/Slow staring kernel proposal.

My feeling after having worked a bit on some restarting/terminating/... edge case is that a State Machine (https://en.wikipedia.org/wiki/State_pattern) would really help to deal with the various kernel state. For now, the code is really ad-hoc and does not leverage the power of the well-know state pattern.

This would mean a deep refactoring of the services, which maybe out of scope of this requirement, but worth at least getting feedback on this idea.

@blink1073
Copy link
Contributor Author

Closing this, since we have implemented the changes in jupyter_client to enable pending_kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants