Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for async kernel management via subclassing #428

Merged
merged 4 commits into from
Mar 4, 2020

Conversation

kevin-bates
Copy link
Member

@kevin-bates kevin-bates commented Mar 11, 2019

After a couple failed attempts due to API breakage and incompleteness as documented in PR #425, this PR takes an approach using subclassing that @minrk had proposed. As a result, I felt it best to start anew with the discussion and I will close #425, which will still be useful for background. With this approach existing applications can continue using the synchronous classes, while applications that desire async kernel management can do so by setting the kernel_manager_class traitlet to the appropriate asynchronous form. As a result, its a fairly clean separation.

AsyncKernelManager derives from KernelManager and uses async (gen.coroutine) methods where applicable. Similarly, AsyncMultiKernelManger from MultiKernelManager. AsyncIOLoopKernelManager derives from AsyncKernelManger instead of IOLoopKernelManager, otherwise it would not extend AsyncKernelManager. And AsyncIOLoopRestarter dervices from IOLoopRestarter since the base KernelRestarter is an independent class.

I chose to continue the use of @gen.coroutine as opposed to async def for the following reasons...

  1. I'm not sure how it works to cross the two - which would need to happen once Notebook starts using the async classes.
  2. The notebook-based subsystems haven't switched to async/await and felt this could be done at that time.
  3. I'm still (stubbornly) hoping that we can backport this to 5.x for python 2 users. 😄

I'm marking this as a work-in-progress for now until a POC with Notebook and Enterprise Gateway can be completed.

EDIT: Add class hierarchy diagram (darker classes are new).
jupyter_client_kernel_manager_hierarchy

@Carreau
Copy link
Member

Carreau commented Mar 14, 2019

Just subscribing to issue, Slowly catching up.

@kevin-bates kevin-bates changed the title [WIP] Add support for async kernel management via subclassing Add support for async kernel management via subclassing Mar 16, 2019
@kevin-bates
Copy link
Member Author

POC with Notebook and Enterprise Gateway has been completed, so I've removed [WIP]. The test failures don't appear related and I can't exactly figure out what's going on. The tests pass in my env.

@kevin-bates
Copy link
Member Author

@Carreau @takluyver @minrk - I'm hoping at least one of you can make a pass across these changes (and hopefully notebook #4479) sometime in the near future. It would be greatly appreciated.

@Carreau
Copy link
Member

Carreau commented Mar 26, 2019

As this just adds new classes, I'm +1 ,I'd still love other to review. @SylvainCorlay ? Someon from your team ?

I also want to get the test to pass on travis.. they pass on my machine if if we can't figure that out, we can do a known_failure.

@kevin-bates
Copy link
Member Author

Thank you Matthias. Yeah, the tests run locally for me as well. I'm not familiar enough with the test history to know if this might be something odd with Travis infrastructure or not, but am happy to look into things given some guidance.

@Carreau
Copy link
Member

Carreau commented Mar 26, 2019

Yeah this seem to be only on slow systems. The timeout for teardown on PyZmq seem to be 2sec which is too slow for travis. I have the same error locally if I get the timeout down to 0.01s.

See zeromq/pyzmq#1275 as a suggested fix. We should be able to workaround this by patching the teardown method. I'll see if I can get a fix to this branch at some point.

return content

self.addCleanup(kc.stop_channels)
self.addCleanup(km.shutdown_kernel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha ! These two are coroutines, My guess is that addCleanup does not wait for them, and this test, modify a global state which make the session_test fails (there was also a timeout issue).

I'm looking at tornado AsyncTestCase but there does not seem to be an addCleanup for async function. I'll see if we can do something smart, or just to a except with a re-raise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least i think this is it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight @Carreau. I agree that this might be what's going on (although only km.shutdown_kernel() is a coroutine). Is there a reason that these need to be called as part of the tear down and not just part of the test? (That said, I ran into issues in the signal test calling shutdown kernel.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Carreau - You're right about the culprit being addCleanup(). I went ahead and removed its use for the async tests - replacing with direct calls to shutdown_kernel() and stop_channels() within finally blocks.

N = 5
for i in range(N):
execute("start")
time.sleep(1) # make sure subprocs stay up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep seem to also make the test break... not sure why. Tried to replace by yield asyncio.sleep() to no avail.

@kevin-bates
Copy link
Member Author

@Carreau - ping.

@MSeal
Copy link
Contributor

MSeal commented May 13, 2019

@Carreau I'm thinking of having a patch release soon for jupyter_client to pick up the parallel zeromq context fix. This looks a bit meaty of a PR. Should we try to push to get it in and do a minor release or move the other change independently out?

@rgbkrk
Copy link
Member

rgbkrk commented May 28, 2019

Hey I'll start checking this out too. I started to do this by hand for fun yesterday during the kernel workshop then thought maybe there were some good bits in jupyter_client. I'll spend time reviewing here instead of adding my code.

@rgbkrk
Copy link
Member

rgbkrk commented May 28, 2019

This leaves room open later for using zmq.asyncio's Context to set up async/await-able sockets, so it looks like my work was not for naught. 😄

Quite a bit happened while I was on leave, it was good to read the discussion in #425. Thanks for making these PRs @kevin-bates and @minrk + @Carreau for creating plans for each coming major release while beginning async work. ❤️

@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/scalable-enterprise-gateway/2014/2

@SylvainCorlay
Copy link
Member

@maartenbreddels @jtpio

@MSeal
Copy link
Contributor

MSeal commented Feb 22, 2020

@kevin-bates @davidbrochart How does this mesh with #506 that got merged? I'm about to do a 6.0 release this weekend with that code which was merged to master with a python 3 only constraint (we're doing that for every jupyter project atm). I wasn't super involved in that PR or this one, though it seems like we could move forward with current master and improve abstractions taken from here in a 6.1 release later if it makes sense.

@kevin-bates
Copy link
Member Author

I think there would be good synergy between the two. This PR is purely targeting the kernel's lifecycle management - addressing the issue of blocking kernel starts, while #506, if I understand correctly, is more about making the communication with the kernel asynchronous via kernel-client. Is that your understanding as well @davidbrochart?

I think using a minor release to introduce this change is probably the right move from the standpoint of not introducing too many changes at once. What is good about this PR is that it simply exposes a sibling kernel manager class that has async support. This way applications can decide when to switch and its merge won't (well, shouldn't) affect the status quo.

@SylvainCorlay
Copy link
Member

Thanks @kevin-bates. Please ping me when you update it. We are definitely interested in this.

Introduced async subclasses that derive from synchronous classes,
overriding appropriate methods with async support.
@kevin-bates
Copy link
Member Author

@SylvainCorlay @davidbrochart - I've made the following changes...

  1. Converted @gen.coroutine/yield to async/await (which wasn't as easy as it sounds. 😄)
  2. Updated the async tests to use David's Async Client where applicable.
  3. I found I needed to update some of the async threading tests to create an event loop, otherwise, they side-affected some of the downstream tests.
  4. There were duplicated test names of 'test_start_sequence_tcp_kernels' where the second meant to be named with 'test_start_sequence_ipc_kernels' so that change was made.

Thank you for your review.

jupyter_client/ioloop/manager.py Outdated Show resolved Hide resolved
jupyter_client/ioloop/restarter.py Outdated Show resolved Hide resolved
jupyter_client/ioloop/restarter.py Outdated Show resolved Hide resolved
jupyter_client/manager.py Outdated Show resolved Hide resolved

# Wait until the kernel terminates.
try:
await asyncio.wait_for(gen.maybe_future(self.kernel.wait()), timeout=5.0)
Copy link
Member

@davidbrochart davidbrochart Mar 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.kernel.wait() is a blocking function (it waits for the process to finish) so gen.maybe_future won't make it async. It means the async nature of this function stack is broken. I think we should have a async_wait function that would poll the status of the process periodically, and this line would be e.g.:

await asyncio.wait_for(await self.kernel_async_wait(), timeout=5.0)

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - good point. I think kernel_async_wait() would look something like this: https://github.com/takluyver/jupyter_kernel_mgmt/blob/master/jupyter_kernel_mgmt/subproc/manager.py#L56-L66

Does that look right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sure does look a lot like finish_shutdown() so I'll look into refactoring these areas a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - good point. I think kernel_async_wait() would look something like this: https://github.com/takluyver/jupyter_kernel_mgmt/blob/master/jupyter_kernel_mgmt/subproc/manager.py#L56-L66

Does that look right?

Indeed. They even use asyncio.subprocess.Process, which would be ideal, but I'm not sure we can have it in jupyter_client.

Copy link
Member Author

@kevin-bates kevin-bates Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - the hierarchy of things gets in the way for that.

Btw, I'm finding I need to indicate that the _async_wait() function is done...

    async def _async_wait(self, pollinterval=0.1):
        # Use busy loop at 100ms intervals, polling until the process is
        # not alive.  If we find the process is no longer alive, complete
        # its cleanup via the blocking wait().  Callers are responsible for
        # issuing calls to wait() using a timeout (see _kill_kernel()).
        while self.is_alive():
            await asyncio.sleep(pollinterval)

        done = asyncio.Future()
        done.set_result(None)
        return done

otherwise I get failures stemming from its introduction since asyncio.wait_for() requires a future. This seems correct to me, but I honestly lack experience in this area. Is there a better way to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio.wait_for accepts an awaitable - a coroutine, a Task or a Future. _async_wait doesn't have to return a result:

    async def _async_wait(self, pollinterval=0.1):
        while self.is_alive():
            await asyncio.sleep(pollinterval)

But you need to catch the timeout:

try:
    await asyncio.wait_for(self._async_wait(), timeout=5.0)
except asyncio.TimeoutError:
    # Wait timed out, just log warning but continue - not much more we can do.
    self.log.warning("Wait for final termination of kernel timed out - continuing...")
    pass

That is what your code was doing already. I realize my comment above was wrong, there is no await in the wait_for call.

jupyter_client/manager.py Outdated Show resolved Hide resolved
Also added the cache_ports logic that existed in the sync multiKernelManager.
@@ -9,17 +9,20 @@

import warnings

from tornado import gen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines +517 to +523
async def _launch_kernel(self, kernel_cmd, **kw):
"""actually launch the kernel

override in a subclass to launch kernel subprocesses differently
"""
res = launch_kernel(kernel_cmd, **kw)
return res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing async about this function, it doesn't await anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review David, but I don't agree with this comment (and the others below) regarding the marking of a method with async that doesn't await. While it's true that this implementation doesn't await. Subclasses, especially those managing remote kernels, certainly will, since the times to start, interrupt, signal, etc. take an order of magnitude longer (perhaps more) to complete.

Because callers won't know they're talking to a kernel manager that is dealing with remote kernels (nor should they), I believe the async markers need to remain so the caller is always performing await start_kernel(), for example.

Similarly, I think is_alive() needs to be added back into AsyncKernelManager since poll() is an RPC in some remote kernels (and a REST call in others). And, by extension, I believe AsyncKernelClient needs an async is_alive() method since it will call on the kernel manager's is_alive() and we'll need to check the class to conditionally apply the await. I have such an incantation working right now, but I wanted to get your opinion before its push.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You convinced me @kevin-bates! Thanks for putting this work into perspective, I had not thought of the remote kernel use case. It makes more sense now, so please discard my previous comments.

Comment on lines +525 to +542
async def start_kernel(self, **kw):
"""Starts a kernel in a separate process in an asynchronous manner.

If random ports (port=0) are being used, this method must be called
before the channels are created.

Parameters
----------
`**kw` : optional
keyword arguments that are passed down to build the kernel_cmd
and launching the kernel (e.g. Popen kwargs).
"""
kernel_cmd, kw = self.pre_start_kernel(**kw)

# launch the kernel subprocess
self.log.debug("Starting kernel (async): %s", kernel_cmd)
self.kernel = await self._launch_kernel(kernel_cmd, **kw)
self.post_start_kernel(**kw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _launch_kernel should not be a coroutine, neither should start_kernel.

Comment on lines +676 to +716
async def interrupt_kernel(self):
"""Interrupts the kernel by sending it a signal.

Unlike ``signal_kernel``, this operation is well supported on all
platforms.
"""
if self.has_kernel:
interrupt_mode = self.kernel_spec.interrupt_mode
if interrupt_mode == 'signal':
if sys.platform == 'win32':
from .win_interrupt import send_interrupt
send_interrupt(self.kernel.win32_interrupt_event)
else:
await self.signal_kernel(signal.SIGINT)

elif interrupt_mode == 'message':
msg = self.session.msg("interrupt_request", content={})
self._connect_control_socket()
self.session.send(self._control_socket, msg)
else:
raise RuntimeError("Cannot interrupt kernel. No kernel is running!")

async def signal_kernel(self, signum):
"""Sends a signal to the process group of the kernel (this
usually includes the kernel and any subprocesses spawned by
the kernel).

Note that since only SIGTERM is supported on Windows, this function is
only useful on Unix systems.
"""
if self.has_kernel:
if hasattr(os, "getpgid") and hasattr(os, "killpg"):
try:
pgid = os.getpgid(self.kernel.pid)
os.killpg(pgid, signum)
return
except OSError:
pass
self.kernel.send_signal(signum)
else:
raise RuntimeError("Cannot signal kernel. No kernel is running!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interrupt_kernel and signal_kernel should not be coroutines, they are not doing anything blocking.

# TerminateProcess() on Win32).
try:
if hasattr(signal, 'SIGKILL'):
await self.signal_kernel(signal.SIGKILL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to await here, signal_kernel should not be a coroutine.

@davidbrochart
Copy link
Member

Just a general question, any reason we have an ioloop directory or should it be under asynchronous?

@kevin-bates
Copy link
Member Author

I guess I don't understand your question. ioloop contains class definitions for both sync and async classes, just like manager.py and multikernelmanager.py. Are you suggesting we move all of these async classes into the asynchronous sub-directory?

@davidbrochart
Copy link
Member

You're right, it doesn't make a lot of sense.
I think this is ready to be merged. Thanks a lot for all the effort you put into this PR!
Comments anybody?

@kevin-bates
Copy link
Member Author

@davidbrochart - thank you for your review - it's much appreciated. I'm really stoked that this has been revived (since there are two other layers above this with dormant async management PRs 😄 ).

@davidbrochart
Copy link
Member

It may be out of this PR's scope, but how would it play with Jupyter server's kernel manager? Will we need to write a MappingAsyncKernelManager, or just rewrite https://github.com/jupyter/jupyter_server/blob/master/jupyter_server/services/kernels/kernelmanager.py so that MappingKernelManager inherits from AsyncMultiKernelManager?

@kevin-bates
Copy link
Member Author

The plan is for juptyer_server to adopt the new kernel management framework via this PR. This transition is targetting Server's 0.4 release. Since kernel management in JKM is only async and the library no longer provides a MultiKernelManager, Server's MappingKernelManager will be straight async. What also needs to happen is applying your client async changes to JKM (please!).

However, the dual-mode behavior is exactly what I did for Notebook and I'm hoping once jupyter_client has a carrier for this PR, then Notebook PR 4479 can be similarly revived. That will then unblock the Enterprise Gateway PR, which will set all the right classes to make the full stack async and something gateway users desperately need!

@davidbrochart
Copy link
Member

Thanks for the feedback, I have a lot to catch up!

@kevin-bates
Copy link
Member Author

No worries. If you're interested in the server efforts, there's a weekly dev meeting on Thursdays: https://jupyter.readthedocs.io/en/latest/community/content-community.html#monthly-meetings

@davidbrochart davidbrochart merged commit 99bb826 into jupyter:master Mar 4, 2020
@davidbrochart
Copy link
Member

@kevin-bates I'm not sure I did the right thing, the commit history shows I am the author of this commit, I don't see you name. Is it because I squashed it?

@SylvainCorlay
Copy link
Member

The plan is for juptyer_server to adopt the new kernel management framework via this PR. This transition is targetting Server's 0.4 release. Since kernel management in JKM is only async and the library no longer provides a MultiKernelManager, Server's MappingKernelManager will be straight async. What also needs to happen is applying your client async changes to JKM (please!).

@kevin-bates I think we need to have a plan for jupyter-server before the move to jupyter_kernel_management goes in. Moving jupyterlab and notebook to server, and also dropping jupyter_client all at once seems very disruptive to me, and should probably be done in two stages.

We have no feedback yet on jupyter_kernel_management, and I would like us to be cautious.

@kevin-bates
Copy link
Member Author

@davidbrochart - Regarding your comment: #428 (comment). No worries - this just means you "touched it last"! 😄

Just kidding. Normally I wouldn't care too much, but this particular effort was significant and has been more than a year coming, so I think this should be corrected. I don't know how this happened, but I would think that commit could be amended with something like this:

git commit --amend --author="Kevin Bates <kbates4@gmail.com>"

cc: @MSeal, @SylvainCorlay for their git fu.

@SylvainCorlay
Copy link
Member

@kevin-bates no worries I will fix the git history first thing tomorrow to include your commits.

@MSeal
Copy link
Contributor

MSeal commented Mar 5, 2020

Thanks for tackling this PR against everyone. It's nice seeing folks helping get these larger changes merged.

If any additional git fu is needed let me know, I can probably help too.

@SylvainCorlay SylvainCorlay mentioned this pull request Mar 5, 2020
@SylvainCorlay
Copy link
Member

@kevin-bates you should now be credited for your commits in master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants