-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for async kernel management via subclassing #428
Conversation
Just subscribing to issue, Slowly catching up. |
POC with Notebook and Enterprise Gateway has been completed, so I've removed [WIP]. The test failures don't appear related and I can't exactly figure out what's going on. The tests pass in my env. |
@Carreau @takluyver @minrk - I'm hoping at least one of you can make a pass across these changes (and hopefully notebook #4479) sometime in the near future. It would be greatly appreciated. |
As this just adds new classes, I'm +1 ,I'd still love other to review. @SylvainCorlay ? Someon from your team ? I also want to get the test to pass on travis.. they pass on my machine if if we can't figure that out, we can do a known_failure. |
Thank you Matthias. Yeah, the tests run locally for me as well. I'm not familiar enough with the test history to know if this might be something odd with Travis infrastructure or not, but am happy to look into things given some guidance. |
Yeah this seem to be only on slow systems. The timeout for teardown on PyZmq seem to be 2sec which is too slow for travis. I have the same error locally if I get the timeout down to 0.01s. See zeromq/pyzmq#1275 as a suggested fix. We should be able to workaround this by patching the teardown method. I'll see if I can get a fix to this branch at some point. |
return content | ||
|
||
self.addCleanup(kc.stop_channels) | ||
self.addCleanup(km.shutdown_kernel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha ! These two are coroutines, My guess is that addCleanup does not wait for them, and this test, modify a global state which make the session_test fails (there was also a timeout issue).
I'm looking at tornado AsyncTestCase but there does not seem to be an addCleanup for async function. I'll see if we can do something smart, or just to a except with a re-raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least i think this is it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insight @Carreau. I agree that this might be what's going on (although only km.shutdown_kernel()
is a coroutine). Is there a reason that these need to be called as part of the tear down and not just part of the test? (That said, I ran into issues in the signal test calling shutdown kernel.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Carreau - You're right about the culprit being addCleanup()
. I went ahead and removed its use for the async tests - replacing with direct calls to shutdown_kernel()
and stop_channels()
within finally blocks.
N = 5 | ||
for i in range(N): | ||
execute("start") | ||
time.sleep(1) # make sure subprocs stay up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleep seem to also make the test break... not sure why. Tried to replace by yield asyncio.sleep()
to no avail.
9817e71
to
53a8d06
Compare
@Carreau - ping. |
@Carreau I'm thinking of having a patch release soon for jupyter_client to pick up the parallel zeromq context fix. This looks a bit meaty of a PR. Should we try to push to get it in and do a minor release or move the other change independently out? |
Hey I'll start checking this out too. I started to do this by hand for fun yesterday during the kernel workshop then thought maybe there were some good bits in |
This leaves room open later for using Quite a bit happened while I was on leave, it was good to read the discussion in #425. Thanks for making these PRs @kevin-bates and @minrk + @Carreau for creating plans for each coming major release while beginning async work. ❤️ |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/scalable-enterprise-gateway/2014/2 |
b1fe8e4
to
3e8ee4a
Compare
8433088
to
7797238
Compare
@kevin-bates @davidbrochart How does this mesh with #506 that got merged? I'm about to do a 6.0 release this weekend with that code which was merged to master with a python 3 only constraint (we're doing that for every jupyter project atm). I wasn't super involved in that PR or this one, though it seems like we could move forward with current master and improve abstractions taken from here in a 6.1 release later if it makes sense. |
I think there would be good synergy between the two. This PR is purely targeting the kernel's lifecycle management - addressing the issue of blocking kernel starts, while #506, if I understand correctly, is more about making the communication with the kernel asynchronous via kernel-client. Is that your understanding as well @davidbrochart? I think using a minor release to introduce this change is probably the right move from the standpoint of not introducing too many changes at once. What is good about this PR is that it simply exposes a sibling kernel manager class that has async support. This way applications can decide when to switch and its merge won't (well, shouldn't) affect the status quo. |
Thanks @kevin-bates. Please ping me when you update it. We are definitely interested in this. |
Introduced async subclasses that derive from synchronous classes, overriding appropriate methods with async support.
ff60c50
to
f6f8696
Compare
@SylvainCorlay @davidbrochart - I've made the following changes...
Thank you for your review. |
jupyter_client/manager.py
Outdated
|
||
# Wait until the kernel terminates. | ||
try: | ||
await asyncio.wait_for(gen.maybe_future(self.kernel.wait()), timeout=5.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.kernel.wait()
is a blocking function (it waits for the process to finish) so gen.maybe_future
won't make it async. It means the async nature of this function stack is broken. I think we should have a async_wait
function that would poll the status of the process periodically, and this line would be e.g.:
await asyncio.wait_for(await self.kernel_async_wait(), timeout=5.0)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - good point. I think kernel_async_wait()
would look something like this: https://github.com/takluyver/jupyter_kernel_mgmt/blob/master/jupyter_kernel_mgmt/subproc/manager.py#L56-L66
Does that look right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sure does look a lot like finish_shutdown()
so I'll look into refactoring these areas a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - good point. I think
kernel_async_wait()
would look something like this: https://github.com/takluyver/jupyter_kernel_mgmt/blob/master/jupyter_kernel_mgmt/subproc/manager.py#L56-L66Does that look right?
Indeed. They even use asyncio.subprocess.Process
, which would be ideal, but I'm not sure we can have it in jupyter_client
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - the hierarchy of things gets in the way for that.
Btw, I'm finding I need to indicate that the _async_wait() function is done...
async def _async_wait(self, pollinterval=0.1):
# Use busy loop at 100ms intervals, polling until the process is
# not alive. If we find the process is no longer alive, complete
# its cleanup via the blocking wait(). Callers are responsible for
# issuing calls to wait() using a timeout (see _kill_kernel()).
while self.is_alive():
await asyncio.sleep(pollinterval)
done = asyncio.Future()
done.set_result(None)
return done
otherwise I get failures stemming from its introduction since asyncio.wait_for()
requires a future. This seems correct to me, but I honestly lack experience in this area. Is there a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asyncio.wait_for
accepts an awaitable - a coroutine, a Task or a Future. _async_wait
doesn't have to return a result:
async def _async_wait(self, pollinterval=0.1):
while self.is_alive():
await asyncio.sleep(pollinterval)
But you need to catch the timeout:
try:
await asyncio.wait_for(self._async_wait(), timeout=5.0)
except asyncio.TimeoutError:
# Wait timed out, just log warning but continue - not much more we can do.
self.log.warning("Wait for final termination of kernel timed out - continuing...")
pass
That is what your code was doing already. I realize my comment above was wrong, there is no await
in the wait_for
call.
Also added the cache_ports logic that existed in the sync multiKernelManager.
jupyter_client/ioloop/restarter.py
Outdated
@@ -9,17 +9,20 @@ | |||
|
|||
import warnings | |||
|
|||
from tornado import gen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
async def _launch_kernel(self, kernel_cmd, **kw): | ||
"""actually launch the kernel | ||
|
||
override in a subclass to launch kernel subprocesses differently | ||
""" | ||
res = launch_kernel(kernel_cmd, **kw) | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing async
about this function, it doesn't await
anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review David, but I don't agree with this comment (and the others below) regarding the marking of a method with async
that doesn't await
. While it's true that this implementation doesn't await. Subclasses, especially those managing remote kernels, certainly will, since the times to start, interrupt, signal, etc. take an order of magnitude longer (perhaps more) to complete.
Because callers won't know they're talking to a kernel manager that is dealing with remote kernels (nor should they), I believe the async markers need to remain so the caller is always performing await start_kernel()
, for example.
Similarly, I think is_alive()
needs to be added back into AsyncKernelManager since poll()
is an RPC in some remote kernels (and a REST call in others). And, by extension, I believe AsyncKernelClient needs an async is_alive()
method since it will call on the kernel manager's is_alive()
and we'll need to check the class to conditionally apply the await
. I have such an incantation working right now, but I wanted to get your opinion before its push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You convinced me @kevin-bates! Thanks for putting this work into perspective, I had not thought of the remote kernel use case. It makes more sense now, so please discard my previous comments.
async def start_kernel(self, **kw): | ||
"""Starts a kernel in a separate process in an asynchronous manner. | ||
|
||
If random ports (port=0) are being used, this method must be called | ||
before the channels are created. | ||
|
||
Parameters | ||
---------- | ||
`**kw` : optional | ||
keyword arguments that are passed down to build the kernel_cmd | ||
and launching the kernel (e.g. Popen kwargs). | ||
""" | ||
kernel_cmd, kw = self.pre_start_kernel(**kw) | ||
|
||
# launch the kernel subprocess | ||
self.log.debug("Starting kernel (async): %s", kernel_cmd) | ||
self.kernel = await self._launch_kernel(kernel_cmd, **kw) | ||
self.post_start_kernel(**kw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _launch_kernel
should not be a coroutine, neither should start_kernel
.
async def interrupt_kernel(self): | ||
"""Interrupts the kernel by sending it a signal. | ||
|
||
Unlike ``signal_kernel``, this operation is well supported on all | ||
platforms. | ||
""" | ||
if self.has_kernel: | ||
interrupt_mode = self.kernel_spec.interrupt_mode | ||
if interrupt_mode == 'signal': | ||
if sys.platform == 'win32': | ||
from .win_interrupt import send_interrupt | ||
send_interrupt(self.kernel.win32_interrupt_event) | ||
else: | ||
await self.signal_kernel(signal.SIGINT) | ||
|
||
elif interrupt_mode == 'message': | ||
msg = self.session.msg("interrupt_request", content={}) | ||
self._connect_control_socket() | ||
self.session.send(self._control_socket, msg) | ||
else: | ||
raise RuntimeError("Cannot interrupt kernel. No kernel is running!") | ||
|
||
async def signal_kernel(self, signum): | ||
"""Sends a signal to the process group of the kernel (this | ||
usually includes the kernel and any subprocesses spawned by | ||
the kernel). | ||
|
||
Note that since only SIGTERM is supported on Windows, this function is | ||
only useful on Unix systems. | ||
""" | ||
if self.has_kernel: | ||
if hasattr(os, "getpgid") and hasattr(os, "killpg"): | ||
try: | ||
pgid = os.getpgid(self.kernel.pid) | ||
os.killpg(pgid, signum) | ||
return | ||
except OSError: | ||
pass | ||
self.kernel.send_signal(signum) | ||
else: | ||
raise RuntimeError("Cannot signal kernel. No kernel is running!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interrupt_kernel
and signal_kernel
should not be coroutines, they are not doing anything blocking.
# TerminateProcess() on Win32). | ||
try: | ||
if hasattr(signal, 'SIGKILL'): | ||
await self.signal_kernel(signal.SIGKILL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to await here, signal_kernel
should not be a coroutine.
Just a general question, any reason we have an |
I guess I don't understand your question. |
You're right, it doesn't make a lot of sense. |
@davidbrochart - thank you for your review - it's much appreciated. I'm really stoked that this has been revived (since there are two other layers above this with dormant async management PRs 😄 ). |
It may be out of this PR's scope, but how would it play with Jupyter server's kernel manager? Will we need to write a |
The plan is for juptyer_server to adopt the new kernel management framework via this PR. This transition is targetting Server's 0.4 release. Since kernel management in JKM is only async and the library no longer provides a However, the dual-mode behavior is exactly what I did for Notebook and I'm hoping once jupyter_client has a carrier for this PR, then Notebook PR 4479 can be similarly revived. That will then unblock the Enterprise Gateway PR, which will set all the right classes to make the full stack async and something gateway users desperately need! |
Thanks for the feedback, I have a lot to catch up! |
No worries. If you're interested in the server efforts, there's a weekly dev meeting on Thursdays: https://jupyter.readthedocs.io/en/latest/community/content-community.html#monthly-meetings |
@kevin-bates I'm not sure I did the right thing, the commit history shows I am the author of this commit, I don't see you name. Is it because I squashed it? |
@kevin-bates I think we need to have a plan for jupyter-server before the move to jupyter_kernel_management goes in. Moving jupyterlab and notebook to server, and also dropping jupyter_client all at once seems very disruptive to me, and should probably be done in two stages. We have no feedback yet on jupyter_kernel_management, and I would like us to be cautious. |
@davidbrochart - Regarding your comment: #428 (comment). No worries - this just means you "touched it last"! 😄 Just kidding. Normally I wouldn't care too much, but this particular effort was significant and has been more than a year coming, so I think this should be corrected. I don't know how this happened, but I would think that commit could be amended with something like this:
cc: @MSeal, @SylvainCorlay for their git fu. |
@kevin-bates no worries I will fix the git history first thing tomorrow to include your commits. |
Thanks for tackling this PR against everyone. It's nice seeing folks helping get these larger changes merged. If any additional git fu is needed let me know, I can probably help too. |
@kevin-bates you should now be credited for your commits in master! |
After a couple failed attempts due to API breakage and incompleteness as documented in PR #425, this PR takes an approach using subclassing that @minrk had proposed. As a result, I felt it best to start anew with the discussion and I will close #425, which will still be useful for background. With this approach existing applications can continue using the synchronous classes, while applications that desire async kernel management can do so by setting the
kernel_manager_class
traitlet to the appropriate asynchronous form. As a result, its a fairly clean separation.AsyncKernelManager
derives fromKernelManager
and uses async (gen.coroutine) methods where applicable. Similarly,AsyncMultiKernelManger
fromMultiKernelManager
.AsyncIOLoopKernelManager
derives fromAsyncKernelManger
instead ofIOLoopKernelManager
, otherwise it would not extendAsyncKernelManager
. AndAsyncIOLoopRestarter
dervices fromIOLoopRestarter
since the baseKernelRestarter
is an independent class.I chose to continue the use of
@gen.coroutine
as opposed toasync def
for the following reasons...async/await
and felt this could be done at that time.I'm marking this as a work-in-progress for now until a POC with Notebook and Enterprise Gateway can be completed.
EDIT: Add class hierarchy diagram (darker classes are new).