-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EG becomes unusable after receiving FD added twice error due to a race condition! #1051
Comments
Hi @rahul26goyal - I am not able to reproduce this issue and would like for you to try to reproduce this on EG 2.6, if that's possible. I will proceed with the PR's review anyway and we can decide if its merge is appropriate at that time. Thanks for your understanding. |
Thanks for trying this out @kevin-bates ..I will try to reproduce on our setup with EG-2.6 and get back. |
Resolved via #1054. |
Description
This issue is related to #1047
We are seeing ZMQStream FD leak happening on Jupyter Enterprise Gateway Server while running kernels on Kubernetes. We are running native spark Kubernetes kernels.
Based on the analysis done below, the leak is happening at the jupyter application layer which integrates with Tornado IOLoop to manage the ZMQSocket Streams. add_handler
At an high level, the leak happens when a race condition between a Kernel Restart and Shutdown for the same kernel happens and this leads to a FD leak for a duration of 1 minute.
kernel_info_timeout
) timeoutThis issue is seen only on remote kernels and not on the local kernels due to various differences that come with remote kernels.
Sample exception trace from one such occurrence!
more on this below!
Reproduce
Since there is a race condition involved here, the scenario is not easy to reproduce. But we have been able to reproduce this issue multiple times by doing the following steps:
Diagnosis of the issue
Given below are the log lines from one such scenario which we have captured and analyzed in depth. In order to do that, we have to add new log lines in multiple places across different code packages! So, you will see log lines which may not look familiar! 😜
Comments for each events are available inline in the below logs!
I have also removed some log lines which were not relevant to the issue.
Expected behavior
There should not any FD leak but this is happening on Eg due to the remote nature of the Kernel and the extra time involved in fetching the kernl_info from the remote kernel. On local kernels, this issue is rare as we probably get the response immediately and this race condition does not occur.
We need to handle this race condition between Restart and Shutdown of Kernel requests on EG.
Few thoughts that come at an high level:
kernel.restarting
field which is already set during kernel restart on EG. Code:kenel_info_request
connection socket within the kernel object and close it down when executing shutdown request. This is probably require change in notebook / jupyter_server where MappingKernelManager is present.Open to other suggestions as well and I am interested to contribute the fix back.
Context
Thanks!
The text was updated successfully, but these errors were encountered: