-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests which fail in multiprocessing contexts #1018
Conversation
Thanks for posting. Let's leave this up while the jupyter_client change gets reviewed and then re-evaluate if the tests need to be made quicker to merge or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for being disconnected for a while. This was the first weekend I was around to go through the past week or two's open source PRs (been a busy month).
Do you have any suggestions for debugging test_parallel_fork_notebooks further? I could probably take a deeper dive on Monday into what's happening here.
assert captured.err == "" | ||
|
||
@pytest.mark.xfail | ||
def test_parallel_fork_notebooks(capfd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't get this test to pass on my ubuntu 18.04 machine. It will always hang if the thread is running when multi-processing launches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with and without the zeromq cleanup, and with / without the jupyter_client pending release changes. Haven't dug in deeper as to why this still hangs.
input_file, | ||
opts, | ||
res, | ||
functools.partial(label_parallel_notebook, label=label), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpacer Wanted to keep run_notebook helper from getting more complex. Could we check in Parallel Execute A.ipynb
and Parallel Execute B.ipynb
into the PR to avoid adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not reaching out about this directly. What I would recommend doing would be the following
Instead of
threads = [
threading.Thread(
target=run_notebook,
args=(
input_file,
opts,
res,
functools.partial(label_parallel_notebook, label=label),
),
)
for label in ("A", "B")
]
go with
input_files = [label_parallel_notebook(input_file, label=label) for label in ("A", "B")]
threads = [
threading.Thread(
target=run_notebook,
args=(
this_input_file,
opts,
res
),
)
for this_input_file in input_files
]
That way you don't need to change the signature of run_notebook
. This should work because preprocessing the notebook to add a cell doesn't need to occur concurrently, but the notebooks themselves will still be running concurrently.
# Destroy the context - if you don't do this, the context | ||
# will survive across the fork, and then fail to start properly. | ||
import zmq | ||
zmq.Context.instance().destroy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I tested this with the new jupyter_client
. Could we add a TODO: delete when jupyter_client>=5.2.4
releases?
I haven't gotten an actual stack trace here, but in tests I have seen
now which indicates there might be another issue in on the kernel side with port management? |
@@ -69,14 +72,17 @@ def build_preprocessor(opts): | |||
return preprocessor | |||
|
|||
|
|||
def run_notebook(filename, opts, resources): | |||
def run_notebook(filename, opts, resources, preprocess_notebook=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described below, you can avoid adding this complication that also introduces an additional meaning to what it means to preprocess a notebook.
} | ||
) | ||
|
||
nb.cells.insert(1, label_cell) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make a deepcopy rather than mutating the copy you sent in.
nb.cells.insert(1, label_cell) | |
nb = deepcopy(nb) | |
nb.cells.insert(1, label_cell) |
|
||
Used for parallel testing to label two notebooks which are run simultaneously. | ||
""" | ||
label_cell = nbformat.NotebookNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a NotebookNode directly I would use the new_code_cell
function from nbformat.v4
.
https://github.com/jupyter/nbformat/blob/11903688167d21af96c92f7f5bf0634ab51819f1/nbformat/v4/nbbase.py#L98-L110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to do that is it ensures that the code cell itself will be valid according to the v4 spec.
return nb | ||
|
||
|
||
def test_parallel_notebooks(capfd, tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are capfd and tmpdir pytest fixtures? Otherwise, where are these being populated when it comes time to test them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbconvert/preprocessors/execute.py
Outdated
@@ -220,6 +220,20 @@ class ExecutePreprocessor(Preprocessor): | |||
) | |||
).tag(config=True) | |||
|
|||
ipython_hist_file = Unicode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might want to remove this change and put it in its own PR as it stands alone as a huge benefit.
That definitely suggests that whatever is assigning ports is unaware of the other instances that are assigning ports. This might have been introduced by getting rid of the singleton ZmqContext or it might have been always present but less likely to occur because there were fewer ports being assigned. I wonder if there are lessons to be learned from how the But the fact that we're running into this issue might mean that we might need to provide a primitive on the nbconvert side to enable this rather than relying on library users to solve it in their own parallel code. |
The subprocess wouldn't have the zeromq change or parent process manipulation, as that's unreleased isolated by process. So I believe it's unrelated to the singleton change. I think I found the bug: https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L185-L187 is a race condition where between when it finds a free port (it's written to handle port conflicts) and when it actually uses it the port could have been taken by another process. It's likely not safe regardless of nbconvert parallelism if you have a notebook server up and are running nbconvert or another notebook server at the same time. I'll open an issue / fix for ipykernel. On the later comment you had I don't know if an nbconvert primitive would help, because I might spawn multiple nbconvert processes at the same time outside of a parent python process (e.g. spawning nbconvert processes in a bash loop). This would probably defeat the approach MultiKernelManager takes. Maybe we could use a file lock to enable cross process nbconvert instances to wait for each kernel to launch before starting a new one. It would only make nbconvert parallelism safe (and slow) but not safe across multiple types of clients connecting to zeromq. Correct me if your thinking of something else here. Basically I think the kernels wrappers should solve it and we should help improve the tools for making kernels to not have to think about this if we can. |
I now have a local fix to the the ipykernel socket issue. But that failure mode is pretty rare and wasn't what we were hitting here. With some modifications to tests and printing a few lines wrapping execute starts and stops I was able to narrow down that any processes that start while the threaded execution is running fail. Looking deeper by adding a log message whenever a zmq context instance is generated revealed that the forked processes before that the parent process finishes executing would not generate new context instances. I believe the traitlets usage or some other wrapper is leaving the context object on the class and reusing it in the forked processes instead of recomputing.
|
Seems like the processes 1-4 aren't even making it to the kernel client initialization. I'll do some more debugging tonight and find where they fail to get past more definitively. |
Found the parallel processing issue after I tracked it down to the GIL freezing up before entering the Basically the processes hangs if the GIL is still active when the multiprocessing forks try to acquire it and deadlock. Going to modify the PR to remove the thread then fork test and get the rest merged so we can move forward. There is a race condition in ipykernel that's difficult to trigger that I have PR partially written to solve the only remaining race condition I've seen crop up. |
Known race condition in ipykernel now has a PR to address: ipython/ipykernel#412 |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/nbconvert-5-6-0-release/1867/1 |
@MSeal and I got a ways toward supporting multiprocessing at PyCon sprints. The two tests added here, one of which still fails, demonstrate the problem from
nbconvert
.I am working on a minimum reproduction for
jupyter_client
as well, but I wanted to put this branch up somewhere so that we can discuss & share the reproducing examples. I've marked both tests as XFail, but since they are both slow, I don't even know that this should get merged.