Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Check if python is exiting on a core worker on non-main Python thread #49547

Merged
merged 6 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions python/ray/_raylet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2362,6 +2362,10 @@ cdef CRayStatus check_signals() nogil:
# The Python exceptions are not handled if it is raised from cdef,
# so we have to handle it here.
try:
if sys.is_finalizing():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that the segfault occurs in with gil because it dereferences something that has already been released. Is it possible to avoid acquiring the GIL in that situation? I guess it is impossible to determine the state of the Python interpreter without the GIL. If it is impossible, I am fine with the current solution.

It would be helpful if you could test it with a reproduction both with and without this PR so that we can know whether this solution helps or not because it's hard to write tests for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya there's no way to get python interpreter state without the gil, so the other option becomes reworking the signal checking in C++ as i attempted here #49319, but it'll require some more work to handle some edge-case tests where we rely on knowing the Python SystemExit state to exit out of the core worker.

Ideally this should fix, because sys.is_finalizing works on non-main thread and should allow us to exit out before python starts freeing the resources needed for with gil. But ya will try more to repro with vs. without this change, and update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar logic used here

if sys.is_finalizing():

from #47702. Only checking for system exit on ChannelTimeout though instead of each time through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's hard to reproduce, I'm fine with running a RayCG program hundreds of times to check if no segfault related to acquiring the GIL occurs.

return CRayStatus.IntentionalSystemExit(
"Python is exiting.".encode("utf-8")
)
PyErr_CheckSignals()
except KeyboardInterrupt:
return CRayStatus.Interrupted(b"")
Expand Down
6 changes: 0 additions & 6 deletions python/ray/dag/tests/experimental/test_accelerated_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -1110,12 +1110,6 @@ def test_dag_exception_chained(ray_start_regular, capsys):
# Can use the DAG after exceptions are thrown.
assert ray.get(compiled_dag.execute(1)) == 2

# Note: somehow the auto triggered teardown() from ray.shutdown()
# does not finish in time for this test, leading to a segfault
# of the following test (likely due to a dangling monitor thread
# upon the new Ray init).
compiled_dag.teardown()


@pytest.mark.parametrize("single_fetch", [True, False])
def test_dag_exception_multi_output(ray_start_regular, single_fetch, capsys):
Expand Down
Loading