Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled graphs] Wait for monitor thread to join before shutting down #48808

Merged
merged 3 commits into from
Nov 20, 2024

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Nov 19, 2024

Why are these changes needed?

Each compiled graph starts a monitor thread to tear down the DAG upon detecting an error in one of the workers' task loops. Currently, during driver shutdown, this thread can live past the lifetime of the C++ CoreWorker. This causes a silent process exit when the thread later tries to call on the CoreWorker but it has already been destructed. To prevent this from happening, this fix joins the monitor thread before destructing the CoreWorker.

Related issue number

Closes #48288.

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
@stephanie-wang stephanie-wang added the go add ONLY when ready to merge, run all tests label Nov 19, 2024
This reverts commit bbc748a.
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
@stephanie-wang stephanie-wang changed the title [wip][donotmerge] Fixing #48288 [core][compiled graphs] Wait for monitor thread to join before shutting down Nov 20, 2024
Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

monitor.teardown(kill_actors=kill_actors)
monitor.join(timeout=ctx.teardown_timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the PR description,

To prevent this from happening, this fix joins the monitor thread before destructing the CoreWorker.

I don't understand when the CoreWorker destruction happens and why calling monitor.join in __del__ ensures that the monitor thread joins before the CoreWorker is destructed.

I guess the CoreWorker will be terminated when ray.shutdown() is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, CoreWorker is destructed in ray.shutdown(), and the monitor thread could still be running at that point.

@stephanie-wang stephanie-wang merged commit eea0057 into ray-project:master Nov 20, 2024
4 of 5 checks passed
@stephanie-wang stephanie-wang deleted the fix-48288 branch November 20, 2024 18:18
jecsand838 pushed a commit to jecsand838/ray that referenced this pull request Dec 4, 2024
…ng down (ray-project#48808)

Each compiled graph starts a monitor thread to tear down the DAG upon
detecting an error in one of the workers' task loops. Currently, during
driver shutdown, this thread can live past the lifetime of the C++
CoreWorker. This causes a silent process exit when the thread later
tries to call on the CoreWorker but it has already been destructed. To
prevent this from happening, this fix joins the monitor thread *before*
destructing the CoreWorker.

## Related issue number

Closes ray-project#48288.

---------

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
…ng down (ray-project#48808)

Each compiled graph starts a monitor thread to tear down the DAG upon
detecting an error in one of the workers' task loops. Currently, during
driver shutdown, this thread can live past the lifetime of the C++
CoreWorker. This causes a silent process exit when the thread later
tries to call on the CoreWorker but it has already been destructed. To
prevent this from happening, this fix joins the monitor thread *before*
destructing the CoreWorker.

## Related issue number

Closes ray-project#48288.

---------

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky
4 participants