Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky #48288

Closed
can-anyscale opened this issue Oct 28, 2024 · 60 comments · Fixed by #48433 or #48808
Closed

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky #48288

can-anyscale opened this issue Oct 28, 2024 · 60 comments · Fixed by #48433 or #48808
Assignees
Labels
bug Something that is supposed to be working; but isn't ci-test core Issues that should be addressed in Ray Core flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P1 Issue that should be fixed within a few weeks ray-test-bot Issues managed by OSS test policy stability

Comments

@can-anyscale
Copy link
Collaborator

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing. Recent failures:
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d1c2-1479-41d6-bf43-5727365f667f
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d187-63be-42ea-9a3e-4e2ff54b9f96

DataCaseName-linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag-END
Managed by OSS Test Policy

@can-anyscale can-anyscale added bug Something that is supposed to be working; but isn't ci-test core Issues that should be addressed in Ray Core flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ ray-test-bot Issues managed by OSS test policy stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases labels Oct 28, 2024
@can-anyscale
Copy link
Collaborator Author

new and flaky test

@can-anyscale
Copy link
Collaborator Author

@can-anyscale can-anyscale changed the title CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky Oct 28, 2024
@can-anyscale can-anyscale reopened this Oct 28, 2024
@can-anyscale
Copy link
Collaborator Author

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky. Recent failures:
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d1c2-1479-41d6-bf43-5727365f667f
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d187-63be-42ea-9a3e-4e2ff54b9f96

DataCaseName-linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag-END
Managed by OSS Test Policy

@can-anyscale
Copy link
Collaborator Author

@can-anyscale
Copy link
Collaborator Author

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky. Recent failures:
- https://buildkite.com/ray-project/postmerge/builds/6705#0192d3e2-fd1a-44c7-8a2b-f82ff3f69711
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d1c2-1479-41d6-bf43-5727365f667f
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d187-63be-42ea-9a3e-4e2ff54b9f96

DataCaseName-linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag-END
Managed by OSS Test Policy

@can-anyscale can-anyscale removed the weekly-release-blocker Issues that will be blocking Ray weekly releases label Oct 28, 2024
@can-anyscale
Copy link
Collaborator Author

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky. Recent failures:
- https://buildkite.com/ray-project/postmerge/builds/6705#0192d3e2-fd1a-44c7-8a2b-f82ff3f69711
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d1c2-1479-41d6-bf43-5727365f667f
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d187-63be-42ea-9a3e-4e2ff54b9f96

DataCaseName-linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag-END
Managed by OSS Test Policy

@can-anyscale
Copy link
Collaborator Author

@can-anyscale can-anyscale changed the title CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing Oct 29, 2024
@can-anyscale can-anyscale reopened this Oct 29, 2024
@can-anyscale
Copy link
Collaborator Author

@can-anyscale can-anyscale changed the title CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky Oct 29, 2024
@can-anyscale
Copy link
Collaborator Author

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 18, 2024
@can-anyscale
Copy link
Collaborator Author

@can-anyscale
Copy link
Collaborator Author

@can-anyscale
Copy link
Collaborator Author

@can-anyscale can-anyscale changed the title CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing Nov 19, 2024
@can-anyscale can-anyscale reopened this Nov 19, 2024
@can-anyscale
Copy link
Collaborator Author

CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing. Recent failures:
- https://buildkite.com/ray-project/postmerge/builds/7003#01934641-525d-4110-a4cd-fbc99786a157
- https://buildkite.com/ray-project/postmerge/builds/7003#01934641-5261-47c3-a715-bb7543bf040d
- https://buildkite.com/ray-project/postmerge/builds/6995#01934568-b84e-4330-b414-23e8da465728
- https://buildkite.com/ray-project/postmerge/builds/6986#019341f4-6b78-4d15-aa0b-75596572bd99

DataCaseName-linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag-END
Managed by OSS Test Policy

@can-anyscale can-anyscale changed the title CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is consistently_failing CI test linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag is flaky Nov 19, 2024
@can-anyscale
Copy link
Collaborator Author

@can-anyscale
Copy link
Collaborator Author

@can-anyscale
Copy link
Collaborator Author

1 similar comment
@can-anyscale
Copy link
Collaborator Author

stephanie-wang added a commit that referenced this issue Nov 20, 2024
…ng down (#48808)

Each compiled graph starts a monitor thread to tear down the DAG upon
detecting an error in one of the workers' task loops. Currently, during
driver shutdown, this thread can live past the lifetime of the C++
CoreWorker. This causes a silent process exit when the thread later
tries to call on the CoreWorker but it has already been destructed. To
prevent this from happening, this fix joins the monitor thread *before*
destructing the CoreWorker.

## Related issue number

Closes #48288.

---------

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
@can-anyscale can-anyscale reopened this Nov 20, 2024
@can-anyscale
Copy link
Collaborator Author

jecsand838 pushed a commit to jecsand838/ray that referenced this issue Dec 4, 2024
…ng down (ray-project#48808)

Each compiled graph starts a monitor thread to tear down the DAG upon
detecting an error in one of the workers' task loops. Currently, during
driver shutdown, this thread can live past the lifetime of the C++
CoreWorker. This causes a silent process exit when the thread later
tries to call on the CoreWorker but it has already been destructed. To
prevent this from happening, this fix joins the monitor thread *before*
destructing the CoreWorker.

## Related issue number

Closes ray-project#48288.

---------

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this issue Dec 7, 2024
…ng down (ray-project#48808)

Each compiled graph starts a monitor thread to tear down the DAG upon
detecting an error in one of the workers' task loops. Currently, during
driver shutdown, this thread can live past the lifetime of the C++
CoreWorker. This causes a silent process exit when the thread later
tries to call on the CoreWorker but it has already been destructed. To
prevent this from happening, this fix joins the monitor thread *before*
destructing the CoreWorker.

## Related issue number

Closes ray-project#48288.

---------

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't ci-test core Issues that should be addressed in Ray Core flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P1 Issue that should be fixed within a few weeks ray-test-bot Issues managed by OSS test policy stability
Projects
None yet
4 participants