[CI] `linux://:gcs_client_test` is failing/flaky on master. #34344

cadedaniel · 2023-04-12T22:31:16Z

e3533ef FAILED Buildkite :cpp: Tests (ASAN)
b01f7bc FAILED Buildkite :cpp: Tests (ASAN)
a9431c7 FAILED Buildkite :cpp: Tests (ASAN)
4013930 FAILED Buildkite :cpp: Tests (ASAN)
d1e7629 FAILED Buildkite :cpp: Tests (ASAN)
f255dda FAILED Buildkite :cpp: Tests (ASAN)
4ad2cd1 FAILED Buildkite :cpp: Tests (ASAN)
aaac9cd FAILED Buildkite :cpp: Tests (ASAN)
4e1f42b FAILED Buildkite :cpp: Tests (ASAN)
7ceb761 FAILED Buildkite :cpp: Tests (ASAN)
a1a6902 FAILED Buildkite :cpp: Tests (ASAN)
968f6e9 FAILED Buildkite :cpp: Tests (ASAN)
4c4f35a FAILED Buildkite :cpp: Tests (ASAN)
3d335e1 FAILED Buildkite :cpp: Tests (ASAN)
fb441ce FAILED Buildkite :cpp: Tests (ASAN)

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://:gcs_client_test-END
....

The text was updated successfully, but these errors were encountered:

rickyyx · 2023-04-19T15:51:36Z

Doesn't look like it worked?

Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure #2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes #34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com>

Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure ray-project#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes ray-project#34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure #2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes ray-project#34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com>

cadedaniel added flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ core Issues that should be addressed in Ray Core labels Apr 12, 2023

cadedaniel self-assigned this Apr 12, 2023

cadedaniel linked a pull request Apr 14, 2023 that will close this issue

[core] Deflake gcs_client_test.cc #34411

Merged

fishbone closed this as completed in #34411 Apr 17, 2023

rickyyx reopened this Apr 19, 2023

rickyyx assigned rkooo567 Apr 21, 2023

rkooo567 unassigned cadedaniel Apr 21, 2023

rkooo567 mentioned this issue Apr 21, 2023

[WIP] GCS client test failure flakiness #34656

Merged

8 tasks

pcmoritz closed this as completed in #34656 Apr 22, 2023

rkooo567 removed the flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ label Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] `linux://:gcs_client_test` is failing/flaky on master. #34344

[CI] `linux://:gcs_client_test` is failing/flaky on master. #34344

cadedaniel commented Apr 12, 2023

rickyyx commented Apr 19, 2023

[CI] linux://:gcs_client_test is failing/flaky on master. #34344

[CI] linux://:gcs_client_test is failing/flaky on master. #34344

Comments

cadedaniel commented Apr 12, 2023

rickyyx commented Apr 19, 2023

[CI] `linux://:gcs_client_test` is failing/flaky on master. #34344

[CI] `linux://:gcs_client_test` is failing/flaky on master. #34344