-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] linux://:gcs_client_test
is failing/flaky on master.
#34344
Labels
core
Issues that should be addressed in Ray Core
Comments
8 tasks
pcmoritz
pushed a commit
that referenced
this issue
Apr 22, 2023
Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure #2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes #34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com>
ProjectsByJackHe
pushed a commit
to ProjectsByJackHe/ray
that referenced
this issue
May 4, 2023
Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure ray-project#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes ray-project#34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
architkulkarni
pushed a commit
to architkulkarni/ray
that referenced
this issue
May 16, 2023
Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure #2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes ray-project#34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://:gcs_client_test-END
....
The text was updated successfully, but these errors were encountered: