-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support hard-kill for unresponsive actors #1005
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code itself looks good, but I'm not sure about the API: The way it is implemented now means that every call to shutdown()
and every creation of a terminator
actor carries the risk of ending the process.
This implies that the function can only be used in "high-stakes" scenarios, where not killing the actors represents catastrophic failure, which seems to limit the usefulness of this function in general. Imho, letting the caller decide how to handle failure seems like the better approach. What do you think?
I agree that it might not be the best to terminate immediately here. But I'm not sure how to integrate this flexibility into the existing API. How would you beef up |
dc33b20
to
2abe5c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me modulo some word-smithing and the CHANGELOG entry.
Since the deterministic actor system in the unit tests lines up all timestamps according to their order of expiry, without actually sleeping/waiting, we need to add an epsilon value to avoid request timeouts firing sporadically. Setting epsilon to 1 microsecond, things seem to work. But it failed with 1 nanosecond.
Thanks for the eagle eyes. I've integrated all your final suggestions in 6db05d9 and will merge after CI gives green light. |
The weird thing is that the filesystem actor actually terminates. Verified by temporality installing a custom exit handler. CAF just won't send us a DOWN message.
I've had to hack around an issue that may be related to actor-framework/actor-framework#1110 in the last commit (da5fbfe). It ain't pretty, but I currently don't know a better way to fix the detached actor issue. We simply can't put a detached actor into our shutdown utility because it won't send us a DOWN message. I tried coming up with a reproducible CAF example, but didn't get far enough in replicating our logic. Here's my attempt at what point I gave up: #include <caf/all.hpp>
using namespace caf;
using actor_type = typed_actor<
reacts_to<int>
>;
actor_type::behavior_type aut(actor_type::pointer) {
return {
[](int) {
// nop
}
};
}
behavior parent(event_based_actor* self) {
auto a = self->spawn<linked + detached>(aut);
self->set_down_handler([=](const down_msg& msg) {
self->quit(msg.reason);
});
self->set_exit_handler([=](const exit_msg& msg) {
self->monitor(a);
self->unlink_from(a);
self->send_exit(a, exit_reason::user_shutdown);
});
return {
[=](int) {
// nop
}
};
}
int main() {
actor_system_config cfg;
actor_system sys{cfg};
scoped_actor self{sys};
auto a = self->spawn(parent);
self->monitor(a);
self->send_exit(a, exit_reason::user_shutdown);
self->receive([=](const down_msg& msg) { });
} I'm not sure if if it's worth going into this detached actor issue in depth before actor-framework/actor-framework#1110 gets fixed. I already spent a lot of cycles on this but don't see light at the end of the tunnel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, the workaround doesn't even feel that hacky since it could be argued that the filesystem is somehow "more global" than the node, and that there might be a use case where we want to use it after the node has shut down.
libvast/src/system/terminator.cpp
Outdated
VAST_WARNING(self, "failed to terminate all actors within 10 mins"); | ||
VAST_WARNING(self, "initiates hard kill of", | ||
self->state.remaining_actors.size(), "remaining actors"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages can be joined. Also, "10 mins" shouldn't be hardcoded here—the value is configurable now, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full ACK regarding the hardcoding - this is fixed now.
While they can be joined, I'd like to keep log lines short. It doesn't hurt if you get a multi-line warning in my opinion.
No description provided.