Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hard-kill for unresponsive actors #1005

Merged
merged 22 commits into from
Aug 14, 2020
Merged

Support hard-kill for unresponsive actors #1005

merged 22 commits into from
Aug 14, 2020

Conversation

mavam
Copy link
Member

@mavam mavam commented Aug 6, 2020

No description provided.

@mavam mavam marked this pull request as ready for review August 6, 2020 12:48
@mavam mavam requested a review from lava August 11, 2020 06:22
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself looks good, but I'm not sure about the API: The way it is implemented now means that every call to shutdown() and every creation of a terminator actor carries the risk of ending the process.

This implies that the function can only be used in "high-stakes" scenarios, where not killing the actors represents catastrophic failure, which seems to limit the usefulness of this function in general. Imho, letting the caller decide how to handle failure seems like the better approach. What do you think?

libvast/src/system/terminator.cpp Show resolved Hide resolved
libvast/src/system/terminator.cpp Outdated Show resolved Hide resolved
@mavam
Copy link
Member Author

mavam commented Aug 11, 2020

I agree that it might not be the best to terminate immediately here. But I'm not sure how to integrate this flexibility into the existing API. How would you beef up shutdown<Policy>(...)? Use terminate<Policy>(...) when the caller should handle a shutdown failure?

@mavam
Copy link
Member Author

mavam commented Aug 11, 2020

@lava please take a look at 91bd9ca as an example on how the node would do a shutdown after receiving an error from the terminator. It's a lot more verbose.

@mavam mavam requested a review from lava August 12, 2020 09:20
@mavam mavam force-pushed the story/ch17933 branch 2 times, most recently from dc33b20 to 2abe5c1 Compare August 12, 2020 10:06
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me modulo some word-smithing and the CHANGELOG entry.

libvast/vast/system/shutdown.hpp Show resolved Hide resolved
libvast/vast/system/terminate.hpp Show resolved Hide resolved
libvast/vast/system/shutdown.hpp Show resolved Hide resolved
libvast/vast/system/terminator.hpp Outdated Show resolved Hide resolved
@mavam
Copy link
Member Author

mavam commented Aug 12, 2020

Thanks for the eagle eyes. I've integrated all your final suggestions in 6db05d9 and will merge after CI gives green light.

mavam and others added 5 commits August 12, 2020 15:40
The weird thing is that the filesystem actor actually terminates.
Verified by temporality installing a custom exit handler. CAF just won't
send us a DOWN message.
@mavam
Copy link
Member Author

mavam commented Aug 13, 2020

I've had to hack around an issue that may be related to actor-framework/actor-framework#1110 in the last commit (da5fbfe). It ain't pretty, but I currently don't know a better way to fix the detached actor issue. We simply can't put a detached actor into our shutdown utility because it won't send us a DOWN message.

I tried coming up with a reproducible CAF example, but didn't get far enough in replicating our logic. Here's my attempt at what point I gave up:

#include <caf/all.hpp>

using namespace caf;

using actor_type = typed_actor<
  reacts_to<int>
>;

actor_type::behavior_type aut(actor_type::pointer) {
  return {
    [](int) {
      // nop
    }
  };
}

behavior parent(event_based_actor* self) {
  auto a = self->spawn<linked + detached>(aut);
  self->set_down_handler([=](const down_msg& msg) {
    self->quit(msg.reason);
  });
  self->set_exit_handler([=](const exit_msg& msg) {
    self->monitor(a);
    self->unlink_from(a);
    self->send_exit(a, exit_reason::user_shutdown);
  });
  return {
    [=](int) {
      // nop
    }
  };
}

int main() {
  actor_system_config cfg;
  actor_system sys{cfg};
  scoped_actor self{sys};
  auto a = self->spawn(parent);
  self->monitor(a);
  self->send_exit(a, exit_reason::user_shutdown);
  self->receive([=](const down_msg& msg) { });
}

I'm not sure if if it's worth going into this detached actor issue in depth before actor-framework/actor-framework#1110 gets fixed. I already spent a lot of cycles on this but don't see light at the end of the tunnel.

@mavam mavam requested a review from lava August 14, 2020 07:17
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, the workaround doesn't even feel that hacky since it could be argued that the filesystem is somehow "more global" than the node, and that there might be a use case where we want to use it after the node has shut down.

libvast/vast/system/terminator.hpp Outdated Show resolved Hide resolved
vast.conf Outdated Show resolved Hide resolved
libvast/vast/defaults.hpp Outdated Show resolved Hide resolved
libvast/vast/system/shutdown.hpp Outdated Show resolved Hide resolved
libvast_test/src/node.cpp Outdated Show resolved Hide resolved
Comment on lines 108 to 110
VAST_WARNING(self, "failed to terminate all actors within 10 mins");
VAST_WARNING(self, "initiates hard kill of",
self->state.remaining_actors.size(), "remaining actors");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages can be joined. Also, "10 mins" shouldn't be hardcoded here—the value is configurable now, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full ACK regarding the hardcoding - this is fixed now.

While they can be joined, I'd like to keep log lines short. It doesn't hurt if you get a multi-line warning in my opinion.

@mavam mavam merged commit fc8b794 into master Aug 14, 2020
@mavam mavam deleted the story/ch17933 branch August 14, 2020 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants