Support hard-kill for unresponsive actors #1005

mavam · 2020-08-06T10:26:03Z

No description provided.

libvast/src/system/terminator.cpp

lava

The code itself looks good, but I'm not sure about the API: The way it is implemented now means that every call to shutdown() and every creation of a terminator actor carries the risk of ending the process.

This implies that the function can only be used in "high-stakes" scenarios, where not killing the actors represents catastrophic failure, which seems to limit the usefulness of this function in general. Imho, letting the caller decide how to handle failure seems like the better approach. What do you think?

libvast/src/system/terminator.cpp

mavam · 2020-08-11T19:21:42Z

I agree that it might not be the best to terminate immediately here. But I'm not sure how to integrate this flexibility into the existing API. How would you beef up shutdown<Policy>(...)? Use terminate<Policy>(...) when the caller should handle a shutdown failure?

mavam · 2020-08-11T19:50:05Z

@lava please take a look at 91bd9ca as an example on how the node would do a shutdown after receiving an error from the terminator. It's a lot more verbose.

lava

Looks good to me modulo some word-smithing and the CHANGELOG entry.

libvast/vast/system/shutdown.hpp

libvast/vast/system/terminate.hpp

libvast/vast/system/shutdown.hpp

libvast/vast/system/terminator.hpp

Since the deterministic actor system in the unit tests lines up all timestamps according to their order of expiry, without actually sleeping/waiting, we need to add an epsilon value to avoid request timeouts firing sporadically. Setting epsilon to 1 microsecond, things seem to work. But it failed with 1 nanosecond.

mavam · 2020-08-12T10:24:00Z

Thanks for the eagle eyes. I've integrated all your final suggestions in 6db05d9 and will merge after CI gives green light.

The weird thing is that the filesystem actor actually terminates. Verified by temporality installing a custom exit handler. CAF just won't send us a DOWN message.

mavam · 2020-08-13T19:53:26Z

I've had to hack around an issue that may be related to actor-framework/actor-framework#1110 in the last commit (da5fbfe). It ain't pretty, but I currently don't know a better way to fix the detached actor issue. We simply can't put a detached actor into our shutdown utility because it won't send us a DOWN message.

I tried coming up with a reproducible CAF example, but didn't get far enough in replicating our logic. Here's my attempt at what point I gave up:

#include <caf/all.hpp>

using namespace caf;

using actor_type = typed_actor<
  reacts_to<int>
>;

actor_type::behavior_type aut(actor_type::pointer) {
  return {
    [](int) {
      // nop
    }
  };
}

behavior parent(event_based_actor* self) {
  auto a = self->spawn<linked + detached>(aut);
  self->set_down_handler([=](const down_msg& msg) {
    self->quit(msg.reason);
  });
  self->set_exit_handler([=](const exit_msg& msg) {
    self->monitor(a);
    self->unlink_from(a);
    self->send_exit(a, exit_reason::user_shutdown);
  });
  return {
    [=](int) {
      // nop
    }
  };
}

int main() {
  actor_system_config cfg;
  actor_system sys{cfg};
  scoped_actor self{sys};
  auto a = self->spawn(parent);
  self->monitor(a);
  self->send_exit(a, exit_reason::user_shutdown);
  self->receive([=](const down_msg& msg) { });
}

I'm not sure if if it's worth going into this detached actor issue in depth before actor-framework/actor-framework#1110 gets fixed. I already spent a lot of cycles on this but don't see light at the end of the tunnel.

lava

Looks good to me, the workaround doesn't even feel that hacky since it could be argued that the filesystem is somehow "more global" than the node, and that there might be a use case where we want to use it after the node has shut down.

libvast/vast/system/terminator.hpp

vast.conf

libvast/vast/defaults.hpp

libvast/vast/system/shutdown.hpp

libvast_test/src/node.cpp

dominiklohmann · 2020-08-14T09:56:21Z

libvast/src/system/terminator.cpp

+      VAST_WARNING(self, "failed to terminate all actors within 10 mins");
+      VAST_WARNING(self, "initiates hard kill of",
+                   self->state.remaining_actors.size(), "remaining actors");


These messages can be joined. Also, "10 mins" shouldn't be hardcoded here—the value is configurable now, isn't it?

Full ACK regarding the hardcoding - this is fixed now.

While they can be joined, I'd like to keep log lines short. It doesn't hurt if you get a multi-line warning in my opinion.

mavam added the enhancement ✨ label Aug 6, 2020

mavam commented Aug 6, 2020

View reviewed changes

libvast/src/system/terminator.cpp Show resolved Hide resolved

mavam commented Aug 6, 2020

View reviewed changes

libvast/src/system/terminator.cpp Show resolved Hide resolved

mavam marked this pull request as ready for review August 6, 2020 12:48

mavam requested a review from lava August 11, 2020 06:22

lava reviewed Aug 11, 2020

View reviewed changes

libvast/src/system/terminator.cpp Show resolved Hide resolved

libvast/src/system/terminator.cpp Outdated Show resolved Hide resolved

mavam force-pushed the story/ch17933 branch from 91bd9ca to c4b43ee Compare August 12, 2020 08:47

mavam requested a review from lava August 12, 2020 09:20

mavam force-pushed the story/ch17933 branch 2 times, most recently from dc33b20 to 2abe5c1 Compare August 12, 2020 10:06

lava approved these changes Aug 12, 2020

View reviewed changes

libvast/vast/system/shutdown.hpp Show resolved Hide resolved

libvast/vast/system/terminate.hpp Show resolved Hide resolved

libvast/vast/system/shutdown.hpp Show resolved Hide resolved

libvast/vast/system/terminator.hpp Outdated Show resolved Hide resolved

mavam and others added 13 commits August 12, 2020 12:13

Add new logic for killing unresponsive actors

2225da5

Make all termination timeouts configurable

55ad6f5

Add comment on sequential monitoring

d68364e

Do not abort when hard kill failed

f4aae50

Use terminate function to implement shutdown

4c2e1c6

Remove third confusing shutdown parameter

54fec3f

Streamline terminate semantics

3a191bc

Abort within shutdown

fca23d2

Clearly document shutdown vs terminate

85f86a8

Rename timeout parameters and adjust defaults

0dc154d

Add changelog entry

69ec1f3

Perform small touch-ups and doc polishing

6db05d9

mavam force-pushed the story/ch17933 branch from dedd758 to 6db05d9 Compare August 12, 2020 10:23

mavam added 2 commits August 12, 2020 13:01

Bump epsilon to please CI

fb51528

Use terminate function consistently

bde44d3

mavam and others added 5 commits August 12, 2020 15:40

Streamline implementation of terminate

85ce736

Make parameter naming more consistent

aa70aa0

Only do multi-step shutdown for finite timeouts

835e5ba

Tweak node API for shutdown and make configurable

5a5d470

Workaround issue with detached actor monitoring

da5fbfe

The weird thing is that the filesystem actor actually terminates. Verified by temporality installing a custom exit handler. CAF just won't send us a DOWN message.

Make workaround less obtrusive

f7a2bd7

mavam requested a review from lava August 14, 2020 07:17

lava approved these changes Aug 14, 2020

View reviewed changes

dominiklohmann reviewed Aug 14, 2020

View reviewed changes

Improve comments and documentation

a75623f

mavam force-pushed the story/ch17933 branch from 116778f to a75623f Compare August 14, 2020 11:28

mavam merged commit fc8b794 into master Aug 14, 2020

mavam deleted the story/ch17933 branch August 14, 2020 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hard-kill for unresponsive actors #1005

Support hard-kill for unresponsive actors #1005

mavam commented Aug 6, 2020

lava left a comment

mavam commented Aug 11, 2020

mavam commented Aug 11, 2020

lava left a comment

mavam commented Aug 12, 2020

mavam commented Aug 13, 2020

lava left a comment

dominiklohmann Aug 14, 2020

mavam Aug 14, 2020

Support hard-kill for unresponsive actors #1005

Support hard-kill for unresponsive actors #1005

Conversation

mavam commented Aug 6, 2020

lava left a comment

Choose a reason for hiding this comment

mavam commented Aug 11, 2020

mavam commented Aug 11, 2020

lava left a comment

Choose a reason for hiding this comment

mavam commented Aug 12, 2020

mavam commented Aug 13, 2020

lava left a comment

Choose a reason for hiding this comment

dominiklohmann Aug 14, 2020

Choose a reason for hiding this comment

mavam Aug 14, 2020

Choose a reason for hiding this comment