Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make shutdown more resilient #75

Open
rmccue opened this issue Mar 27, 2024 · 5 comments
Open

Make shutdown more resilient #75

rmccue opened this issue Mar 27, 2024 · 5 comments

Comments

@rmccue
Copy link
Member

rmccue commented Mar 27, 2024

Cavalcade has a persistent failure mode where jobs "stall" - i.e. are marked in the DB as running but aren't actually running. This leaves them in a limbo state where the jobs can't be rescheduled (as they still appear in wp_next_scheduled()/etc), but aren't ever going to complete.

@owaincuvelier identified that many (all?) of our "stalled" jobs on Altis are coming from the container termination. This termination happens either as a result of autoscaling or when containers are replaced during deployments.

This is a known issue but I thought we had worked around it effectively; it appears that is not the case, so we need to improve the handling to be more aggressive.

The problem

When the container is gracefully terminated in ECS:

  • ECS sends a SIGTERM to the process
  • Cavalcade Runner catches this signal and runs Runner::terminate()
  • terminate() outputs Cavalcade received terminate signal (15), shutting down 1 worker(s) into the logs
  • terminate() waits for the workers to finish up and stops spawning new workers, removing the runner from the "pool"
  • In theory:
    • The workers complete their jobs and are marked as completed/failed as normal
    • The runner outputs Shutting down! (Terminated by signal: 15)

In practice for longer-lived jobs (>60), what actually happens is:

  • terminate() keeps waiting for the workers to finish
  • ECS hits the stopTimeout (30s by default, but on Altis we have this configured to 60s)
  • ECS sents a SIGKILL to the process, immediately terminating the process and children - hence, job status is never updated

We increased the stopTimeout to help address this but in practice it didn't help massively, and doesn't help with any tasks which last longer.

The key issue is that terminate() doesn't propagate the signal forward to the worker processes, so they will continue the job. For any short lived jobs this is fine, but for long-running jobs, this means they'll continue on as if nothing has changed until they are abruptly killed.

Suggested solution

I'd like to suggest the following solutions, combined together:

  • In terminate(), after 30s if workers are still running, we send a SIGTERM to the processes
  • In terminate(), after 90s if workers are still running, we send a SIGKILL to the process
  • For jobs killed by one of the two signals, we mark as failed, plus indicate in the logs that it was killed
  • In our task config, we increase the stopTimeout to the maximum 120s

This has the following results:

  • This keeps the current behaviour as it is for short running jobs (<30s)
  • For longer running jobs which aren't designed to run long, they'll get killed by SIGTERM and will get the error message in the logs
  • For longer running jobs which are designed to run long, they can catch the SIGTERM and have 60s to decide how to shut down gracefully
  • For jobs which get killed, the logs will reflect this and allow developers a better understanding
  • The "stuck" jobs caused by this process should be eliminated

For others using Cavalcade, we can document this process clearly and how systems should be configured. We should make the 30s/90s configurable for different systems, including the ability to set either/both to 0 (meaning immediately) or disable the functionality entirely.

@rmccue
Copy link
Member Author

rmccue commented Mar 27, 2024

  • In terminate(), after 30s if workers are still running, we send a SIGTERM to the processes
  • In terminate(), after 90s if workers are still running, we send a SIGKILL to the process

Aside: we may want to pick defaults that maximise the time before the worker is sent the SIGTERM to avoid killing naive jobs. We should give at least 60s for jobs to decide how to shutdown, need to use a maximum of 120s, and need some time for ourselves to mark jobs as failed.

e.g. We could pick 50s/110s, which would give us 10s to update the DB to mark failed.

@rmccue
Copy link
Member Author

rmccue commented Mar 27, 2024

Note: we'll also need to remove the ignore of SIGTERM in the command by default.

@rmccue
Copy link
Member Author

rmccue commented Mar 27, 2024

In terminate(), after 30s if workers are still running, we send a SIGTERM to the processes

Something I'm not entirely clear on is how this will interact with init daemons; in humanmade/Cavalcade#20, we explicitly ignored SIGTERM because I think upstart was sending SIGTERM to all children.

From some research, systemd would require KillMode=mixed so that SIGTERM is only sent to the runner, but SIGKILL can clean up the whole group. (TimeoutStopSec would also need to be set appropriately to last longer than the Cavalcade equivalent.)

In our Dockerised case, we are the init daemon, so that's not an issue. I assume other init daemons have equivalents too that we can document.

@kovshenin
Copy link

3600 seems to be a perfectly valid stopTimeout if we wanted to wait for the long running jobs to finish. Any reason we're not doing that? Seems to be an oversight to set it to 60, unless we want some resiliency in Fargate?

@rmccue
Copy link
Member Author

rmccue commented Apr 18, 2024

@kovshenin Per the docs for stopTimeout:

The valid values are 2-120 seconds.

(That said, ECS is specific to Altis rather than Cavalcade Runner, so covered that in the internal issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants