Make shutdown more resilient #75

rmccue · 2024-03-27T15:55:00Z

Cavalcade has a persistent failure mode where jobs "stall" - i.e. are marked in the DB as running but aren't actually running. This leaves them in a limbo state where the jobs can't be rescheduled (as they still appear in wp_next_scheduled()/etc), but aren't ever going to complete.

@owaincuvelier identified that many (all?) of our "stalled" jobs on Altis are coming from the container termination. This termination happens either as a result of autoscaling or when containers are replaced during deployments.

This is a known issue but I thought we had worked around it effectively; it appears that is not the case, so we need to improve the handling to be more aggressive.

The problem

When the container is gracefully terminated in ECS:

ECS sends a SIGTERM to the process
Cavalcade Runner catches this signal and runs Runner::terminate()
terminate() outputs Cavalcade received terminate signal (15), shutting down 1 worker(s) into the logs
terminate() waits for the workers to finish up and stops spawning new workers, removing the runner from the "pool"
In theory:
- The workers complete their jobs and are marked as completed/failed as normal
- The runner outputs Shutting down! (Terminated by signal: 15)

In practice for longer-lived jobs (>60), what actually happens is:

terminate() keeps waiting for the workers to finish
ECS hits the stopTimeout (30s by default, but on Altis we have this configured to 60s)
ECS sents a SIGKILL to the process, immediately terminating the process and children - hence, job status is never updated

We increased the stopTimeout to help address this but in practice it didn't help massively, and doesn't help with any tasks which last longer.

The key issue is that terminate() doesn't propagate the signal forward to the worker processes, so they will continue the job. For any short lived jobs this is fine, but for long-running jobs, this means they'll continue on as if nothing has changed until they are abruptly killed.

Suggested solution

I'd like to suggest the following solutions, combined together:

In terminate(), after 30s if workers are still running, we send a SIGTERM to the processes
In terminate(), after 90s if workers are still running, we send a SIGKILL to the process
For jobs killed by one of the two signals, we mark as failed, plus indicate in the logs that it was killed
In our task config, we increase the stopTimeout to the maximum 120s

This has the following results:

This keeps the current behaviour as it is for short running jobs (<30s)
For longer running jobs which aren't designed to run long, they'll get killed by SIGTERM and will get the error message in the logs
For longer running jobs which are designed to run long, they can catch the SIGTERM and have 60s to decide how to shut down gracefully
For jobs which get killed, the logs will reflect this and allow developers a better understanding
The "stuck" jobs caused by this process should be eliminated

For others using Cavalcade, we can document this process clearly and how systems should be configured. We should make the 30s/90s configurable for different systems, including the ability to set either/both to 0 (meaning immediately) or disable the functionality entirely.

The text was updated successfully, but these errors were encountered:

rmccue · 2024-03-27T15:57:43Z

In terminate(), after 30s if workers are still running, we send a SIGTERM to the processes

In terminate(), after 90s if workers are still running, we send a SIGKILL to the process

Aside: we may want to pick defaults that maximise the time before the worker is sent the SIGTERM to avoid killing naive jobs. We should give at least 60s for jobs to decide how to shutdown, need to use a maximum of 120s, and need some time for ourselves to mark jobs as failed.

e.g. We could pick 50s/110s, which would give us 10s to update the DB to mark failed.

rmccue · 2024-03-27T18:09:48Z

Note: we'll also need to remove the ignore of SIGTERM in the command by default.

rmccue · 2024-03-27T18:40:59Z

In terminate(), after 30s if workers are still running, we send a SIGTERM to the processes

Something I'm not entirely clear on is how this will interact with init daemons; in humanmade/Cavalcade#20, we explicitly ignored SIGTERM because I think upstart was sending SIGTERM to all children.

From some research, systemd would require KillMode=mixed so that SIGTERM is only sent to the runner, but SIGKILL can clean up the whole group. (TimeoutStopSec would also need to be set appropriately to last longer than the Cavalcade equivalent.)

In our Dockerised case, we are the init daemon, so that's not an issue. I assume other init daemons have equivalents too that we can document.

kovshenin · 2024-04-18T17:19:57Z

3600 seems to be a perfectly valid stopTimeout if we wanted to wait for the long running jobs to finish. Any reason we're not doing that? Seems to be an oversight to set it to 60, unless we want some resiliency in Fargate?

rmccue · 2024-04-18T22:27:06Z

@kovshenin Per the docs for stopTimeout:

The valid values are 2-120 seconds.

(That said, ECS is specific to Altis rather than Cavalcade Runner, so covered that in the internal issue)

This was referenced Mar 27, 2024

Catch orphaned "running" jobs from MIA workers humanmade/Cavalcade#31

Open

Propagate graceful shutdown #76

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make shutdown more resilient #75

Make shutdown more resilient #75

rmccue commented Mar 27, 2024

rmccue commented Mar 27, 2024

rmccue commented Mar 27, 2024

rmccue commented Mar 27, 2024

kovshenin commented Apr 18, 2024

rmccue commented Apr 18, 2024

Make shutdown more resilient #75

Make shutdown more resilient #75

Comments

rmccue commented Mar 27, 2024

The problem

Suggested solution

rmccue commented Mar 27, 2024

rmccue commented Mar 27, 2024

rmccue commented Mar 27, 2024

kovshenin commented Apr 18, 2024

rmccue commented Apr 18, 2024