-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make shutdown more resilient #75
Comments
Aside: we may want to pick defaults that maximise the time before the worker is sent the SIGTERM to avoid killing naive jobs. We should give at least 60s for jobs to decide how to shutdown, need to use a maximum of 120s, and need some time for ourselves to mark jobs as failed. e.g. We could pick 50s/110s, which would give us 10s to update the DB to mark failed. |
Note: we'll also need to remove the ignore of SIGTERM in the command by default. |
Something I'm not entirely clear on is how this will interact with init daemons; in humanmade/Cavalcade#20, we explicitly ignored SIGTERM because I think upstart was sending SIGTERM to all children. From some research, systemd would require In our Dockerised case, we are the init daemon, so that's not an issue. I assume other init daemons have equivalents too that we can document. |
3600 seems to be a perfectly valid stopTimeout if we wanted to wait for the long running jobs to finish. Any reason we're not doing that? Seems to be an oversight to set it to 60, unless we want some resiliency in Fargate? |
@kovshenin Per the docs for stopTimeout:
(That said, ECS is specific to Altis rather than Cavalcade Runner, so covered that in the internal issue) |
Cavalcade has a persistent failure mode where jobs "stall" - i.e. are marked in the DB as
running
but aren't actually running. This leaves them in a limbo state where the jobs can't be rescheduled (as they still appear inwp_next_scheduled()
/etc), but aren't ever going to complete.@owaincuvelier identified that many (all?) of our "stalled" jobs on Altis are coming from the container termination. This termination happens either as a result of autoscaling or when containers are replaced during deployments.
This is a known issue but I thought we had worked around it effectively; it appears that is not the case, so we need to improve the handling to be more aggressive.
The problem
When the container is gracefully terminated in ECS:
Runner::terminate()
terminate()
outputsCavalcade received terminate signal (15), shutting down 1 worker(s)
into the logsterminate()
waits for the workers to finish up and stops spawning new workers, removing the runner from the "pool"Shutting down! (Terminated by signal: 15)
In practice for longer-lived jobs (>60), what actually happens is:
terminate()
keeps waiting for the workers to finishWe increased the stopTimeout to help address this but in practice it didn't help massively, and doesn't help with any tasks which last longer.
The key issue is that
terminate()
doesn't propagate the signal forward to the worker processes, so they will continue the job. For any short lived jobs this is fine, but for long-running jobs, this means they'll continue on as if nothing has changed until they are abruptly killed.Suggested solution
I'd like to suggest the following solutions, combined together:
terminate()
, after 30s if workers are still running, we send a SIGTERM to the processesterminate()
, after 90s if workers are still running, we send a SIGKILL to the processThis has the following results:
For others using Cavalcade, we can document this process clearly and how systems should be configured. We should make the 30s/90s configurable for different systems, including the ability to set either/both to
0
(meaning immediately) or disable the functionality entirely.The text was updated successfully, but these errors were encountered: