-
-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard kill resilience with execution counts #922
Comments
We recently introduced (#830) an extension that will raise a rescuable exception if the job was interrupted/terminated during execution: https://github.com/bensheldon/good_job/#interrupts I think that might address your need... unless the termination is happening when Active Job deserializes the arguments (I could imagine that hydrating a huge number of global-id objects could wreck it) before Active Job execution callbacks are invoked. I also like your suggestion of incrementing a value in an atomic way when the job is first fetched. I am trying to defer that sort of thing until #831 ever happens, but if it's a dealbreaker I don't want to defer it too much. |
What is this black sorcery 🧙 ok we'll try that new interrupt exception, I think it is sane enough to configure it for all jobs actually (at least for us). Incrementing a "checkout counter" could be done with the selection query as part of the |
We have encountered a peculiar pattern with one of our heavy jobs. When executing, it would exhaust the RAM limit on the GCP instance. As we run Docker + "naked" GCP, what would happen is that GCP would reboot our instance with no warning (the OOM killer would kill the process, therefore there would be no healthcheck response, and GCP would "auto heal" by rebooting). Despite configuring the executions limit with the
retry_on
we haven't found a way to make sure these retries are honored in case of "hard kills".In our case this led to a "poison pill" job which would endlessly restart on the cluster, exhausting the memory of the instance and leading to a hard reboot. The advisory lock gets released properly of course.
The SRE book specifies a nice way of measuring the number of failures - they recommend recording the "start" and "ok" events, but not "failure" - because the system may fail during execution in such a way that there won't be a possibility to record a failure as such.
Could we implement something similar (or change semantics of
executions
to increment on checkout for instance?) so that there would be some protection against those hard kills?This could imply that the display in the dashboard would also change - for example a job that "is executing" might be "executing or the executing system has been killed or hung", or it could imply just a change with regards to where the executions get incremented, avoiding endless restarts.
Curious to know what would be the options?
The text was updated successfully, but these errors were encountered: