-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardise using Cronitor for all scheduled tasks #3380
Conversation
Previously we only used this wrapper for tasks in "nightly_tasks" and "reporting_tasks", but it's important these tasks are running regularly as well, and we shouldn't rely on them emitting lots of errors in order to know otherwise. Since I'm about to add another task to this file, this seems like a good opportunity to fix what seems to be a gap in our monitoring. Some of the tasks run frequently, others less so. We may want to tune Cronitor for some of them - the more critical ones - and set a sensible default for others e.g. a bit over a day.
a68eb66
to
ac6ce90
Compare
I'd suggest a review from @leohemsted on this but there is a chance that each cronitor task specified in here needs an equivalent set up manually in the cronitor user interface... |
@idavidmcdonald is correct - we'll need to create the alert in cronitor, and then add the URL/task-name to the |
Thanks @leohemsted, I didn't realise. I'll mention some of what @servingUpAces said on Slack too:
Resource-wise, it's just a couple of HTTP requests, which aren't going to slow down anything considerably when we're talking in minutes. We'll also still continue running the task if Cronitor is down for some reason. I'm planning to add a
Here's an overview of what each proposal would look like:
Still feels a bit arbitrary. I'm trying to think about them in terms of "would I bother writing a manual alert for this?". I don't think we need to monitor monitoring tasks like "trigger link tests", and we could potentially drop the "tv numbers" one. We also don't need to have an alert for everything, but (2) strikes me as a valuable addition. What do we think? (I would also accept a blanket "let's leave it as-is and not add |
Been quiet on this for a while but I guess one thought I'm struggling with a bit is - what's the next step if a fifteen minutely task fails? What's the lag on us finding out about it through a cronitor email? We added cronitor originally because nightly tasks were silently failing due to paas cells rolling at a similar time of night to these tasks running, and them being killed without having a chance to emit a stack trace error log. If a nightly task silently fails, we'll likely investigate the cause, and may well need to and manually kick it off again. If a fifteen minutely task silently fails, if we happen to see the cronitor email before the next run, which isn't a guarantee given there'd be a buffer before the email sends. Would we have any action other than "wait for the next 15 minute interval and see what happens"? Same with the hourly tasks. They're non-essential to run immediately, if we delete old invites an hour late, that shouldn't impact the system badly. If either task noisily fails we'll get an error log, which will send an email out which we should act on. If we're not acting on those emails, I think the answer is to reduce the amount of error logs we emit that we can't act on, rather than increasing the volume of a different type of error log. I think I'd rather we just improve the quality of the Or you'll find before long we're all filtering cronitor emails into a folder somewhere that none of us read and the whole cycle will continue. |
I don't have a strong opinion on this (partially because I'm finding it hard to evaluate which is the best option). Unless people feel strongly and we have a good idea of how long it might take to start using one of the suggested approaches, I'd suggest we don't do anything about changing our cronitor approach for the moment. Happy for you to make the call Ben. |
@idavidmcdonald I'm content to leave it as-is and just document how we use it now, so it's clear in future, based on some of @leohemsted's comment and talking in Slack. I propose we have a quick decision record like the following, which links back to this PR (can discuss elsewhere if controversial):
What do you think? |
I like your idea Ben and that is a nice length. One thing that might be good is to clarify that this is currently what we are doing, rather than a decision made today to adopt a new process (I think). |
Thanks @idavidmcdonald. I've a note about it being retrospective in the Wiki version. Closing in favour of new documentation: https://github.com/alphagov/notifications-manuals/wiki/Decision-records#use-cronitor-to-monitor-daily-tasks---7122021 |
https://www.pivotaltracker.com/story/show/180344153
Previously we only used this wrapper for tasks in "nightly_tasks"
and "reporting_tasks", but it's important these tasks are running
regularly as well, and we shouldn't rely on them emitting lots of
errors in order to know otherwise.
Since I'm about to add another task to this file, this seems like
a good opportunity to fix what seems to be a gap in our monitoring.
Some of the tasks run frequently, others less so. We may want to
tune Cronitor for some of them - the more critical ones - and set
a sensible default for others e.g. a bit over a day.