Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardise using Cronitor for all scheduled tasks #3380

Closed
wants to merge 1 commit into from

Conversation

benthorner
Copy link
Contributor

https://www.pivotaltracker.com/story/show/180344153

Previously we only used this wrapper for tasks in "nightly_tasks"
and "reporting_tasks", but it's important these tasks are running
regularly as well, and we shouldn't rely on them emitting lots of
errors in order to know otherwise.

Since I'm about to add another task to this file, this seems like
a good opportunity to fix what seems to be a gap in our monitoring.

Some of the tasks run frequently, others less so. We may want to
tune Cronitor for some of them - the more critical ones - and set
a sensible default for others e.g. a bit over a day.

Previously we only used this wrapper for tasks in "nightly_tasks"
and "reporting_tasks", but it's important these tasks are running
regularly as well, and we shouldn't rely on them emitting lots of
errors in order to know otherwise.

Since I'm about to add another task to this file, this seems like
a good opportunity to fix what seems to be a gap in our monitoring.

Some of the tasks run frequently, others less so. We may want to
tune Cronitor for some of them - the more critical ones - and set
a sensible default for others e.g. a bit over a day.
@benthorner benthorner force-pushed the scheduled-cronitor-180344153 branch from a68eb66 to ac6ce90 Compare November 24, 2021 13:42
@idavidmcdonald
Copy link
Contributor

I'd suggest a review from @leohemsted on this but there is a chance that each cronitor task specified in here needs an equivalent set up manually in the cronitor user interface...

@leohemsted
Copy link
Contributor

@idavidmcdonald is correct - we'll need to create the alert in cronitor, and then add the URL/task-name to the CRONITOR_KEYS dict in credentials/production/paas/environment-variables

@benthorner
Copy link
Contributor Author

Thanks @leohemsted, I didn't realise.

I'll mention some of what @servingUpAces said on Slack too:

  • We didn't add cronitor to catch bugs but to catch tasks that run once a day but do not complete because the app is restart while it's processing.
  • Scheduled jobs is probably alright to check that it runs once an hour, we trigger the task once every 15 minutes, but services can only schedule once an hour.
  • I'm more convinced about adding this monitor for less frequent jobs. But still not 100% convinced.
  • Although adding £28/month to our running costs doesn't seem like much, we should still consider it and check with Caley or at least make her aware that the bill is going up.

Resource-wise, it's just a couple of HTTP requests, which aren't going to slow down anything considerably when we're talking in minutes. We'll also still continue running the task if Cronitor is down for some reason.

I'm planning to add a @cronitor for the new hourly task I'm adding, so I'd like to reach a conclusion on the existing ones here. Here's a few different proposals of what we could be checking for:

  1. ">= daily" - check tasks that run about once a day, in case they get killed due to a deploy / transient issue.
  2. ">= daily + critical tasks" - adds monitoring where we do scheduling on behalf of the user (jobs, alerts).
  3. ">= daily + critical tasks + maintenance" - also adds protection for issues that would build up over time.

Here's an overview of what each proposal would look like:

Task Frequency proposal (>= daily) proposal (>= daily + critical tasks) proposal (>= daily + critical tasks + maintenance)
run-scheduled-jobs every 15 mins
delete-verify-codes ~hourly
delete-invitations ~hourly
switch-current-sms-provider-on-slow-delivery minutely
check-job-status minutely
tend-providers-back-to-middle every 5 mins
check-for-missing-rows-in-completed-jobs every 10 mins
replay-created-notifications every 15 mins
check-if-letters-still-pending-virus-check week-daily (change to daily)
check-if-letters-still-in-created week-daily (change to daily)
check-for-services-with-high-failure-rates-or-sending-to-tv-numbers week-daily
trigger-link-tests every 15 mins
auto-expire-broadcast-messages every 5 mins
remove-yesterdays-planned-tests-on-govuk-alerts daily
(my new hourly task) hourly

Still feels a bit arbitrary. I'm trying to think about them in terms of "would I bother writing a manual alert for this?". I don't think we need to monitor monitoring tasks like "trigger link tests", and we could potentially drop the "tv numbers" one. We also don't need to have an alert for everything, but (2) strikes me as a valuable addition.

What do we think?

(I would also accept a blanket "let's leave it as-is and not add @cronitor to anything for now".)

@leohemsted
Copy link
Contributor

Been quiet on this for a while but I guess one thought I'm struggling with a bit is - what's the next step if a fifteen minutely task fails? What's the lag on us finding out about it through a cronitor email?

We added cronitor originally because nightly tasks were silently failing due to paas cells rolling at a similar time of night to these tasks running, and them being killed without having a chance to emit a stack trace error log. If a nightly task silently fails, we'll likely investigate the cause, and may well need to and manually kick it off again.

If a fifteen minutely task silently fails, if we happen to see the cronitor email before the next run, which isn't a guarantee given there'd be a buffer before the email sends. Would we have any action other than "wait for the next 15 minute interval and see what happens"? Same with the hourly tasks. They're non-essential to run immediately, if we delete old invites an hour late, that shouldn't impact the system badly.

If either task noisily fails we'll get an error log, which will send an email out which we should act on. If we're not acting on those emails, I think the answer is to reduce the amount of error logs we emit that we can't act on, rather than increasing the volume of a different type of error log.

I think I'd rather we just improve the quality of the production-notify-delivery-worker-reporting-exception-warning and production-notify-delivery-worker-periodic-exception-warning alarm emails, rather than add a whole new and relatively costly way of finding out that things didn't work.

Or you'll find before long we're all filtering cronitor emails into a folder somewhere that none of us read and the whole cycle will continue.

@idavidmcdonald
Copy link
Contributor

I don't have a strong opinion on this (partially because I'm finding it hard to evaluate which is the best option).

Unless people feel strongly and we have a good idea of how long it might take to start using one of the suggested approaches, I'd suggest we don't do anything about changing our cronitor approach for the moment.

Happy for you to make the call Ben.

@benthorner
Copy link
Contributor Author

@idavidmcdonald I'm content to leave it as-is and just document how we use it now, so it's clear in future, based on some of @leohemsted's comment and talking in Slack. I propose we have a quick decision record like the following, which links back to this PR (can discuss elsewhere if controversial):

More detail in this discussion PR.

We use Cronitor to monitor infrequent but critical tasks have run as expected e.g. collate-letter-pdfs-to-be-sent must run on schedule. While we usually get an alert if a task logs an exception, Cronitor covers us if a task fails silently, which is usually due to a deployment or if an instance gets recycled.

We don't add Cronitor to more frequent tasks. Adding a task requires extra config in -credentials and costs more. There's also a risk of reducing the value of Cronitor alerts if we alert on all tasks indiscriminately, since more frequent tasks naturally recur without needing any action from us.

We did consider the potential for Cronitor to cover us for schedule / scheduler bugs, but we should get alerts about these together with other errors. We don't add Cronitor for daily -alert- tasks (e.g. check-if-letters-still-in-created) - although these are critical, they are really temporary until we have more timely alerts.

What do you think?

@idavidmcdonald
Copy link
Contributor

I like your idea Ben and that is a nice length.

One thing that might be good is to clarify that this is currently what we are doing, rather than a decision made today to adopt a new process (I think).

@benthorner
Copy link
Contributor Author

Thanks @idavidmcdonald. I've a note about it being retrospective in the Wiki version.

Closing in favour of new documentation: https://github.com/alphagov/notifications-manuals/wiki/Decision-records#use-cronitor-to-monitor-daily-tasks---7122021

@benthorner benthorner closed this Dec 7, 2021
@benthorner benthorner deleted the scheduled-cronitor-180344153 branch December 7, 2021 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants