Skip to content
This repository has been archived by the owner on Jul 12, 2023. It is now read-only.

Switch to forward-progress alerting #1929

Merged
merged 1 commit into from
Mar 20, 2021
Merged

Switch to forward-progress alerting #1929

merged 1 commit into from
Mar 20, 2021

Conversation

sethvargo
Copy link
Member

This changes most of our background jobs to alert when forward-progress is not achieved instead of on specific failures. This did require we increase the frequency of the modeler and appsync workers because of how Cloud Monitoring calculates windows.

Part of #1777:

  • appsync-worker - appsync runs every 4h, alert after 2 failures
  • backup-database-worker - alert on all failures, page, playbook should have us run a manual backup
  • cleanup worker - cleanup runs every 1h, alert after 4 failures
  • docker-mirror-worker - email on failures, do not page. no longer in use
  • e2e - page on each failure, but not if the scheduler fails to reach the container
  • modeler - modeler runs every 4h, alert after 2 failures
  • realm-key-rotation-worker - realm-key-rotation runs every 15m, alert after 2 failures
  • rotation worker - rotation runs every 30m, alert after 2 failures
  • stats-puller - stats-puller runs every 15m, alert after 2 failures

There's a 5min buffer on the end to give us enough time in case a later attempt just succeeded to avoid the alert.

Release Note

Switch to forward-progress alerting for most background jobs. See the updated ForwardProgressFailed.md documentation for more information.

@sethvargo sethvargo requested a review from a team as a code owner March 20, 2021 00:11
@google-cla google-cla bot added the cla: yes Auto: added by CLA bot when all committers have signed a CLA. label Mar 20, 2021
@@ -59,14 +59,14 @@ func init() {
Aggregation: view.Count(),
},
{
Name: metricPrefix + "/token_success",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to rename these so we can use the same MQL query for .success (they are new metrics I just created in the last PR, so no prod impact).

@sethvargo sethvargo requested a review from whaught March 20, 2021 00:19
This changes most of our background jobs to alert when forward-progress is not achieved instead of on specific failures. This did require we increase the frequency of the modeler and appsync workers because of how Cloud Monitoring calculates windows.
@sethvargo sethvargo merged commit 564c0f2 into main Mar 20, 2021
@sethvargo sethvargo deleted the sethvargo/ff branch March 20, 2021 20:26
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes Auto: added by CLA bot when all committers have signed a CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants