Skip to content
This repository has been archived by the owner on Jul 12, 2023. It is now read-only.

Add monitoring and alerting for realm capacity #645

Merged
merged 4 commits into from
Sep 23, 2020
Merged

Add monitoring and alerting for realm capacity #645

merged 4 commits into from
Sep 23, 2020

Conversation

femnad
Copy link
Contributor

@femnad femnad commented Sep 23, 2020

Fixes #572

Proposed Changes

  • Record metric for realm capacity on issuing a verification token
  • Add dashboard and alerting for monitoring realm capacity

Release Note

Record and monitor realm verification token capacity

@googlebot googlebot added the cla: yes Auto: added by CLA bot when all committers have signed a CLA. label Sep 23, 2020
Copy link
Member

@sethvargo sethvargo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I was more thinking we'd generate a metric/alert when a realm reached their capacity (or when the capacity was less than some small number like "10". The result from Take should include all the info you need without making the extra db calls.

pkg/controller/issueapi/issue.go Outdated Show resolved Hide resolved
pkg/controller/issueapi/issue.go Outdated Show resolved Hide resolved
pkg/controller/issueapi/issue.go Outdated Show resolved Hide resolved
pkg/controller/issueapi/issue.go Outdated Show resolved Hide resolved
@femnad
Copy link
Contributor Author

femnad commented Sep 23, 2020

Hmm I was more thinking we'd generate a metric/alert when a realm reached their capacity (or when the capacity was less than some small number like "10". The result from Take should include all the info you need without making the extra db calls.

Seeing the return of Take now I see that would have been more straightforward. Change to comparing the remaining to 10 (constant) and have a two-state alert based on that?

Copy link
Member

@sethvargo sethvargo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @icco

@@ -254,6 +255,8 @@ func (c *Controller) HandleIssue() http.Handler {
return
}

c.recordCapacity(ctx, realm, remaining)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, so we know their total configured limit and the number remaining. I'm gonna defer to @icco on whether we want to alert on a fixed val (e.g. 10) or a ratio here (e.g. < 10% quota remaining).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanna alert on percentage, but the metrics should be the actual numbers. So record remaining and total here, and in the alert calculate the percentage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have alerting based on multiple metrics? From what I can tell Stackdriver only seemed to consider a single metric, am I wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, my understanding was that we can do math now with Query Notation, but I have not tried

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can do something like the following to calculate memory utilization in metrics explorer via MQL:

{ fetch 'compute.googleapis.com/instance/memory/balloon/ram_used'
; fetch 'compute.googleapis.com/instance/memory/balloon/ram_size' }
| join
| div

however, couldn't make something similar work via Terraform alert policy filter, it doesn't seem possible to select multiple metrics by ORring the metric type.

I'll dig further, but in the meantime, added metrics for remaining and issued tokens, which is closer to what we want here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense. Recording all three below seems fine.

Copy link
Contributor

@icco icco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@sethvargo
Copy link
Member

/lgtm
/hold

I'd like @icco to give the final approval

* Avoid db calls by using the remaning token count from returnee of Take
* Not recording capacity is not fatal
* Reverse the capacity logic to alert above 90% utilization
@femnad
Copy link
Contributor Author

femnad commented Sep 23, 2020

New changes are detected. LGTM label has been removed.

Had to rebase on main to fix a conflict.

Copy link
Contributor

@icco icco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -254,6 +255,8 @@ func (c *Controller) HandleIssue() http.Handler {
return
}

c.recordCapacity(ctx, realm, remaining)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense. Recording all three below seems fine.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: femnad, icco

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@icco
Copy link
Contributor

icco commented Sep 23, 2020

/unhold

@google-oss-robot google-oss-robot merged commit 4a8b236 into google:main Sep 23, 2020
@femnad femnad deleted the 572-capacity-monitoring-pr branch September 23, 2020 21:29
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes Auto: added by CLA bot when all committers have signed a CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set up monitoring and alerting when realms are at certain capacity levels
5 participants