-
Notifications
You must be signed in to change notification settings - Fork 83
Add monitoring and alerting for realm capacity #645
Add monitoring and alerting for realm capacity #645
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I was more thinking we'd generate a metric/alert when a realm reached their capacity (or when the capacity was less than some small number like "10". The result from Take
should include all the info you need without making the extra db calls.
Seeing the return of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @icco
@@ -254,6 +255,8 @@ func (c *Controller) HandleIssue() http.Handler { | |||
return | |||
} | |||
|
|||
c.recordCapacity(ctx, realm, remaining) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, so we know their total configured limit and the number remaining. I'm gonna defer to @icco on whether we want to alert on a fixed val (e.g. 10) or a ratio here (e.g. < 10% quota remaining).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wanna alert on percentage, but the metrics should be the actual numbers. So record remaining and total here, and in the alert calculate the percentage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have alerting based on multiple metrics? From what I can tell Stackdriver only seemed to consider a single metric, am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, my understanding was that we can do math now with Query Notation, but I have not tried
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can do something like the following to calculate memory utilization in metrics explorer via MQL:
{ fetch 'compute.googleapis.com/instance/memory/balloon/ram_used'
; fetch 'compute.googleapis.com/instance/memory/balloon/ram_size' }
| join
| div
however, couldn't make something similar work via Terraform alert policy filter, it doesn't seem possible to select multiple metrics by OR
ring the metric type.
I'll dig further, but in the meantime, added metrics for remaining and issued tokens, which is closer to what we want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense. Recording all three below seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
/lgtm I'd like @icco to give the final approval |
* Avoid db calls by using the remaning token count from returnee of Take * Not recording capacity is not fatal * Reverse the capacity logic to alert above 90% utilization
Had to rebase on main to fix a conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -254,6 +255,8 @@ func (c *Controller) HandleIssue() http.Handler { | |||
return | |||
} | |||
|
|||
c.recordCapacity(ctx, realm, remaining) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense. Recording all three below seems fine.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: femnad, icco The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
Fixes #572
Proposed Changes
Release Note