-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower CortexIngesterRestarts severity #321
Conversation
Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, downscale is the cause, restart is a symptom. Typically on downscale I'd expect single restart, not multiple. We have caught issues with this alert before (mostly in Loki, but still). I would keep it as is, and perhaps tune the parameters.
An ingester restart is not an issue per se, not even if it happen twice in a row. |
I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine. |
That said, since other alerts cover our write and read path, maybe lowering priority for this one is fine. 🤔 |
Agree on this. However, even if that's the case, if it just happens on a single ingester it's still not a critical thing (not critical enough to be waken up during the night because doesn't negatively affect the cluster). If it happens on multiple ingesters at the same time, SLO and other critical alerts (eg. requests failure) will trigger: these latter alerts are more symptom based than "an ingester is restarting too frequently". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think this can be moved to warning
as other causes should be covered by SLOs, etc.
…ster-restarts-severity Lower CortexIngesterRestarts severity
What this PR does:
Last night I've been paged by
CortexIngesterRestarts
but it was a false positive caused by a K8S cluster downscaling triggered by the node autoscaler. There was no impact on the Cortex cluster availability, given the ingester pod disruption budget was honored.It's not the first time this is happening and, in this PR, I'm proposing to lower
CortexIngesterRestarts
severity fromcritical
towarning
. An ingester restart is not an issue per se. The signal may still be useful in case of a cluster outage, but we shouldn't have a critical alert on it.Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]