Lower CortexIngesterRestarts severity #321

pracucci · 2021-06-07T06:55:29Z

What this PR does:
Last night I've been paged by CortexIngesterRestarts but it was a false positive caused by a K8S cluster downscaling triggered by the node autoscaler. There was no impact on the Cortex cluster availability, given the ingester pod disruption budget was honored.

It's not the first time this is happening and, in this PR, I'm proposing to lower CortexIngesterRestarts severity from critical to warning. An ingester restart is not an issue per se. The signal may still be useful in case of a cluster outage, but we shouldn't have a critical alert on it.

Which issue(s) this PR fixes:
N/A

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pstibrany

In my opinion, downscale is the cause, restart is a symptom. Typically on downscale I'd expect single restart, not multiple. We have caught issues with this alert before (mostly in Loki, but still). I would keep it as is, and perhaps tune the parameters.

pracucci · 2021-06-07T07:24:52Z

In my opinion, downscale is the cause, restart is a symptom. Typically on downscale I'd expect single restart, not multiple. We have caught issues with this alert before (mostly in Loki, but still). I would keep it as is, and perhaps tune the parameters.

An ingester restart is not an issue per se, not even if it happen twice in a row.

pstibrany · 2021-06-07T07:49:17Z

An ingester restart is not an issue per se, not even if it happen twice in a row.

I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine.

pstibrany · 2021-06-07T07:52:12Z

I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine.

That said, since other alerts cover our write and read path, maybe lowering priority for this one is fine. 🤔

pracucci · 2021-06-07T12:03:30Z

I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine.

Agree on this. However, even if that's the case, if it just happens on a single ingester it's still not a critical thing (not critical enough to be waken up during the night because doesn't negatively affect the cluster). If it happens on multiple ingesters at the same time, SLO and other critical alerts (eg. requests failure) will trigger: these latter alerts are more symptom based than "an ingester is restarting too frequently".

owen-d

Agreed, I think this can be moved to warning as other causes should be covered by SLOs, etc.

…ster-restarts-severity Lower CortexIngesterRestarts severity

Lower CortexIngesterRestarts severity

bd0dad6

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci requested a review from a team as a code owner June 7, 2021 06:55

pstibrany reviewed Jun 7, 2021

View reviewed changes

pstibrany approved these changes Jun 7, 2021

View reviewed changes

owen-d approved these changes Jun 7, 2021

View reviewed changes

Merge branch 'main' into lower-ingester-restarts-severity

2624c08

pracucci merged commit e7cbfe4 into main Jun 8, 2021

pracucci deleted the lower-ingester-restarts-severity branch June 8, 2021 13:23

simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021

Merge pull request grafana/cortex-jsonnet#321 from grafana/lower-inge…

edd68a4

…ster-restarts-severity Lower CortexIngesterRestarts severity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower CortexIngesterRestarts severity #321

Lower CortexIngesterRestarts severity #321

pracucci commented Jun 7, 2021

pstibrany left a comment

pracucci commented Jun 7, 2021

pstibrany commented Jun 7, 2021

pstibrany commented Jun 7, 2021

pracucci commented Jun 7, 2021

owen-d left a comment

Lower CortexIngesterRestarts severity #321

Lower CortexIngesterRestarts severity #321

Conversation

pracucci commented Jun 7, 2021

pstibrany left a comment

Choose a reason for hiding this comment

pracucci commented Jun 7, 2021

pstibrany commented Jun 7, 2021

pstibrany commented Jun 7, 2021

pracucci commented Jun 7, 2021

owen-d left a comment

Choose a reason for hiding this comment