Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ruler alerts, and add playbooks. #347

Merged
merged 8 commits into from
Jul 2, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
* [CHANGE] Replace `CortexRulerFailedEvaluations` with two new alerts: `CortexRulerTooManyFailedPushes` and `CortexRulerTooManyFailedQueries`. #347
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
Expand Down
30 changes: 25 additions & 5 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -527,20 +527,40 @@
name: 'ruler_alerts',
rules: [
{
alert: 'CortexRulerFailedEvaluations',
alert: 'CortexRulerTooManyFailedPushes',
expr: |||
sum by (%s, instance, rule_group) (rate(cortex_prometheus_rule_evaluation_failures_total[1m]))
100 * (
sum by (%s, instance) (rate(cortex_ruler_write_requests_failed_total[1m]))
/
sum by (%s, instance, rule_group) (rate(cortex_prometheus_rule_evaluations_total[1m]))
> 0.01
sum by (%s, instance) (rate(cortex_ruler_write_requests_total[1m]))
) > 1
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '5m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Cortex Ruler {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% write errors.
|||,
},
},
{
alert: 'CortexRulerTooManyFailedQueries',
expr: |||
100 * (
sum by (%s, instance) (rate(cortex_ruler_queries_failed_total[1m]))
/
sum by (%s, instance) (rate(cortex_ruler_queries_total[1m]))
) > 1
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '5m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
Cortex Ruler {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% errors for the rule group {{ $labels.rule_group }}.
Cortex Ruler {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% write errors.
pstibrany marked this conversation as resolved.
Show resolved Hide resolved
|||,
},
},
Expand Down
21 changes: 19 additions & 2 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,9 +144,26 @@ More information:

This alert occurs when a ruler is unable to validate whether or not it should claim ownership over the evaluation of a rule group. The most likely cause is that one of the rule ring entries is unhealthy. If this is the case proceed to the ring admin http page and forget the unhealth ruler. The other possible cause would be an error returned the ring client. If this is the case look into debugging the ring based on the in-use backend implementation.

### CortexRulerFailedEvaluations
### CortexRulerTooManyFailedPushes

_TODO: this playbook has not been written yet._
This alert fires when rulers cannot push new samples (result of rule evaluation) to ingesters.

In general, pushing samples can fail due to problems with Cortex operations (eg. too many ingesters have crashed, and ruler cannot write samples to them), or due to problems with resulting data (eg. user hitting limit for number of series, out of order samples, etc.).
This alert fires only for first kind of problems, and not for problems caused by limits or invalid rules.

How to **fix**:
- Investigate the ruler logs to find out the reason why ruler cannot write samples.

### CortexRulerTooManyFailedQueries

This alert fires when rulers fail to evaluate rule queries.

Each rule evaluation may fail due to many reasons, eg. due to invalid PromQL expression, or query hits limits on number of chunks. These are "user errors", and this alert ignores them.

There is a category of errors that is more important: errors due to failure to read data from store-gateways or ingesters. These errors would result in 500 when run from querier. This alert fires if there is too many of such failures.

How to **fix**:
- Investigate the ruler logs to find out the reason why ruler cannot evaluate queries. Note that rule logs rule evaluation errors even for "user errors", but those are not causing the alert to fire. Focus on problems with ingesters or store-gateways.

### CortexRulerMissedEvaluations

Expand Down