Skip to content

Commit

Permalink
add alert for mhc in short circuit
Browse files Browse the repository at this point in the history
This adds an alert for when an mhc has been in short-circuit for more
than 30 minutes, also adds some documentation about it.

ref: https://issues.redhat.com/browse/OCPCLOUD-922
  • Loading branch information
elmiko committed Jul 9, 2021
1 parent fdaf9a9 commit 9410f63
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 0 deletions.
22 changes: 22 additions & 0 deletions docs/user/Alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,25 @@ due to either network issue or missing service definition.

### Resolution
Investigate the logs of the machine-api-operator to determine why it is unable to gather machines and machinesets, or investigate the collection of metrics.

## MachineHealthCheckUnterminatedShortCircuit
A MachineHealthCheck has been in short circuit for an extended period of time
and is no longer remediating unhealthy machines.

### Query
```
# for: 30m
mapi_machinehealthcheck_short_circuit == 1
```

### Possible Causes
* The number of unhealthy machine has exceeded the `maxUnhealthy` limit for the check

### Resolution
Check to ensure that the `maxUnhealthy` field on the MachineHealthCheck is not set too low.
In some cases a low value for `maxUnhealthy` will mean that the MachineHealthCheck will enter
short-circuit if only a few nodes are unhealthy.

If the `maxUnhealthy` value looks acceptable, the next step is to inspect the
unhealthy machines and remediate them manually if possible. This can usually be achieved
by deleting the machines in question and allowing the Machine API to recreate them.
10 changes: 10 additions & 0 deletions install/0000_90_machine-api-operator_04_alertrules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,13 @@ spec:
severity: critical
annotations:
message: "machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api"
- name: machine-health-check-unterminated-short-circuit
rules:
- alert: MachineHealthCheckUnterminatedShortCircuit
expr: |
mapi_machinehealthcheck_short_circuit == 1
for: 30m
labels:
severity: warning
annotation:
message: "machine health check {{ $labels.name }} has been short circuited for more than 30 minutes"

0 comments on commit 9410f63

Please sign in to comment.