Skip to content

Commit

Permalink
Merge pull request openshift#886 from elmiko/add-mhc-sc-alert
Browse files Browse the repository at this point in the history
[OCPCLOUD-922] add alert for mhc in short circuit
  • Loading branch information
openshift-merge-robot committed Jul 26, 2021
2 parents 1cd30d8 + 6bc53e1 commit 356e121
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 0 deletions.
27 changes: 27 additions & 0 deletions docs/user/Alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,30 @@ due to either network issue or missing service definition.

### Resolution
Investigate the logs of the machine-api-operator to determine why it is unable to gather machines and machinesets, or investigate the collection of metrics.

## MachineHealthCheckUnterminatedShortCircuit
A MachineHealthCheck has been in short circuit for an extended period of time
and is no longer remediating unhealthy machines.

### Query
```
# for: 30m
mapi_machinehealthcheck_short_circuit == 1
```

### Possible Causes
* The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check

### Resolution
Check to ensure that the `maxUnhealthy` field on the MachineHealthCheck is not set too low.
In some cases a low value for `maxUnhealthy` will mean that the MachineHealthCheck will enter
short-circuit if only a few nodes are unhealthy. Setting this value will be different for
every cluster's and user's needs, but in general you should consider the size of your cluster
and the maximum number of machines which can unhealthy before the MachineHealthCheck will
stop attempting remediation. You might consider setting this value to a percentage (eg `50%`)
to ensure that the MachineHealthCheck will continue to perform as expected as your cluster
grows.

If the `maxUnhealthy` value looks acceptable, the next step is to inspect the
unhealthy machines and remediate them manually if possible. This can usually be achieved
by deleting the machines in question and allowing the Machine API to recreate them.
10 changes: 10 additions & 0 deletions install/0000_90_machine-api-operator_04_alertrules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,13 @@ spec:
severity: critical
annotations:
message: "machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api"
- name: machine-health-check-unterminated-short-circuit
rules:
- alert: MachineHealthCheckUnterminatedShortCircuit
expr: |
mapi_machinehealthcheck_short_circuit == 1
for: 30m
labels:
severity: warning
annotation:
message: "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"

0 comments on commit 356e121

Please sign in to comment.