-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDRaid Alert #76
Comments
If you use textfile collector script from here you can get much more insight into your array. Also after that you can have a very simple alert rule like:
|
Hi @xixipangma In the node-exporter code, I see some nice metrics => https://github.com/prometheus/node_exporter/blob/master/collector/mdadm_linux.go#L42 I'm not sure node_md_disk and node_md_disk_active still exist. I don't have any MD devices on my side. Could you please test the following alerts 🙏 ?
|
Hi @samber I tested this on one of the md nodes running node_exporter version 0.18.1 and i don't see these new metrics yet. I still see the old node_md_disk and node_md_disk_active. And the old ones have no information about the md state. May be it isn't released yet.
Check this https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md |
Oh! Let's wait 2 weeks before merging this 😉 Thanks for your suggestion! |
Just a nit - there are no [1]: https://github.com/prometheus/node_exporter/blob/release-0.18/collector/mdadm_linux.go#L235
Exactly. :) In next node_exporter release, metric This means that for the next node_exporter version alerts relying on - alert: RAIDCriticalFailure
expr: 'node_md_state{state="inactive"} >= 1'
labels:
severity: critical
annotations:
summary: "Degraded RAID array on {{ $labels.instance }}"
description: "RAID array '{{ $labels.device }}' is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically."
- alert: RAIDDiskFailure
expr: 'node_md_disks{state="fail"} > 0'
labels:
severity: warning
annotations:
summary: "Failed device in RAID array on {{ $labels.instance }}"
description: "At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.md_device }}' needs attention and possibly a disk swap" |
Thanks for your help! Merged to master => https://github.com/samber/awesome-prometheus-alerts/pull/82/files Looks like node-exporter will be released in v1 soon ;) |
Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.
The text was updated successfully, but these errors were encountered: