MDRaid Alert #76

xixipangma · 2020-01-23T21:38:14Z

Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.

- alert: MDRaidDegrade
    expr: (node_md_disk - node_md_disk_active) != 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID."
      description: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID {{$labels.device}}. VALUE - {{ $value }}.

The text was updated successfully, but these errors were encountered:

paulfantom · 2020-01-23T21:43:22Z

If you use textfile collector script from here you can get much more insight into your array. Also after that you can have a very simple alert rule like:

node_md_info_FailedDevices > 0

samber · 2020-01-24T11:27:16Z

Hi @xixipangma

In the node-exporter code, I see some nice metrics => https://github.com/prometheus/node_exporter/blob/master/collector/mdadm_linux.go#L42

I'm not sure node_md_disk and node_md_disk_active still exist.

I don't have any MD devices on my side. Could you please test the following alerts 🙏 ?

count(node_md_state{state="recovering"}) > 0 by (device) (warning)
count(node_md_state{state="resync"}) > 0 by (device) (info)
count(node_md_state{state="inactive"}) > 0 by (device) (warning)
node_md_disks_required > count(node_md_state{state="active"}) by (device) (critical)

xixipangma · 2020-01-24T12:23:59Z

Hi @samber

I tested this on one of the md nodes running node_exporter version 0.18.1 and i don't see these new metrics yet. I still see the old node_md_disk and node_md_disk_active. And the old ones have no information about the md state. May be it isn't released yet.

node_md_is_active is replaced by node_md_state with a state set of "active", "inactive", "recovering", "resync".

Check this https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md

samber · 2020-01-24T12:39:12Z

Oh!

Let's wait 2 weeks before merging this 😉

Thanks for your suggestion!

paulfantom · 2020-01-24T16:35:46Z

Just a nit - there are no node_md_disk_active metrics in node_exporter, but there is node_md_disks_active [1]. Same for node_md_disk, it should be node_md_disks [2]

[1]: https://github.com/prometheus/node_exporter/blob/release-0.18/collector/mdadm_linux.go#L235
[2]: https://github.com/prometheus/node_exporter/blob/release-0.18/collector/mdadm_linux.go#L242

May be it isn't released yet

Exactly. :)

In next node_exporter release, metric node_md_disks_active won't be reported anymore as the whole collector was rewritten to include more information (like spare drives or failed devices). More in prometheus/node_exporter#1403.

This means that for the next node_exporter version alerts relying on node_md_disk_active won't work. I think from next node_exporter release, the following alerts should cover major problems with RAID arrays:

  - alert: RAIDCriticalFailure
    expr: 'node_md_state{state="inactive"} >= 1'
    labels:
      severity: critical
    annotations:
      summary: "Degraded RAID array on {{ $labels.instance }}"
      description: "RAID array '{{ $labels.device }}' is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically."
  - alert: RAIDDiskFailure
    expr: 'node_md_disks{state="fail"} > 0'
    labels:
      severity: warning
    annotations:
      summary: "Failed device in RAID array on {{ $labels.instance }}"
      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.md_device }}' needs attention and possibly a disk swap"

samber · 2020-03-07T16:54:30Z

Thanks for your help!

Merged to master => https://github.com/samber/awesome-prometheus-alerts/pull/82/files

Looks like node-exporter will be released in v1 soon ;)

samber mentioned this issue Mar 7, 2020

Added RAID alerts (node-exporter) #82

Merged

samber closed this as completed Mar 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDRaid Alert #76

MDRaid Alert #76

xixipangma commented Jan 23, 2020

paulfantom commented Jan 23, 2020 •

edited

Loading

samber commented Jan 24, 2020

xixipangma commented Jan 24, 2020 •

edited

Loading

samber commented Jan 24, 2020

paulfantom commented Jan 24, 2020 •

edited

Loading

samber commented Mar 7, 2020

MDRaid Alert #76

MDRaid Alert #76

Comments

xixipangma commented Jan 23, 2020

paulfantom commented Jan 23, 2020 • edited Loading

samber commented Jan 24, 2020

xixipangma commented Jan 24, 2020 • edited Loading

samber commented Jan 24, 2020

paulfantom commented Jan 24, 2020 • edited Loading

samber commented Mar 7, 2020

paulfantom commented Jan 23, 2020 •

edited

Loading

xixipangma commented Jan 24, 2020 •

edited

Loading

paulfantom commented Jan 24, 2020 •

edited

Loading