Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDRaid Alert #76

Closed
xixipangma opened this issue Jan 23, 2020 · 6 comments
Closed

MDRaid Alert #76

xixipangma opened this issue Jan 23, 2020 · 6 comments

Comments

@xixipangma
Copy link

Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.

- alert: MDRaidDegrade
    expr: (node_md_disk - node_md_disk_active) != 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID."
      description: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID {{$labels.device}}. VALUE - {{ $value }}.
@paulfantom
Copy link

paulfantom commented Jan 23, 2020

If you use textfile collector script from here you can get much more insight into your array. Also after that you can have a very simple alert rule like:

node_md_info_FailedDevices > 0

@samber
Copy link
Owner

samber commented Jan 24, 2020

Hi @xixipangma

In the node-exporter code, I see some nice metrics => https://github.com/prometheus/node_exporter/blob/master/collector/mdadm_linux.go#L42

I'm not sure node_md_disk and node_md_disk_active still exist.

I don't have any MD devices on my side. Could you please test the following alerts 🙏 ?

  • count(node_md_state{state="recovering"}) > 0 by (device) (warning)
  • count(node_md_state{state="resync"}) > 0 by (device) (info)
  • count(node_md_state{state="inactive"}) > 0 by (device) (warning)
  • node_md_disks_required > count(node_md_state{state="active"}) by (device) (critical)

@xixipangma
Copy link
Author

xixipangma commented Jan 24, 2020

Hi @samber

I tested this on one of the md nodes running node_exporter version 0.18.1 and i don't see these new metrics yet. I still see the old node_md_disk and node_md_disk_active. And the old ones have no information about the md state. May be it isn't released yet.

node_md_is_active is replaced by node_md_state with a state set of "active", "inactive", "recovering", "resync".

Check this https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md

@samber
Copy link
Owner

samber commented Jan 24, 2020

Oh!

Let's wait 2 weeks before merging this 😉

Thanks for your suggestion!

@paulfantom
Copy link

paulfantom commented Jan 24, 2020

Just a nit - there are no node_md_disk_active metrics in node_exporter, but there is node_md_disks_active [1]. Same for node_md_disk, it should be node_md_disks [2]

[1]: https://github.com/prometheus/node_exporter/blob/release-0.18/collector/mdadm_linux.go#L235
[2]: https://github.com/prometheus/node_exporter/blob/release-0.18/collector/mdadm_linux.go#L242


May be it isn't released yet

Exactly. :)

In next node_exporter release, metric node_md_disks_active won't be reported anymore as the whole collector was rewritten to include more information (like spare drives or failed devices). More in prometheus/node_exporter#1403.

This means that for the next node_exporter version alerts relying on node_md_disk_active won't work. I think from next node_exporter release, the following alerts should cover major problems with RAID arrays:

  - alert: RAIDCriticalFailure
    expr: 'node_md_state{state="inactive"} >= 1'
    labels:
      severity: critical
    annotations:
      summary: "Degraded RAID array on {{ $labels.instance }}"
      description: "RAID array '{{ $labels.device }}' is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically."
  - alert: RAIDDiskFailure
    expr: 'node_md_disks{state="fail"} > 0'
    labels:
      severity: warning
    annotations:
      summary: "Failed device in RAID array on {{ $labels.instance }}"
      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.md_device }}' needs attention and possibly a disk swap"

@samber
Copy link
Owner

samber commented Mar 7, 2020

Thanks for your help!

Merged to master => https://github.com/samber/awesome-prometheus-alerts/pull/82/files

Looks like node-exporter will be released in v1 soon ;)

@samber samber closed this as completed Mar 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants