Feature request: mdadm disk fail metric #261

hryamzik · 2016-06-22T19:06:15Z

Node exporter doesn't report amount of failed disks in mdadm, probably the most useful metric for this collector.

mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md5 : active raid5 sda1[5] sdc1[0] sdb1[4](F) sdd1[1]
      8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

md1 : active raid1 sde1[0] sdf1[1]
      58581824 blocks super 1.2 [2/2] [UU]

unused devices: <none>

current metrics:

# HELP node_md_blocks Total number of blocks on device.
# TYPE node_md_blocks gauge
node_md_blocks{device="md1"} 5.8581824e+07
node_md_blocks{device="md5"} 8.790400512e+09
# HELP node_md_blocks_synced Number of blocks synced on device.
# TYPE node_md_blocks_synced gauge
node_md_blocks_synced{device="md1"} 5.8581824e+07
node_md_blocks_synced{device="md5"} 8.790400512e+09
# HELP node_md_disks Total number of disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md1"} 2
node_md_disks{device="md5"} 4
# HELP node_md_disks_active Number of active disks of device.
# TYPE node_md_disks_active gauge
node_md_disks_active{device="md1"} 2
node_md_disks_active{device="md5"} 3
# HELP node_md_is_active Indicator whether the md-device is active or not.
# TYPE node_md_is_active gauge
node_md_is_active{device="md1"} 1
node_md_is_active{device="md5"} 1

P.S.: I do see the node_md_disks - node_md_disks_active calculation but not sure how should it work with hot spares.

The text was updated successfully, but these errors were encountered:

SuperQ · 2016-06-23T09:57:29Z

Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state

node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1

frittentheke · 2018-04-25T21:11:46Z

May I ask if this issue is still pursued with PR #492 closed?

hryamzik · 2018-04-25T21:19:14Z

I didn't manage to find corresponding PR and can't check on a real system now. However issue looks addressed, let's close it for now.

mpursley · 2019-01-26T00:34:20Z

@hryamzik (and anyone else who is searching for why node_exporter doesn't have a metric for md software raid disk states (e.g. failed, active, etc)),

I found a few PRs with updates to add the disk states, but none of them got merged.. e.g.
#648
#492

Seems like there's more debate then consensus in those PRs and they get closed over time...
In the meantime, I have updatedn the md_info textcollector to include these disk states into node_md_info_* metrics... e.g. node_md_info, node_md_info_FailedDevices, node_md_info_WorkingDevices, etc...

See that PR and updated md_info textcollector here...
#1204

discordianfish · 2019-02-09T11:04:19Z

@mpursley You're right, the PRs weren't merged. As mentioned here: #648 (comment)
The consent was that we would like to have the functionality but we would like the functionality in it's own module that the node-exporter just uses. That could be part of the node-exporter but would be great if externally maintained.

mpursley · 2019-02-15T19:20:32Z

Yeah, makes sense. Another option people can use in the mean time is this (now merged) text_collector script (running in a cronjob as root)...

https://github.com/prometheus/node_exporter/blob/master/text_collector_examples/md_info_detail.sh

Thanks @discordianfish

You-NeverKnow · 2019-06-07T21:13:39Z

Hi everyone. I was planning to extract away the stat-extraction complexity in mdadm_linux.go to a different repository, and calling functions from that repo in mdadm_linux.go to serve it with node_exporter. Does that sound good?

SuperQ · 2019-06-10T11:42:38Z

@You-NeverKnow Yes, sounds great, we've been moving all of the generic /proc and /sys parsing to prometheus/procfs.

You-NeverKnow · 2019-06-10T13:44:50Z

I'm confused. I was under the impression that we were using GET_ARRAY_INFO IOCTL call to retrieve raid array statuses according to this comment: #648 (comment).

Would you rather want to use the mdstat parser in procfs instead?

SuperQ · 2019-06-10T14:17:03Z

#648 was never merged, so we're still just doing proc file parsing.

We could go the syscall method route, or the parsing route. I haven't looked at the mdadm stuff recently, but afaik there's some stuff you can only get with parsing. Also, we need to make sure any syscalls are available as non-root. We don't allow code in the node_exporter that requires root level access for safety.

* Closes issue #261 on node_exporter. Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics. -> Added disk labels: "fail", "spare", "active" to indicate disk status -> hanged metric node_md_disks_total ==> node_md_disks_required -> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project. Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>

judos · 2020-03-29T15:35:05Z

Just came across this very useful feature request. Anything changed since last year?

For users:
Link was broken it's now:
script: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
doc: https://github.com/prometheus/node_exporter#textfile-collector

hoffie · 2020-03-29T16:17:23Z

Just came across this very useful feature request. Anything changed since last year?

Yes, I would say so. #1403 was merged, which included the refactoring into procfs and the addition of the state label. I think the merge of that PR was also supposed to close this issue? @SuperQ

The change is part of v1.0.0-rc.0.
Technically, it should be possible to see disks in status "failed" now. However, this is only true as long as the kernel has not removed the failed disks from the array (see #1655).

judos · 2020-03-29T16:33:36Z

Awesome I changed to use the v1.0.0-rc.0 version and I get the metrics I wanted.
e.g.:

node_md_disks{device="md0",state="active"} 2
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0

I would also consider the ticket closed 😄 Thanks for the fast help!

discordianfish · 2020-04-17T09:04:39Z

Great and thanks for confirming. Closing.

* Closes issue prometheus#261 on node_exporter. Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics. -> Added disk labels: "fail", "spare", "active" to indicate disk status -> hanged metric node_md_disks_total ==> node_md_disks_required -> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project. Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>

grobie added the enhancement label Aug 10, 2016

discordianfish added the accepted label Jan 4, 2017

bs-github mentioned this issue Mar 3, 2017

add mdadm disk fail and spare metric, move node_md_disks_active #492

Closed

hryamzik closed this as completed Apr 25, 2018

discordianfish reopened this Feb 9, 2019

You-NeverKnow mentioned this issue Jun 14, 2019

Fixed edge cases, added more tests, and extracted failed/spare device information from mdstat prometheus/procfs#181

Merged

judos mentioned this issue Mar 29, 2020

remove md_info, as node exporter supports it with v1.0.0-rc0 prometheus-community/node-exporter-textfile-collector-scripts#48

Open

discordianfish closed this as completed Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: mdadm disk fail metric #261

Feature request: mdadm disk fail metric #261

hryamzik commented Jun 22, 2016 •

edited

Loading

SuperQ commented Jun 23, 2016

frittentheke commented Apr 25, 2018

hryamzik commented Apr 25, 2018

mpursley commented Jan 26, 2019

discordianfish commented Feb 9, 2019

mpursley commented Feb 15, 2019 •

edited

Loading

You-NeverKnow commented Jun 7, 2019

SuperQ commented Jun 10, 2019

You-NeverKnow commented Jun 10, 2019

SuperQ commented Jun 10, 2019

judos commented Mar 29, 2020

hoffie commented Mar 29, 2020

judos commented Mar 29, 2020 •

edited

Loading

discordianfish commented Apr 17, 2020

Feature request: mdadm disk fail metric #261

Feature request: mdadm disk fail metric #261

Comments

hryamzik commented Jun 22, 2016 • edited Loading

SuperQ commented Jun 23, 2016

frittentheke commented Apr 25, 2018

hryamzik commented Apr 25, 2018

mpursley commented Jan 26, 2019

discordianfish commented Feb 9, 2019

mpursley commented Feb 15, 2019 • edited Loading

You-NeverKnow commented Jun 7, 2019

SuperQ commented Jun 10, 2019

You-NeverKnow commented Jun 10, 2019

SuperQ commented Jun 10, 2019

judos commented Mar 29, 2020

hoffie commented Mar 29, 2020

judos commented Mar 29, 2020 • edited Loading

discordianfish commented Apr 17, 2020

hryamzik commented Jun 22, 2016 •

edited

Loading

mpursley commented Feb 15, 2019 •

edited

Loading

judos commented Mar 29, 2020 •

edited

Loading