Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: mdadm disk fail metric #261

Closed
hryamzik opened this issue Jun 22, 2016 · 14 comments
Closed

Feature request: mdadm disk fail metric #261

hryamzik opened this issue Jun 22, 2016 · 14 comments

Comments

@hryamzik
Copy link

hryamzik commented Jun 22, 2016

Node exporter doesn't report amount of failed disks in mdadm, probably the most useful metric for this collector.

mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md5 : active raid5 sda1[5] sdc1[0] sdb1[4](F) sdd1[1]
      8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

md1 : active raid1 sde1[0] sdf1[1]
      58581824 blocks super 1.2 [2/2] [UU]

unused devices: <none>

current metrics:

# HELP node_md_blocks Total number of blocks on device.
# TYPE node_md_blocks gauge
node_md_blocks{device="md1"} 5.8581824e+07
node_md_blocks{device="md5"} 8.790400512e+09
# HELP node_md_blocks_synced Number of blocks synced on device.
# TYPE node_md_blocks_synced gauge
node_md_blocks_synced{device="md1"} 5.8581824e+07
node_md_blocks_synced{device="md5"} 8.790400512e+09
# HELP node_md_disks Total number of disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md1"} 2
node_md_disks{device="md5"} 4
# HELP node_md_disks_active Number of active disks of device.
# TYPE node_md_disks_active gauge
node_md_disks_active{device="md1"} 2
node_md_disks_active{device="md5"} 3
# HELP node_md_is_active Indicator whether the md-device is active or not.
# TYPE node_md_is_active gauge
node_md_is_active{device="md1"} 1
node_md_is_active{device="md5"} 1

P.S.: I do see the node_md_disks - node_md_disks_active calculation but not sure how should it work with hot spares.

@SuperQ
Copy link
Member

SuperQ commented Jun 23, 2016

Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state

node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1

@frittentheke
Copy link
Contributor

May I ask if this issue is still pursued with PR #492 closed?

@hryamzik
Copy link
Author

I didn't manage to find corresponding PR and can't check on a real system now. However issue looks addressed, let's close it for now.

@mpursley
Copy link
Contributor

@hryamzik (and anyone else who is searching for why node_exporter doesn't have a metric for md software raid disk states (e.g. failed, active, etc)),

I found a few PRs with updates to add the disk states, but none of them got merged.. e.g.
#648
#492

Seems like there's more debate then consensus in those PRs and they get closed over time...
In the meantime, I have updatedn the md_info textcollector to include these disk states into node_md_info_* metrics... e.g. node_md_info, node_md_info_FailedDevices, node_md_info_WorkingDevices, etc...

See that PR and updated md_info textcollector here...
#1204

@discordianfish discordianfish reopened this Feb 9, 2019
@discordianfish
Copy link
Member

@mpursley You're right, the PRs weren't merged. As mentioned here: #648 (comment)
The consent was that we would like to have the functionality but we would like the functionality in it's own module that the node-exporter just uses. That could be part of the node-exporter but would be great if externally maintained.

@mpursley
Copy link
Contributor

mpursley commented Feb 15, 2019

Yeah, makes sense. Another option people can use in the mean time is this (now merged) text_collector script (running in a cronjob as root)...

https://github.com/prometheus/node_exporter/blob/master/text_collector_examples/md_info_detail.sh

Thanks @discordianfish

@You-NeverKnow
Copy link
Contributor

Hi everyone. I was planning to extract away the stat-extraction complexity in mdadm_linux.go to a different repository, and calling functions from that repo in mdadm_linux.go to serve it with node_exporter. Does that sound good?

@SuperQ
Copy link
Member

SuperQ commented Jun 10, 2019

@You-NeverKnow Yes, sounds great, we've been moving all of the generic /proc and /sys parsing to prometheus/procfs.

@You-NeverKnow
Copy link
Contributor

I'm confused. I was under the impression that we were using GET_ARRAY_INFO IOCTL call to retrieve raid array statuses according to this comment: #648 (comment).

Would you rather want to use the mdstat parser in procfs instead?

@SuperQ
Copy link
Member

SuperQ commented Jun 10, 2019

#648 was never merged, so we're still just doing proc file parsing.

We could go the syscall method route, or the parsing route. I haven't looked at the mdadm stuff recently, but afaik there's some stuff you can only get with parsing. Also, we need to make sure any syscalls are available as non-root. We don't allow code in the node_exporter that requires root level access for safety.

SuperQ pushed a commit that referenced this issue Jul 1, 2019
* Closes issue #261 on node_exporter.

Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics.
-> Added disk labels: "fail", "spare", "active" to indicate disk status
-> hanged metric node_md_disks_total ==> node_md_disks_required
-> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project.

Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>
@judos
Copy link

judos commented Mar 29, 2020

Just came across this very useful feature request. Anything changed since last year?

For users:
Link was broken it's now:
script: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
doc: https://github.com/prometheus/node_exporter#textfile-collector

@hoffie
Copy link
Contributor

hoffie commented Mar 29, 2020

Just came across this very useful feature request. Anything changed since last year?

Yes, I would say so. #1403 was merged, which included the refactoring into procfs and the addition of the state label. I think the merge of that PR was also supposed to close this issue? @SuperQ

The change is part of v1.0.0-rc.0.
Technically, it should be possible to see disks in status "failed" now. However, this is only true as long as the kernel has not removed the failed disks from the array (see #1655).

@judos
Copy link

judos commented Mar 29, 2020

Awesome I changed to use the v1.0.0-rc.0 version and I get the metrics I wanted.
e.g.:

node_md_disks{device="md0",state="active"} 2
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0

I would also consider the ticket closed 😄 Thanks for the fast help!

@discordianfish
Copy link
Member

Great and thanks for confirming. Closing.

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
* Closes issue prometheus#261 on node_exporter.

Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics.
-> Added disk labels: "fail", "spare", "active" to indicate disk status
-> hanged metric node_md_disks_total ==> node_md_disks_required
-> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project.

Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
* Closes issue prometheus#261 on node_exporter.

Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics.
-> Added disk labels: "fail", "spare", "active" to indicate disk status
-> hanged metric node_md_disks_total ==> node_md_disks_required
-> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project.

Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants