-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node_exporter can't work on Centos 7 #697
Comments
This sounds like you have a firewall dropping packets. Do you get the same issue when |
it is a good idea. so I do a testing again. Following is what I do, please correct me if i wrong 1. check the firewall, and confirm the 9100 is open [aaa@localhost ~]$ sudo systemctl disable firewalld
2. check the node_export log [aaa@localhost ~]$ sudo ./node_exporter --log.level="debug" 3 another check of iptables [aaa@localhost ~]$ sudo iptables -L Chain FORWARD (policy ACCEPT) Chain OUTPUT (policy ACCEPT)
from the node_export output , I think the process get the query, and collector the metrics, but it didn't send out the response. Please take a look |
I don't know the root cause, but it can work on another centos 7. so I close it. |
some update: I think it is a bug of node_export on centos 7. so I reopen it. following is what I do, checking log. [aaa@localhost node_exporter-0.15.0.linux-amd64]$ curl http://127.0.0.1:9100/ -m 10 <title>Node Exporter</title>Node Exporter[aaa@localhost node_exporter-0.15.0.linux-amd64]$ curl http://10.29.101.105:9100/ -m 10 <title>Node Exporter</title>Node Exporter[aaa@localhost node_exporter-0.15.0.linux-amd64]$ curl http://10.29.101.105:9100/metrics -m 10 curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received [aaa@localhost node_exporter-0.15.0.linux-amd64]$ curl http://127.0.0.1:9100/metrics -m 10 curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received [aaa@localhost node_exporter-0.15.0.linux-amd64]$ sudo iptables -L Chain INPUT (policy ACCEPT) target prot opt source destinationChain FORWARD (policy DROP) Chain OUTPUT (policy ACCEPT) Chain DOCKER (0 references) Chain DOCKER-ISOLATION (0 references) Chain DOCKER-USER (0 references) |
When I rollback to https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz |
Based on the log you included, the only collector not reporting OK was |
And if that fixes the issue, run node exporter with only this collector and If this happens a lot we might want to disable hwmon by default? |
(but so far, this is the only report) |
Post some strace log for debuging.
@SuperQ I think you are correct, when I use ./node_exporter --no-collector.hwmon, it works well. |
Please try running and then scraping the node_exporter with all collectors except the hwmon collector and attach the log and trace output from this command.
|
Lol. Wish I had seen this. My thread tonight is here: What I saw: WORKING 0.14.0 BROKEN 0.15.0 When I refreshed the browser somehow the request for metrics dropped off and the agent changed from Prometheus to the Firefox browser. My test case was Working Clear Test Broken Trace Files For Both Working and Broken Testcase. |
@DanielNeedles Please try the command I listed above to trace only the |
@SuperQ Thanks so much! And I appologize in advance for my failure find and RTFM but can you point me to the URL /man page on Node Explorer besides the source code? I know 0.14.0 is collecting metrics for memory and CPU, because I am graphing it. So clearly I don't understand what you mean by "collect[]" isn't supported. Is there a more indepth description of these various parameters and what node exporter is actually performing? Thanks! There probably should be something in the README if this is the command folks need to use on RHEL7/CentOS7. 8-) Attached is the resulting trace. |
@DanielNeedles We added a new feature to the |
You included only the log output of the exporter, but not the strace contents. Also, it would help to set the log level to debug, so that I can get the timing info for your curl test command to see where the hwmon collector is failing. |
@SuperQ My bad. I missed that you had the -o option arealy. Adding that file. This is the command I used for your reference:
|
I have encountered the same issue with Ubuntu 14.04 LTS:
node_exporter 0.14.0 worked fine, in node_exporter 0.15.0 I need to disable |
@stephan-vollmer If you can also run the same procedure above to see if we can trace the root cause of what is going on with these older kernels. |
@SuperQ Did the new trace I provide help at all? Or do you need something else? |
Yes, a little. It was attempting to open temp1_input, and got an error. We may be not catching the error correctly, or there is some other problem. If you look in the trace, you could try cat on the last file it tries to open. I will see about adding some more debug logging to this collector. |
@SuperQ Here is the trace file I created. |
I also have problems with running 0.15.0 on CentOS (in Docker), in my case using the filesystem collector. I need to exclude the filesystems "net.*" to avoid loads of errors like My kernel is |
@lindhor my work around was to use 0.14.0 for now. |
I'm seeing similar behavior on Ubuntu 14.04 x86_64 (on VMware) with node_exporter 0.15.0. A curl call to /metrics just hangs after the GET has been sent. Disabling the hwmon collector (or downgrading to node_exporter-0.14.0) solves the problem. I can provide more data if needed. |
@hrak That helps, again, I'm seeing an I don't see any obvious places where we're trying to read a file and don't catch errors. |
I've built a test node_exporter binary with some more verbose debugging logs. I will setup some test machines with older distros to try and reproduce the bug. |
I've done several hours of testing on a laptop (ThinkPad x230) running CentOS 7 with both the production and test binary and haven not been able to reproduce the error. 😞 If anyone is interested in testing, please try building from the superq/hwmon_test branch. It includes a bunch of extra debug text, and only enables hwmon by default. |
Man, I am sorry about that! I really really hate it when that happens. I’ll take a gander later today.
From: Ben Kochie [mailto:notifications@github.com]
Sent: Thursday, November 2, 2017 7:34 AM
To: prometheus/node_exporter <node_exporter@noreply.github.com>
Cc: DanielNeedles <dneedles@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [prometheus/node_exporter] node_exporter can't work on Centos 7 (#697)
I've done several hours of testing on a laptop (THinkPad x230) running CentOS 7 with both the production and test binary and haven not been able to reproduce the error. 😞
If anyone is interested in testing, please try building from the superq/hwmon_test <https://github.com/prometheus/node_exporter/tree/superq/hwmon_test> branch. It includes a bunch of extra debug text, and only enables hwmon by default.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#697 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJzvWYWWDTRj7u8FkvsFI3CmDtb4vwBzks5sydLpgaJpZM4P39I3> . <https://github.com/notifications/beacon/AJzvWftEeyv9yZPQlnREgH9zP3o32GuRks5sydLpgaJpZM4P39I3.gif>
|
@SuperQ I just tried your branch, and this is what i'm getting:
What i also noticed is that i'm getting a 'Resource temporarily unavailable' when trying to read from
|
@hrak Thanks, I think what we might have to do is replace the Can you give me some details on what your hardware is, or what hwmon0 is?
|
@SuperQ This is on Ubuntu 14.04 x86_64 on VMware. Actually i'm seeing a pattern of this only happening on Ubuntu 14.04 in our VMware env. Bare metal is fine.
|
I'm assuming that vmware is trying to present a CPU that Ubuntu thinks has I'm still trying to figure out if |
@SuperQ i just found out installing kernel 4.4.0 ( |
@hrak Good news there. Does the kernel load/report hwmon coretemp metrics correctly? |
Its reporting bogus values from the looks of things, but i guess that makes sense in a VM?
I still wonder what changed between 0.14.0 and 0.15.0 that broke this on 3.13 kernels though. 0.14.0 worked fine on these VMs. |
Got this same issue running on RHEL7 and fixed with the But one color note is that of the 4 nodes I'm running, 3 of them are on kubs 1.8.0 and 1 on 1.8.1. |
@lucasvel Can you report on what kind of platform these are on? vmware? Can you report the output of |
@SuperQ Yes they are on vmware.
|
Yes, it seems like |
@SuperQ I can confirm that. Just unload the Thanks! |
We have published v0.15.1 that contains the workaround for broken hwmon data. Please post if it improves things. |
@SuperQ just tested the v0.15.1 and works like a charm without the need to disable the Thank you! |
That’s great! I hate those sort of kernel bugs. Good job at finding a work around.
From: Ben Kochie [mailto:notifications@github.com]
Sent: Tuesday, November 7, 2017 10:39 AM
To: prometheus/node_exporter <node_exporter@noreply.github.com>
Cc: DanielNeedles <dneedles@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [prometheus/node_exporter] node_exporter can't work on Centos 7 (#697)
We have published v0.15.1 <https://github.com/prometheus/node_exporter/releases/tag/v0.15.1> that contains the workaround for broken hwmon data. Please post if it improves things.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#697 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AJzvWeX1xOh_jyCP7ku9JeD-6eKP5pOSks5s0KPNgaJpZM4P39I3> . <https://github.com/notifications/beacon/AJzvWetEmVhuALlgal-2bZwToHAJiUw0ks5s0KPNgaJpZM4P39I3.gif>
|
@SuperQ verified 0.15.1 working here as well. Thanks! |
I'm still getting Anything rootfs related is giving permission errors. This wasn't an issue in v14. My docker run command is:
Is there something I'm missing? |
@lindhor I've tried several things but no luck, still getting that error. CentOS 7.3/Node exporter 15.2 |
Host operating system: output of
uname -a
[aaa@localhost ~]$ uname -a
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
node_exporter, version 0.15.0 (branch: HEAD, revision: 6e2053c)
build user: root@168089f37ad9
build date: 20171006-11:33:58
go version: go1.9.1
node_exporter command line flags
sudo ./node_exporter --log.level="debug" --no-collector.zfs
Are you running node_exporter in Docker?
no. it didn't work in docker
What did you do that produced an error?
[aaa@localhost ~]$ sudo ./node_exporter --log.level="debug" --no-collector.zfs
INFO[0000] Starting node_exporter (version=0.15.0, branch=HEAD, revision=6e2053c557f96efb63aef3691f15335a70baaffd) source="node_exporter.go:43"
INFO[0000] Build context (go=go1.9.1, user=root@168089f37ad9, date=20171006-11:33:58) source="node_exporter.go:44"
INFO[0000] No directory specified, see --collector.textfile.directory source="textfile.go:57"
INFO[0000] Enabled collectors: source="node_exporter.go:50"
INFO[0000] - filesystem source="node_exporter.go:52"
INFO[0000] - vmstat source="node_exporter.go:52"
INFO[0000] - edac source="node_exporter.go:52"
INFO[0000] - hwmon source="node_exporter.go:52"
INFO[0000] - infiniband source="node_exporter.go:52"
INFO[0000] - meminfo source="node_exporter.go:52"
INFO[0000] - textfile source="node_exporter.go:52"
INFO[0000] - cpu source="node_exporter.go:52"
INFO[0000] - entropy source="node_exporter.go:52"
INFO[0000] - arp source="node_exporter.go:52"
INFO[0000] - sockstat source="node_exporter.go:52"
INFO[0000] - loadavg source="node_exporter.go:52"
INFO[0000] - netdev source="node_exporter.go:52"
INFO[0000] - wifi source="node_exporter.go:52"
INFO[0000] - timex source="node_exporter.go:52"
INFO[0000] - xfs source="node_exporter.go:52"
INFO[0000] - netstat source="node_exporter.go:52"
INFO[0000] - diskstats source="node_exporter.go:52"
INFO[0000] - mdadm source="node_exporter.go:52"
INFO[0000] - time source="node_exporter.go:52"
INFO[0000] - conntrack source="node_exporter.go:52"
INFO[0000] - filefd source="node_exporter.go:52"
INFO[0000] - ipvs source="node_exporter.go:52"
INFO[0000] - stat source="node_exporter.go:52"
INFO[0000] - uname source="node_exporter.go:52"
INFO[0000] - bcache source="node_exporter.go:52"
INFO[0000] Listening on :9100 source="node_exporter.go:76"
DEBU[0005] OK: bcache collector succeeded after 0.000072s. source="collector.go:126"
DEBU[0005] CPU "/sys/bus/cpu/devices/cpu0" is missing cpufreq source="cpu_linux.go:114"
DEBU[0005] CPU "/sys/bus/cpu/devices/cpu0" is missing thermal_throttle source="cpu_linux.go:135"
DEBU[0005] Package "/sys/bus/node/devices/node0" CPU "0" is missing package_throttle source="cpu_linux.go:166"
DEBU[0005] OK: cpu collector succeeded after 0.000471s. source="collector.go:126"
DEBU[0005] Ignoring mount point: /sys source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /proc source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /dev source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/kernel/security source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /dev/shm source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /dev/pts source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/systemd source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/pstore source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/devices source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/cpu,cpuacct source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/memory source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/perf_event source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/freezer source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/net_cls source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/cpuset source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/hugetlb source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/fs/cgroup/blkio source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/kernel/config source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /proc/sys/fs/binfmt_misc source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /sys/kernel/debug source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /dev/mqueue source="filesystem_linux.go:42"
DEBU[0005] Ignoring mount point: /dev/hugepages source="filesystem_linux.go:42"
DEBU[0005] OK: filesystem collector succeeded after 0.000990s. source="collector.go:126"
DEBU[0005] OK: edac collector succeeded after 0.000022s. source="collector.go:126"
DEBU[0005] Unable to detect InfiniBand devices source="infiniband_linux.go:110"
DEBU[0005] OK: infiniband collector succeeded after 0.000081s. source="collector.go:126"
DEBU[0005] Set node_mem: map[string]float64{"WritebackTmp":0, "HugePages_Rsvd":0, "DirectMap4k":6.2849024e+07, "DirectMap2M":2.084569088e+09, "Inactive_file":1.22441728e+08, "SwapTotal":2.147479552e+09, "KernelStack":2.12992e+06, "PageTables":4.52608e+06, "VmallocTotal":3.5184372087808e+13, "Buffers":970752, "Unevictable":0, "HugePages_Total":0, "Dirty":8192, "CommitLimit":3.112349696e+09, "Active_file":3.760128e+07, "AnonPages":5.5844864e+07, "SUnreclaim":1.570816e+07, "Committed_AS":4.00420864e+08, "MemAvailable":1.667551232e+09, "Cached":1.6797696e+08, "Mlocked":0, "HugePages_Surp":0, "MemTotal":1.929740288e+09, "MemFree":1.611333632e+09, "SwapFree":2.147479552e+09, "Shmem":8.904704e+06, "SReclaimable":2.299904e+07, "Bounce":0, "AnonHugePages":8.388608e+06, "Active":9.3741056e+07, "Inactive":1.3099008e+08, "HugePages_Free":0, "Hugepagesize":2.097152e+06, "Writeback":0, "VmallocChunk":3.5184201691136e+13, "Inactive_anon":8.548352e+06, "Mapped":4.3884544e+07, "Slab":3.87072e+07, "NFS_Unstable":0, "VmallocUsed":1.61316864e+08, "HardwareCorrupted":0, "SwapCached":0, "Active_anon":5.6139776e+07} source="meminfo.go:48"
DEBU[0005] OK: meminfo collector succeeded after 0.000517s. source="collector.go:126"
DEBU[0005] OK: textfile collector succeeded after 0.000000s. source="collector.go:126"
DEBU[0005] OK: entropy collector succeeded after 0.000061s. source="collector.go:126"
DEBU[0005] OK: arp collector succeeded after 0.000098s. source="collector.go:126"
DEBU[0005] OK: sockstat collector succeeded after 0.000149s. source="collector.go:126"
DEBU[0005] return load 0: 0.000000 source="loadavg.go:51"
DEBU[0005] return load 1: 0.010000 source="loadavg.go:51"
DEBU[0005] return load 2: 0.050000 source="loadavg.go:51"
DEBU[0005] OK: loadavg collector succeeded after 0.000116s. source="collector.go:126"
DEBU[0005] OK: netdev collector succeeded after 0.000479s. source="collector.go:126"
DEBU[0005] OK: wifi collector succeeded after 0.000247s. source="collector.go:126"
DEBU[0005] OK: timex collector succeeded after 0.000024s. source="collector.go:126"
DEBU[0005] OK: xfs collector succeeded after 0.000157s. source="collector.go:126"
DEBU[0005] OK: netstat collector succeeded after 0.001702s. source="collector.go:126"
DEBU[0005] Ignoring device: fd0 source="diskstats_linux.go:175"
DEBU[0005] Ignoring device: sda1 source="diskstats_linux.go:175"
DEBU[0005] Ignoring device: sda2 source="diskstats_linux.go:175"
DEBU[0005] OK: diskstats collector succeeded after 0.000333s. source="collector.go:126"
DEBU[0005] OK: mdadm collector succeeded after 0.000084s. source="collector.go:126"
DEBU[0005] Return time: 1507873748.996584 source="time.go:47"
DEBU[0005] OK: time collector succeeded after 0.000041s. source="collector.go:126"
DEBU[0005] OK: conntrack collector succeeded after 0.000086s. source="collector.go:126"
DEBU[0005] OK: filefd collector succeeded after 0.000064s. source="collector.go:126"
DEBU[0005] ipvs collector metrics are not available for this system source="ipvs_linux.go:113"
DEBU[0005] OK: ipvs collector succeeded after 0.000099s. source="collector.go:126"
DEBU[0005] OK: stat collector succeeded after 0.000185s. source="collector.go:126"
DEBU[0005] OK: uname collector succeeded after 0.000049s. source="collector.go:126"
DEBU[0005] OK: vmstat collector succeeded after 0.008549s. source="collector.go:126"
use curl to do a testing
[aaa@localhost ~]$ curl http://10.29.101.101:9100/metrics --max-time 10 -kvv
curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received
What did you expect to see?
no output on Centos7 http://10.29.101.101:9100/metrics. The same process can work well on ubuntu 16.04. it seem it is hanging on some step, would you please take a look?
What did you see instead?
The text was updated successfully, but these errors were encountered: