Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyclic load peaks induced by node_exporter #1963

Closed
niko opened this issue Feb 10, 2021 · 11 comments
Closed

Cyclic load peaks induced by node_exporter #1963

niko opened this issue Feb 10, 2021 · 11 comments

Comments

@niko
Copy link

niko commented Feb 10, 2021

We're running node_exporter on a bunch of ubuntu 18.04 servers within a docker container. We were experiencing mysterious cyclic load peaks every 105 minutes and by gradually switching of services we identified node_exporter as the culprit. We were running node_exporter with some collectors exclude of at first. Now we almost eliminated the problem by excluding most of the collectors. Here's the git diff:

-        command: --no-collector.bcache --no-collector.textfile --no-collector.timex --no-collector.wifi --no-collector.xfs --no-collector.zfs
+        command: --no-collector.bcache --no-collector.btrfs --no-collector.textfile --no-collector.wifi --no-collector.arp --no-collector.bonding --no-collector.conntrack --no-collector.cpufreq --no-collector.entropy --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.netclass --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.pressure --no-collector.rapl --no-collector.sockstat --no-collector.softnet --no-collector.thermal_zone --no-collector.time --no-collector.timex --no-collector.udp_queues --no-collector.xfs --no-collector.zfs

Just to give you a sense of the issue here are some screenshots of the load graphs. As you can tell we deployed node_exporter with reduces collectors at about 0:30.

Fileservers 24h:

image

Streamingserver 24h:

image

Strangely there seems to be a second order interference. Here's one server over a 9 day time period:

image

There is a wave of peaks at 16:30, 1:00, 14:00, 23:00, 11:30, 20:30, 9:00, 18:00.

Conclusion:

I think node_exporter should default to less collectors. I think you should warn people about activating too many collectors, even the more harmless ones. If possible the intervals where stuff gets run inside node_exporter should somehow be synchronized.

I don't want to sound too harsh. node_exporter is a great tool for our infrastructure and I really love prometheus! Thanks!

@uniemimu
Copy link
Contributor

See also #1880

Anybody hitting strange load issues ought to first try removing the cpufreq collector

@niko niko changed the title Cyclic load peaks Cyclic load peaks induced by node_exporter Feb 10, 2021
@discordianfish
Copy link
Member

Yeah please try to see if it happens without the cpu collector. If that's the case, you can close this issue and we follow up in #1880

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

It would be useful to provide metric data, specifically what metrics are in your graphs, and the results of rate(process_cpu_seconds_total[1m]).

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

If possible the intervals where stuff gets run inside node_exporter should somehow be synchronized.

I think you misunderstand how the node_exporter works. There are no intervals in the node_exporter, or most Prometheus exporters for that matter. The node_exporter collects data on demand when Prometheus scrapes /metrics.

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

It would also be useful to include some system information:

  • What kernel version
  • What kind of hardware/VM?
  • How many CPUs?

Based on what evidence is provide so far, this seems like a graph of load average. This is going to be highly susceptible to run queue noise, and isn't exactly a good metric to rely on for "load'.

I'm also suspecting if you set the environment variable GOMAXPROCS=1 in your node_exporter container, and ran with the default flags, everything would be normal.

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

If my theory about GOMAXPROCS is true, #1964 will help.

@discordianfish
Copy link
Member

Can you maybe provide also graphs for CPU usage split by mode? I'd like to see if this is actual cpu usage in userspace or in kernel space, indicating some issue there.

@xelatirdan
Copy link

xelatirdan commented Jul 14, 2021

I faced same issue with node-exporter and disabling cpufreq collector (--no-collector.cpufreq) solved the problem. You can see it after 15:00 at screenshot:
2ZG6QdF

As @uniemimu wrote it seems issue #1880 related.

@discordianfish
Copy link
Member

I think we should collect the following infos for each case:

  • What kernel version
  • What kind of hardware/VM?
  • How many CPUs?
  • CPU usage split by mode

@xelatirdan Can you provide these details?

@xelatirdan
Copy link

Hello @discordianfish

What kernel version

Debian 10.10
# uname -a Linux server1 4.19.0-17-amd64 #1 SMP Debian 4.19.194-1 (2021-06-10) x86_64 GNU/Linux

What kind of hardware/VM?

I see the issue on all servers, don't depends what CPU used.
AMD EPYC 7402P 24-Core Processor or Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

How many CPUs?

On EPYC 24 (48 HT) and on Xeon 16 (32 HT)

CPU usage split by mode

On CPU usage graph I don't see any spikes.
Screenshot_2021-07-15_13-19-24

I use node-exporter in docker container:
quay.io/prometheus/node-exporter:v1.1.2
Version:
node_exporter" version="(version=1.1.2, branch=HEAD, revision=b597c1244d7bef49e6f3359c87a56dd7707f6719)

@discordianfish
Copy link
Member

Thanks! Yeah looks like it's the issue in #1880. Let's close this and continue over there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants