Cyclic load peaks induced by node_exporter #1963

niko · 2021-02-10T07:39:02Z

We're running node_exporter on a bunch of ubuntu 18.04 servers within a docker container. We were experiencing mysterious cyclic load peaks every 105 minutes and by gradually switching of services we identified node_exporter as the culprit. We were running node_exporter with some collectors exclude of at first. Now we almost eliminated the problem by excluding most of the collectors. Here's the git diff:

-        command: --no-collector.bcache --no-collector.textfile --no-collector.timex --no-collector.wifi --no-collector.xfs --no-collector.zfs
+        command: --no-collector.bcache --no-collector.btrfs --no-collector.textfile --no-collector.wifi --no-collector.arp --no-collector.bonding --no-collector.conntrack --no-collector.cpufreq --no-collector.entropy --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.netclass --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.pressure --no-collector.rapl --no-collector.sockstat --no-collector.softnet --no-collector.thermal_zone --no-collector.time --no-collector.timex --no-collector.udp_queues --no-collector.xfs --no-collector.zfs

Just to give you a sense of the issue here are some screenshots of the load graphs. As you can tell we deployed node_exporter with reduces collectors at about 0:30.

Fileservers 24h:

Streamingserver 24h:

Strangely there seems to be a second order interference. Here's one server over a 9 day time period:

There is a wave of peaks at 16:30, 1:00, 14:00, 23:00, 11:30, 20:30, 9:00, 18:00.

Conclusion:

I think node_exporter should default to less collectors. I think you should warn people about activating too many collectors, even the more harmless ones. If possible the intervals where stuff gets run inside node_exporter should somehow be synchronized.

I don't want to sound too harsh. node_exporter is a great tool for our infrastructure and I really love prometheus! Thanks!

The text was updated successfully, but these errors were encountered:

uniemimu · 2021-02-10T07:52:47Z

See also #1880

Anybody hitting strange load issues ought to first try removing the cpufreq collector

discordianfish · 2021-02-10T09:50:40Z

Yeah please try to see if it happens without the cpu collector. If that's the case, you can close this issue and we follow up in #1880

SuperQ · 2021-02-10T10:30:53Z

It would be useful to provide metric data, specifically what metrics are in your graphs, and the results of rate(process_cpu_seconds_total[1m]).

SuperQ · 2021-02-10T10:38:56Z

If possible the intervals where stuff gets run inside node_exporter should somehow be synchronized.

I think you misunderstand how the node_exporter works. There are no intervals in the node_exporter, or most Prometheus exporters for that matter. The node_exporter collects data on demand when Prometheus scrapes /metrics.

SuperQ · 2021-02-10T11:18:41Z

It would also be useful to include some system information:

What kernel version
What kind of hardware/VM?
How many CPUs?

Based on what evidence is provide so far, this seems like a graph of load average. This is going to be highly susceptible to run queue noise, and isn't exactly a good metric to rely on for "load'.

I'm also suspecting if you set the environment variable GOMAXPROCS=1 in your node_exporter container, and ran with the default flags, everything would be normal.

SuperQ · 2021-02-10T12:41:30Z

If my theory about GOMAXPROCS is true, #1964 will help.

discordianfish · 2021-02-10T16:34:03Z

Can you maybe provide also graphs for CPU usage split by mode? I'd like to see if this is actual cpu usage in userspace or in kernel space, indicating some issue there.

xelatirdan · 2021-07-14T18:28:49Z

I faced same issue with node-exporter and disabling cpufreq collector (--no-collector.cpufreq) solved the problem. You can see it after 15:00 at screenshot:

As @uniemimu wrote it seems issue #1880 related.

discordianfish · 2021-07-15T10:06:11Z

I think we should collect the following infos for each case:

What kernel version
What kind of hardware/VM?
How many CPUs?
CPU usage split by mode

@xelatirdan Can you provide these details?

xelatirdan · 2021-07-15T10:24:48Z

Hello @discordianfish

What kernel version

Debian 10.10
# uname -a Linux server1 4.19.0-17-amd64 #1 SMP Debian 4.19.194-1 (2021-06-10) x86_64 GNU/Linux

What kind of hardware/VM?

I see the issue on all servers, don't depends what CPU used.
AMD EPYC 7402P 24-Core Processor or Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

How many CPUs?

On EPYC 24 (48 HT) and on Xeon 16 (32 HT)

CPU usage split by mode

On CPU usage graph I don't see any spikes.

I use node-exporter in docker container:
quay.io/prometheus/node-exporter:v1.1.2
Version:
node_exporter" version="(version=1.1.2, branch=HEAD, revision=b597c1244d7bef49e6f3359c87a56dd7707f6719)

discordianfish · 2021-07-15T10:38:43Z

Thanks! Yeah looks like it's the issue in #1880. Let's close this and continue over there

niko changed the title ~~Cyclic load peaks~~ Cyclic load peaks induced by node_exporter Feb 10, 2021

discordianfish mentioned this issue Feb 28, 2021

Default GOMAXPROCS to 1 #1964

Closed

discordianfish closed this as completed Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyclic load peaks induced by node_exporter #1963

Cyclic load peaks induced by node_exporter #1963

niko commented Feb 10, 2021

uniemimu commented Feb 10, 2021

discordianfish commented Feb 10, 2021

SuperQ commented Feb 10, 2021

SuperQ commented Feb 10, 2021

SuperQ commented Feb 10, 2021

SuperQ commented Feb 10, 2021

discordianfish commented Feb 10, 2021

xelatirdan commented Jul 14, 2021 •

edited

Loading

discordianfish commented Jul 15, 2021

xelatirdan commented Jul 15, 2021

discordianfish commented Jul 15, 2021

Cyclic load peaks induced by node_exporter #1963

Cyclic load peaks induced by node_exporter #1963

Comments

niko commented Feb 10, 2021

uniemimu commented Feb 10, 2021

discordianfish commented Feb 10, 2021

SuperQ commented Feb 10, 2021

SuperQ commented Feb 10, 2021

SuperQ commented Feb 10, 2021

SuperQ commented Feb 10, 2021

discordianfish commented Feb 10, 2021

xelatirdan commented Jul 14, 2021 • edited Loading

discordianfish commented Jul 15, 2021

xelatirdan commented Jul 15, 2021

discordianfish commented Jul 15, 2021

xelatirdan commented Jul 14, 2021 •

edited

Loading