Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_exporter uses excessive CPU on a ThunderX2 host #1880

Closed
djeverett opened this issue Nov 2, 2020 · 46 comments
Closed

node_exporter uses excessive CPU on a ThunderX2 host #1880

djeverett opened this issue Nov 2, 2020 · 46 comments

Comments

@djeverett
Copy link

djeverett commented Nov 2, 2020

Host operating system: output of uname -a

Linux rhel-8-2-aarch64-gpu-1 4.18.0-193.14.2.el8_2.aarch64 #1 SMP Wed Jul 29 19:46:42 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

node_exporter version: output of node_exporter --version

1.0.1

node_exporter command line flags

--collector.systemd --collector.textfile --collector.textfile.directory=/var/lib/node_exporter --collector.filesystem --collector.filesystem.ignored-fs-types='^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fuse..*|fusectl|hugetlbfs|iso9660|mqueue|nfs|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|tmpfs)$' --collector.filesystem.ignored-mount-points='^/(dev|proc|run|sys)($|/)' --web.listen-address=0.0.0.0:9100

Are you running node_exporter in Docker?

No

What did you do that produced an error?

On a ThunderX2 machine, run node_exporter and top. The run curl localhost:9100/metrics and watch for node_exporter in top

The output of lscpu on this host is:

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  4
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               2
Model name:          ThunderX2 99xx
Stepping:            0x1
CPU max MHz:         2500.0000
CPU min MHz:         1000.0000
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-127
NUMA node1 CPU(s):   128-255
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

What did you expect to see?

%CPU for node_exporter is a few percent.

What did you see instead?

CPU usage goes up to more than 1200%.

@uniemimu
Copy link
Contributor

uniemimu commented Nov 5, 2020

The cpu (-freq) collector(s) parallelize a lot, which can cause a sudden burst of threads accessing related kernel functionality pretty much at the same time. Could you try running node_exporter without the cpu* collectors to see if the issue is in those?

@djeverett
Copy link
Author

djeverett commented Nov 5, 2020

Thanks for the response. I can confirm that the excessive CPU load is not seen if node_exporter is started with --no-collector.cpufreq. We are not actually using the cpufreq metrics, so disabling the collector is an appropriate solution for us.

@uniemimu
Copy link
Contributor

uniemimu commented Nov 6, 2020

A long time ago I recall suggesting not to do frequency querying for hyperthreaded cores. The frequency for those is always the same as for the physical core, so one could as well just copy the information from the related physical core value. That would on some CPUs reduce the thread burden by 50% and as such it would be a valid optimization. But I fear that is probably not quite a sufficient remedy.

Perhaps limiting the thread count to something much lower than the physical core count is the way to go. Or the related kernel side functionality should be investigated to find out what exactly is so problematic when many threads try to read the frequency information at the same time.

Last but not least I would like to emphasize that reading the frequency information with as many threads as possible is problematic also in the sense that on some hardware architectures it has a rather significant impact on the result, basically exhibiting a CS case of observer effect. Starting as many threads as there are cores just to read the frequency information "fast" will on certain high-core architectures immediately kick the cores out from any kind of turbo-frequencies they might have used, basically rendering the information that is read, rather useless. You'll never see those turbo frequencies reported with node_exporter, even if they in fact are getting used. I could also imagine overall CPU-frequencies going up (but not as far as to turbo) during the period of reading the information, if spinlocks are getting congested (which could explain the perceived excessive CPU usage). But of the latter, I do not have proof. For the lack of turbo-frequencies getting reported, that I know and have witnessed for sure on high core count machines.

@SuperQ please take note

@SuperQ
Copy link
Member

SuperQ commented Nov 6, 2020

The problem with limiting threads is it will cause the scrape to be slow and just defer the problem.

Remember, these are not OS threads, these are goroutines. If you want to limit the core time of the node_exporter, it's easier to use GOMAXPROCS or cgroup controls.

@uniemimu
Copy link
Contributor

I can verify that this can be an issue also for certain high core count Xeons running short-ish scrape intervals, and the workaround is disabling of cpufreq collector.

I haven't seen any issues with low core count nodes.

@discordianfish
Copy link
Member

Can y'all confirm this happens with both the cpu and the cpufreq collector?

Given that more people report this now, we need to do something about this. I suspect there is some kernel issue that we're triggering. We're saying:

	// Execute the parsing of each CPU in parallel.
	// This is done because the kernel intentionally delays access to each CPU by
	// 50 milliseconds to avoid DDoSing possibly expensive functions.

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

IIRC, @rtreffer found the explicit 50ms delay in the kernel code.

@uniemimu
Copy link
Contributor

Can y'all confirm this happens with both the cpu and the cpufreq collector?

For us disabling cpufreq-collector was sufficient to bring the load down and the nagging from the fellow maintaining one particular cluster ceased with that. We do still run the cpu-collector.

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

Thanks, I'm guessing this would be solved with GOMAXPROCS=1 as well.

@discordianfish
Copy link
Member

@uniemimu Can you try setting GOMAXPROCS=1, enable the cpufreq collector and see if you still see the issue?

SuperQ added a commit that referenced this issue Feb 10, 2021
Avoid running on all CPUs by limiting the Go runtime to one CPU by
default. Avoids having Go routines schedule on every CPU, driving up the
visible run queue length on high CPU count systems.

#1880

Signed-off-by: Ben Kochie <superq@gmail.com>
@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

For reference, the original issue about cpufreq taking too long is here: #1086

@uniemimu
Copy link
Contributor

@uniemimu Can you try setting GOMAXPROCS=1, enable the cpufreq collector and see if you still see the issue?

It is better, and doesn't get out of control anymore in a 96 core machine. However on an idling machine you can still easily tell that cpufreq is running, the 1m load went from 0.5 to 1.5 or like that. Compare with the default GOMAXPROCS the system becomes rather unstable and logging in with ssh is very sluggish, with the user quickly going for "sudo kill -9 pidof node_exporter". Load averages are in the hundreds then.

@SuperQ
Copy link
Member

SuperQ commented Feb 10, 2021

Any chance we could get access to one of these systems to do some tracing? Unfortunately, I have no such hardware to test with.

@uniemimu
Copy link
Contributor

uniemimu commented Feb 11, 2021

Any chance we could get access to one of these systems to do some tracing? Unfortunately, I have no such hardware to test with.

From my side I can't unfortunately give access to these. But I did a perf-trace:

 42.43% swapper
     99.72% [kernel.kallsyms]

 18.80% node_exporter
     82.58% [kernel.kallsyms]
     17.23% node_exporter

 14.47% kubelet
     62.91% kubelet
     36.19% [kernel.kallsyms]

  6.66% containerd-shim
     61.98% [kernel.kallsyms]
     36.74% containerd-shim-runc-v2
      1.27% [vdso]

  3.37% containerd
     68.11% containerd
     31.02% [kernel.kallsyms]

  1.20% alertmanager
     55.03% alertmanager
     43.07% [kernel.kallsyms]
      1.21% [vdso]

  1.06% weaver
     58.93% weaver
     37.85% [kernel.kallsyms]
      2.67% [unknown]

I only allowed it to run for a while with the default GOMAXPROCS, system load had climbed to around 10 when I started tracing. It would have climbed to hundreds within an hour. The issue is at the kernel side obviously, where exactly is another thing. I was running 5.8.0-41-generic kernel during the trace, but 5.9.0 is no better.

@SuperQ
Copy link
Member

SuperQ commented Feb 11, 2021

What would be useful is a Go pprof CPU profile.

  • Startup node_exporter (on any port you want)
  • go tool pprof -seconds 60 "http://${target}:${port}/debug/pprof/profile"
  • Scrape the test node_exporter.
  • Save the profile (i.e. ${HOME}/pprof/pprof.node_exporter.samples.cpu.001.pb.gz)

The pprof tool will gather data for as long as you specify -seconds.

@uniemimu
Copy link
Contributor

Meanwhile I took a look at kernel.map which I happened to find, and the thing node_exporter is reaching for in the kernel points to osq_lock. Which to me makes perfect sense, I was expecting it to be fighting over a (spin)lock down there, for some reason or the other.

@uniemimu
Copy link
Contributor

pprof_tree.txt

@eero-t
Copy link

eero-t commented Feb 11, 2021

@SuperQ With >4/5 of the "node_exporter" CPU cycles going to kernel per-CPU spinlock spinning on Xeon, I think that "strace" is enough to debug the issue, and that it can be done on any machine.

See if there are any syscalls that might require kernel internal spinlocks (when you enable just cpufreq collector), and reduce the number of those that would scale up with the number of host CPUs.

@eero-t
Copy link

eero-t commented Feb 11, 2021

Looking at the kernel osq_lock history: https://github.com/torvalds/linux/commits/master/kernel/locking/osq_lock.c
There haven't been any recent performance changes, so kernel version doesn't matter as long as it's from 2018 or later.

@SuperQ
Copy link
Member

SuperQ commented Feb 12, 2021

Alright, I guess we need to trace which exact thing is causing the spinlock race, and put some kind of concurrency wrapper around it.

@cgwalters
Copy link

So the core tension here seems to be the kernel is trying intentionally to limit userspace frequency here for good reason, vs node_exporter not wanting to block. But isn't explicitly just using one asynchronous goroutine here sufficient?

@discordianfish
Copy link
Member

@cgwalters That's still a bit unclear. We don't think it can be explained in this case by just intentional throttling but nobody had yet time to dive deeper into that issue. We tried with GOMAXPROCS=1 but using one goroutine might be worth a try.

SuperQ added a commit that referenced this issue Nov 28, 2022
Avoid running on all CPUs by limiting the Go runtime to one CPU by
default. Avoids having Go routines schedule on every CPU, driving up the
visible run queue length on high CPU count systems.

This also helps workaround a kernel deadlock issue with reading from
sysfs concurrently.

See:
* #1880
* #2500

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Nov 29, 2022
Avoid running on all CPUs by limiting the Go runtime to one CPU by
default. Avoids having Go routines schedule on every CPU, driving up the
visible run queue length on high CPU count systems.

This also helps workaround a kernel deadlock issue with reading from
sysfs concurrently.

See:
* #1880
* #2500

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Nov 29, 2022
Avoid running on all CPUs by limiting the Go runtime to one CPU by
default. Avoids having Go routines schedule on every CPU, driving up the
visible run queue length on high CPU count systems.

This also helps workaround a kernel deadlock issue with reading from
sysfs concurrently.

See:
* #1880
* #2500

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Nov 29, 2022
NOTE: This changes the Go runtime "GOMAXPROCS" to 1. This is done to limit the
  concurrency of the exporter to 1 CPU thread at a time in order to avoid a
  race condition problem in the Linux kernel (#2500) and parallel IO issues
  on nodes with high numbers of CPUs/CPU threads (#1880).

* [CHANGE] Default GOMAXPROCS to 1 #2530
* [FEATURE] Add multiple listeners and systemd socket listener activation #2393
* [ENHANCEMENT] Add RTNL version of netclass collector #2492, #2528
* [BUGFIX] Fix diskstats exclude flags #2487
* [BUGFIX] Bump go/x/crypt and go/x/net #2488
* [BUGFIX] Fix hwmon label sanitizer #2504
* [BUGFIX] Use native endianness when encoding InetDiagMsg #2508
* [BUGFIX] Fix btrfs device stats always being zero #2516
* [BUGFIX] Security: Update exporter-toolkit (CVE-2022-46146) #2531

Signed-off-by: Ben Kochie <superq@gmail.com>
@SuperQ
Copy link
Member

SuperQ commented Nov 29, 2022

I have published node_exporter v1.5.0 which now defaults GOMAXPROCS to 1.

We have also converted a couple of the collectors from reading files to netlink. This should improve the performance greatly.

Although for one of these, you will need to use the flag --collector.netclass.netlink to enable it.

@britcey
Copy link

britcey commented Jan 27, 2023

I've had enough issues on my large physical hosts that I'm thinking cpufreq should be disabled by default (do we have any sense of how widely used those metrics are?).

A sample:

Screen Shot 2023-01-27 at 2 09 20 PM

48 CPUs (2 sockets, 12 cores per, 2 threads per). node_exporter 1.3.1 with GOMAXPROCS=1 set via systemd override (I can share how much worse it looked before setting that).

Those queries are

avg_over_time(node_load1{instance="$instance"}[$__interval])

and

rate(process_cpu_seconds_total{instance="$instance", job="node"}[$__rate_interval])

You can see when I disabled the cpufreq collector on the 26th. All the little load spikes went away (there are still some spikes from real work). A nice drop in the CPU utilization as well.

Installing 1.5.0 today yielded even better results with the CPU. I briefly re-enabled cpufreq on 1.5.0 - still an issue.

cpufreq is probably only an issue on these sorts of large hosts but it seems common enough on those that it should not be enabled by default.

This looks to be an issue on about 20% of my hosts - i.e., all of the large ones.

@SuperQ
Copy link
Member

SuperQ commented Jan 27, 2023

It would be useful to get a pprof CPU profile when the problem is happening.

Load average is not really a useful metric, as it doesn't really reflect any real world problem.

@britcey
Copy link

britcey commented Jan 27, 2023

Load average is not really a useful metric, as it doesn't really reflect any real world problem.

Agreed, but it's what got my users complaining. The CPU utilization is real (not huge, but real).

Let me scare up a suitable test host and get that pprof profile.

@britcey
Copy link

britcey commented Jan 27, 2023

pprof.node_exporter.samples.cpu.001.pb.gz

This is from the same machine that's shown in the graph. node_exporter 1.5.0, only --web.listen-address & --collector.textfile.directory specified.

It was just a single scrape - let me know if you want one with repeated scrapes (my poll interval is 10s).

@SuperQ
Copy link
Member

SuperQ commented Jan 28, 2023

Hmm, that pprof file is only showing 80ms of CPU time over 60s. Even if it's just one scrape over 10s, that's only 0.8% average CPU use. I don't see much we can improve on there. Most of the time is spent in syscalls, which we don't have a lot of control over, that's up to the kernel.

@britcey
Copy link

britcey commented Jan 28, 2023

Could there be a cumulative issue that shows up after repeated scrapes? You can see from the graph that CPU utilization is cut by more than half when cpufreq was disabled in production, which seems significant.

It'll also be 2 scrapes in 10 seconds, since there are two pollers.

@SuperQ
Copy link
Member

SuperQ commented Jan 28, 2023

Could there be a cumulative issue that shows up after repeated scrapes

Not likely. This code is very simple and basically no different to other collectors.

The only issue that I can see, which is specific to cpufreq, is there is something wrong with the Kernel access to the data.

@britcey
Copy link

britcey commented Jan 28, 2023

This is interesting:

I've got a group with a few dozen hosts with the same workload (OpenStack host nodes), but about half are AMD-based and the others Xeon-based - all of the Xeon-based hosts show elevated CPU consumption by node_exporter (~ 8%, with a couple significantly higher).

None of the AMD-based ones do (~ 1.2%).

(they're all running 1.5.0).

Edit:

It's a bit more complicated, as the Xeon-based hosts are still running 1.3.1. I'll talk to the group next week and see if they can get to 1.5.0 across the board so we can do an apples-to-apples comparison.

@eero-t
Copy link

eero-t commented Jan 30, 2023

@britcey How many cores and hyperthreads your Xeon and AMD hosts have?

(I'm asking because freq info is per core, so that affects how much overhead querying them has.)

@eero-t
Copy link

eero-t commented Jan 30, 2023

It's a bit more complicated, as the Xeon-based hosts are still running 1.3.1. I'll talk to the group next week and see if they can get to 1.5.0 across the board so we can do an apples-to-apples comparison.

@britcey The relevant change is GOMAXPROCS=1. You can do that with pod spec env var without upgrading node_exporter.

As to 1.5.0 instances, new --collector.netclass.netlink option is not the default, so you could try whether that has any impact.

@eero-t
Copy link

eero-t commented Jan 30, 2023

Load average is not really a useful metric, as it doesn't really reflect any real world problem.

IMHO list of run queue wait wait list size is more useful metric than CPU frequency, which is basically a (strongly biased) random number gauge, as its value can change hundreds of times per second, but node_exporter is not polling it at a frequency that would catch those changes (for obvious reasons).

The only issue that I can see, which is specific to cpufreq, is there is something wrong with the Kernel access to the data.

When its done from threads reading that for all cores at same time, it's lock contention.

While core frequency may have once (in last century) been a kernel variable you can just read, nowadays frequency is controlled by firmware, and kernel needs to use MSR registers to query current frequency for given core:

Accessing core MSR reqisters requires locks and waits. See:

@britcey
Copy link

britcey commented Jan 31, 2023

@britcey How many cores and hyperthreads your Xeon and AMD hosts have?

Two models of Xeon:
CPU(s): 40
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 1

CPU(s): 96
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2

AMD:
CPU(s): 64
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1

@britcey The relevant change is GOMAXPROCS=1. You can do that with pod spec env var without upgrading node_exporter.

I had a further CPU reduction going from 1.3.1 w/ GOMAXPROCS=1 to 1.5.0 but just getting GOMAXPROCS would indeed be a better comparison - I'll see what they say.

@britcey
Copy link

britcey commented Feb 13, 2023

FYI, I haven't forgotten this, just waiting on another group to upgrade node_exporter to 1.5.0 so we can compare their AMD & Xeon hosts.

@lyveng
Copy link

lyveng commented May 18, 2023

@SuperQ We are facing a similar issue and we have a reproducible test case. We use similar hardware:
CPU(s): 96
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2

This problem doesn't seem to affect all workloads. Specifically it affected the performance of a certain java workload. We tried running stress-ng with cpufreq collector enabled and disabled and noticed that it did not have any noticeable impact. But for the java workload the impact was massive:

  1. throughput dropped by 10%
  2. mean latency increased by 25%
  3. tail latencies doubled, i.e., increased by 100%

The primary observation is that the run queue length(seen in output of vmstat) seems to almost drop to 0 every 5 to 10 seconds. The cpu utilisation also drops almost to 0 during this time. We did JFR profiling for the same workload with and without cpufreq enabled in node exporter. The cpu utilisation was spikey here as well

java workload cpu utilisation when cpufreq collector is enabled
image

java workload cpu uilisation when cpufreq collector is disabled
image

stress-ng output when cpufreq collector is enabled

# stress-ng --taskset 0-95 -c 96 --cpu-method all --metrics-brief -t 30
stress-ng: info:  [2376019] dispatching hogs: 96 cpu
stress-ng: info:  [2376019] successful run completed in 30.03s
stress-ng: info:  [2376019] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [2376019]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [2376019] cpu             2044289     30.00   2841.41      1.38     68136.33         719.11

stress-ng output when cpufreq collector is disabled

# stress-ng --taskset 0-95 -c 96 --cpu-method all --metrics-brief -t 30
stress-ng: info:  [2378142] dispatching hogs: 96 cpu
stress-ng: info:  [2378142] successful run completed in 30.03s
stress-ng: info:  [2378142] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [2378142]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [2378142] cpu             2033389     30.00   2850.53      1.09     67773.24         713.06
# stress-ng --taskset 0-95 -c 96 --cpu-method all --metrics-brief -t 30
stress-ng: info:  [2378627] dispatching hogs: 96 cpu
stress-ng: info:  [2378627] successful run completed in 30.03s
stress-ng: info:  [2378627] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [2378627]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [2378627] cpu             2054517     30.00   2835.17      1.87     68475.13         724.18

We also generated flamegraph with and without cpufreq collector enabled

  1. flamegraph with cpufreq collector enabled - bmprofile_with_cpufreq
  2. flamegraph with cpufreq collector disabled -
    bmprofile_wo_cpufreq

From both these flamegraphs, the function osq_lock seems to be have been held for almost same time ~14.9% in both cases

We also ran the bpf tool klockstat with and without cpufreq collector enabled

  1. cpufreq collector enabled - lock_analysis_cpufreq_java.log
  2. cpufreq collector disabled - lock_analysis_wo_cpufreq_java.log

The specific lines of interest in the klockstat output when cpufreq collector is enabled is

                                  Caller   Avg Spin  Count   Max spin Total spin
           b'kernfs_dop_revalidate+0x33'     255479 833203 18446744071814924592 212866659508
           b'kernfs_dop_revalidate+0x33'     196896 416598 18446744071812460639 82026580458
           b'kernfs_iop_permission+0x29'     249223 1249808 18446744071811711629 311481831642
           b'kernfs_iop_permission+0x29'     244866 416594 18446744071810605667 102009733672
              b'kernfs_iop_getattr+0x28'     175374 416592 18446744071807082659 73059481320

It looks like the max spin and total spin are crossing the limits of u64.

So these functions(kernfs_dop_revalidate, kerfs_iop_permission, kernfs_iop_getattr) have mutexes where the threads are blocking and these functions in-turn are calling osq_lock which is a spinlock. The primary difference between cpufreq enabled and disabled is acquiring these mutexes seems to take a lot of time when cpufreq collector is enabled. Unfortunately the flamegraph doesn't has the stack between java and these functions as unknown. I'm already using perf-map-agent scripts to generate the flame graph and symbols for the application code seems to work fine.
I tried to use the fileslower bpf tool to see opening of which files is blocking on these mutexes. But in both cases(cpufreq enabled and disabled), the java process was accessing only cgroup related files and the max latency was only 10-15ms.

I tried writing a bpftrace program to record files that are opened using the openat2 syscall and is taking more than a certain threshold like 50ms. But getting some errors in the program. I'll try to get it running

@SuperQ We can have a joint debugging session to root cause this if required. We would not be able to give access to the systems since they are behind a vpn. As of now we have disabled cpufreq collector in our systems.

@SuperQ
Copy link
Member

SuperQ commented May 18, 2023

@lyveng What version of node_exporter was used in your testing?

@lyveng
Copy link

lyveng commented May 18, 2023

Node Exporter - v1.4.1
Kubernetes - 1.22.12
Kernel - 5.10.0-21-amd64
OS - Debian 11.6

I also checked the Lock instances tab of JFR dumps in Java mission control. There were no major differences between java's locks between the 2 dumps. I think this is because JFR records only concurrency primitives native to java like synchronized keyword.

@lyveng
Copy link

lyveng commented May 18, 2023

Node Exporter is allocated the following resources:

node-expoter: 
  resources:
      limits:
        cpu: 100m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
kube-rbac-proxy: 
    resources:
      limits:
        cpu: 20m
        memory: 40Mi
      requests:
        cpu: 10m
        memory: 20Mi

We install node-exporter using kube-prometheus stack.

@SuperQ
Copy link
Member

SuperQ commented May 18, 2023

@lyveng That version is too old. As noted above, 1.5.0 from November 2022 includes changes that should make this problem go away.

Also, I don't recommend setting CPU limits on the node_exporter, rather, using the new version that sets GOMAXPROCS to 1.

@lyveng
Copy link

lyveng commented May 23, 2023

Setting GOMAXPROCS to 1 in node exporter 1.4.1 fixed the issue temporarily although I didn't run it for a longer period to validate. I went through #2500 to understand how the application threads get blocked because of the kernel bug. So it doesn't seem to be related to cpufreq except for the fact that one goroutine per core is spawned(coderef) which causes heavy concurrent access of the sysfs on systems with large number of cpus.

We are using linux kernel 5.10. I'm trying to figure out if the commit 289caf5d8f6c61c6d2b7fd752a7f483cd153f182 that seems to be fixing this race condition(as per comment) is present in our kernel or not.

Also is there any other reason why this ticket is kept open? Just asking to see if there are any other bugs to look out for.

@SuperQ
Copy link
Member

SuperQ commented May 23, 2023

We left this open as we were waiting for confirmation that the issue is fixed enough. Also in case there were additional mitigations or fixes that would more permanently solve this.

If everyone is happy with the GOMAXPROCS workaround, we can call this solved.

Ping @britcey

jaimeyh added a commit to sysdiglabs/node_exporter that referenced this issue Feb 28, 2024
* Fixup codespell (#2455)

* Fix some mistakes
* Switch to an ignore file.

Signed-off-by: Ben Kochie <superq@gmail.com>

Signed-off-by: Ben Kochie <superq@gmail.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.2.0 to 1.2.2 (#2459)

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.2.0 to 1.2.2.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.2.0...v1.2.2)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump github.com/prometheus/client_golang (#2460)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.12.2 to 1.13.0.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.12.2...v1.13.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Wrap accesses of c.osFilename and c.osMtime in
mutex to prevent race condition.

Signed-off-by: Robin Nabel <rnabel@ucdavis.edu>

* feat: add support macos version (#2471)

Signed-off-by: Serhii Freidin <sfreydin@macpaw.com>

Signed-off-by: Serhii Freidin <sfreydin@macpaw.com>

* Merge metrics descriptions in textfile collector (#2475)

The textfile collector will now provide a unified metric description
(that will look like "Metric read from file/a.prom, file/b.prom")
for metrics collected accross several text-files that don't already
have a description.

Also change the error handling in the textfile collector tests to
ContinueOnError to better mirror the real-life use-case.

Signed-off-by: Guillaume Espanel <guillaume.espanel.ext@ovhcloud.com>

Signed-off-by: Guillaume Espanel <guillaume.espanel.ext@ovhcloud.com>

* Skip zfs iostats (#2451)

skip over the zfs IO metrics if their paths are missing

Signed-off-by: tnextday <fw2k4@163.com>

Signed-off-by: tnextday <fw2k4@163.com>

* Add btrfs device error stats (#2193)

* Improve metrics filesystem scanning logic
* Makes ioctl syscalls to load the device error stats.
* Adds filesystem mountpoint labels to existing metrics for ease of use.

Signed-off-by: Marcus Cobden <leth@users.noreply.github.com>

* Update common Prometheus files (#2473)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Release 1.4.0 (#2478)

* [CHANGE] Merge metrics descriptions in textfile collector #2475
* [FEATURE] [node-mixin] Add darwin dashboard to mixin #2351
* [FEATURE] Add "isolated" metric on cpu collector on linux #2251
* [FEATURE] Add cgroup summary collector #2408
* [FEATURE] Add selinux collector #2205
* [FEATURE] Add slab info collector #2376
* [FEATURE] Add sysctl collector #2425
* [FEATURE] Also track the CPU Spin time for OpenBSD systems #1971
* [FEATURE] Add support for MacOS version #2471
* [ENHANCEMENT] [node-mixin] Add missing selectors #2426
* [ENHANCEMENT] [node-mixin] Change current datasource to grafana's default #2281
* [ENHANCEMENT] [node-mixin] Change disk graph to disk table #2364
* [ENHANCEMENT] [node-mixin] Change io time units to %util #2375
* [ENHANCEMENT] Ad user_wired_bytes and laundry_bytes on *bsd #2266
* [ENHANCEMENT] Add additional vm_stat memory metrics for darwin #2240
* [ENHANCEMENT] Add device filter flags to arp collector #2254
* [ENHANCEMENT] Add diskstats include and exclude device flags #2417
* [ENHANCEMENT] Add node_softirqs_total metric #2221
* [ENHANCEMENT] Add rapl zone name label option #2401
* [ENHANCEMENT] Add slabinfo collector #1799
* [ENHANCEMENT] Allow user to select port on NTP server to query #2270
* [ENHANCEMENT] collector/diskstats: Add labels and metrics from udev #2404
* [ENHANCEMENT] Enable builds against older macOS SDK #2327
* [ENHANCEMENT] qdisk-linux: Add exclude and include flags for interface name #2432
* [ENHANCEMENT] systemd: Expose systemd minor version #2282
* [ENHANCEMENT] Use netlink for tcpstat collector #2322
* [ENHANCEMENT] Use netlink to get netdev stats #2074
* [ENHANCEMENT] Add additional perf counters for stalled frontend/backend cycles #2191
* [ENHANCEMENT] Add btrfs device error stats #2193
* [BUGFIX] [node-mixin] Fix fsSpaceAvailableCriticalThreshold and fsSpaceAvailableWarning #2352
* [BUGFIX] Fix concurrency issue in ethtool collector #2289
* [BUGFIX] Fix concurrency issue in netdev collector #2267
* [BUGFIX] Fix diskstat reads and write metrics for disks with different sector sizes #2311
* [BUGFIX] Fix iostat on macos broken by deprecation warning #2292
* [BUGFIX] Fix NodeFileDescriptorLimit alerts #2340
* [BUGFIX] Sanitize rapl zone names #2299
* [BUGFIX] Add file descriptor close safely in test #2447
* [BUGFIX] Fix race condition in os_release.go #2454
* [BUGFIX] Skip ZFS IO metrics if their paths are missing #2451

Signed-off-by: Ben Kochie <superq@gmail.com>

Signed-off-by: Ben Kochie <superq@gmail.com>

* Archived fixtures/udev similar to fixtures/sys to avoid go-get errors, fixes #2482 (#2485)

Signed-off-by: Darshil Chanpura <darshil@thatwebsite.xyz>

* Fix diskstats exclude flags (#2487)

Correctly handle the new `collector.diskstats.device-exclude` flag to
avoid errors when using the old `collector.diskstats.ignored-devices`
flag.

Fixes: https://github.com/prometheus/node_exporter/issues/2486

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update ISSUE_TEMPLATE.md

Signed-off-by: Johannes 'fish' Ziemke <github@5pi.de>

* Bump crypto and net CVE-2022-27191 CVE-2022-27664 (#2488)

* Bump crypto and net CVE-2022-27191 CVE-2022-27664

Signed-off-by: Jason Culligan <jason.culligan@intel.com>

* Fix hwmon label sanitizer (#2504)

We don't need to fully sanitize the hwmon label values to metric/label
name strings.
* Just make sure they're valid UTF-8.
* Always included the label metric to avoid group_left failures.

Signed-off-by: Ben Kochie <superq@gmail.com>

Signed-off-by: Ben Kochie <superq@gmail.com>

* build(deps): bump github.com/opencontainers/selinux

Bumps [github.com/opencontainers/selinux](https://github.com/opencontainers/selinux) from 1.10.1 to 1.10.2.
- [Release notes](https://github.com/opencontainers/selinux/releases)
- [Commits](https://github.com/opencontainers/selinux/compare/v1.10.1...v1.10.2)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/selinux
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/mdlayher/netlink from 1.6.0 to 1.6.2

Bumps [github.com/mdlayher/netlink](https://github.com/mdlayher/netlink) from 1.6.0 to 1.6.2.
- [Release notes](https://github.com/mdlayher/netlink/releases)
- [Changelog](https://github.com/mdlayher/netlink/blob/main/CHANGELOG.md)
- [Commits](https://github.com/mdlayher/netlink/compare/v1.6.0...v1.6.2)

---
updated-dependencies:
- dependency-name: github.com/mdlayher/netlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/coreos/go-systemd/v22 from 22.3.2 to 22.4.0 (#2493)

Bumps [github.com/coreos/go-systemd/v22](https://github.com/coreos/go-systemd) from 22.3.2 to 22.4.0.
- [Release notes](https://github.com/coreos/go-systemd/releases)
- [Commits](https://github.com/coreos/go-systemd/compare/v22.3.2...v22.4.0)

---
updated-dependencies:
- dependency-name: github.com/coreos/go-systemd/v22
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs/node-mixin: add fsMointpointSelector to alerts and dashboards (#2446)

* docs/node-mixin: add fsMountpointSelector

This adds the option to add a `mountpoint` selector to filesystem
related alerts. The default is `mountpoint!=""`.

* docs/node-mixins: add fsMountpointSelector to dashboards

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>

* Use native endianness when encoding InetDiagMsg (#2508)

Note however that the InetDiagMsg struct contains a InetDiagSockID
member, which itself contains some members which are explicitly
specified as big-endian in Linux kernel source:

struct inet_diag_sockid {
	__be16	idiag_sport;
	__be16	idiag_dport;
	__be32	idiag_src[4];
	__be32	idiag_dst[4];
	__u32	idiag_if;
	__u32	idiag_cookie[2];
};

node_exporter currently does not use these members for anything, so this
is acceptable (for now).

Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>

* Add multiple listeners and systemd socket listener activation (#2393)

Update exporter-toolkit to v0.8.1 to enable new listener support.

Signed-off-by: Perry Naseck <git@perrynaseck.com>

* Add procfs fallback to netdev collector (#2509)

Some systems have broken netlink messages due to patched kernels. Since
these messages can not be parsed, add a flag to fall back to parsing
from `/proc/net/dev`.

Fixes: https://github.com/prometheus/node_exporter/issues/2502

Signed-off-by: Ben Kochie <superq@gmail.com>

Signed-off-by: Ben Kochie <superq@gmail.com>

* build(deps): bump github.com/prometheus/client_model from 0.2.0 to 0.3.0

Bumps [github.com/prometheus/client_model](https://github.com/prometheus/client_model) from 0.2.0 to 0.3.0.
- [Release notes](https://github.com/prometheus/client_model/releases)
- [Commits](https://github.com/prometheus/client_model/compare/v0.2.0...v0.3.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_model
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.2.2 to 1.2.3

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.2.2 to 1.2.3.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.2.2...v1.2.3)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix btrfs device stats always being zero (#2516)

* Respect rootfs path config option in btrfs ioctl
* Fix btrfs device stats always being zero

Signed-off-by: Marcus Cobden <leth@users.noreply.github.com>

* readme: remove RHEL/CentOS/Fedora installation info (#2525)

Copr community prometheus-exporters repository is obsoleted.

Signed-off-by: Otto Sabart <seberm@seberm.com>

Signed-off-by: Otto Sabart <seberm@seberm.com>

* add RTNL version of netclass collector (#2492)

* update rtnetlink package to v1.2.3
* add RTNL version of netclass collector that have all the metrics that netdev collector provides, too.

Signed-off-by: Haoyu Sun <hasun@redhat.com>

* update to golang.org/x/sys v0.1.0 (#2524)

Signed-off-by: Manuel Stausberg <stausberg@denic.de>

* Refactor netclass_rtnl collector (#2528)

* Refactor netclass_rtnl collector

Merge the netclass_rtnl collector into the netclass collector.
* Disabled by default
* Followup to #2492

Signed-off-by: Ben Kochie <superq@gmail.com>

* Default GOMAXPROCS to 1 (#2530)

Avoid running on all CPUs by limiting the Go runtime to one CPU by
default. Avoids having Go routines schedule on every CPU, driving up the
visible run queue length on high CPU count systems.

This also helps workaround a kernel deadlock issue with reading from
sysfs concurrently.

See:
* https://github.com/prometheus/node_exporter/issues/1880
* https://github.com/prometheus/node_exporter/issues/2500

Signed-off-by: Ben Kochie <superq@gmail.com>

* Bump Go modules (#2531)

* Update all Go modules to latest.
* Update Go minimum version to 1.18.
Signed-off-by: Ben Kochie <superq@gmail.com>

* Release v1.5.0

NOTE: This changes the Go runtime "GOMAXPROCS" to 1. This is done to limit the
  concurrency of the exporter to 1 CPU thread at a time in order to avoid a
  race condition problem in the Linux kernel (#2500) and parallel IO issues
  on nodes with high numbers of CPUs/CPU threads (#1880).

* [CHANGE] Default GOMAXPROCS to 1 #2530
* [FEATURE] Add multiple listeners and systemd socket listener activation #2393
* [ENHANCEMENT] Add RTNL version of netclass collector #2492, #2528
* [BUGFIX] Fix diskstats exclude flags #2487
* [BUGFIX] Bump go/x/crypt and go/x/net #2488
* [BUGFIX] Fix hwmon label sanitizer #2504
* [BUGFIX] Use native endianness when encoding InetDiagMsg #2508
* [BUGFIX] Fix btrfs device stats always being zero #2516
* [BUGFIX] Security: Update exporter-toolkit (CVE-2022-46146) #2531

Signed-off-by: Ben Kochie <superq@gmail.com>

* Correct documentation for --web.config.file flag

The --web.config flag changed to --web.config.file in
440a132c389dddd405b44d2d5e47157d1159e4d0 and was realised in the recent
v1.5.0 release.

Signed-off-by: Joe Groocock <me@frebib.net>

* Log current value of GOMAXPROCS

With `--runtime.gomaxprocs=0`, the GOMAXPROXS value will default to the
number of logical CPUs. In this case, it is more useful to log the
actual value than the value set by the user via the command-line.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* add options for perf profilers

Signed-off-by: mchtech <michu_an@126.com>

* fix the docker link in the ISSUE_TEMPLATE

Signed-off-by: Yury Vidineev <adeptg@gmail.com>

* Replace mistaken ) with }, resulting in parsable promql

Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>

* Update v1.5.0 release notes

Add note about the change to the experimental web flag.

Fixes: https://github.com/prometheus/node_exporter/issues/2535

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Migrate arp_linux.go to procfs

Signed-off-by: James Bach <qweet.ing@gmail.com>
Signed-off-by: James Bach <james.bach@wise.com>
Signed-off-by: jalev <qweet.ing@gmail.com>

* Change var name to match previous

Signed-off-by: James Bach <james.bach@wise.com>
Signed-off-by: jalev <qweet.ing@gmail.com>

* Fix lint issues

Signed-off-by: jalev <qweet.ing@gmail.com>

* Bump perf-utils version to 0.6.0

This change updates the perf-utils library to 0.6.0 which has some fixes
for automatically detecting the correct tracefs mountpoint if available.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

* Fix thermal_zone collector noise

Add a check for missing/unreadable thermal zone stats and ignore if not
availlable.

Fixes: https://github.com/prometheus/node_exporter/issues/2552

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Enable uname collector on NetBSD too

This collector works just fine without any further changes.

Signed-off-by: Benny Siegert <bsiegert@gmail.com>

* build(deps): bump github.com/mdlayher/netlink from 1.7.0 to 1.7.1

Bumps [github.com/mdlayher/netlink](https://github.com/mdlayher/netlink) from 1.7.0 to 1.7.1.
- [Release notes](https://github.com/mdlayher/netlink/releases)
- [Changelog](https://github.com/mdlayher/netlink/blob/main/CHANGELOG.md)
- [Commits](https://github.com/mdlayher/netlink/compare/v1.7.0...v1.7.1)

---
updated-dependencies:
- dependency-name: github.com/mdlayher/netlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/josharian/native from 1.0.0 to 1.1.0

Bumps [github.com/josharian/native](https://github.com/josharian/native) from 1.0.0 to 1.1.0.
- [Release notes](https://github.com/josharian/native/releases)
- [Commits](https://github.com/josharian/native/compare/v1.0.0...v1.1.0)

---
updated-dependencies:
- dependency-name: github.com/josharian/native
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix cpustat when some cpus are offline

Signed-off-by: Jia Xin <alexjx@gmail.com>

* build(deps): bump github.com/prometheus/common from 0.37.0 to 0.39.0

Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.37.0 to 0.39.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Commits](https://github.com/prometheus/common/compare/v0.37.0...v0.39.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update e2e output for new common version.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* NetBSD support for the meminfo collector

This depends on a recent change to golang.org/x/sys that adds a
unix.SysctlUvmexp function.

Signed-off-by: Benny Siegert <bsiegert@gmail.com>

* memory_bsd: Fix a problem fetching the user wire count on FreeBSD

Signed-off-by: David O'Rourke <david.orourke@gmail.com>

* Optimize cpufreq collector

Move metric descriptiions to package vars to avoid allocating them every
time `NewCPUFreqCollector()` is called.

Signed-off-by: Ben Kochie <superq@gmail.com>

* build(deps): bump github.com/hodgesds/perf-utils from 0.6.0 to 0.7.0

Bumps [github.com/hodgesds/perf-utils](https://github.com/hodgesds/perf-utils) from 0.6.0 to 0.7.0.
- [Release notes](https://github.com/hodgesds/perf-utils/releases)
- [Commits](https://github.com/hodgesds/perf-utils/compare/v0.6.0...v0.7.0)

---
updated-dependencies:
- dependency-name: github.com/hodgesds/perf-utils
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Deprecate ntp collector

The ntp collector has always been a source of confusion and problems.
The data it produces is more of a blackbox probe against an NTP server.
The time sync / offset data produced is not what users expect.

Mark this collector as deprecated to be removed in v2.0.0

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* build(deps): bump golang.org/x/net from 0.4.0 to 0.7.0

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.4.0 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/compare/v0.4.0...v0.7.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* Remove metrics of offline CPUs in CPU collector

Signed-off-by: Haoyu Sun <hasun@redhat.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.3.0 to 1.3.1

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.3.0 to 1.3.1.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.3.0...v1.3.1)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/opencontainers/selinux

Bumps [github.com/opencontainers/selinux](https://github.com/opencontainers/selinux) from 1.10.2 to 1.11.0.
- [Release notes](https://github.com/opencontainers/selinux/releases)
- [Commits](https://github.com/opencontainers/selinux/compare/v1.10.2...v1.11.0)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/selinux
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump golang.org/x/sys from 0.5.0 to 0.6.0

Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/golang/sys/releases)
- [Commits](https://github.com/golang/sys/compare/v0.5.0...v0.6.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update exporter-toolkit

* Bump exporter-toolkit to the latest release.
* Use new toolkit landing page function.
* Update kingpin flags.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Bump exporter-toolkit

Pick up the fixes for 32-bit mode and updated HTML template.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update build

* Update Go to 1.20
* Update golangci-lint.
* Update CI orb.
* Fix staticcheck issue in perf collector.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Allow root path as metrics path. (#2590)

Signed-off-by: LamGC <lam827@lamgc.net>

* Fix spelling issues

Minor typo fixup.

Signed-off-by: Ben Kochie <superq@gmail.com>

* interrupts_linux: Fix fields on aarch64 (#2631)

* interrupts_linux: Fix fields on aarch64

Fixes #2557

---------

Signed-off-by: Daniël van Eeden <git@myname.nl>

* feat: add support for cpu freq governor metrics

Signed-off-by: Lukas Coppens <lukas.coppens@be-mobile.com>

* feat: add support for cpu freq governor metrics

Signed-off-by: Lukas Coppens <lukas.coppens@be-mobile.com>

* Reduce priviliges needed for btrfs device stats

Signed-off-by: Marcus Cobden <leth@users.noreply.github.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* build(deps): bump github.com/safchain/ethtool from 0.2.0 to 0.3.0

Bumps [github.com/safchain/ethtool](https://github.com/safchain/ethtool) from 0.2.0 to 0.3.0.
- [Release notes](https://github.com/safchain/ethtool/releases)
- [Commits](https://github.com/safchain/ethtool/compare/v0.2.0...v0.3.0)

---
updated-dependencies:
- dependency-name: github.com/safchain/ethtool
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/prometheus/common from 0.41.0 to 0.42.0

Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.41.0 to 0.42.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Commits](https://github.com/prometheus/common/compare/v0.41.0...v0.42.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* NetBSD support for CPU collector (#2626)

* Added CPU collector for NetBSD to provide load and temperature statistics

---------

Signed-off-by: Matthias Petermann <mp@petermann-it.de>

* feat: added suspended as a node_zfs_zpool_state (#2449)

Signed-off-by: Pablo Caderno <kaderno@gmail.com>

* build(deps): bump github.com/mdlayher/netlink from 1.7.1 to 1.7.2

Bumps [github.com/mdlayher/netlink](https://github.com/mdlayher/netlink) from 1.7.1 to 1.7.2.
- [Release notes](https://github.com/mdlayher/netlink/releases)
- [Changelog](https://github.com/mdlayher/netlink/blob/main/CHANGELOG.md)
- [Commits](https://github.com/mdlayher/netlink/compare/v1.7.1...v1.7.2)

---
updated-dependencies:
- dependency-name: github.com/mdlayher/netlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/prometheus/client_golang

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.14.0 to 1.15.0.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.14.0...v1.15.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* doc: added undocumented include and exclude flags (#2670)

* doc: added undocumented exclude flags


Signed-off-by: David Calvert <david@0xdc.me>

* Expose administrative state of network interfaces as 'adminstate'. (#2515)

Signed-off-by: Maximilian Wilhelm <max@sdn.clinic>

* Use go-runit fork, mark collector as deprecated

Signed-off-by: Johannes Ziemke <github@5pi.de>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.3.1 to 1.3.2 (#2673)

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.3.1 to 1.3.2.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.3.1...v1.3.2)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs (node/mixin): fix annotation for Skew alert (#2671)

This updates the annotation for the NodeClockSkewDetected mixin alert to
match the new threshold set.

Original discussion was in this PR: https://github.com/prometheus/node_exporter/pull/1480

I spent an embarrassingly large amount of time trying to figure out how
the heck that alert would mean 300s of clock skew. Turns out the
annotation was just left the same after the threshold change.

Signed-off-by: Will Bollock <wbollock@linode.com>

* collector/netisr_freebsd.go: Added collector for netisr subsystem. (#2668)

Signed-off-by: Jonathan Davies <jpds@protonmail.com>

* Do not hand define struct clockinfo here. Instead use the version from (#2663)

x/sys/unix. The clockinfo struct was altered beginning of 2021 and this
code was not adjusted.

Signed-off-by: Claudio Jeker <claudio@openbsd.org>

* Fix filesystem collector for OpenBSD to not print loads of zero bytes in name (#2637)

Use the filesystem collector for all OpenBSD archs, there is no reason to
only use it on amd64 systems.

Signed-off-by: Claudio Jeker <claudio@openbsd.org>

* collector: fix comment and remove redundant parentheses (#2691)

Signed-off-by: cui fliter <imcusg@gmail.com>

* bcache: remove cache_readaheads_totals metrics #2103 (#2583)

* bcache: remove cache_readaheads_totals metrics #2103

Signed-off-by: Saleh Sal <0xack13@gmail.com>

* Append bcacheReadaheadMetrics when CacheReadaheads value exists

Signed-off-by: Saleh Sal <0xack13@gmail.com>

* Update test cases for cachereadahead greater than zero

Signed-off-by: Saleh Sal <0xack13@gmail.com>

---------

Signed-off-by: Saleh Sal <0xack13@gmail.com>

* Fix CVE-2022-41723 by upgrading x/net to v0.10.0 (#2694)

Signed-off-by: Nitin Shelke <nshelke@cloudera.com>

* Update e2e output fixtures (#2696)

Fix up correct e2e output for node_power_supply_info.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update Go modules (#2695)

Update Prometheus modules to latest releases.
* Add missing fixtures for cpus online/offline.

Signed-off-by: Ben Kochie <superq@gmail.com>

* fix(zfs): add `memory_available_bytes`, fix `dbufstats` filename on Linux (#2687)

* Fix zfs memory_available_bytes collector
* Fix zfs dbufstats collector
---------

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

* Update Go module for ema/qdisc (#2700)

* Update Go module for ema/qdisc

---------

Signed-off-by: jbradleynh <jbradley@fastly.com>

* Deprecate supervisord collector

Mark the `supervisord` as deprecated. This process
supevisor, like `runit`, is of scope for the node_exporter.

Signed-off-by: Ben Kochie <superq@gmail.com>

* collector/diskstats: Use SCSI_IDENT_SERIAL as serial (#2612)

On most hard drives, `ID_SERIAL_SHORT` and `SCSI_IDENT_SERIAL` are identical,
but on some SAS drives they do differ. In that case, `SCSI_IDENT_SERIAL`
corresponds to the serial number printed on the drive label, and to the value
returned by `smartctl -i`.

So use that value by default for the `serial` label on the `node_disk_info`
metric, and fallback to `ID_SERIAL_SHORT` only if it's undefined.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>

* softnet: additionals metrics from softnet_data,  (#2592)

* softnet: additionals metrics from softnet_data, https://github.com/prometheus/procfs/pull/473
---------

Signed-off-by: remi <remijouannet@gmail.com>
Signed-off-by: Rémi Jouannet <remijouannet@gmail.com>

* exposing softirq metrics (#2294)

Signed-off-by: abbeywoodyear <abbey.woodyear@thehutgroup.com>

* netlink: read missing attributes from sysfs (#2669)

Read missing dev_id, name_assign_type, and addr_assign_type
from sysfs, since they only take a device-specific lock and
not the whole RTNL lock. This means reading them is much less
impactful on other system processes than many of the other
attributes in sysfs that do take the RTNL lock.

Signed-off-by: Dan Williams <dcbw@redhat.com>

* Update ansible role in README.md (#2702)

https://github.com/cloudalchemy/ansible-node-exporter has been deprecated

Signed-off-by: Johannes Dilli <jd1@users.noreply.github.com>

* Release v1.6.0 (#2701)

* [CHANGE] Fix cpustat when some cpus are offline #2318
* [CHANGE] Remove metrics of offline CPUs in CPU collector #2605
* [CHANGE] Deprecate ntp collector #2603
* [CHANGE] Remove bcache `cache_readaheads_totals` metrics #2583
* [CHANGE] Deprecate supervisord collector #2685
* [FEATURE] Enable uname collector on NetBSD #2559
* [FEATURE] NetBSD support for the meminfo collector #2570
* [FEATURE] NetBSD support for CPU collector #2626
* [FEATURE] Add FreeBSD collector for netisr subsystem #2668
* [FEATURE] Add softirqs collector #2669
* [ENHANCEMENT] Add suspended as a `node_zfs_zpool_state` #2449
* [ENHANCEMENT] Add administrative state of Linux network interfaces #2515
* [ENHANCEMENT] Log current value of GOMAXPROCS #2537
* [ENHANCEMENT] Add profiler options for perf collector #2542
* [ENHANCEMENT] Allow root path as metrics path #2590
* [ENHANCEMENT] Add cpu frequency governor metrics #2569
* [ENHANCEMENT] Add new landing page #2622
* [ENHANCEMENT] Reduce privileges needed for btrfs device stats #2634
* [ENHANCEMENT] Add ZFS `memory_available_bytes` #2687
* [ENHANCEMENT] Use `SCSI_IDENT_SERIAL` as serial in diskstats #2612
* [ENHANCEMENT] Read missing from netlink netclass attributes from sysfs #2669
* [BUGFIX] perf: fixes for automatically detecting the correct tracefs mountpoints #2553
* [BUGFIX] Fix `thermal_zone` collector noise @2554
* [BUGFIX] Fix a problem fetching the user wire count on FreeBSD 2584
* [BUGFIX] interrupts: Fix fields on linux aarch64 #2631
* [BUGFIX] Remove metrics of offline CPUs in CPU collector #2605
* [BUGFIX] Fix OpenBSD filesystem collector string parsing #2637
* [BUGFIX] Fix bad reporting of `node_cpu_seconds_total` in OpenBSD #2663

Signed-off-by: Ben Kochie <superq@gmail.com>

* build(deps): bump github.com/beevik/ntp from 0.3.0 to 1.0.0

Bumps [github.com/beevik/ntp](https://github.com/beevik/ntp) from 0.3.0 to 1.0.0.
- [Release notes](https://github.com/beevik/ntp/releases)
- [Changelog](https://github.com/beevik/ntp/blob/main/RELEASE_NOTES.md)
- [Commits](https://github.com/beevik/ntp/compare/v0.3.0...v1.0.0)

---
updated-dependencies:
- dependency-name: github.com/beevik/ntp
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/prometheus/procfs from 0.10.0 to 0.10.1

Bumps [github.com/prometheus/procfs](https://github.com/prometheus/procfs) from 0.10.0 to 0.10.1.
- [Release notes](https://github.com/prometheus/procfs/releases)
- [Commits](https://github.com/prometheus/procfs/compare/v0.10.0...v0.10.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/procfs
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.3.2 to 1.3.3

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.3.2 to 1.3.3.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.3.2...v1.3.3)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* Parallelize stat calls in Linux filesystem collector.

This change adds the ability to process multiple stat calls in parallel.
Processing is rate-limited based on the new flag
`collector.filesystem.stat-workers` (default 4).

Caveat: filesystem stats information is no longer in the same order as
returned by `/proc/1/mounts`.  This should not be an issue.

Caveat: This change currently uses unbuffered channels to prove
correctness without reliance on buffers.  Buffered channels will yield
superior performance.

Signed-off-by: Erica Mays <erica@emays.dev>

* fix misspel in CHANGELOG.md (#2717)

Signed-off-by: juzhao <juzhao@redhat.com>

* Bump wifi Go module (#2719)

Update github.com/mdlayher/wifi to the latest commit.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Bump ethtool library (#2720)

Update to latest release.

Signed-off-by: Ben Kochie <superq@gmail.com>

* add missing linkspeeds (#2711)

Signed-off-by: Cam Cope <ccope@crusoeenergy.com>

* Update common Prometheus files (#2723)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Update golangci-lint config (#2722)

* Migrate from Python codespell to golangci-lint misspell.
* Inline errcheck exclude list in the golangci-lint config.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Add mountpoint to NodeFilesystem alerts

This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Decrease NodeFilesystem pending time to 15m

30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add CPU and memory alerts

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add failed systemd service alert

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Decrease NodeNetwork*Errs pending period

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Set 'at' everywhere as preposition for instance

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add NodeDiskIOSaturation alert

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add %(nodeExporterSelector)s to Network and conntrack alerts

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add diskDevice selector

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Fix NodeMemoryHighUtilization alert

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add NodeSystemSaturation and NodeMemoryMajorPagesFaults

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Decrease NodeSystemdServiceFailed severity to warning

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Extend alert description

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add comma after 'mounted on'

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add thresholds for memory alerts

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Add thresholds for memory, disk and system alerts

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Set severity to NodeCPUHighUsage to info

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Update NodeSystemSaturation severity

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Revert alerts pending durtions

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Add cpu vulnerabilities reporting from sysfs (#2721)

* Add cpu vulnerabilities reporting from sysfs

---------

Signed-off-by: Michal Wasilewski <michal@mwasilewski.net>

* build(deps): bump github.com/beevik/ntp from 1.0.0 to 1.1.1

Bumps [github.com/beevik/ntp](https://github.com/beevik/ntp) from 1.0.0 to 1.1.1.
- [Release notes](https://github.com/beevik/ntp/releases)
- [Changelog](https://github.com/beevik/ntp/blob/main/RELEASE_NOTES.md)
- [Commits](https://github.com/beevik/ntp/compare/v1.0.0...v1.1.1)

---
updated-dependencies:
- dependency-name: github.com/beevik/ntp
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/prometheus/client_golang

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.15.1 to 1.16.0.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.15.1...v1.16.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Add include and exclude filter for hwmon collector (#2699)

* Add include and exclude flags chip name flags to hwmon collector, following example in systemd collector

---------

Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: Ben Kochie <superq@gmail.com>

* Add missing ethtool flag documentation (#2743)

Signed-off-by: Gabi Davar <grizzly.nyo@gmail.com>

* Update all Include and Exclude variables to use the systemdUnit naming (#2740)

prefix.

Leave an annotation about using regexps instead of device_filter.go, so
@SuperQ doesn't need to remember everything.

Signed-off-by: Conall O'Brien <conall@conall.net>

* Fixup hwmon chip include (#2739)

Use the correct include value to the device filter function.
* Add new bogus hwmon fixture.
* Update end-to-end test to use hwmon chip include flag.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Release v1.6.1 (#2747)

Rebuild with latest Go compiler bugfix release.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Synchronize common files from prometheus/prometheus (#2736)

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Fixup linting issues

* Disbale unused-parameter check.
* Fixup minor linting issues.

Signed-off-by: Ben Kochie <superq@gmail.com>

---------

Signed-off-by: prombot <prometheus-team@googlegroups.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
Co-authored-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files (#2752)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Include drm collector in README

The DRM collector was missing in the README, this change includes it together with a short description.

Signed-off-by: L <3177243+LukeLR@users.noreply.github.com>

* collector/netdev_linux.go: Fallback to 32-bit stats (#2757)

On some platforms, `msg.Attributes.Stats64` is `nil` because the kernel doesn't
expose 64-bit stats. In that case, return `msg.Attributes.Stats` instead, which
are the 32-bit equivalent.

Note that `RXOtherhostDropped` isn't available in that case, so we hardcode it
to zero.

Fixes #2756.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>

* build(deps): bump github.com/beevik/ntp from 1.1.1 to 1.3.0 (#2762)

Signed-off-by: Ben Kochie <superq@gmail.com>

* build(deps): bump github.com/prometheus/procfs from 0.11.0 to 0.11.1 (#2763)

Bumps [github.com/prometheus/procfs](https://github.com/prometheus/procfs) from 0.11.0 to 0.11.1.
- [Release notes](https://github.com/prometheus/procfs/releases)
- [Commits](https://github.com/prometheus/procfs/compare/v0.11.0...v0.11.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/procfs
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.3.3 to 1.3.4 (#2765)

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.3.3 to 1.3.4.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.3.3...v1.3.4)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Drop redundant GOOS build tags if already in filename

Drop redundant GOOS build tags at start of file if the constraint is
already specified by the filename, e.g. foo_GOOS.go or
foo_GOOS_GOARCH.go, avoiding potential confusion in future.

cf. https://pkg.go.dev/cmd/go#hdr-Build_constraints

Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>

* Sync build tags in *_test.go (#2767)

Ensure that unwanted tests are correctly excluded when various build
tags are specified, i.e. when the code that they test would be excluded
from compilation.

Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>

* Upgrade github.com/ema/qdisc to v1.0.0 to improve qdisc collector (#2779)

performance

Signed-off-by: Oliver Geiselhardt-Herms <ogh@deepl.com>
Co-authored-by: Oliver Geiselhardt-Herms <ogh@deepl.com>

* Add CPU MHz as the value for "node_cpu_info" metric

For CPUs which don't have an available (or insertable) cpufreq driver,
the /proc/cpuinfo file can sometimes have accurate CPU core frequency
measurements. This change replaces the constant value of "1" for the
"node_cpu_info" metric with the parsed CPU MHz value from
/proc/cpuinfo for each core.

Signed-off-by: John Kordich <jkordich@gmail.com>

* Update e2e-output.txt with new expected metric values

Changes the e2e-output.txt file to have the expected CPU MHz values
for the node_cpu_info metric.

Signed-off-by: John Kordich <jkordich@gmail.com>

* Add new node_cpu_frequency_hertz metric

Revert changes to node_cpu_info and add new node_cpu_frequency_hertz
metric for measuring CPU frequency from /proc/cpuinfo

Signed-off-by: John Kordich <jkordich@gmail.com>

* Change log message from Warn to Debug

Signed-off-by: John Kordich <jkordich@gmail.com>

Co-authored-by: Ben Kochie <superq@gmail.com>
Signed-off-by: John Kordich <jkordich@gmail.com>

* fix(qdisc) flag naming corrected for consistency (#2782)

* fix collector qdisc flag naming for consistency

---------

Signed-off-by: jbradleynh <jbradley@fastly.com>

* btrfs: close btrfs.FS handle after use

Despite being quite hard to provoke (< 10% in my testing), the btrfs
collector would occasionally leave stale FDs relating to btrfs
mountpoints, making the filesystems unable to be unmounted.

Fixes: #2772.

Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>

* Update to Go 1.21 (#2796)

* Update Go build to 1.21.
* Update machine images to Ubuntu 22.04 current.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Optionally fetch ARP stats via rtnetlink instead of procfs (#2777)

* Optionally fetch ARP stats via rtnetlink instead of procfs

Implement collection of ARP stats via rtnetlink to work around
shortcomings in the output of /proc/net/arp, which truncates InfiniBand
link-layer addresses.

Fixes: #2776

---------

Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
Co-authored-by: Ben Kochie <superq@gmail.com>

* build(deps): bump golang.org/x/sys from 0.10.0 to 0.12.0 (#2797)

Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.10.0 to 0.12.0.
- [Commits](https://github.com/golang/sys/compare/v0.10.0...v0.12.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update common Prometheus files (#2798)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Add ZFS freebsd per dataset stats (#2753)

* Rename parsePoolObjsetFile to parseLinuxPoolObjsetFile to better reflect
it's scope
* Create a new parseFreeBSDPoolObjsetStats function, to generate a list
of per pool metrics to be queried via sysctl


---------

Signed-off-by: Conall O'Brien <conall@conall.net>

* Move RO status before error return

Signed-off-by: Metbog <metbog@gmail.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* fix(zfs)  zfs `arcstats.p` on FreeBSD 14.0+ (#2754)

* dongjiang, fix zfs arcstats.p

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

* dongjiang, fix gofmt -s

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

* change warn log to debug log by code review

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

---------

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

* Fix promhttp_metric_handler_errors_total metric not being disabled by flag

Signed-off-by: ToMe25 <ToMe25@gmx.de>

* build(deps): bump github.com/prometheus/client_golang (#2815)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.16.0 to 1.17.0.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.16.0...v1.17.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix inconsistent variable name, to address compilation issue (#2820)

https://github.com/prometheus/node_exporter/issues/2819

Signed-off-by: Conall O'Brien <conall@conall.net>

* Update README.md: update the 'more details' url in the section 'TLS endpoint' (#2814)

* Update README.md: correct the wrong url(link to exporter-toolkit web-config) in the section 'TLS endpoint'

Signed-off-by: yang-stressfree <68363665+yang-stressfree@users.noreply.github.com>

* Update README.md

Co-authored-by: Ben Kochie <superq@gmail.com>
Signed-off-by: yang-stressfree <68363665+yang-stressfree@users.noreply.github.com>

---------

Signed-off-by: yang-stressfree <68363665+yang-stressfree@users.noreply.github.com>
Co-authored-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* build(deps): bump golang.org/x/net from 0.11.0 to 0.17.0

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.11.0 to 0.17.0.
- [Commits](https://github.com/golang/net/compare/v0.11.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): bump github.com/prometheus/procfs from 0.11.1 to 0.12.0

Bumps [github.com/prometheus/procfs](https://github.com/prometheus/procfs) from 0.11.1 to 0.12.0.
- [Release notes](https://github.com/prometheus/procfs/releases)
- [Commits](https://github.com/prometheus/procfs/compare/v0.11.1...v0.12.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/procfs
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update e2e fixtures

Update for fixes in https://github.com/prometheus/procfs/pull/543

Signed-off-by: Ben Kochie <superq@gmail.com>

* NFSd: fix nfsd v4 index miss (#2824)

* fix nfsd v4 index miss

---------

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

* fix readme about expose memory statistics

Signed-off-by: joey <zchengjoey@gmail.com>

* Fix typo in CHANGELOG.md (#2836)

Use # consistently for PR number.

Signed-off-by: nemobis <federicoleva@tiscali.it>

* Update common Prometheus files (#2840)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* build(deps): bump github.com/prometheus/common from 0.44.0 to 0.45.0 (#2837)

Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.44.0 to 0.45.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Commits](https://github.com/prometheus/common/compare/v0.44.0...v0.45.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump github.com/prometheus/client_model (#2838)

Bumps [github.com/prometheus/client_model](https://github.com/prometheus/client_model) from 0.4.1-0.20230718164431-9a2bf3000d16 to 0.5.0.
- [Release notes](https://github.com/prometheus/client_model/releases)
- [Commits](https://github.com/prometheus/client_model/commits/v0.5.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_model
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Release 1.7.0 (#2845)

* [FEATURE] Add ZFS freebsd per dataset stats #2753
* [FEATURE] Add cpu vulnerabilities reporting from sysfs #2721
* [ENHANCEMENT] Parallelize stat calls in Linux filesystem collector #1772
* [ENHANCEMENT] Add missing linkspeeds to ethtool collector 2711
* [ENHANCEMENT] Add CPU MHz as the value for `node_cpu_info` metric #2778
* [ENHANCEMENT] Improve qdisc collector performance #2779
* [ENHANCEMENT] Add include and exclude filter for hwmon collector #2699
* [ENHANCEMENT] Optionally fetch ARP stats via rtnetlink instead of procfs #2777
* [BUFFIX] Fix ZFS arcstats on FreeBSD 14.0+ 2754
* [BUGFIX] Fallback to 32-bit stats in netdev #2757
* [BUGFIX] Close btrfs.FS handle after use #2780
* [BUGFIX] Move RO status before error return #2807
* [BUFFIX] Fix `promhttp_metric_handler_errors_total` being always active #2808
* [BUGFIX] Fix nfsd v4 index miss #2824

Signed-off-by: Ben Kochie <superq@gmail.com>

* Add NodeBondingDegraded alert (#2843)

Signed-off-by: Ayoub Nasr <ayoub.nasr@scality.com>

* Make filesystem space prediction window configurable (#2844)

Signed-off-by: fitz123 <alugovoi@ordercapital.com>

* NFSd: handle new wdeleg_getattr attribute in /proc/net/rpc/nfsd (#2810)

This attribute was introduced it v6.6-rc1.

The relevant changes in procfs were merged here:

https://github.com/prometheus/procfs/pull/574

and are part of procfs v0.11.2

I have also figured out that the stat should be part of the v4 ops
counters struct, but that will need changes to both procfs and this
code. Since people are already using 6.6-rc1, I think it's better to get
the code out there --- even if they don't care about wdeleg_getattr,
currently they get _no_ nfsd stats with 6.6-rc1.

I will make two follow-up PRs to clean this up in the next releases of
procfs and node-exporter.

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>

* Update common Prometheus files (#2851)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Update containerization warnings (#2855)

Running node_exporter in containers is now a fairly well understood
problem. Replace the warnings with something less dire and more
prescriptive.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Fix debug log in cpu collector (#2857)

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* build(deps): bump github.com/alecthomas/kingpin/v2 from 2.3.2 to 2.4.0 (#2865)

Bumps [github.com/alecthomas/kingpin/v2](https://github.com/alecthomas/kingpin) from 2.3.2 to 2.4.0.
- [Release notes](https://github.com/alecthomas/kingpin/releases)
- [Commits](https://github.com/alecthomas/kingpin/compare/v2.3.2...v2.4.0)

---
updated-dependencies:
- dependency-name: github.com/alecthomas/kingpin/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump howett.net/plist from 1.0.0 to 1.0.1 (#2862)

Bumps [howett.net/plist](https://github.com/DHowett/go-plist) from 1.0.0 to 1.0.1.
- [Commits](https://github.com/DHowett/go-plist/compare/v1.0.0...v1.0.1)

---
updated-dependencies:
- dependency-name: howett.net/plist
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add new collector and metrics for XFRM (#2544) (#2866)

Signed-off-by: Gavin Lam <gavin.oss@tutamail.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.3.5 to 1.4.0 (#2864)

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.3.5 to 1.4.0.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.3.5...v1.4.0)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump golang.org/x/sys from 0.13.0 to 0.15.0 (#2863)

Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.13.0 to 0.15.0.
- [Commits](https://github.com/golang/sys/compare/v0.13.0...v0.15.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add TCPOFOQueue to default netstat metrics (#2867)

Adds a count for TCP packets received out of orders. This can be an
indication that there is packet loss on the way packets travel towards
this server. In that case, the sender will retransmit (and we can
already monitor the Tcp_RetransSegs there), but we have no way to
monitor the packet loss on the receiver side. When a packet is received
and the receiver detects previous one missing, it will increase the
TCPOFOQueue counter and reply with selective ACK to the sender, both
possible indications of packet loss. Confirmation of packet loss can be
achieved by taking packet captures, ignoring wireshark analysis, and
carefully looking at data being retransmitted based on the TCP seq.

Just like RetransSegs, TCPOFOQueue should be interesting for any
deployment as a mean to detect packet loss, so here suggesting adding it
to the default list.

Signed-off-by: François Rigault <frigo@amadeus.com>
Co-authored-by: François Rigault <frigo@amadeus.com>

* Update common Prometheus files (#2870)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Add mitigation information to the linux vulnerabilities collector (#2806)

While the CPU vulnerabilities collector has been added in https://github.com/prometheus/node_exporter/pull/2721 , it's currently not including information regarding the mitigation strategy used for a given vulnerability.

This information can be quite valuable, as often times different mitigation strategies come with a different performance impact.

This commit adds a third label to the cpu_vulnerabilities_info metric, to include the "mitigation" used for a given vulnerability - if a given vulnerability is not affecting a node or the node is still vulnerable, the mitigation is expected to be empty.

Signed-off-by: João Lima <jlima@cloudflare.com>

* Update common Prometheus files (#2872)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* build(deps): bump golang.org/x/crypto from 0.14.0 to 0.17.0 (#2877)

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.14.0 to 0.17.0.
- [Commits](https://github.com/golang/crypto/compare/v0.14.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update common Prometheus files (#2879)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* build(deps): bump github.com/prometheus/exporter-toolkit (#2885)

Bumps [github.com/prometheus/exporter-toolkit](https://github.com/prometheus/exporter-toolkit) from 0.10.0 to 0.11.0.
- [Release notes](https://github.com/prometheus/exporter-toolkit/releases)
- [Changelog](https://github.com/prometheus/exporter-toolkit/blob/master/CHANGELOG.md)
- [Commits](https://github.com/prometheus/exporter-toolkit/compare/v0.10.0...v0.11.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/exporter-toolkit
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump github.com/beevik/ntp from 1.3.0 to 1.3.1 (#2886)

Bumps [github.com/beevik/ntp](https://github.com/beevik/ntp) from 1.3.0 to 1.3.1.
- [Release notes](https://github.com/beevik/ntp/releases)
- [Changelog](https://github.com/beevik/ntp/blob/main/RELEASE_NOTES.md)
- [Commits](https://github.com/beevik/ntp/compare/v1.3.0...v1.3.1)

---
updated-dependencies:
- dependency-name: github.com/beevik/ntp
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump github.com/prometheus/client_golang (#2887)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.17.0 to 1.18.0.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.17.0...v1.18.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update common Prometheus files (#2897)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* diskstats: ignore zram devices on linux systems by default (#2898)

Signed-off-by: DBS-ST-VIT <dbs-st-vit@users.noreply.github.com>
Co-authored-by: DBS-ST-VIT <dbs-st-vit@users.noreply.github.com>

* Bump golang-builder version (#2908)

Signed-off-by: Alper Polat <gitperr@gmail.com>

* exec_bsd: Fix labels for vm.stats.sys.v_syscall sysctl (#2895)

Signed-off-by: David O'Rourke <david.orourke@gmail.com>

* chore:remove constant from function (#2884)

Signed-off-by: tyltr <tylitianrui@126.com>

* build(deps): bump github.com/prometheus/common from 0.45.0 to 0.46.0 (#2910)

Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.45.0 to 0.46.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Commits](https://github.com/prometheus/common/compare/v0.45.0...v0.46.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump github.com/jsimonetti/rtnetlink from 1.4.0 to 1.4.1 (#2909)

Bumps [github.com/jsimonetti/rtnetlink](https://github.com/jsimonetti/rtnetlink) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/jsimonetti/rtnetlink/releases)
- [Commits](https://github.com/jsimonetti/rtnetlink/compare/v1.4.0...v1.4.1)

---
updated-dependencies:
- dependency-name: github.com/jsimonetti/rtnetlink
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix hwmon nil ptr (#2873)

* fix hwmon nil ptr

syslink maybe lost in some cases.

---------

Signed-off-by: TaoGe <6657718+yowenter@users.noreply.github.com>

* Fix hwmon error capture (#2915)

Fix golangci-lint "ineffectual assignment" by correctly capturing any
errors within the hwmon gathering loop.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files (#2917)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* Revert "Add ZFS freebsd per dataset stats (#2753)" (#2925)

This reverts commit f34aaa61092fe7e3c6618fdb0b0d16a68a291ff7.

Signed-off-by: Caleb Webber <caleb@codingthemsoftly.com>

* filesystem: fix mountTimeout not working issue (#2903)

Signed-off-by: DongWei <jiangxuege@hotmail.com>

* Fix description for NodeDiskIOSaturation alert (#2929)

NodeDiskIOSaturation description should say 30m per the "for" clause

Signed-off-by: Taylor Sly <slyt@users.noreply.github.com>

* Enforce no subprocess policy (#2926)

Add depguard to golangci-lint to enforce the no-os/exec policy.

Signed-off-by: Ben Kochie <superq@gmail.com>

* filesystem: surface device errors (#2923)

filesystem: surface filesystem device error

Fixes: #2918
---------

Signed-off-by: Pamela Mei i540369 <pamela.mei@sap.com>

* Revert "filesystem: fix mountTimeout not working issue (#2903)" (#2932)

This reverts commit 9f1f791ac2e1377781c4f8807a23d86d92ad6499.

Signed-off-by: Ben Kochie <superq@gmail.com>

* Update common Prometheus files (#2939)

Signed-off-by: prombot <prometheus-team@googlegroups.com>

* UPDATE base image golang version

* UPDATE vulnerabilities dependencies

* UPDATE dependencies

---------

Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: dependabot[bot] <support@github.com>
Sign…
@kucharskim
Copy link

@SuperQ which version of Node Exporter should have this problem fixed?

@SuperQ
Copy link
Member

SuperQ commented Mar 7, 2024

@kucharskim Please do not at-mention me asking about things that are answered several times in this issue. That is rude, and especially bad that you have failed to read the thread here. If you continue to do this you will be banned.

@SuperQ
Copy link
Member

SuperQ commented Mar 7, 2024

Since this has been fixed, and have not gotten any additional reports, I am closing this.

@SuperQ SuperQ closed this as completed Mar 7, 2024
@prometheus prometheus locked as resolved and limited conversation to collaborators Mar 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants