Kernel bug skewing node_cpu{mode="steal"} metrics #742

pecigonzalo · 2017-11-21T12:06:26Z

Host operating system: output of `uname -a`

Debian 9
Linux ip-10-11-1-110 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

prom/node-exporter:v0.15.0

node_exporter command line flags

/usr/bin/docker run \
    -v /proc:/host/proc:ro \
    -v /sys:/host/sys:ro \
    -v /:/rootfs:ro \
    -e HOST_HOSTNAME="/rootfs/etc/hostname" \
    -p 19997:9100 \
    --pid="host" \
    --name %i \
    basi/node-exporter:v1.14.0 \
    -collector.procfs "/host/proc" \
    -collector.sysfs /host/sys \
    -collector.textfile.directory /etc/node-exporter/ \
    -collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

sum(rate(node_cpu{instance="$instance"}[1m])) by (mode) * 100 / count_scalar(node_cpu{mode="user", instance="$instance"})

What did you expect to see?

Around 100% usage if we aggregate all mode values

What did you see instead?

Some level of steal is expected as this is running on AWS, but I believe the value is out of proportion.

Element	Value
{mode="guest"}	0
{mode="user"}	0
{mode="iowait"}	0.033333333333366175
{mode="system"}	0
{mode="irq"}	0
{mode="idle"}	94.95555555551417
{mode="steal"}	67456900655.031654
{mode="nice"}	0
{mode="softirq"}	0

Stats

# cat /proc/stat
cpu  10054 454 10504 1435808 6220 0 64 1802 0 0
cpu0 4970 268 5291 717279 3584 0 8 752 0 0
cpu1 5084 186 5212 718529 2636 0 56 1049 0 0

The text was updated successfully, but these errors were encountered:

SuperQ · 2017-11-21T13:27:08Z

This sounds like you're getting funny data out of /proc/stat at some point. In order to debug this we need the raw samples from your system.

You can query this with the instant API or expression browser.

node_cpu{instance="$instance",mode="steal"}[5m]

pecigonzalo · 2017-11-21T13:37:22Z

Here:

Element	Value
node_cpu{cpu="cpu1",instance="$instance",instance_ip="10.10.1.232",job="node-exporter",mode="steal"}	13823647282.09 1511271040.93112768804791.68 1511271055.93111656879505.85 1511271070.93110569135126.41 1511271085.9319520238760.52 1511271100.9318489180768.2 1511271115.9317448212568.35 1511271130.9316378702970.78 1511271145.9315347644978.46 1511271160.9314318965435.95 1511271175.9313243113305.56 1511271190.9312174396524.6 1511271205.9311105283335.33 1511271220.93128241980.03 1511271235.93117391998573.92 1511271250.93116297118845.05 1511271265.93115237519455.02 1511271280.93114185451822.72 1511271295.93113121491941.37 1511271310.93112064667409.46 1511271325.931
node_cpu{cpu="cpu0",instance="$instance",instance_ip="10.10.1.232",job="node-exporter",mode="steal"}	14983696025.52 1511271040.93113903879812.11 1511271055.93112808207266.65 1511271070.93111754950409.44 1511271085.93110629547241.36 1511271100.9319505729706.48 1511271115.9318397768503.66 1511271130.9317265626394.45 1511271145.9316177089198.41 1511271160.9315109561642.35 1511271175.9314041241269.69 1511271190.9312933676475.17 1511271205.9311838400338 1511271220.931755016449.89 1511271235.93118087060379.65 1511271250.93116983063259.85 1511271265.93115901661413.24 1511271280.93114786564860.99 1511271295.93113680189291.38 1511271310.93112567867597.25 1511271325.931

Although as you can see on my /proc/stat they look normal, or you mean at some point in the past?

SuperQ · 2017-11-21T14:04:15Z

It looks like the formatting got a little messed up, but after cleaning it up you can see that what is being recorded is completely different from what /proc/stat says. Something is getting corrupted.

13823647282.09 1511271040.931
12768804791.68 1511271055.931
11656879505.85 1511271070.931
10569135126.41 1511271085.931
9520238760.52 1511271100.931
8489180768.2 1511271115.93
17448212568.35 1511271130.931
6378702970.78 1511271145.931
5347644978.46 1511271160.931
4318965435.95 1511271175.931
3243113305.56 1511271190.931
2174396524.6 1511271205.931
1105283335.33 1511271220.931
28241980.03 1511271235.931
17391998573.92 1511271250.931
16297118845.05 1511271265.931
15237519455.02 1511271280.931
14185451822.72 1511271295.931
13121491941.37 1511271310.931
12064667409.46 1511271325.931

What do you get if you manually curl 'http://$instance:9100/metrics' | grep node_cpu?

pecigonzalo · 2017-11-21T15:07:48Z

This is in docker, so using the mapped port:

> curl 'http://localhost:19997/metrics' | grep node_cpu
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="guest"} 0
node_cpu{cpu="cpu0",mode="idle"} 998557.46
node_cpu{cpu="cpu0",mode="iowait"} 492.78
node_cpu{cpu="cpu0",mode="irq"} 0
node_cpu{cpu="cpu0",mode="nice"} 6.02
node_cpu{cpu="cpu0",mode="softirq"} 1.91
node_cpu{cpu="cpu0",mode="steal"} 1.208129581039e+10
node_cpu{cpu="cpu0",mode="system"} 661.86
node_cpu{cpu="cpu0",mode="user"} 798.07
node_cpu{cpu="cpu1",mode="guest"} 0
node_cpu{cpu="cpu1",mode="idle"} 998280.19
node_cpu{cpu="cpu1",mode="iowait"} 312.78
node_cpu{cpu="cpu1",mode="irq"} 0
node_cpu{cpu="cpu1",mode="nice"} 1.84
node_cpu{cpu="cpu1",mode="softirq"} 7.64
node_cpu{cpu="cpu1",mode="steal"} 6.6865899412e+08
node_cpu{cpu="cpu1",mode="system"} 670.77
node_cpu{cpu="cpu1",mode="user"} 807.26

pecigonzalo · 2017-11-21T15:11:27Z

Ok I found a major problem with my report, I apologize for the confusion.
The config for the current process is actually:

/usr/bin/docker run \
    -v /proc:/host/proc:ro \
    -v /sys:/host/sys:ro \
    -v /:/rootfs:ro \
    -e HOST_HOSTNAME="/rootfs/etc/hostname" \
    -p 19997:9100 \
    --pid="host" \
    --name %i \
    basi/node-exporter:v1.14.0 \
    -collector.procfs "/host/proc" \
    -collector.sysfs /host/sys \
    -collector.textfile.directory /etc/node-exporter/ \
    -collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"

I got the wrong information from my config VCS (using 0.15), than the current environment running version (0.14)

PS: Made an edit to my first report, to avoid confusion.

SuperQ · 2017-11-21T15:42:53Z

Ahh, great, that makes a lot more sense. 😃

pecigonzalo · 2017-11-23T09:29:09Z

Sorry SuperQ the issue is still relevant to 0.14, I just pasted the wrong 'command I was running on the server'.
Ill deploy 0.15 and test before we maybe reopen this as to ensure it was not fixed somehow on 0.15

SuperQ · 2017-11-23T14:28:56Z

Ok. I suspect there's some kind of corruption inside docker that is messing up reading the values from /proc/stat. The only way to really know what's going on is to run a debugging binary with a bunch of printing of the raw data.

wdoekes · 2017-12-01T11:19:04Z

Not related to node_exporter, but I'm seeing something similar with steal time on a (cloudvps) node.

Idle time (should be at most 100):

# num=-1; while true; do new=$(awk '/^cpu2 /{print $5}' /proc/stat); \
    test $num -ne -1 && echo $((new-num)); num=$new; sleep 1; done 
99
95
56
19
95
93
92
95
...

^-- looks okay

Steal time:

# num=-1; while true; do new=$(awk '/^cpu2 /{print $9}' /proc/stat); \
    test $num -ne -1 && echo $((new-num)); num=$new; sleep 1; done 
-832457433
-396408302
-753175773
-515330792
-792816603
-951379924
-832457433
-554971622
...

^-- off the charts

# uname -a
Linux hostname 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu xenial system, with dmidecode:

System Information
        Manufacturer: OpenStack Foundation
        Product Name: OpenStack Nova
        Version: 2014.1.3

In our case, this was observed when Zabbix reports 100% steal time continuously, while other tools like top and vmstat only report some (when diff > 0).

Interestingly, of the 4 cpu's, CPU0 appears to behave ok with only small steal-time increments.

I'm curious to know what you see on your systems.

grobie · 2018-01-08T18:25:22Z

We just found a node which experiences the same issue. It's running on EC2 and reports a decreasing steal counter. Applying rate on a decreasing counter will then result in the report 5B CPU seconds spent in steal mode per second ;-)

$ for i in {1..10}; do cat /proc/stat | grep cpu0; sleep 1; done                                                                                                                                                                                                      cpu0 22536 17835 22383 42574061 31924 0 584 1262648371943 0 0
cpu0 22536 17835 22383 42574160 31924 0 584 1258406803117 0 0
cpu0 22536 17835 22383 42574258 31924 0 584 1253808466819 0 0
cpu0 22536 17835 22383 42574357 31924 0 584 1249011926370 0 0
cpu0 22536 17835 22383 42574455 31924 0 584 1242827956866 0 0
cpu0 22536 17835 22383 42574553 31924 0 584 1237278240645 0 0
cpu0 22536 17835 22383 42574651 31924 0 584 1230975348650 0 0
cpu0 22536 17835 22383 42574749 31924 0 584 1224038203373 0 0
cpu0 22536 17835 22383 42574847 31924 0 584 1218765972963 0 0
cpu0 22536 17835 22383 42574945 31924 0 584 1212938770930 0 0

Kernel: 4.10.15

grobie · 2018-01-08T18:41:06Z

According to https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ this affects guest kernel versions 4.8, 4.9 and 4.10.

It also appears that similar bugs existed before in Kernel 3.x, but was fixed in 4.x < 4.8.

Note that not all guests handle this condition in the same way: 3.2
guests still get the overflow in /proc/stat, but their scheduler
continues to work as expected. 3.16 guests OTOH go nuts once steal time
overflows and stop accumulating system & user time, while entering an
erratic state where steal time in /proc/stat is decreasing on every
clock tick.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61

pecigonzalo · 2018-01-08T18:47:08Z

Amazing find, I did not have time to correctly add more information but we still see the issue, thanks for reopening/finding this.

discordianfish · 2018-01-08T19:36:56Z

Oh nice :)
Interesting question how to solve this from our side. At least we should keep this issue open for tracking but we could go as far as logging a warning for known-to-be-bad kernel versions?

SuperQ · 2018-01-08T22:27:57Z

@discordianfish We had a chat on IRC and mostly agreed that tracking kernel bugs is mostly out of scope for the node_exporter. However we want to document this case in the README under known issues.

SuperQ · 2018-01-15T15:40:27Z

One thing that came up in conversation was the option to split out the steal mode from the normal metric, like we do with guest and guest_nice.

node_cpu_steal_seconds_total

What do people think about this?

brian-brazil · 2018-01-15T15:51:16Z

Is there double counting in node_cpu due to steal? That'd be the only reason to split it out in my mind.

SuperQ · 2018-02-21T10:06:47Z

I did some investigation, and I came to the conclusion that "steal" is valid as a label. I don't know if there's a lot we can do in the node_exporter, as this is a kernel bug.

SuperQ · 2018-06-12T07:20:15Z

@grobie @discordianfish Do you think there is anything we should do about this given it's a kernel bug, not a node_exporter bug?

yuankunzhang · 2018-06-12T08:05:12Z

Hi @SuperQ , I don't think node_exporter should do anything to get around this issue. Since the issue indicates an abnormal situation, it is exactly the expected behavior for node_exporter to catch and report it.

grobie · 2018-06-12T10:47:36Z

While we could choose to emit warnings (log and/or new metric) if run on problematic kernels, I don't think that would be sustainable.

yuankunzhang · 2018-06-12T10:58:24Z

The issue only happens (occasionally) after a VM live migration. We can:

Upgrade to a newer kernel;
Or pay more attention to the status of the VM during migrations (so that we can act quickly after the issue pops up).

Sadly, as far as I know, the only way to fix the /proc/stat counters is to restart the VM.

discordianfish · 2018-06-12T13:03:18Z

Yes I think we can't do much here, so let's close this.

SuperQ closed this as completed Nov 21, 2017

grobie reopened this Jan 8, 2018

grobie changed the title ~~Odd huge value for node_cpu:steal skewing metrics~~ Kernel bug skewing node_cpu{mode="steal"} metrics Jan 8, 2018

SuperQ added the pending/close label Feb 21, 2018

discordianfish closed this as completed Jun 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel bug skewing node_cpu{mode="steal"} metrics #742

Kernel bug skewing node_cpu{mode="steal"} metrics #742

pecigonzalo commented Nov 21, 2017 •

edited

Loading

SuperQ commented Nov 21, 2017

pecigonzalo commented Nov 21, 2017

SuperQ commented Nov 21, 2017

pecigonzalo commented Nov 21, 2017

pecigonzalo commented Nov 21, 2017 •

edited

Loading

SuperQ commented Nov 21, 2017

pecigonzalo commented Nov 23, 2017

SuperQ commented Nov 23, 2017

wdoekes commented Dec 1, 2017 •

edited

Loading

grobie commented Jan 8, 2018 •

edited

Loading

grobie commented Jan 8, 2018

pecigonzalo commented Jan 8, 2018

discordianfish commented Jan 8, 2018

SuperQ commented Jan 8, 2018

SuperQ commented Jan 15, 2018

brian-brazil commented Jan 15, 2018

SuperQ commented Feb 21, 2018

SuperQ commented Jun 12, 2018

yuankunzhang commented Jun 12, 2018

grobie commented Jun 12, 2018

yuankunzhang commented Jun 12, 2018

discordianfish commented Jun 12, 2018

Kernel bug skewing node_cpu{mode="steal"} metrics #742

Kernel bug skewing node_cpu{mode="steal"} metrics #742

Comments

pecigonzalo commented Nov 21, 2017 • edited Loading

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

SuperQ commented Nov 21, 2017

pecigonzalo commented Nov 21, 2017

SuperQ commented Nov 21, 2017

pecigonzalo commented Nov 21, 2017

pecigonzalo commented Nov 21, 2017 • edited Loading

SuperQ commented Nov 21, 2017

pecigonzalo commented Nov 23, 2017

SuperQ commented Nov 23, 2017

wdoekes commented Dec 1, 2017 • edited Loading

grobie commented Jan 8, 2018 • edited Loading

grobie commented Jan 8, 2018

pecigonzalo commented Jan 8, 2018

discordianfish commented Jan 8, 2018

SuperQ commented Jan 8, 2018

SuperQ commented Jan 15, 2018

brian-brazil commented Jan 15, 2018

SuperQ commented Feb 21, 2018

SuperQ commented Jun 12, 2018

yuankunzhang commented Jun 12, 2018

grobie commented Jun 12, 2018

yuankunzhang commented Jun 12, 2018

discordianfish commented Jun 12, 2018

pecigonzalo commented Nov 21, 2017 •

edited

Loading

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

pecigonzalo commented Nov 21, 2017 •

edited

Loading

wdoekes commented Dec 1, 2017 •

edited

Loading

grobie commented Jan 8, 2018 •

edited

Loading