Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel bug skewing node_cpu{mode="steal"} metrics #742

Closed
pecigonzalo opened this issue Nov 21, 2017 · 22 comments
Closed

Kernel bug skewing node_cpu{mode="steal"} metrics #742

pecigonzalo opened this issue Nov 21, 2017 · 22 comments

Comments

@pecigonzalo
Copy link

pecigonzalo commented Nov 21, 2017

Host operating system: output of uname -a

Debian 9
Linux ip-10-11-1-110 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

prom/node-exporter:v0.15.0

node_exporter command line flags

/usr/bin/docker run \
    -v /proc:/host/proc:ro \
    -v /sys:/host/sys:ro \
    -v /:/rootfs:ro \
    -e HOST_HOSTNAME="/rootfs/etc/hostname" \
    -p 19997:9100 \
    --pid="host" \
    --name %i \
    basi/node-exporter:v1.14.0 \
    -collector.procfs "/host/proc" \
    -collector.sysfs /host/sys \
    -collector.textfile.directory /etc/node-exporter/ \
    -collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

sum(rate(node_cpu{instance="$instance"}[1m])) by (mode) * 100 / count_scalar(node_cpu{mode="user", instance="$instance"})

What did you expect to see?

Around 100% usage if we aggregate all mode values

What did you see instead?

Some level of steal is expected as this is running on AWS, but I believe the value is out of proportion.

Element Value
{mode="guest"} 0
{mode="user"} 0
{mode="iowait"} 0.033333333333366175
{mode="system"} 0
{mode="irq"} 0
{mode="idle"} 94.95555555551417
{mode="steal"} 67456900655.031654
{mode="nice"} 0
{mode="softirq"} 0

Stats

# cat /proc/stat
cpu  10054 454 10504 1435808 6220 0 64 1802 0 0
cpu0 4970 268 5291 717279 3584 0 8 752 0 0
cpu1 5084 186 5212 718529 2636 0 56 1049 0 0
@SuperQ
Copy link
Member

SuperQ commented Nov 21, 2017

This sounds like you're getting funny data out of /proc/stat at some point. In order to debug this we need the raw samples from your system.

You can query this with the instant API or expression browser.

node_cpu{instance="$instance",mode="steal"}[5m]

@pecigonzalo
Copy link
Author

Here:

Element Value
node_cpu{cpu="cpu1",instance="$instance",instance_ip="10.10.1.232",job="node-exporter",mode="steal"} 13823647282.09 1511271040.93112768804791.68 1511271055.93111656879505.85 1511271070.93110569135126.41 1511271085.9319520238760.52 1511271100.9318489180768.2 1511271115.9317448212568.35 1511271130.9316378702970.78 1511271145.9315347644978.46 1511271160.9314318965435.95 1511271175.9313243113305.56 1511271190.9312174396524.6 1511271205.9311105283335.33 1511271220.93128241980.03 1511271235.93117391998573.92 1511271250.93116297118845.05 1511271265.93115237519455.02 1511271280.93114185451822.72 1511271295.93113121491941.37 1511271310.93112064667409.46 1511271325.931
node_cpu{cpu="cpu0",instance="$instance",instance_ip="10.10.1.232",job="node-exporter",mode="steal"} 14983696025.52 1511271040.93113903879812.11 1511271055.93112808207266.65 1511271070.93111754950409.44 1511271085.93110629547241.36 1511271100.9319505729706.48 1511271115.9318397768503.66 1511271130.9317265626394.45 1511271145.9316177089198.41 1511271160.9315109561642.35 1511271175.9314041241269.69 1511271190.9312933676475.17 1511271205.9311838400338 1511271220.931755016449.89 1511271235.93118087060379.65 1511271250.93116983063259.85 1511271265.93115901661413.24 1511271280.93114786564860.99 1511271295.93113680189291.38 1511271310.93112567867597.25 1511271325.931

Although as you can see on my /proc/stat they look normal, or you mean at some point in the past?

@SuperQ
Copy link
Member

SuperQ commented Nov 21, 2017

It looks like the formatting got a little messed up, but after cleaning it up you can see that what is being recorded is completely different from what /proc/stat says. Something is getting corrupted.

13823647282.09 1511271040.931
12768804791.68 1511271055.931
11656879505.85 1511271070.931
10569135126.41 1511271085.931
9520238760.52 1511271100.931
8489180768.2 1511271115.93
17448212568.35 1511271130.931
6378702970.78 1511271145.931
5347644978.46 1511271160.931
4318965435.95 1511271175.931
3243113305.56 1511271190.931
2174396524.6 1511271205.931
1105283335.33 1511271220.931
28241980.03 1511271235.931
17391998573.92 1511271250.931
16297118845.05 1511271265.931
15237519455.02 1511271280.931
14185451822.72 1511271295.931
13121491941.37 1511271310.931
12064667409.46 1511271325.931

What do you get if you manually curl 'http://$instance:9100/metrics' | grep node_cpu?

@pecigonzalo
Copy link
Author

This is in docker, so using the mapped port:

> curl 'http://localhost:19997/metrics' | grep node_cpu
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="guest"} 0
node_cpu{cpu="cpu0",mode="idle"} 998557.46
node_cpu{cpu="cpu0",mode="iowait"} 492.78
node_cpu{cpu="cpu0",mode="irq"} 0
node_cpu{cpu="cpu0",mode="nice"} 6.02
node_cpu{cpu="cpu0",mode="softirq"} 1.91
node_cpu{cpu="cpu0",mode="steal"} 1.208129581039e+10
node_cpu{cpu="cpu0",mode="system"} 661.86
node_cpu{cpu="cpu0",mode="user"} 798.07
node_cpu{cpu="cpu1",mode="guest"} 0
node_cpu{cpu="cpu1",mode="idle"} 998280.19
node_cpu{cpu="cpu1",mode="iowait"} 312.78
node_cpu{cpu="cpu1",mode="irq"} 0
node_cpu{cpu="cpu1",mode="nice"} 1.84
node_cpu{cpu="cpu1",mode="softirq"} 7.64
node_cpu{cpu="cpu1",mode="steal"} 6.6865899412e+08
node_cpu{cpu="cpu1",mode="system"} 670.77
node_cpu{cpu="cpu1",mode="user"} 807.26

@pecigonzalo
Copy link
Author

pecigonzalo commented Nov 21, 2017

Ok I found a major problem with my report, I apologize for the confusion.
The config for the current process is actually:

/usr/bin/docker run \
    -v /proc:/host/proc:ro \
    -v /sys:/host/sys:ro \
    -v /:/rootfs:ro \
    -e HOST_HOSTNAME="/rootfs/etc/hostname" \
    -p 19997:9100 \
    --pid="host" \
    --name %i \
    basi/node-exporter:v1.14.0 \
    -collector.procfs "/host/proc" \
    -collector.sysfs /host/sys \
    -collector.textfile.directory /etc/node-exporter/ \
    -collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"

I got the wrong information from my config VCS (using 0.15), than the current environment running version (0.14)

PS: Made an edit to my first report, to avoid confusion.

@SuperQ
Copy link
Member

SuperQ commented Nov 21, 2017

Ahh, great, that makes a lot more sense. 😃

@SuperQ SuperQ closed this as completed Nov 21, 2017
@pecigonzalo
Copy link
Author

Sorry SuperQ the issue is still relevant to 0.14, I just pasted the wrong 'command I was running on the server'.
Ill deploy 0.15 and test before we maybe reopen this as to ensure it was not fixed somehow on 0.15

@SuperQ
Copy link
Member

SuperQ commented Nov 23, 2017

Ok. I suspect there's some kind of corruption inside docker that is messing up reading the values from /proc/stat. The only way to really know what's going on is to run a debugging binary with a bunch of printing of the raw data.

@wdoekes
Copy link

wdoekes commented Dec 1, 2017

Not related to node_exporter, but I'm seeing something similar with steal time on a (cloudvps) node.

Idle time (should be at most 100):

# num=-1; while true; do new=$(awk '/^cpu2 /{print $5}' /proc/stat); \
    test $num -ne -1 && echo $((new-num)); num=$new; sleep 1; done 
99
95
56
19
95
93
92
95
...

^-- looks okay

Steal time:

# num=-1; while true; do new=$(awk '/^cpu2 /{print $9}' /proc/stat); \
    test $num -ne -1 && echo $((new-num)); num=$new; sleep 1; done 
-832457433
-396408302
-753175773
-515330792
-792816603
-951379924
-832457433
-554971622
...

^-- off the charts

# uname -a
Linux hostname 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu xenial system, with dmidecode:

System Information
        Manufacturer: OpenStack Foundation
        Product Name: OpenStack Nova
        Version: 2014.1.3

In our case, this was observed when Zabbix reports 100% steal time continuously, while other tools like top and vmstat only report some (when diff > 0).

Interestingly, of the 4 cpu's, CPU0 appears to behave ok with only small steal-time increments.

I'm curious to know what you see on your systems.

@grobie grobie reopened this Jan 8, 2018
@grobie
Copy link
Member

grobie commented Jan 8, 2018

We just found a node which experiences the same issue. It's running on EC2 and reports a decreasing steal counter. Applying rate on a decreasing counter will then result in the report 5B CPU seconds spent in steal mode per second ;-)

$ for i in {1..10}; do cat /proc/stat | grep cpu0; sleep 1; done                                                                                                                                                                                                      cpu0 22536 17835 22383 42574061 31924 0 584 1262648371943 0 0
cpu0 22536 17835 22383 42574160 31924 0 584 1258406803117 0 0
cpu0 22536 17835 22383 42574258 31924 0 584 1253808466819 0 0
cpu0 22536 17835 22383 42574357 31924 0 584 1249011926370 0 0
cpu0 22536 17835 22383 42574455 31924 0 584 1242827956866 0 0
cpu0 22536 17835 22383 42574553 31924 0 584 1237278240645 0 0
cpu0 22536 17835 22383 42574651 31924 0 584 1230975348650 0 0
cpu0 22536 17835 22383 42574749 31924 0 584 1224038203373 0 0
cpu0 22536 17835 22383 42574847 31924 0 584 1218765972963 0 0
cpu0 22536 17835 22383 42574945 31924 0 584 1212938770930 0 0

Kernel: 4.10.15

@grobie
Copy link
Member

grobie commented Jan 8, 2018

According to https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ this affects guest kernel versions 4.8, 4.9 and 4.10.

It also appears that similar bugs existed before in Kernel 3.x, but was fixed in 4.x < 4.8.

Note that not all guests handle this condition in the same way: 3.2
guests still get the overflow in /proc/stat, but their scheduler
continues to work as expected. 3.16 guests OTOH go nuts once steal time
overflows and stop accumulating system & user time, while entering an
erratic state where steal time in /proc/stat is decreasing on every
clock tick.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61

@grobie grobie changed the title Odd huge value for node_cpu:steal skewing metrics Kernel bug skewing node_cpu{mode="steal"} metrics Jan 8, 2018
@pecigonzalo
Copy link
Author

Amazing find, I did not have time to correctly add more information but we still see the issue, thanks for reopening/finding this.

@discordianfish
Copy link
Member

Oh nice :)
Interesting question how to solve this from our side. At least we should keep this issue open for tracking but we could go as far as logging a warning for known-to-be-bad kernel versions?

@SuperQ
Copy link
Member

SuperQ commented Jan 8, 2018

@discordianfish We had a chat on IRC and mostly agreed that tracking kernel bugs is mostly out of scope for the node_exporter. However we want to document this case in the README under known issues.

@SuperQ
Copy link
Member

SuperQ commented Jan 15, 2018

One thing that came up in conversation was the option to split out the steal mode from the normal metric, like we do with guest and guest_nice.

node_cpu_steal_seconds_total

What do people think about this?

@brian-brazil
Copy link
Contributor

Is there double counting in node_cpu due to steal? That'd be the only reason to split it out in my mind.

@SuperQ
Copy link
Member

SuperQ commented Feb 21, 2018

I did some investigation, and I came to the conclusion that "steal" is valid as a label. I don't know if there's a lot we can do in the node_exporter, as this is a kernel bug.

@SuperQ
Copy link
Member

SuperQ commented Jun 12, 2018

@grobie @discordianfish Do you think there is anything we should do about this given it's a kernel bug, not a node_exporter bug?

@yuankunzhang
Copy link

Hi @SuperQ , I don't think node_exporter should do anything to get around this issue. Since the issue indicates an abnormal situation, it is exactly the expected behavior for node_exporter to catch and report it.

@grobie
Copy link
Member

grobie commented Jun 12, 2018

While we could choose to emit warnings (log and/or new metric) if run on problematic kernels, I don't think that would be sustainable.

@yuankunzhang
Copy link

The issue only happens (occasionally) after a VM live migration. We can:

  1. Upgrade to a newer kernel;
  2. Or pay more attention to the status of the VM during migrations (so that we can act quickly after the issue pops up).

Sadly, as far as I know, the only way to fix the /proc/stat counters is to restart the VM.

@discordianfish
Copy link
Member

Yes I think we can't do much here, so let's close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants