-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel bug skewing node_cpu{mode="steal"} metrics #742
Comments
This sounds like you're getting funny data out of You can query this with the instant API or expression browser.
|
Here:
Although as you can see on my /proc/stat they look normal, or you mean at some point in the past? |
It looks like the formatting got a little messed up, but after cleaning it up you can see that what is being recorded is completely different from what
What do you get if you manually |
This is in docker, so using the mapped port:
|
Ok I found a major problem with my report, I apologize for the confusion.
I got the wrong information from my config VCS (using 0.15), than the current environment running version (0.14) PS: Made an edit to my first report, to avoid confusion. |
Ahh, great, that makes a lot more sense. 😃 |
Sorry SuperQ the issue is still relevant to 0.14, I just pasted the wrong 'command I was running on the server'. |
Ok. I suspect there's some kind of corruption inside docker that is messing up reading the values from |
Not related to node_exporter, but I'm seeing something similar with steal time on a (cloudvps) node. Idle time (should be at most 100):
^-- looks okay Steal time:
^-- off the charts
Ubuntu xenial system, with dmidecode:
In our case, this was observed when Zabbix reports 100% steal time continuously, while other tools like top and vmstat only report some (when diff > 0). Interestingly, of the 4 cpu's, CPU0 appears to behave ok with only small steal-time increments. I'm curious to know what you see on your systems. |
We just found a node which experiences the same issue. It's running on EC2 and reports a decreasing steal counter. Applying rate on a decreasing counter will then result in the report 5B CPU seconds spent in steal mode per second ;-)
Kernel: 4.10.15 |
According to https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ this affects guest kernel versions 4.8, 4.9 and 4.10. It also appears that similar bugs existed before in Kernel 3.x, but was fixed in 4.x < 4.8.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61 |
Amazing find, I did not have time to correctly add more information but we still see the issue, thanks for reopening/finding this. |
Oh nice :) |
@discordianfish We had a chat on IRC and mostly agreed that tracking kernel bugs is mostly out of scope for the node_exporter. However we want to document this case in the README under known issues. |
One thing that came up in conversation was the option to split out the
What do people think about this? |
Is there double counting in node_cpu due to steal? That'd be the only reason to split it out in my mind. |
I did some investigation, and I came to the conclusion that "steal" is valid as a label. I don't know if there's a lot we can do in the |
@grobie @discordianfish Do you think there is anything we should do about this given it's a kernel bug, not a node_exporter bug? |
Hi @SuperQ , I don't think node_exporter should do anything to get around this issue. Since the issue indicates an abnormal situation, it is exactly the expected behavior for node_exporter to catch and report it. |
While we could choose to emit warnings (log and/or new metric) if run on problematic kernels, I don't think that would be sustainable. |
The issue only happens (occasionally) after a VM live migration. We can:
Sadly, as far as I know, the only way to fix the |
Yes I think we can't do much here, so let's close this. |
Host operating system: output of
uname -a
Debian 9
Linux ip-10-11-1-110 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
prom/node-exporter:v0.15.0
node_exporter command line flags
Are you running node_exporter in Docker?
Yes
What did you do that produced an error?
sum(rate(node_cpu{instance="$instance"}[1m])) by (mode) * 100 / count_scalar(node_cpu{mode="user", instance="$instance"})
What did you expect to see?
Around 100% usage if we aggregate all
mode
valuesWhat did you see instead?
Some level of
steal
is expected as this is running on AWS, but I believe the value is out of proportion.Stats
The text was updated successfully, but these errors were encountered: