Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query CPU usage per process in percent #494

Closed
scorpiock opened this issue Mar 31, 2020 · 20 comments
Closed

Query CPU usage per process in percent #494

scorpiock opened this issue Mar 31, 2020 · 20 comments

Comments

@scorpiock
Copy link

Hello,

I am configuring Grafana to get CPU utilization by a process with wmi_process_cpu_time_total but it doesn't give the accurate utilization percentage.

I tried many combinations with no luck. Wondering if anyone has a working query or some known issues?

Tried the following:

sum by (process) (rate(wmi_process_cpu_time_total{instance=~"$server.*", process !~"Idle"}[5m]))
avg by (process_id) (rate(wmi_process_cpu_time_total{instance=~"$server",process=!~"Idle"}[5m])) * 100

Servers are having multi-core CPUs. Am I doing anything wrong?

@carlpett
Copy link
Collaborator

carlpett commented Apr 1, 2020

Hey @scorpiock,
How is this inaccuracy manifesting itself?

@scorpiock
Copy link
Author

Hey @scorpiock,
How is this inaccuracy manifesting itself?

Hello @carlpett

I looked into CPU utilization on server and found number doesn't match at all. It is way below low. When I was comparing, one of the processes was taking around 15% of the CPU but the above sum query was showing around 1%.

Is there any query you advise to test?

@carlpett
Copy link
Collaborator

carlpett commented Apr 4, 2020

The query looks correct, so that should be fine. Are you comparing with Task Manager, or some other tool?

@scorpiock
Copy link
Author

I compared with task manager and process explorer, both.

@Mario-Hofstaetter
Copy link
Contributor

Mario-Hofstaetter commented Apr 5, 2020

This has been updated to use the windows_ prefix of current releases. Use wmi_ if you run an older version.

@scorpiock I believe to have a working query, which was not as easy as I first thought, but I need this myself soon.

This gives me accurate (as far as I can tell) CPU usage in percent per process ✔️:

100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total{instance="$instance", process!="Idle"}[5m]))
 / on(instance) group_left sum by(instance) (rate(windows_cpu_time_total{instance="$instance"}[5m]))

(Edit 2020-11-18: Added filter for useless Idle process.)

I am still learning prometheus but I will try to explean what the heck I did there.

1. Ratio

You need to calculate the ratio of total cpu time used vs cpu time per process. That is

100 * windows_process_cpu_time_total / windows_cpu_time_total

2. windows_cpu_time_total vs windows_process_cpu_time_total

It is important (for my scenario) to use windows_cpu_time_total instead of sum(windows_process_cpu_time_total) for the total load.

I am whitelisting (--collector.process.whitelist) a few processes to monitor to keep metrics count low. So for me just adding up windows_process_cpu_time_total would exclude many other processes on the machine I do not explicitly monitor.

Also using windows_cpu_time_total{mode!="idle"} is a bad idea, because if the server is under light load, monitored processes would show big percentage of the small total load.

3. many-to-one

The division is a many-to-one match (if you have more than one process name). For this on both sides prometheus must have matchin labels,

In my case label instance, using the on(instance) statement (think SQL JOIN ON).

In my case, the many vector is on the left side, so I must use group_left statement.

Even when there is only on instance value, the sum by(instance) is needed so that the label is not dropped or the join does not work. The same is true for instance in sum by(instance, process, process_id) , if remove the query no longer returns data.

Wrapping up

I tested this on a 4-Core CPU using prime95 to generate load. First with two worker threads, then running one.

prime95-2threads

The values matched those reported in taskmanager / process explorer (~44% and ~25% load), so I believe the query is working. If this can be achived more easily I'd be glad to know how.

Wrong trails

If one does not consider the points in 2) , stuff like this happens.

Using ❌ :

topk(5, 100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total [5m])) / on(instance) group_left sum by(instance) (rate(windows_cpu_time_total{mode!="idle"}[5m])))

wrong1

Here idle time is ignored, so prime95 cpu usage is higher than it really was and sqlservr cpu usage seems to increase when the total system load dropped, but actually I stayed the same.

This is even worse ❌ :

topk(5, 100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total{instance="$instance"}[5m])) / on(instance) group_left sum by(instance) (rate(windows_process_cpu_time_total{instance="$instance"}[5m])))

wrong2
This only shows the load of monitored processes relative to each other,

@scorpiock
Copy link
Author

Thank you so much for your help. It works like charm.

@Mario-Hofstaetter
Copy link
Contributor

@carlpett Could anybody please edit the titel of this issue? Surly other people could google for this task.
Something like "Query CPU usage per process in percent" or so. And adding label question

I will make a PR to add the query above to the process collector docs, since it looks like a success 😃

@scorpiock scorpiock changed the title CPU Utilization by Process is not accurate Query CPU usage per process in percent Apr 7, 2020
@carlpett
Copy link
Collaborator

carlpett commented Apr 9, 2020

@Mario-Hofstaetter Nice detective work! I'm a bit surprised that the query needed to be that complicated, but your reasoning makes sense.
I'll update the topic!

@zlepper
Copy link
Contributor

zlepper commented Jun 30, 2020

Sorry to necro a closed question like this, but how do you do that query now that wmi is no longer a thing in the collector? I can't seem to find the corresponding values?

(I'm also a beginner at both Prometheus and Grafana currently, so i'm sorry if this is a stupid question)

@carlpett
Copy link
Collaborator

@zlepper You can simply replace wmi_ with windows_ in the metric names and it should work.

@zlepper
Copy link
Contributor

zlepper commented Jun 30, 2020

And it goes indeed, i was an idiot, and had forgotten to enable to process collector. Thank you very much! :D

@zakiharis
Copy link

Sorry to necro this question again.

and sorry to tag you @Mario-Hofstaetter

just wondering is something wrong with my metrics

Screenshot_20210114_182245

The CPU usage reached 22% but when I see the CPU load per process and sum up the percentage, it doesn't even reach the 22%

Screenshot_20210114_182450

I'm using the exact query as above

@Mario-Hofstaetter
Copy link
Contributor

Mario-Hofstaetter commented Jan 14, 2021

@zakiharis Can you share your PromQL Queries (whats the query for the left hand side chart)
or the whole Grafana Dashboard?

Edit: Also can you copy & paste the Commandline Params of windows_exporter.exe from Task Manager?

@zakiharis
Copy link

@Mario-Hofstaetter
thank you for replying to my question.

left side:

100 - avg(irate(windows_cpu_time_total{hostname=~"$hostname",mode="idle"}[5m]))*100

right side:

100 * sum by(hostname, process, process_id) (rate(windows_process_cpu_time_total{hostname="$hostname", process!="Idle"}[5m]))
 / on(hostname) group_left sum by(hostname) (rate(windows_cpu_time_total{hostname="$hostname"}[5m]))

and below is the params:

"C:\Program Files\windows_exporter\windows_exporter.exe" --log.format logger:eventlog?name=windows_exporter --collectors.enabled cpu,cs,logical_disk,memory,net,os,process,service,system,tcp,textfile --telemetry.addr :9182   

@Mario-Hofstaetter
Copy link
Contributor

Is $hostname a unique string, or a RegEx expression?
In the first query you use =~ operator, in the second its = . You can check by removing the avg and see if too many timeseries from different servers are returned.

I will check with one of my systems, but later in the evening (TZ GMT+1)

@zakiharis
Copy link

@Mario-Hofstaetter

$hostname is a unique string just like instance

Removing the avg it will return per core but with the same server

Screenshot_20210114_214033

Thank you again for checking

@zakiharis
Copy link

Hi @Mario-Hofstaetter

just want to check around, are you manage to try on your system?

@kyleli666
Copy link

kyleli666 commented Feb 17, 2025

@Mario-Hofstaetter @carlpett @zlepper
Hi, I'm getting values larger than 100% on AWS EC2 with windows_exporter v0.30.4, scrape interval 10s, and this query

100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total{instance="$instance", process!="Idle"}[1m]))
 / on(instance) group_left sum by(instance) (rate(windows_cpu_time_total{instance="$instance"}[1m]))

Even with [5m] duration I can get some sudden value higher than 1500%. Except these sudden values, the other values look good to me.
Image
What can be the problem here?

@Mario-Hofstaetter
Copy link
Contributor

What can be the problem here?

@kyleli666
I've been on a sabbatical for some time so no prometheus on hand atm. Some thoughts:

  • If your scrape intervalls are at least 2.5 * the interval, [1m] should be ok afaik
  • pick one of the faulty values (pid xxxxx) and check the metrics for duplicates / weird labels. There is a sum by(instance, process, process_id).. so if those 3 labels are not unique, it might increase the value. I have not used version v0.30 myself so far.

@kyleli666
Copy link

kyleli666 commented Feb 18, 2025

I've been on a sabbatical for some time so no prometheus on hand atm. Some thoughts:

@Mario-Hofstaetter, Thanks a lot! After digging into the data source, I sometimes found the windows_process_cpu_time_total metric gets 0 or near 0, so when it's suddenly back, its rate is dramatically big. I'm trying to filter these 0 but seems not easy 🤦.

Image

Sometimes, this time counter just get smaller but not 0.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants