Query CPU usage per process in percent #494

scorpiock · 2020-03-31T10:45:38Z

Hello,

I am configuring Grafana to get CPU utilization by a process with wmi_process_cpu_time_total but it doesn't give the accurate utilization percentage.

I tried many combinations with no luck. Wondering if anyone has a working query or some known issues?

Tried the following:

sum by (process) (rate(wmi_process_cpu_time_total{instance=~"$server.*", process !~"Idle"}[5m]))
avg by (process_id) (rate(wmi_process_cpu_time_total{instance=~"$server",process=!~"Idle"}[5m])) * 100

Servers are having multi-core CPUs. Am I doing anything wrong?

The text was updated successfully, but these errors were encountered:

carlpett · 2020-04-01T20:23:26Z

Hey @scorpiock,
How is this inaccuracy manifesting itself?

scorpiock · 2020-04-01T20:30:38Z

Hey @scorpiock,
How is this inaccuracy manifesting itself?

Hello @carlpett

I looked into CPU utilization on server and found number doesn't match at all. It is way below low. When I was comparing, one of the processes was taking around 15% of the CPU but the above sum query was showing around 1%.

Is there any query you advise to test?

carlpett · 2020-04-04T13:41:19Z

The query looks correct, so that should be fine. Are you comparing with Task Manager, or some other tool?

scorpiock · 2020-04-04T19:11:17Z

I compared with task manager and process explorer, both.

Mario-Hofstaetter · 2020-04-05T00:13:00Z

This has been updated to use the `windows_` prefix of current releases. Use `wmi_` if you run an older version.

@scorpiock I believe to have a working query, which was not as easy as I first thought, but I need this myself soon.

This gives me accurate (as far as I can tell) CPU usage in percent per process ✔️:

100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total{instance="$instance", process!="Idle"}[5m]))
 / on(instance) group_left sum by(instance) (rate(windows_cpu_time_total{instance="$instance"}[5m]))

(Edit 2020-11-18: Added filter for useless Idle process.)

I am still learning prometheus but I will try to explean what the heck I did there.

1. Ratio

You need to calculate the ratio of total cpu time used vs cpu time per process. That is

100 * windows_process_cpu_time_total / windows_cpu_time_total

2. `windows_cpu_time_total` vs `windows_process_cpu_time_total`

It is important (for my scenario) to use windows_cpu_time_total instead of sum(windows_process_cpu_time_total) for the total load.

I am whitelisting (--collector.process.whitelist) a few processes to monitor to keep metrics count low. So for me just adding up windows_process_cpu_time_total would exclude many other processes on the machine I do not explicitly monitor.

Also using windows_cpu_time_total{mode!="idle"} is a bad idea, because if the server is under light load, monitored processes would show big percentage of the small total load.

3. many-to-one

The division is a many-to-one match (if you have more than one process name). For this on both sides prometheus must have matchin labels,

In my case label instance, using the on(instance) statement (think SQL JOIN ON).

In my case, the many vector is on the left side, so I must use group_left statement.

Even when there is only on instance value, the sum by(instance) is needed so that the label is not dropped or the join does not work. The same is true for instance in sum by(instance, process, process_id) , if remove the query no longer returns data.

Wrapping up

I tested this on a 4-Core CPU using prime95 to generate load. First with two worker threads, then running one.

The values matched those reported in taskmanager / process explorer (~44% and ~25% load), so I believe the query is working. If this can be achived more easily I'd be glad to know how.

Wrong trails

If one does not consider the points in 2) , stuff like this happens.

Using ❌ :

topk(5, 100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total [5m])) / on(instance) group_left sum by(instance) (rate(windows_cpu_time_total{mode!="idle"}[5m])))

Here idle time is ignored, so prime95 cpu usage is higher than it really was and sqlservr cpu usage seems to increase when the total system load dropped, but actually I stayed the same.

This is even worse ❌ :

topk(5, 100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total{instance="$instance"}[5m])) / on(instance) group_left sum by(instance) (rate(windows_process_cpu_time_total{instance="$instance"}[5m])))

This only shows the load of monitored processes relative to each other,

scorpiock · 2020-04-07T10:24:07Z

Thank you so much for your help. It works like charm.

Mario-Hofstaetter · 2020-04-07T10:33:41Z

@carlpett Could anybody please edit the titel of this issue? Surly other people could google for this task.
Something like "Query CPU usage per process in percent" or so. And adding label question

I will make a PR to add the query above to the process collector docs, since it looks like a success 😃

carlpett · 2020-04-09T12:06:09Z

@Mario-Hofstaetter Nice detective work! I'm a bit surprised that the query needed to be that complicated, but your reasoning makes sense.
I'll update the topic!

zlepper · 2020-06-30T08:34:40Z

Sorry to necro a closed question like this, but how do you do that query now that wmi is no longer a thing in the collector? I can't seem to find the corresponding values?

(I'm also a beginner at both Prometheus and Grafana currently, so i'm sorry if this is a stupid question)

carlpett · 2020-06-30T08:36:13Z

@zlepper You can simply replace wmi_ with windows_ in the metric names and it should work.

zlepper · 2020-06-30T08:51:36Z

And it goes indeed, i was an idiot, and had forgotten to enable to process collector. Thank you very much! :D

zakiharis · 2021-01-14T10:26:47Z

Sorry to necro this question again.

and sorry to tag you @Mario-Hofstaetter

just wondering is something wrong with my metrics

The CPU usage reached 22% but when I see the CPU load per process and sum up the percentage, it doesn't even reach the 22%

I'm using the exact query as above

Mario-Hofstaetter · 2021-01-14T13:19:25Z

@zakiharis Can you share your PromQL Queries (whats the query for the left hand side chart)
or the whole Grafana Dashboard?

Edit: Also can you copy & paste the Commandline Params of windows_exporter.exe from Task Manager?

zakiharis · 2021-01-14T13:30:41Z

@Mario-Hofstaetter
thank you for replying to my question.

left side:

100 - avg(irate(windows_cpu_time_total{hostname=~"$hostname",mode="idle"}[5m]))*100

right side:

100 * sum by(hostname, process, process_id) (rate(windows_process_cpu_time_total{hostname="$hostname", process!="Idle"}[5m]))
 / on(hostname) group_left sum by(hostname) (rate(windows_cpu_time_total{hostname="$hostname"}[5m]))

and below is the params:

"C:\Program Files\windows_exporter\windows_exporter.exe" --log.format logger:eventlog?name=windows_exporter --collectors.enabled cpu,cs,logical_disk,memory,net,os,process,service,system,tcp,textfile --telemetry.addr :9182

Mario-Hofstaetter · 2021-01-14T13:34:29Z

Is $hostname a unique string, or a RegEx expression?
In the first query you use =~ operator, in the second its = . You can check by removing the avg and see if too many timeseries from different servers are returned.

I will check with one of my systems, but later in the evening (TZ GMT+1)

zakiharis · 2021-01-14T13:41:33Z

@Mario-Hofstaetter

$hostname is a unique string just like instance

Removing the avg it will return per core but with the same server

Thank you again for checking

zakiharis · 2021-01-20T06:42:08Z

Hi @Mario-Hofstaetter

just want to check around, are you manage to try on your system?

kyleli666 · 2025-02-17T10:53:35Z

@Mario-Hofstaetter @carlpett @zlepper
Hi, I'm getting values larger than 100% on AWS EC2 with windows_exporter v0.30.4, scrape interval 10s, and this query

100 * sum by(instance, process, process_id) (rate(windows_process_cpu_time_total{instance="$instance", process!="Idle"}[1m]))
 / on(instance) group_left sum by(instance) (rate(windows_cpu_time_total{instance="$instance"}[1m]))

Even with [5m] duration I can get some sudden value higher than 1500%. Except these sudden values, the other values look good to me.

What can be the problem here?

Mario-Hofstaetter · 2025-02-17T22:36:26Z

What can be the problem here?

@kyleli666
I've been on a sabbatical for some time so no prometheus on hand atm. Some thoughts:

If your scrape intervalls are at least 2.5 * the interval, [1m] should be ok afaik
pick one of the faulty values (pid xxxxx) and check the metrics for duplicates / weird labels. There is a sum by(instance, process, process_id).. so if those 3 labels are not unique, it might increase the value. I have not used version v0.30 myself so far.

kyleli666 · 2025-02-18T05:03:51Z

I've been on a sabbatical for some time so no prometheus on hand atm. Some thoughts:

@Mario-Hofstaetter, Thanks a lot! After digging into the data source, I sometimes found the windows_process_cpu_time_total metric gets 0 or near 0, so when it's suddenly back, its rate is dramatically big. I'm trying to filter these 0 but seems not easy 🤦.

Sometimes, this time counter just get smaller but not 0.

carlpett added the collector/cpu label Apr 1, 2020

scorpiock closed this as completed Apr 7, 2020

scorpiock changed the title ~~CPU Utilization by Process is not accurate~~ Query CPU usage per process in percent Apr 7, 2020

carlpett added the ❓ question label Apr 9, 2020

katepangLiu mentioned this issue Jun 25, 2021

Inaccurate windows_cpu_time_total value #810

Closed

Adel-CHT mentioned this issue Oct 12, 2021

CPU metric doesn't works on Physical machine (physical server) #850

Closed

paoloyx mentioned this issue Feb 3, 2025

Need help on determining millicores CPU usage for the "windows_exporter" process itself #1868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query CPU usage per process in percent #494

Query CPU usage per process in percent #494

scorpiock commented Mar 31, 2020

carlpett commented Apr 1, 2020

scorpiock commented Apr 1, 2020

carlpett commented Apr 4, 2020

scorpiock commented Apr 4, 2020

Mario-Hofstaetter commented Apr 5, 2020 •

edited

Loading

scorpiock commented Apr 7, 2020

Mario-Hofstaetter commented Apr 7, 2020

carlpett commented Apr 9, 2020

zlepper commented Jun 30, 2020

carlpett commented Jun 30, 2020

zlepper commented Jun 30, 2020

zakiharis commented Jan 14, 2021

Mario-Hofstaetter commented Jan 14, 2021 •

edited

Loading

zakiharis commented Jan 14, 2021

Mario-Hofstaetter commented Jan 14, 2021

zakiharis commented Jan 14, 2021

zakiharis commented Jan 20, 2021

kyleli666 commented Feb 17, 2025 •

edited

Loading

Mario-Hofstaetter commented Feb 17, 2025

kyleli666 commented Feb 18, 2025 •

edited

Loading

Query CPU usage per process in percent #494

Query CPU usage per process in percent #494

Comments

scorpiock commented Mar 31, 2020

carlpett commented Apr 1, 2020

scorpiock commented Apr 1, 2020

carlpett commented Apr 4, 2020

scorpiock commented Apr 4, 2020

Mario-Hofstaetter commented Apr 5, 2020 • edited Loading

This has been updated to use the windows_ prefix of current releases. Use wmi_ if you run an older version.

1. Ratio

2. windows_cpu_time_total vs windows_process_cpu_time_total

3. many-to-one

Wrapping up

Wrong trails

scorpiock commented Apr 7, 2020

Mario-Hofstaetter commented Apr 7, 2020

carlpett commented Apr 9, 2020

zlepper commented Jun 30, 2020

carlpett commented Jun 30, 2020

zlepper commented Jun 30, 2020

zakiharis commented Jan 14, 2021

Mario-Hofstaetter commented Jan 14, 2021 • edited Loading

zakiharis commented Jan 14, 2021

Mario-Hofstaetter commented Jan 14, 2021

zakiharis commented Jan 14, 2021

zakiharis commented Jan 20, 2021

kyleli666 commented Feb 17, 2025 • edited Loading

Mario-Hofstaetter commented Feb 17, 2025

kyleli666 commented Feb 18, 2025 • edited Loading

Mario-Hofstaetter commented Apr 5, 2020 •

edited

Loading

This has been updated to use the `windows_` prefix of current releases. Use `wmi_` if you run an older version.

2. `windows_cpu_time_total` vs `windows_process_cpu_time_total`

Mario-Hofstaetter commented Jan 14, 2021 •

edited

Loading

kyleli666 commented Feb 17, 2025 •

edited

Loading

kyleli666 commented Feb 18, 2025 •

edited

Loading