Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cpu_speed (in mhz) to cpu or system measurement #4256

Closed
randallt opened this issue Jun 8, 2018 · 10 comments · Fixed by #8988
Closed

Add cpu_speed (in mhz) to cpu or system measurement #4256

randallt opened this issue Jun 8, 2018 · 10 comments · Fixed by #8988
Labels
area/system feature request Requests for new plugin and for new features to existing plugins

Comments

@randallt
Copy link

randallt commented Jun 8, 2018

Add cpu_speed to cpu or system measurement

We are currently in the process of converting from Ganglia to Telegraf. (yeah!) Unfortunately, we have some existing dependence on the Ganglia cpu_speed metric. This is not found in Telegraf.

Proposal:

Add a cpu_speed field or equivalent to the cpu or system measurement. This would be in MHz.

Use case: [Why is this important (helps with prioritizing requests)]

This helps mostly in the capacity management area, when mapping cpu mhz of an application group that is targeted for migration to new hosts. We can get the cpu speed other ways, of course, but having it directly and natively in Telegraf would be optimal.

@danielnelson
Copy link
Contributor

A quick look suggests that we could use the cpu.Info() function from gopsutil to pull in some additional cpu fields:

type InfoStat struct {
	CPU        int32    `json:"cpu"`
	VendorID   string   `json:"vendorId"`
	Family     string   `json:"family"`
	Model      string   `json:"model"`
	Stepping   int32    `json:"stepping"`
	PhysicalID string   `json:"physicalId"`
	CoreID     string   `json:"coreId"`
	Cores      int32    `json:"cores"`
	ModelName  string   `json:"modelName"`
	Mhz        float64  `json:"mhz"`
	CacheSize  int32    `json:"cacheSize"`
	Flags      []string `json:"flags"`
	Microcode  string   `json:"microcode"`
}

This reads and parses /proc/cpuinfo on Linux.

@danielnelson danielnelson added feature request Requests for new plugin and for new features to existing plugins area/system labels Jun 8, 2018
@phemmer
Copy link
Contributor

phemmer commented Jun 8, 2018

Well it depends on what we're really looking for here. Are we wanting maximum speed, or current speed? What about the max or min limits?

@randallt
Copy link
Author

randallt commented Jun 11, 2018

I was looking for just the "CPU MHz" field from the linux command 'lscpu', which appears to be the same as the "cpu MHz" field of each CPU core from 'cat /proc/cpuinfo'. This doesn't change for me and matches the CPU description, like "Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz". I'm on a VMware infrastruture, though.

I was not looking for instantaneous frequency or even max boost frequency, just the frequency that corresponds to the CPU description--which when combined with the CPU core count can help give some comparative sense of capacity among VMs and environments.

@danielnelson
Copy link
Contributor

This field is the instantaneous frequency of the processor, but there is also min and max, here it is on my laptop:

CPU MHz:             499.877
CPU max MHz:         3400.0000
CPU min MHz:         400.0000

Even though min/max don't change, I can see the usefulness across a fleet of systems of collecting them. I think the main thing we should decide is if we want collecting this data to be opt-in or if it is light enough we should just add it. I think we can just add these 3 fields in as part of the standard fields collected by the cpu plugin since it should be a fairly light amount of extra load.

@phemmer
Copy link
Contributor

phemmer commented Jun 12, 2018

What about the limits? Limits might be useful on embedded (or other) systems which adjust the limits to conserve power.
Dunno if gopsutil provides them all in one spot, but they can all be obtained from /sys/devices/system/cpu/cpu*/cpufreq/

Basically all the fields and their relationships with each other are:
cpuinfo_min_freq <= scaling_min_freq <= scaling_cur_freq <= scaling_max_freq <= cpuinfo_max_freq

@randallt
Copy link
Author

This feature request should probably also be reconciled with this PR:
#4215

@dewi-ny-je
Copy link

dewi-ny-je commented Mar 2, 2020

I tried in the past to read lscpu and to input the data into InfluxDB using the exec input plugin.

The values were always higher than what I would get by running the same command from the command line because by the time telegraf gets to run the plugin, the CPU or kernel already increased the frequency.

I would say that the plugin makes little sense, unless it is proven to provide reliable values.

I'm running a E3-1220 v2 on Ubuntu 18.04.

@jose-d
Copy link
Contributor

jose-d commented Jul 15, 2020

I tried in the past to read lscpu and to input the data into InfluxDB using the exec input plugin.
The values were always higher than what I would get by running the same command from the command line because by the time telegraf gets to run the plugin, the CPU or kernel already increased the frequency.

I would say that the plugin makes little sense, unless it is proven to provide reliable values.

I see what you mean. In usecases like mine, (having XX cores HPC machine) one could assume the noise introduced by Telegraf itself can be expected to affect just few (?) cores (?). Anyway, going to write some exec() collection of /sys/devices/system/cpu/cpuXXX/cpufreq/cpuinfo_cur_freq and keep it running for some weeks on few compute nodes to see the real-life results.

@jose-d
Copy link
Contributor

jose-d commented Jul 15, 2020

here is P-O-C graded collector meant to be used as Exec input in Telegraf:

https://github.com/jose-d/telegraf-collectors/blob/master/cpufreq-monitor/give_stats.py

at the end I collect the data from

/sys/devices/system/cpu/cpuNN/cpufreq/scaling_cur_freq as it is readable (Centos7) by non-root user.

screenshot from Grafana:

(it's actually showing the reason why this monitoring is useful for me - detecting suboptimal usage of CPU resources by $users )

Screenshot_2020-07-15 node details - Grafana

@quentinmit
Copy link

quentinmit commented Jun 19, 2022

Just looking at scaling_cur_freq or cpuinfo_cur_freq is going to cause a whole bunch of aliasing because the frequency normally changes much more often than the Telegraf update interval. It would be better to collect stats/time_in_state which gives a cumulative counter of time spent at each state, which could correctly show you if half the time is spent at one frequency and half at another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
6 participants