Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hwmon duplicate core temps #333

Closed
mgulick opened this issue Oct 26, 2016 · 8 comments · Fixed by #334
Closed

hwmon duplicate core temps #333

mgulick opened this issue Oct 26, 2016 · 8 comments · Fixed by #334
Labels

Comments

@mgulick
Copy link

mgulick commented Oct 26, 2016

On node_exporter 13.0-rc.1 on Linux (Debian 8), I'm getting the following warnings from prometheus:

Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_0"} => 54 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_1"} => 72 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_10"} => 74 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_11"} => 68 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_12"} => 72 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_2"} => 74 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_3"} => 56 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_4"} => 76 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_8"} => 65 @[1477489208.369] source="scrape.go:460"
Oct 26 09:40:08 sbtools-11-ah prometheus[9625]: time="2016-10-26T09:40:08-04:00" level=debug msg="Sample discarded" error="sample with repeated timestamp but different value" sample=node_hwmon_temp_celsius{chip="coretemp", instance="sb6934glnxa64:9100", job="sbfarm-nodes", sensor="core_9"} => 73 @[1477489208.369] source="scrape.go:460"

A sample from /metrics:

# HELP node_hwmon_temp_celsius Hardware monitor for temperature (input)
# TYPE node_hwmon_temp_celsius gauge
node_hwmon_temp_celsius{chip="coretemp",sensor="core_0"} 65
node_hwmon_temp_celsius{chip="coretemp",sensor="core_0"} 62
node_hwmon_temp_celsius{chip="coretemp",sensor="core_1"} 62
node_hwmon_temp_celsius{chip="coretemp",sensor="core_1"} 63
node_hwmon_temp_celsius{chip="coretemp",sensor="core_10"} 62
node_hwmon_temp_celsius{chip="coretemp",sensor="core_10"} 62
node_hwmon_temp_celsius{chip="coretemp",sensor="core_11"} 66
node_hwmon_temp_celsius{chip="coretemp",sensor="core_11"} 61
...

CPU is a Xeon E5-2680 v3

Let me know if there is any other information I can provide.

@brian-brazil
Copy link
Contributor

@rtreffer

@SuperQ
Copy link
Member

SuperQ commented Oct 26, 2016

It looks like this is an issue for dual socket hardware, there are two coretemp trees on these nodes.

@SuperQ
Copy link
Member

SuperQ commented Oct 26, 2016

I think we need to include the device path as a label.

For example:

/sys/class/hwmon$ ls -l hwmon*/device
lrwxrwxrwx 1 root root 0 Oct 26 15:04 hwmon0/device -> ../../../0000:02:00.0
lrwxrwxrwx 1 root root 0 Oct 26 15:05 hwmon1/device -> ../../../0000:06:00.0
lrwxrwxrwx 1 root root 0 Oct 26 15:00 hwmon2/device -> ../../../coretemp.0
lrwxrwxrwx 1 root root 0 Oct 26 15:01 hwmon3/device -> ../../../coretemp.1

@rtreffer
Copy link
Contributor

Ah, I suspected this might happen. @brian-brazil I removed the path-derived name which is guaranteed to be unique, should I re-add that to the labels?

@brian-brazil
Copy link
Contributor

So that'd be coretemp.1 here rather than coretemp?

@rtreffer
Copy link
Contributor

Both solutions can lead to problems.....

Using path fragments

Pro: Guaranteed to be unique
Con: Can change between kernel version or even reboots. Possibly cryptic (e.g. isa-0000:02:00.0)

Using names as exported

Pro: Human readable, stable across reboots
Con: Might be not-unique

(I actually tried to come up with a situation where it is non-unique, but failed to come up with this one shown here)

My feeling right now would be to use the device path/name if a device link exists, then switching to the exported name and finally using the hwmon name.
Basically just changing the precedence in https://github.com/prometheus/node_exporter/blob/master/collector/hwmon_linux.go#L296

@rtreffer
Copy link
Contributor

And that would use platform-coretemp.0 as the name. Which is always unique.

@brian-brazil
Copy link
Contributor

Sounds like a plan.

horazont added a commit to cloudandheat/node_exporter that referenced this issue Nov 28, 2016
The chip label generation has been changed in prometheus#334 to prefer the
unique device path (e.g. the location on the PCI bus) due to prometheus#333.

Here, a new label, chipName, is introduced which, again, carries
the human-readable sensor name (e.g. coretemp). It is used in
addition to the existing labels.

This allows to mitigate the downsides of the solution to prometheus#333
(namely that the device path may not be stable across kernels and
reboots) for cases where it does not matter that multiple devices
may have the same human-readable name (e.g. aggregation or where
at most one device of a type is present).
horazont added a commit to cloudandheat/node_exporter that referenced this issue Dec 1, 2016
The chip label generation has been changed in prometheus#334 to prefer the
unique device path (e.g. the location on the PCI bus) due to prometheus#333.

Here, a new annotation metric ``node_hwmon_chip_names`` is
introduced which allows to link the unique chip sysfs path to a
human-readable chip name which may not be unique among chip sysfs
paths (for example, dual-slot systems have multiple
chipType="coretemp" sensors).

This allows to mitigate the downsides of the solution to prometheus#333
(namely that the device path may not be stable across kernels and
reboots) for cases where it does not matter that multiple devices
may have the same human-readable name (e.g. aggregation or where
at most one device with a common chip name is present).

For cases where no human-readable name can be derived, the
annotation metric is not emitted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants