Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ethtool: Prevent duplicate metric names #2187

Merged
merged 1 commit into from
Nov 15, 2021

Conversation

bdrung
Copy link
Contributor

@bdrung bdrung commented Oct 29, 2021

Sanitizing the metric names can lead to duplicate metric names:

caller=level.go:63 level=error caller="error gathering metrics: [from Gatherer #2] collected metric \"node_ethtool_giant_hdr\" { label:<name:\"device\" value:\"ens192\" > untyped:<value:0" msg=" > } was collected before with the same name and label values"

Generate a map from the sanitized metric names to the metric names from ethtool. In case of duplicate sanitized metric names drop both metrics, because it is unknown which one to take.

Fixes: #2185

Sanitizing the metric names can lead to duplicate metric names:

```
caller=level.go:63 level=error caller="error gathering metrics: [from Gatherer prometheus#2] collected metric \"node_ethtool_giant_hdr\" { label:<name:\"device\" value:\"ens192\" > untyped:<value:0" msg=" > } was collected before with the same name and label values"
```

Generate a map from the sanitized metric names to the metric names from
ethtool. In case of duplicate sanitized metric names drop both metrics,
because it is unknown which one to take.

Fixes: prometheus#2185
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
@ventifus
Copy link
Contributor

LGTM

@discordianfish
Copy link
Member

Not sure if we should drop them or add an index or something to make them unique.
@SuperQ wdyt?

@ventifus
Copy link
Contributor

ventifus commented Nov 1, 2021

The driver in question, vmxnet3 (that's the VMware paravirt NIC driver) makes a creative "visual indent" hierarchy of ethtool metrics. I just checked one of my systems that has multi-queue enabled and it's worse than I thought. Not only are there two "giant hdr" entries per queue, but the individual metrics themselves are not uniquely named per queue.

NIC statistics:
     Tx Queue#: 0
       TSO pkts tx: 180312179
       TSO bytes tx: 2088168382975
       ucast pkts tx: 6553058629
       ucast bytes tx: 2657614853748
       mcast pkts tx: 4922
       mcast bytes tx: 306340
       bcast pkts tx: 0
       bcast bytes tx: 0
       pkts tx err: 0
       pkts tx discard: 0
       drv dropped tx total: 0
          too many frags: 0
          giant hdr: 0
          hdr err: 0
          tso: 0
       ring full: 9
       pkts linearized: 0
       hdr cloned: 2
       giant hdr: 0
     Tx Queue#: 1
       TSO pkts tx: 180417461
       TSO bytes tx: 2110097903563
       ucast pkts tx: 3001677632
       ucast bytes tx: 2455481947665
       mcast pkts tx: 21
       mcast bytes tx: 1806
       bcast pkts tx: 9445
       bcast bytes tx: 396690
       pkts tx err: 0
       pkts tx discard: 0
       drv dropped tx total: 0
          too many frags: 0
          giant hdr: 0
          hdr err: 0
          tso: 0
       ring full: 3
       pkts linearized: 0
       hdr cloned: 1
       giant hdr: 0
...

This is trying to show a hierarchical relationship, so the various "giant hdr" metrics are really more like
Tx Queue#: 0 / drv dropped tx total / giant hdr
Tx Queue#: 0 / giant hdr
Tx Queue#: 1 / drv dropped tx total / giant hdr
Tx Queue#: 1 / giant hdr

We could attempt to reconstruct that hierarchy by counting leading spaces. We could add an index number, but we'd end up with "giant hdr number 1", "giant hdr number 2" for queue 0, and "giant hdr number 3" and "giant hdr number 4" for queue 2?

@bdrung
Copy link
Contributor Author

bdrung commented Nov 1, 2021

The unfolding of the indentation should be done in the ethtool library IMO.

@discordianfish
Copy link
Member

Yeah ideally the ethtool lib takes care of parsing this as a hierarchy. I'm quite surprised though that these syscalls are returning indented strings?

But until this happens, the best we can do is either drop duplicates or add an index. I think we'll need to do the later since otherwise it's simply impossible to get the otherwise drops metrics. Thoughts?

@bdrung
Copy link
Contributor Author

bdrung commented Nov 3, 2021

I prefer dropping it, because adding an index makes the metric name fragile IMO. Is "giant hdr number 2" the giant hdr from Tx Queue number 1 or did the queue change?

@discordianfish
Copy link
Member

I'd need to be a stable order, sure. But this seems better than not being able to monitor "giant hdr number 2" at all, no?

@ventifus
Copy link
Contributor

ventifus commented Nov 3, 2021

We're going to have to drop the duplicates for now until I can get the ethtool library to do something useful here. The Stats() call returns a map[string]uint64, so they come to us unordered.

As to why the syscall is returning indented strings, they're hardcoded in the driver.

@discordianfish
Copy link
Member

Ah got it, well then this LGTM.

@discordianfish
Copy link
Member

@SuperQ PTAL

@SuperQ SuperQ merged commit d85cbaa into prometheus:master Nov 15, 2021
@bdrung bdrung deleted the prevent-duplicates branch November 15, 2021 10:30
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
Sanitizing the metric names can lead to duplicate metric names:

```
caller=level.go:63 level=error caller="error gathering metrics: [from Gatherer #2] collected metric \"node_ethtool_giant_hdr\" { label:<name:\"device\" value:\"ens192\" > untyped:<value:0" msg=" > } was collected before with the same name and label values"
```

Generate a map from the sanitized metric names to the metric names from
ethtool. In case of duplicate sanitized metric names drop both metrics,
because it is unknown which one to take.

Fixes: prometheus#2185
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
Sanitizing the metric names can lead to duplicate metric names:

```
caller=level.go:63 level=error caller="error gathering metrics: [from Gatherer #2] collected metric \"node_ethtool_giant_hdr\" { label:<name:\"device\" value:\"ens192\" > untyped:<value:0" msg=" > } was collected before with the same name and label values"
```

Generate a map from the sanitized metric names to the metric names from
ethtool. In case of duplicate sanitized metric names drop both metrics,
because it is unknown which one to take.

Fixes: prometheus#2185
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ethtool error: "panic: "node_ethtool_ err" is not a valid metric name"
4 participants