-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infiniband: Add new collector for InfiniBand statistics #450
Conversation
Nice! Though I would guess that InfiniBand is probably uncommon enough that it shouldn't be enabled by default. I'll review the rest now. |
Good point! I made sure that it works regardless of whether or not InfiniBand is installed, but I can certainly make that change if desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of my comments involve general Go stylistic suggestions. As always, if one of the maintainers of the project (or you, the author) disagree, please feel free to reply to my suggestions.
collector/infiniband.go
Outdated
// limitations under the License. | ||
|
||
// +build linux | ||
// +build !noinfiniband |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the collectors have this tag while others do not. Question for maintainers: are these collector-specific tags still being used in practice anywhere?
collector/infiniband.go
Outdated
i.infinibandCounters = map[string]string{ | ||
"link_downed": "Number of times the link failed to recover from an error state and went down", | ||
"link_error_recovery": "Number of times the link successfully recovered from an error state", | ||
"multicast_rcv_packets": "Number of multicast packets received (including errors)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the precedent is to spell out "receive".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree with you that spelling out "receive" would be beneficial, but multicast_rcv_packets
is the name of the file I read the metric from, hence the use here. I extract every metric's filename from the key in this map.
collector/infiniband.go
Outdated
"link_downed": "Number of times the link failed to recover from an error state and went down", | ||
"link_error_recovery": "Number of times the link successfully recovered from an error state", | ||
"multicast_rcv_packets": "Number of multicast packets received (including errors)", | ||
"multicast_xmit_packets": "Number of multicast packets transmitted (including errors)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, but "transmit" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment for my reasoning.
collector/infiniband.go
Outdated
for metric, description := range i.infinibandCounters { | ||
i.metricDescs[metric] = prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, subsystem, metric+"_total"), | ||
fmt.Sprintf(description), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the fmt.Sprintf
wrapper. No need and I'd imagine that descriptions probably should not be variable anyway.
collector/infiniband.go
Outdated
subsystem := "infiniband" | ||
i.metricDescs = make(map[string]*prometheus.Desc) | ||
i.infinibandPath = "class/infiniband" | ||
i.infinibandNoDevicesFound = errors.New("InfiniBand / No devices detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unusual to pass around errors as struct members like this. I'd put them up above in a var()
block and just reference them by name that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these pieces of data, I had Robert pull them into the structure based on the feedback I received from my previous pull requests about global variables. Totally cool if any of the maintainers want us to pull them back out.
collector/infiniband.go
Outdated
|
||
subsystem := "infiniband" | ||
i.metricDescs = make(map[string]*prometheus.Desc) | ||
i.infinibandPath = "class/infiniband" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I doubt this will change, it may also be easier to just put this in a const()
block up top and reference it that way, instead of passing it to functions.
collector/infiniband.go
Outdated
|
||
// Extract just the filenames which equates to the port numbers. | ||
for i, port := range ports { | ||
_, ports[i] = filepath.Split(port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filepath.Base(port).
collector/infiniband_linux_test.go
Outdated
"testing" | ||
) | ||
|
||
func TestDevices(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test names should be namespaced something like TestInfiniBand*
for ease of finding failing tests.
collector/infiniband_linux_test.go
Outdated
} | ||
|
||
if len(devices) != 1 { | ||
t.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will just do t.Fatal(nil)
in effect. Should add a more descriptive error message.
collector/infiniband_linux_test.go
Outdated
} | ||
} | ||
|
||
func TestLinkDowned(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the end to end tests, the remaining tests in this file seem unnecessary, as they're essentially just reading the value of a fixture and confirming it. They aren't actually testing that the value is appropriately output with other metrics.
collector/infiniband.go
Outdated
@@ -0,0 +1,166 @@ | |||
// Copyright 2017 The Prometheus Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be named infiniband_linux.go
.
9c74ef8
to
7886d4c
Compare
@mdlayher thank you very much for the feedback! I am a recent Go convert from python and am still getting used to the new environment. I just pushed the changes you suggested - let me know if there is anything else you see that needs to be changed. |
No problem at all, I'm always happy to help. I don't have time to review this again tonight but I'm happy to take another look tomorrow. Or, if someone from the Prometheus organization would like to take a look, that would be great too! |
Okay! Starting in on another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there! More nitpicky Go type things and a couple requests to bubble up errors and provide more descriptive test failures.
collector/infiniband_linux.go
Outdated
const infinibandPath = "class/infiniband" | ||
|
||
var ( | ||
infinibandNoDevicesFound = errors.New("InfiniBand / No devices detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors should start with err*
, so errInfinibandDevicesNotFound
or similar. Same for next error.
collector/infiniband_linux.go
Outdated
i.counters = map[string]string{ | ||
"link_downed": "Number of times the link failed to recover from an error state and went down", | ||
"link_error_recovery": "Number of times the link successfully recovered from an error state", | ||
"multicast_rcv_packets": "Number of multicast packets received (including errors)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand "rcv" to "receive" and "xmit" to "transmit" on all metric names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdlayher As @roclark said in his previous response on this subject, he's using those names because those are the actual metric names used for those metrics in /sys. I took at look at the existing e2e-output.txt file and saw four other metrics from the netstat collector using "Rcv" within the metric name as well (no cases of xmit though). We're open to the most correct answer being to expand these names, but it seemed like the status quo was to maintain the source location's naming (that's what I did for all the ZFS metrics). I figure we'll need input from @discordianfish or @SuperQ on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I must have missed that reply with the new commit. Yeah, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdlayher It's a pretty busy thread, no worries!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using node_network_{receive,transmit}_*
as a reference for what it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. @roclark and I were talking about some of the inconsistency across metrics in the collectors here (certainly not saying our own choices are perfect, mind you), maybe at some point it would make sense to try to standardize within this exporter about things like when to expand abbreviation, use camelcase, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree. At this point I am just curious if it's okay to go back and "fix" metrics names since that would break metrics for previous timeseries when the node_exporter is upgraded.
I was working on a tool at one point that would lint metric names and tell you what the conventions were, but I haven't gotten back to it recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #453 about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From skimming through the PR, giving it 'better' names would require storing some mapping between the filename and the metric, correct? That's what we want to avoid and rather keep the filename as metric name.
Arguably it's somewhat in between, since you store a list of files that might change in the collector anyway, so could also make it a mapping and use names following https://prometheus.io/docs/practices/naming/.
I'm personally fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to explicitly list every file, then you should also fix the names while you're at it. If you were automatically finding the files, then doing a straight pass through would be fine.
collector/infiniband_linux.go
Outdated
// Retrieve a list of InfiniBand devices. | ||
func infinibandDevices(infinibandPath string) ([]string, error) { | ||
devices, err := filepath.Glob(filepath.Join(infinibandPath, "/*")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unusual to have a blank line before an error check like this.
collector/infiniband_linux.go
Outdated
prometheus.BuildFQName(Namespace, subsystem, metric+"_total"), | ||
description, | ||
[]string{"device", "port"}, | ||
nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total nitpick, but if you break the parenthesis onto its own line, things line up nicely and it becomes easy to add or remove arguments in one line later.
Ex:
m := foo(
"bar",
"baz",
)
collector/infiniband_linux.go
Outdated
// Retrieve a list of ports for the InfiniBand device. | ||
func infinibandPorts(infinibandPath, device string) ([]string, error) { | ||
ports, err := filepath.Glob(filepath.Join(infinibandPath, device, "ports/*")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line.
collector/infiniband_linux_test.go
Outdated
|
||
func TestInfiniBandDevices(t *testing.T) { | ||
devices, err := infinibandDevices("fixtures/sys/class/infiniband") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line.
collector/infiniband_linux_test.go
Outdated
|
||
func TestInfiniBandPorts(t *testing.T) { | ||
ports, err := infinibandPorts("fixtures/sys/class/infiniband", "mlx4_0") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line.
collector/infiniband_linux_test.go
Outdated
} | ||
|
||
if len(devices) != 1 { | ||
t.Fatal("Retrieved an unexpected number of InfiniBand devices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to specify how many devices were found. I typically do:
if l := len(devices); l != 1 {
t.Fatalf("unexpected number of InfiniBand devices: %d", l)
}
collector/infiniband_linux_test.go
Outdated
t.Fatal(err) | ||
} | ||
|
||
if len(ports) != 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above about a more descriptive error for failing the test.
collector/infiniband_linux.go
Outdated
return &i, nil | ||
} | ||
|
||
// Retrieve a list of InfiniBand devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go comments should be of the form:
// infinibandDevices retrieves a list of InfiniBand devices.
7886d4c
to
36df228
Compare
Thanks again for the reviews Matt! Always good to get some helpful tips from an expert. Hopefully I got everything sorted out with my latest push but as always, let me know if otherwise. I look forward to getting feedback from some of the other maintainers. |
36df228
to
61cb3f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one final comment about format of error messages. There is also the outstanding question of whether or not this should be on by default.
I'll leave the rest for @SuperQ and @discordianfish . Thanks for bearing with my reviews!
collector/infiniband_linux.go
Outdated
const infinibandPath = "class/infiniband" | ||
|
||
var ( | ||
errInfinibandNoDevicesFound = errors.New("InfiniBand / No devices detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I forgot to mention last time (sorry!), error messages should have text that starts lowercase (so they can be chained if needed). You probably want some text like no InfiniBand devices detected
. Same goes for the ports error.
An example of why it's typically done this way (with three different errors that each add context):
metrics collection failed: infiniband collector failed: no InfiniBand devices detected".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! I'd rather get all of these sorted out now instead of causing headaches later.
61cb3f7
to
c392648
Compare
Appears this needs a rebase on master now. |
3bef2cf
to
f0e6fc1
Compare
Rebased off master so all merge conflicts are gone again. |
No further feedback from me. Will let @SuperQ , @discordianfish , and @brian-brazil look this over and also determine if it should be enabled by default. |
collector/infiniband_linux.go
Outdated
"link_error_recovery": "Number of times the link successfully recovered from an error state", | ||
"multicast_rcv_packets": "Number of multicast packets received (including errors)", | ||
"multicast_xmit_packets": "Number of multicast packets transmitted (including errors)", | ||
"port_rcv_data": "Number of data octets received on all links", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes
should be in this name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion above: Using good names if we have to track them here anyway or us the filenames here. If we add bytes here, we should also change rcv to receive etc etc, right?
I lean towards making all names follow our best practices but fine either way.
eb77cd5
to
95c5c82
Compare
I edited the names to hopefully be more in line with the Prometheus standards. Let me know if you have any thoughts! And if not, are we good to merge this? Thanks everyone for the help! |
collector/infiniband_linux.go
Outdated
var i infinibandCollector | ||
|
||
// Filenames of all InfiniBand counter metrics including a detailed description. | ||
i.counters = map[string][]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a little more type safety and readability, I'd probably go with a new type for the value in this map:
type infinibandMetric struct {
File string
Help string
}
Alternatively, I suppose you could do an array ([2]string
), but a struct feels cleaner to me than either a slice or an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed another update using a struct. It works as I intend, but as always, let me know if I can make it more Go-idiomatic.
Add new metrics for the InfiniBand network protocol including the amount of packets sent and received, the number of times the link has been downed and how many times the link has recovered from an error state. Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
95c5c82
to
f809bfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again. Up to maintainers to determine if this should be on by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In general we enable collectors by default that won't generate any metrics or significant load if run on a system without support for whatever the collector is doing. That's the case here so it can be enabled by default.
@brian-brazil @SuperQ any additional thoughts from either of you? |
@SuperQ How about this one? Or are we waiting to pull it in until @brian-brazil re-reviews? |
I was trying to hold off a lot of things so that we could cut 0.14. I guess we can add this to the pile. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@SuperQ Ya know, I was starting to think that was the plan (hold some PRs back to cut a release)...thanks for pulling this in! We're done at HPE for now until we get Omnipath cards in, so it's helpful for us to have all of our contributed code merged ahead of 0.14. |
Thanks @SuperQ for the review and merge! As Joe mentioned, once I get my hands on some Omni-Path cards, I will crank out another collector. |
Allow parsing multiple `xprt` fields present in the same NFS transport stats data. Fixes: prometheus#450 Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Add new metrics for the InfiniBand network protocol including the amount of packets sent and received, the number of times the link has been downed and how many times the link has recovered from an error state.
Signed-Off-By: Robert Clark robert.d.clark@hpe.com