Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infiniband: Add new collector for InfiniBand statistics #450

Merged
merged 5 commits into from
Feb 16, 2017

Conversation

roclark
Copy link
Contributor

@roclark roclark commented Feb 2, 2017

Add new metrics for the InfiniBand network protocol including the amount of packets sent and received, the number of times the link has been downed and how many times the link has recovered from an error state.

Signed-Off-By: Robert Clark robert.d.clark@hpe.com

@mdlayher
Copy link
Contributor

mdlayher commented Feb 2, 2017

Nice! Though I would guess that InfiniBand is probably uncommon enough that it shouldn't be enabled by default. I'll review the rest now.

@roclark
Copy link
Contributor Author

roclark commented Feb 2, 2017

Good point! I made sure that it works regardless of whether or not InfiniBand is installed, but I can certainly make that change if desired.

Copy link
Contributor

@mdlayher mdlayher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of my comments involve general Go stylistic suggestions. As always, if one of the maintainers of the project (or you, the author) disagree, please feel free to reply to my suggestions.

// limitations under the License.

// +build linux
// +build !noinfiniband
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the collectors have this tag while others do not. Question for maintainers: are these collector-specific tags still being used in practice anywhere?

i.infinibandCounters = map[string]string{
"link_downed": "Number of times the link failed to recover from an error state and went down",
"link_error_recovery": "Number of times the link successfully recovered from an error state",
"multicast_rcv_packets": "Number of multicast packets received (including errors)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the precedent is to spell out "receive".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with you that spelling out "receive" would be beneficial, but multicast_rcv_packets is the name of the file I read the metric from, hence the use here. I extract every metric's filename from the key in this map.

"link_downed": "Number of times the link failed to recover from an error state and went down",
"link_error_recovery": "Number of times the link successfully recovered from an error state",
"multicast_rcv_packets": "Number of multicast packets received (including errors)",
"multicast_xmit_packets": "Number of multicast packets transmitted (including errors)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, but "transmit" instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment for my reasoning.

for metric, description := range i.infinibandCounters {
i.metricDescs[metric] = prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, metric+"_total"),
fmt.Sprintf(description),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the fmt.Sprintf wrapper. No need and I'd imagine that descriptions probably should not be variable anyway.

subsystem := "infiniband"
i.metricDescs = make(map[string]*prometheus.Desc)
i.infinibandPath = "class/infiniband"
i.infinibandNoDevicesFound = errors.New("InfiniBand / No devices detected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unusual to pass around errors as struct members like this. I'd put them up above in a var() block and just reference them by name that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these pieces of data, I had Robert pull them into the structure based on the feedback I received from my previous pull requests about global variables. Totally cool if any of the maintainers want us to pull them back out.


subsystem := "infiniband"
i.metricDescs = make(map[string]*prometheus.Desc)
i.infinibandPath = "class/infiniband"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I doubt this will change, it may also be easier to just put this in a const() block up top and reference it that way, instead of passing it to functions.


// Extract just the filenames which equates to the port numbers.
for i, port := range ports {
_, ports[i] = filepath.Split(port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filepath.Base(port).

"testing"
)

func TestDevices(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test names should be namespaced something like TestInfiniBand* for ease of finding failing tests.

}

if len(devices) != 1 {
t.Fatal(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just do t.Fatal(nil) in effect. Should add a more descriptive error message.

}
}

func TestLinkDowned(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the end to end tests, the remaining tests in this file seem unnecessary, as they're essentially just reading the value of a fixture and confirming it. They aren't actually testing that the value is appropriately output with other metrics.

@@ -0,0 +1,166 @@
// Copyright 2017 The Prometheus Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be named infiniband_linux.go.

@roclark
Copy link
Contributor Author

roclark commented Feb 2, 2017

@mdlayher thank you very much for the feedback! I am a recent Go convert from python and am still getting used to the new environment. I just pushed the changes you suggested - let me know if there is anything else you see that needs to be changed.

@mdlayher
Copy link
Contributor

mdlayher commented Feb 3, 2017

No problem at all, I'm always happy to help. I don't have time to review this again tonight but I'm happy to take another look tomorrow. Or, if someone from the Prometheus organization would like to take a look, that would be great too!

@mdlayher
Copy link
Contributor

mdlayher commented Feb 3, 2017

Okay! Starting in on another review.

Copy link
Contributor

@mdlayher mdlayher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there! More nitpicky Go type things and a couple requests to bubble up errors and provide more descriptive test failures.

const infinibandPath = "class/infiniband"

var (
infinibandNoDevicesFound = errors.New("InfiniBand / No devices detected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors should start with err*, so errInfinibandDevicesNotFound or similar. Same for next error.

i.counters = map[string]string{
"link_downed": "Number of times the link failed to recover from an error state and went down",
"link_error_recovery": "Number of times the link successfully recovered from an error state",
"multicast_rcv_packets": "Number of multicast packets received (including errors)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand "rcv" to "receive" and "xmit" to "transmit" on all metric names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdlayher As @roclark said in his previous response on this subject, he's using those names because those are the actual metric names used for those metrics in /sys. I took at look at the existing e2e-output.txt file and saw four other metrics from the netstat collector using "Rcv" within the metric name as well (no cases of xmit though). We're open to the most correct answer being to expand these names, but it seemed like the status quo was to maintain the source location's naming (that's what I did for all the ZFS metrics). I figure we'll need input from @discordianfish or @SuperQ on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I must have missed that reply with the new commit. Yeah, sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdlayher It's a pretty busy thread, no worries!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using node_network_{receive,transmit}_* as a reference for what it's worth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. @roclark and I were talking about some of the inconsistency across metrics in the collectors here (certainly not saying our own choices are perfect, mind you), maybe at some point it would make sense to try to standardize within this exporter about things like when to expand abbreviation, use camelcase, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree. At this point I am just curious if it's okay to go back and "fix" metrics names since that would break metrics for previous timeseries when the node_exporter is upgraded.

I was working on a tool at one point that would lint metric names and tell you what the conventions were, but I haven't gotten back to it recently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #453 about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From skimming through the PR, giving it 'better' names would require storing some mapping between the filename and the metric, correct? That's what we want to avoid and rather keep the filename as metric name.
Arguably it's somewhat in between, since you store a list of files that might change in the collector anyway, so could also make it a mapping and use names following https://prometheus.io/docs/practices/naming/.

I'm personally fine either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to explicitly list every file, then you should also fix the names while you're at it. If you were automatically finding the files, then doing a straight pass through would be fine.

// Retrieve a list of InfiniBand devices.
func infinibandDevices(infinibandPath string) ([]string, error) {
devices, err := filepath.Glob(filepath.Join(infinibandPath, "/*"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unusual to have a blank line before an error check like this.

prometheus.BuildFQName(Namespace, subsystem, metric+"_total"),
description,
[]string{"device", "port"},
nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nitpick, but if you break the parenthesis onto its own line, things line up nicely and it becomes easy to add or remove arguments in one line later.

Ex:

m := foo(
    "bar",
    "baz",
)

// Retrieve a list of ports for the InfiniBand device.
func infinibandPorts(infinibandPath, device string) ([]string, error) {
ports, err := filepath.Glob(filepath.Join(infinibandPath, device, "ports/*"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.


func TestInfiniBandDevices(t *testing.T) {
devices, err := infinibandDevices("fixtures/sys/class/infiniband")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.


func TestInfiniBandPorts(t *testing.T) {
ports, err := infinibandPorts("fixtures/sys/class/infiniband", "mlx4_0")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.

}

if len(devices) != 1 {
t.Fatal("Retrieved an unexpected number of InfiniBand devices")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to specify how many devices were found. I typically do:

if l := len(devices); l != 1 {
    t.Fatalf("unexpected number of InfiniBand devices: %d", l)
}

t.Fatal(err)
}

if len(ports) != 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about a more descriptive error for failing the test.

return &i, nil
}

// Retrieve a list of InfiniBand devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go comments should be of the form:

// infinibandDevices retrieves a list of InfiniBand devices.

@roclark
Copy link
Contributor Author

roclark commented Feb 3, 2017

Thanks again for the reviews Matt! Always good to get some helpful tips from an expert. Hopefully I got everything sorted out with my latest push but as always, let me know if otherwise. I look forward to getting feedback from some of the other maintainers.

Copy link
Contributor

@mdlayher mdlayher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one final comment about format of error messages. There is also the outstanding question of whether or not this should be on by default.

I'll leave the rest for @SuperQ and @discordianfish . Thanks for bearing with my reviews!

const infinibandPath = "class/infiniband"

var (
errInfinibandNoDevicesFound = errors.New("InfiniBand / No devices detected")
Copy link
Contributor

@mdlayher mdlayher Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I forgot to mention last time (sorry!), error messages should have text that starts lowercase (so they can be chained if needed). You probably want some text like no InfiniBand devices detected. Same goes for the ports error.

An example of why it's typically done this way (with three different errors that each add context):

metrics collection failed: infiniband collector failed: no InfiniBand devices detected".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I'd rather get all of these sorted out now instead of causing headaches later.

@mdlayher
Copy link
Contributor

mdlayher commented Feb 4, 2017

Appears this needs a rebase on master now.

@roclark roclark force-pushed the add-infiniband branch 4 times, most recently from 3bef2cf to f0e6fc1 Compare February 4, 2017 18:36
@roclark
Copy link
Contributor Author

roclark commented Feb 6, 2017

Rebased off master so all merge conflicts are gone again.

@mdlayher
Copy link
Contributor

mdlayher commented Feb 6, 2017

No further feedback from me. Will let @SuperQ , @discordianfish , and @brian-brazil look this over and also determine if it should be enabled by default.

"link_error_recovery": "Number of times the link successfully recovered from an error state",
"multicast_rcv_packets": "Number of multicast packets received (including errors)",
"multicast_xmit_packets": "Number of multicast packets transmitted (including errors)",
"port_rcv_data": "Number of data octets received on all links",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes should be in this name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion above: Using good names if we have to track them here anyway or us the filenames here. If we add bytes here, we should also change rcv to receive etc etc, right?
I lean towards making all names follow our best practices but fine either way.

@roclark roclark force-pushed the add-infiniband branch 2 times, most recently from eb77cd5 to 95c5c82 Compare February 7, 2017 16:52
@roclark
Copy link
Contributor Author

roclark commented Feb 7, 2017

I edited the names to hopefully be more in line with the Prometheus standards. Let me know if you have any thoughts! And if not, are we good to merge this? Thanks everyone for the help!

var i infinibandCollector

// Filenames of all InfiniBand counter metrics including a detailed description.
i.counters = map[string][]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a little more type safety and readability, I'd probably go with a new type for the value in this map:

type infinibandMetric struct {
    File string
    Help string
}

Alternatively, I suppose you could do an array ([2]string), but a struct feels cleaner to me than either a slice or an array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed another update using a struct. It works as I intend, but as always, let me know if I can make it more Go-idiomatic.

Robert Clark added 2 commits February 7, 2017 11:09
Add new metrics for the InfiniBand network protocol including the amount of packets sent and received, the number of times the link has been downed and how many times the link has recovered from an error state.

Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Robert Clark added 3 commits February 7, 2017 11:09
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Copy link
Contributor

@mdlayher mdlayher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again. Up to maintainers to determine if this should be on by default.

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In general we enable collectors by default that won't generate any metrics or significant load if run on a system without support for whatever the collector is doing. That's the case here so it can be enabled by default.

@roclark
Copy link
Contributor Author

roclark commented Feb 9, 2017

@brian-brazil @SuperQ any additional thoughts from either of you?

@joehandzik
Copy link
Contributor

@SuperQ How about this one? Or are we waiting to pull it in until @brian-brazil re-reviews?

@SuperQ
Copy link
Member

SuperQ commented Feb 16, 2017

I was trying to hold off a lot of things so that we could cut 0.14. I guess we can add this to the pile. 😄

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ SuperQ changed the title Add new collector for InfiniBand statistics infiniband: Add new collector for InfiniBand statistics Feb 16, 2017
@SuperQ SuperQ merged commit 38cd07e into prometheus:master Feb 16, 2017
@joehandzik
Copy link
Contributor

@SuperQ Ya know, I was starting to think that was the plan (hold some PRs back to cut a release)...thanks for pulling this in! We're done at HPE for now until we get Omnipath cards in, so it's helpful for us to have all of our contributed code merged ahead of 0.14.

@roclark
Copy link
Contributor Author

roclark commented Feb 16, 2017

Thanks @SuperQ for the review and merge! As Joe mentioned, once I get my hands on some Omni-Path cards, I will crank out another collector.

@grobie grobie mentioned this pull request Mar 7, 2017
tamcore pushed a commit to gitgrave/node_exporter that referenced this pull request Oct 22, 2024
Allow parsing multiple `xprt` fields present in the same NFS transport
stats data.

Fixes: prometheus#450

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants