Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new input plugin for InfiniBand card/port statistics #6631

Merged
merged 16 commits into from
Jan 16, 2020

Conversation

willfurnell
Copy link
Contributor

@willfurnell willfurnell commented Nov 7, 2019

This PR adds a new input plugin for InfiniBand card/port statistics.

Implements #5686

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

@willfurnell willfurnell changed the title WIP: Add a new input plugin for InfiniBand card/port statistics Add a new input plugin for InfiniBand card/port statistics Nov 7, 2019
@willfurnell
Copy link
Contributor Author

Well I've added a test and cleaned up the code, and added a README - although the packaging seems to be failing due to unrelated reasons.

@danielnelson
Copy link
Contributor

The packaging failure is related but on Windows. Presumably this plugin wouldn't work on Windows, I think the best way to handle this is to mirror this change from the ethtool PR:

https://github.com/influxdata/telegraf/pull/5865/files#diff-980231a87c3954198c19881c18fd0126

@willfurnell
Copy link
Contributor Author

willfurnell commented Nov 8, 2019

Looks like the import I am using, rdmamap is causing the packaging failure when using Windows as it imports netns. Not sure how to prevent this though as I'm not importing it on non-linux builds?

Tried go vet on netns on a Windows box, produces this error :(

vet.exe: .\netns.go:70:26: cannot use int(*ns) (value of type int) as syscall.Handle value in argument to syscall.Close

Guess we could fork rdmamap and remove the Docker related bits as they aren't used in this code anyway - and therefore prevent this issue from occurring - or is there a way to only vet dependencies on a system they will be run on? EDIT: Although I can see that the docker library uses netns, so no idea why it hasn't been picket up before? And other packages use netlink, which uses netns...

@gregorybrzeski
Copy link
Contributor

I would love to use this plugin however it's Linux specific and Infiniband cards are used on Linux as well as on Windows extensively. We use them on Windows mainly.

Windows has numerous counters available for Mellanox cards via perfmon:

> Get-Counter -ListSet *mellanox* | select countersetname

CounterSetName
--------------
Mellanox IB Adapter Traffic Counters
Mellanox IB Adapter Diagnostic Counters
Mellanox Adapter Diagnostic Counters
Mellanox Adapter Traffic Counters
Mellanox Adapter QoS Counters
Mellanox WinOF Bus Counters

These counters can be monitored using win_perf_counters input plugin. This plugin or its parts can be reused here to provide support for windows.

On the other hand is this a plugin for InifiBand (IB) stats (general taking into account multiple vendors) or Mellanox cards IB stats ? Name suggests it is for the former. Of course Mellanox is currently the main vendor for InfiniBand equipment after Qlogic was sold to Intel. Oracle tried to produce their own IB chipset but I am not sure how successful they were and if this equipment is on market.

Maybe design the plugin in such a way that it allows for extending to other vendors (future?) and platforms (Windows as well as Linux) using same interface ?

List of all counters available on Windows in perfmon:

> (Get-Counter -ListSet *mellanox*).counter
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Discarded
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Bad CRC Error
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Symbol Error
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Frame Length Error
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Errors
\Mellanox IB Adapter Traffic Counters(*)\Packets Outbound Discarded
\Mellanox IB Adapter Traffic Counters(*)\Packets Outbound Errors
\Mellanox IB Adapter Traffic Counters(*)\Control Packets
\Mellanox IB Adapter Traffic Counters(*)\Packets Total/Sec
\Mellanox IB Adapter Traffic Counters(*)\Packets Total
\Mellanox IB Adapter Traffic Counters(*)\KBytes Total/Sec
\Mellanox IB Adapter Traffic Counters(*)\Bytes Total
\Mellanox IB Adapter Traffic Counters(*)\Packets Sent/Sec
\Mellanox IB Adapter Traffic Counters(*)\Packets Sent
\Mellanox IB Adapter Traffic Counters(*)\KBytes Sent/Sec
\Mellanox IB Adapter Traffic Counters(*)\Bytes Sent
\Mellanox IB Adapter Traffic Counters(*)\Packets Received/Sec
\Mellanox IB Adapter Traffic Counters(*)\Packets Received
\Mellanox IB Adapter Traffic Counters(*)\KBytes Received/Sec
\Mellanox IB Adapter Traffic Counters(*)\Bytes Received
\Mellanox IB Adapter Diagnostic Counters(*)\TX Ring Is Full Packets
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Timeout Received
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Duplicate Request Received
\Mellanox IB Adapter Diagnostic Counters(*)\CQ Overflows
\Mellanox IB Adapter Diagnostic Counters(*)\Requester RNR NAK Retries Exceeded Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Transport Retries Exceeded Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Remote Operation Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Out-of-order Sequence Received
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Out-of-order Sequence NAK
\Mellanox IB Adapter Diagnostic Counters(*)\Responder RNR NAK
\Mellanox IB Adapter Diagnostic Counters(*)\Requester RNR NAK
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Remote Access Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Remote Access Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Invalid Request Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Invalid Request Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder CQE Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester CQE Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Protection Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Protection Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder QP Operation Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester QP Operation Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Length Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Length Errors
\Mellanox Adapter Diagnostic Counters(*)\Device detected stalled state
\Mellanox Adapter Diagnostic Counters(*)\Packet detected as stalled
\Mellanox Adapter Diagnostic Counters(*)\Packets discarded due to TC in stalled state
\Mellanox Adapter Diagnostic Counters(*)\Packets discarded due to Head-Of-Queue lifetime limit
\Mellanox Adapter Diagnostic Counters(*)\Dropless Mode Entries
\Mellanox Adapter Diagnostic Counters(*)\Dropless Mode Exits
\Mellanox Adapter Diagnostic Counters(*)\TX Ring Is Full Packets
\Mellanox Adapter Diagnostic Counters(*)\Requester Timeout Received
\Mellanox Adapter Diagnostic Counters(*)\Responder Duplicate Request Received
\Mellanox Adapter Diagnostic Counters(*)\CQ Overflows
\Mellanox Adapter Diagnostic Counters(*)\Requester RNR NAK Retries Exceeded Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Transport Retries Exceeded Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Remote Operation Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Out-of-order Sequence Received
\Mellanox Adapter Diagnostic Counters(*)\Requester Out-of-order Sequence NAK
\Mellanox Adapter Diagnostic Counters(*)\Responder RNR NAK
\Mellanox Adapter Diagnostic Counters(*)\Requester RNR NAK
\Mellanox Adapter Diagnostic Counters(*)\Responder Remote Access Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Remote Access Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Invalid Request Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Invalid Request Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder CQE Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester CQE Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Protection Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Protection Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder QP Operation Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester QP Operation Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Length Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Length Errors
\Mellanox Adapter Traffic Counters(*)\Packets Received Discarded
\Mellanox Adapter Traffic Counters(*)\Packets Received Bad CRC Error
\Mellanox Adapter Traffic Counters(*)\Packets Received Symbol Error
\Mellanox Adapter Traffic Counters(*)\Packets Received Frame Length Error
\Mellanox Adapter Traffic Counters(*)\Packets Received Errors
\Mellanox Adapter Traffic Counters(*)\Packets Outbound Discarded
\Mellanox Adapter Traffic Counters(*)\Packets Outbound Errors
\Mellanox Adapter Traffic Counters(*)\Control Packets
\Mellanox Adapter Traffic Counters(*)\Packets Total/Sec
\Mellanox Adapter Traffic Counters(*)\Packets Total
\Mellanox Adapter Traffic Counters(*)\KBytes Total/Sec
\Mellanox Adapter Traffic Counters(*)\Bytes Total
\Mellanox Adapter Traffic Counters(*)\Packets Sent/Sec
\Mellanox Adapter Traffic Counters(*)\Packets Sent
\Mellanox Adapter Traffic Counters(*)\KBytes Sent/Sec
\Mellanox Adapter Traffic Counters(*)\Bytes Sent
\Mellanox Adapter Traffic Counters(*)\Packets Received/Sec
\Mellanox Adapter Traffic Counters(*)\Packets Received
\Mellanox Adapter Traffic Counters(*)\KBytes Received/Sec
\Mellanox Adapter Traffic Counters(*)\Bytes Received
\Mellanox Adapter QoS Counters(*)\Responder Ignored ECN due CNP coalesce
\Mellanox Adapter QoS Counters(*)\Sent Discard Frames
\Mellanox Adapter QoS Counters(*)\Requester Traffic Rate Low Peak
\Mellanox Adapter QoS Counters(*)\Requester Traffic Rate High Peak
\Mellanox Adapter QoS Counters(*)\Responder CNP Sent Successfully
\Mellanox Adapter QoS Counters(*)\Responder ECN Handled Successfully
\Mellanox Adapter QoS Counters(*)\Responder Ignored ECN
\Mellanox Adapter QoS Counters(*)\Responder Active CNP
\Mellanox Adapter QoS Counters(*)\Requester Successfully Handled Limitation Request
\Mellanox Adapter QoS Counters(*)\Requester Ignored Limitation Request
\Mellanox Adapter QoS Counters(*)\Requester Allocated Rate Limiters
\Mellanox Adapter QoS Counters(*)\Requester Total Allocated Rate Limiters
\Mellanox Adapter QoS Counters(*)\Requester Current Total Rate
\Mellanox Adapter QoS Counters(*)\Requester Average Total Rate
\Mellanox Adapter QoS Counters(*)\Rcv Pause Duration
\Mellanox Adapter QoS Counters(*)\Rcv Pause Frames
\Mellanox Adapter QoS Counters(*)\Sent Pause Duration
\Mellanox Adapter QoS Counters(*)\Sent Pause Frames
\Mellanox Adapter QoS Counters(*)\Packets Total/Sec
\Mellanox Adapter QoS Counters(*)\Packets Total
\Mellanox Adapter QoS Counters(*)\KBytes Total/Sec
\Mellanox Adapter QoS Counters(*)\Bytes Total
\Mellanox Adapter QoS Counters(*)\Packets Sent/Sec
\Mellanox Adapter QoS Counters(*)\Packets Sent
\Mellanox Adapter QoS Counters(*)\KBytes Sent/Sec
\Mellanox Adapter QoS Counters(*)\Bytes Sent
\Mellanox Adapter QoS Counters(*)\Packets Received/Sec
\Mellanox Adapter QoS Counters(*)\Packets Received
\Mellanox Adapter QoS Counters(*)\KBytes Received/Sec
\Mellanox Adapter QoS Counters(*)\Bytes Received
\Mellanox WinOF Bus Counters(*)\Arrived RDMA CNPs
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for MR
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for EQ
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for CQ
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for QP
\Mellanox WinOF Bus Counters(*)\MTT entries used for MR
\Mellanox WinOF Bus Counters(*)\MTT entries used for EQ
\Mellanox WinOF Bus Counters(*)\MTT entries used for CQ
\Mellanox WinOF Bus Counters(*)\MTT entries used for QP
\Mellanox WinOF Bus Counters(*)\MPT entries used for MR
\Mellanox WinOF Bus Counters(*)\MPT entries used for EQ
\Mellanox WinOF Bus Counters(*)\MPT entries used for CQ
\Mellanox WinOF Bus Counters(*)\MPT entries used for QP
\Mellanox WinOF Bus Counters(*)\External Doorbell Drop/sec
\Mellanox WinOF Bus Counters(*)\External Doorbell Push/sec
\Mellanox WinOF Bus Counters(*)\External Blueflame Replace/sec
\Mellanox WinOF Bus Counters(*)\External Blueflame hit/sec
\Mellanox WinOF Bus Counters(*)\MPT Miss/sec
\Mellanox WinOF Bus Counters(*)\MTT Miss/sec
\Mellanox WinOF Bus Counters(*)\EQ Miss/sec
\Mellanox WinOF Bus Counters(*)\CQ Miss/sec
\Mellanox WinOF Bus Counters(*)\RQ Miss/sec
\Mellanox WinOF Bus Counters(*)\SQ Miss/sec
\Mellanox WinOF Bus Counters(*)\Receive WQE cache lookup/sec
\Mellanox WinOF Bus Counters(*)\Receive WQE cache hit/sec
\Mellanox WinOF Bus Counters(*)\Steering/QPC Back-pressure/sec
\Mellanox WinOF Bus Counters(*)\WQE fetch/Atomic Back-pressure/sec
\Mellanox WinOF Bus Counters(*)\Scatter Back-pressure/sec
\Mellanox WinOF Bus Counters(*)\No-WQE Drops/sec
\Mellanox WinOF Bus Counters(*)\PCI Back-pressure/sec

@willfurnell
Copy link
Contributor Author

Hi @gregorybrzeski - that looks interesting. To me it looks like the Windows drivers support more counters than the Linux counterparts (the Kb/s bits) - and they are named differently too. I am unable to develop for Windows - I only have access to InfiniBand on RHEL & CentOS, so someone else would have to contribute this.

I also only have Mellanox hardware to test with, but it looks to me, according to the Kernel documentation here that all InfiniBand devices will be enumerated in the /sys/class/infiniband area. https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-class-infiniband

So stats should be consistent between different vendors on Linux I think?

At the moment the plugin is split into infiniband_linux.go, with support for Linux and infiniband_nonlinux.go for all other platforms, so an infiniband_windows.go file could be created and Windows support added here.

@willfurnell
Copy link
Contributor Author

I've finally got it building on Windows by making sure that any references to the rdmamap library are only built on Linux - a much simpler solution to the one I suggested earlier :)

@willfurnell
Copy link
Contributor Author

Is it possible to request a review of this please? Thanks!

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, would just a few minor changes.


func init() {
inputs.Add("infiniband", func() telegraf.Input {
log.Print("W! [inputs.infiniband] Current platform is not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into a Init() function, as we noticed that this placement causes the warning to be printed at every startup.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ethtool/ethtool_notlinux.go

Also, a bit of a nitpick but can you call this file: infiniband_notlinux.go for improved consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done! Don't worry about nitpicking, I'm always happy to learn a better way to do things :)

@@ -0,0 +1,8 @@
Copyright 2019 United Kingdom Research and Innovation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this file? We don't include any additional LICENSE or copyright notices outside of the top level LICENSE. You will still maintain copyright on this code, check the CLA for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep removed


// Sample configuration for plugin
var InfinibandConfig = `
## no config required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this comment line, Telegraf has magic to add something very similar when there is no config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

rdmaDevices := rdmamap.GetRdmaDeviceList()

if len(rdmaDevices) == 0 {
return fmt.Errorf("No InfiniBand devices found on this system! Check /sys/class/infiniband/ exists")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/golang/go/wiki/CodeReviewComments#error-strings

return fmt.Errorf("no InfiniBand devices found in /sys/class/infiniband/")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced based on your comment

Copy link
Contributor Author

@willfurnell willfurnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented all changes as per your request :)

@danielnelson danielnelson added this to the 1.14.0 milestone Jan 16, 2020
@danielnelson danielnelson merged commit 182104f into influxdata:master Jan 16, 2020
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants