Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544
+20
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch exposes each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms
Before, plugin would group NICs around trainium device, or error our in case of accelerator is not a NVIDIA GPU or a trainium device. This patch is implemented by applying the following two changes:
RDMA: Expose each NIC as a device in case of unknown accelerator
The NIC grouping code for RDMA protocol groups NICs around
accelerators. To find the accelerators, the grouping code uses
hard-coded accelerator identifies. Supported hard-coded accelerators
are for NVIDIA GPUs and trainium devices.
In case of an unknown accelerator, the grouping code would fail and
error out.
This patch modifies the code such that each libfabric NIC is exposed
as one device to the user of the plugin in case of absense of known
accelerators around which a libfabric NIC can be grouped.
topo: Avoid grouping of multiple NICs to trainium accelerator
Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.
In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.