Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544

maxtmann · 2024-08-26T16:36:58Z

This patch exposes each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms

Before, plugin would group NICs around trainium device, or error our in case of accelerator is not a NVIDIA GPU or a trainium device. This patch is implemented by applying the following two changes:

RDMA: Expose each NIC as a device in case of unknown accelerator

The NIC grouping code for RDMA protocol groups NICs around
accelerators. To find the accelerators, the grouping code uses
hard-coded accelerator identifies. Supported hard-coded accelerators
are for NVIDIA GPUs and trainium devices.
In case of an unknown accelerator, the grouping code would fail and
error out.
This patch modifies the code such that each libfabric NIC is exposed
as one device to the user of the plugin in case of absense of known
accelerators around which a libfabric NIC can be grouped.

topo: Avoid grouping of multiple NICs to trainium accelerator

Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.

In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

src/nccl_ofi_topo.c

The NIC grouping code for RDMA protocol groups NICs around accelerators. To find the accelerators, the grouping code uses hard-coded accelerator identifies. Supported hard-coded accelerators are for NVIDIA GPUs and trainium devices. In case of an unknown accelerator, the grouping code would fail and error out. This patch modifies the code such that each libfabric NIC is exposed as one device to the user of the plugin in case of absense of known accelerators around which a libfabric NIC can be grouped. Signed-off-by: Michael Axtmann <axtmannm@amazon.com>

Since a TRN accelerator is composed of multiple cores, the number of trainium accelerators does not necessarily reflect the number of NIC devices that the RDMA protocol should expose to the user. Instead, each core should have a NIC accessible for communication if that many NICs are available. The best approach, for now, is to remove trainium accelerators from the list of accelerators around which NICs are grouped. Consequently, each libfabric NIC is exposed as on NIC device to the user. This provides trainium maximal freedom in routing data over NICs. In the long run, a better solution might be to expose the number of actual cores to the plugin and take that number into account while NIC grouping. Signed-off-by: Michael Axtmann <axtmannm@amazon.com>

maxtmann requested a review from a team as a code owner August 26, 2024 16:36

rauteric previously approved these changes Aug 26, 2024

View reviewed changes

src/nccl_ofi_topo.c Outdated Show resolved Hide resolved

maxtmann dismissed rauteric’s stale review via 2b4de50 August 26, 2024 17:40

maxtmann force-pushed the feature/avoid_grouping_on_TRN_platforms branch from f0cb8f6 to 2b4de50 Compare August 26, 2024 17:40

rauteric previously approved these changes Aug 26, 2024

View reviewed changes

bwbarrett previously approved these changes Aug 26, 2024

View reviewed changes

liralon previously approved these changes Aug 26, 2024

View reviewed changes

rajachan reviewed Aug 27, 2024

View reviewed changes

src/nccl_ofi_topo.c Show resolved Hide resolved

maxtmann dismissed stale reviews from liralon, bwbarrett, and rauteric via 414997b August 27, 2024 17:32

maxtmann force-pushed the feature/avoid_grouping_on_TRN_platforms branch from 2b4de50 to 414997b Compare August 27, 2024 17:32

maxtmann added 2 commits August 27, 2024 17:50

maxtmann force-pushed the feature/avoid_grouping_on_TRN_platforms branch from 414997b to 2021785 Compare August 27, 2024 17:51

rajachan approved these changes Aug 27, 2024

View reviewed changes

maxtmann merged commit 3ec37f8 into aws:master Aug 27, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544

Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544

maxtmann commented Aug 26, 2024

Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544

Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544

Conversation

maxtmann commented Aug 26, 2024

This patch exposes each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms

RDMA: Expose each NIC as a device in case of unknown accelerator

topo: Avoid grouping of multiple NICs to trainium accelerator