Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms #544

Merged
merged 2 commits into from
Aug 27, 2024

Conversation

maxtmann
Copy link
Contributor

This patch exposes each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms

Before, plugin would group NICs around trainium device, or error our in case of accelerator is not a NVIDIA GPU or a trainium device. This patch is implemented by applying the following two changes:

RDMA: Expose each NIC as a device in case of unknown accelerator

The NIC grouping code for RDMA protocol groups NICs around
accelerators. To find the accelerators, the grouping code uses
hard-coded accelerator identifies. Supported hard-coded accelerators
are for NVIDIA GPUs and trainium devices.
In case of an unknown accelerator, the grouping code would fail and
error out.
This patch modifies the code such that each libfabric NIC is exposed
as one device to the user of the plugin in case of absense of known
accelerators around which a libfabric NIC can be grouped.

topo: Avoid grouping of multiple NICs to trainium accelerator

Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.

In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@maxtmann maxtmann requested a review from a team as a code owner August 26, 2024 16:36
rauteric
rauteric previously approved these changes Aug 26, 2024
@maxtmann maxtmann force-pushed the feature/avoid_grouping_on_TRN_platforms branch from f0cb8f6 to 2b4de50 Compare August 26, 2024 17:40
rauteric
rauteric previously approved these changes Aug 26, 2024
bwbarrett
bwbarrett previously approved these changes Aug 26, 2024
liralon
liralon previously approved these changes Aug 26, 2024
@maxtmann maxtmann dismissed stale reviews from liralon, bwbarrett, and rauteric via 414997b August 27, 2024 17:32
@maxtmann maxtmann force-pushed the feature/avoid_grouping_on_TRN_platforms branch from 2b4de50 to 414997b Compare August 27, 2024 17:32

Verified

This commit was signed with the committer’s verified signature.
maxtmann Michael Axtmann
The NIC grouping code for RDMA protocol groups NICs around
accelerators. To find the accelerators, the grouping code uses
hard-coded accelerator identifies. Supported hard-coded accelerators
are for NVIDIA GPUs and trainium devices.
In case of an unknown accelerator, the grouping code would fail and
error out.
This patch modifies the code such that each libfabric NIC is exposed
as one device to the user of the plugin in case of absense of known
accelerators around which a libfabric NIC can be grouped.

Signed-off-by: Michael Axtmann <axtmannm@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
maxtmann Michael Axtmann
Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.

In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.

Signed-off-by: Michael Axtmann <axtmannm@amazon.com>
@maxtmann maxtmann force-pushed the feature/avoid_grouping_on_TRN_platforms branch from 414997b to 2021785 Compare August 27, 2024 17:51
@maxtmann maxtmann merged commit 3ec37f8 into aws:master Aug 27, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants