Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: do local RDMA read on all NIC rails for flush() #652

Merged
merged 1 commit into from
Oct 14, 2024

Conversation

taeilum00
Copy link
Contributor

@taeilum00 taeilum00 commented Oct 7, 2024

This changes flush() implementation in such a way that local RDMA read is requested to all NIC rails for multi-rail cases. This is to make sure all data traffic paths are safely flushed from all NICs to GPU.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@taeilum00 taeilum00 requested a review from a team as a code owner October 7, 2024 23:24
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence in your patch is misleading. The flush function is never disabled. It's just not called in some situations.

@taeilum00 taeilum00 changed the title rdma: do fi_read on all NIC rails for flush() rdma: do local RDMA read on all NIC rails for flush() Oct 9, 2024
@ghost
Copy link

ghost commented Oct 9, 2024

Please drop the merge commit and squash your fixups.

A rebased branch is here: https://github.com/aws-nslick/nccl-net-ofi/tree/taeil-flush-rebased

You can do this locally by just calling:

git checkout flush
git remote add upstream git@github.com:aws/aws-ofi-nccl.git
git remote update --prune
git branch --set-upstream-to=upstream/master
git rebase --signoff -i upstream/master
git push --force-with-lease origin flush

@taeilum00 taeilum00 force-pushed the flush branch 2 times, most recently from 8361649 to 497d444 Compare October 10, 2024 00:15
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the two small nits, looks good to me.

This changes flush() implementation in such a way that local
RDMA read is requested to all NIC rails for multi-rail cases.
This is to make sure all data traffic paths are safely flushed
from all NICs to GPU.

Signed-off-by: Taeil Um <taeilum@amazon.com>
@rauteric rauteric merged commit d1e2fd8 into aws:master Oct 14, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants