Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rdma): send periodic control messages to sync sender/receiver #640

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

rauteric
Copy link
Contributor

@rauteric rauteric commented Oct 1, 2024

We hit a bug where the sender sends a long run of eager messages and
ends up outpacing the receiver by more than the width of the message
buffer, causing an error.

In this patch, receiver will send a control message to sender at least
every msgbuff_size - max_requests messages, and sender will pause if
it hasn't received a control message within this duration.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rauteric rauteric requested a review from a team as a code owner October 1, 2024 00:20
@rauteric rauteric force-pushed the rdma_ctrl_sync_fix branch from fc22f71 to a418a10 Compare October 2, 2024 00:55
@rauteric
Copy link
Contributor Author

rauteric commented Oct 2, 2024

Refactored some of the logic to avoid duplicate code.

@rauteric rauteric force-pushed the rdma_ctrl_sync_fix branch from a418a10 to bb32ff8 Compare October 2, 2024 00:57
@rauteric
Copy link
Contributor Author

rauteric commented Oct 2, 2024

Rebased on master

@rauteric rauteric force-pushed the rdma_ctrl_sync_fix branch from bb32ff8 to 5205e62 Compare October 2, 2024 19:11
This lock will be used more generally.

Signed-off-by: Eric Raut <eraut@amazon.com>
We hit a bug where the sender sends a long run of eager messages and
ends up outpacing the receiver by more than the width of the message
buffer, causing an error.

In this patch, receiver will send a control message to sender at least
every `msgbuff_size - max_requests` messages, and sender will pause if
it hasn't received a control message within this duration.

Signed-off-by: Eric Raut <eraut@amazon.com>
@rauteric
Copy link
Contributor Author

rauteric commented Oct 2, 2024

Addressed feedback from Nick

@rauteric rauteric force-pushed the rdma_ctrl_sync_fix branch from 5205e62 to 8dc7adf Compare October 2, 2024 19:12
@rauteric
Copy link
Contributor Author

rauteric commented Oct 2, 2024

Rebased on master

@rauteric rauteric merged commit 2479540 into aws:master Oct 2, 2024
31 checks passed
@rauteric rauteric deleted the rdma_ctrl_sync_fix branch October 3, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant