-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/coll: Fix reduce composition alpha with non-zero root #6543
Conversation
test:mpich/ch4/most |
|
test:mpich/ch4/most |
I was trying to enhance the test to cover this bug, with following patch:
But strangely, the enhanced test passes even without your fix. It appears that the posix release gather algorithm works even with the wrong usage of |
test:mpich/ch4/most |
I was wondering why |
Both |
@hzhou did you figure out a consistent test config for this one? |
I haven't got time yet. I'll approve the PR and add the test as future PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TODO: add test as future PRs
We need to handle the case where a non-zero root uses MPI_IN_PLACE. Otherwise we could try reading from a bad address and crash. Fixes pmodels#6540. NOTE: For single node reduce operation with non-zero root, this composition incurs an extra copy from rank 0->root.
@@ -594,14 +594,19 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_Reduce_intra_composition_alpha(const void *se | |||
recvbuf = (void *) ((char *) recvbuf - true_lb); | |||
} | |||
|
|||
/* non-zero root needs to send from recvbuf if using MPI_IN_PLACE */ | |||
intra_sendbuf = (sendbuf == MPI_IN_PLACE && root != 0) ? recvbuf : sendbuf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found why the release_gather code was immune to the bug. It is due to
mpich/src/mpid/ch4/shm/posix/release_gather/nb_reduce_release_gather.h
Lines 397 to 399 in 8451885
if (send_buf == MPI_IN_PLACE) { | |
send_buf = recv_buf; | |
} |
In general, are we allowed to have sendbuf
and recvbuf
overlap? Apparently all our reduce algorithms works with this replacement, but I wonder whether this is a potential hazard.
Pull Request Description
We need to handle the case where a non-zero root uses
MPI_IN_PLACE. Otherwise we could try reading from a bad address and
crash. Fixes #6540.
NOTE: For single node reduce operation with non-zero root, this
composition incurs an unnecessary extra copy.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.