-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.x communicator code updates #2215
Conversation
(cherry picked from commit 7397276) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit simplifies the communicator context ID generation by removing the blocking code. The high level calls: ompi_comm_nextcid and ompi_comm_activate remain but now call the non-blocking variants and wait on the resulting request. This was done to remove the parallel paths for context ID generation in preperation for further improvements of the CID generation code. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 035c2e2) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit introduces a new algorithm for MPI_Comm_split_type. The old algorithm performed an allgather on the communicator to decide which processes were part of the new communicators. This does not scale well in either time or memory. The new algorithm performs a couple of all reductions to determine the global parameters of the MPI_Comm_split_type call. If any rank gives an inconsistent split_type (as defined by the standard) an error is returned without proceeding further. The algorithm then creates a communicator with all the ranks that match the split_type (no communication required) in the same order as the original communicator. It then does an allgather on the new communicator (which should be much smaller) to determine 1) if the new communicator is in the correct order, and 2) if any ranks in the new communicator supplied MPI_UNDEFINED as the split_type. If either of these conditions are detected the new communicator is split using ompi_comm_split and the intermediate communicator is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 4c49c42) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 36a9063) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Back-ported from 01a653d Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
…_allreduce_intra_pmix_nb() (cherry picked from commit bbc6d4b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit ba77d9b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
That was causing CUDA collective to crash. (cherry picked from commit 61e900e) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit should restore the pre-non-blocking behavior of the CID allocator when threads are used. There are two primary changes: 1) do not hold the cid allocator lock past the end of a request callback, and 2) if a lower id communicator is detected during CID allocation back off and let the lower id communicator finish before continuing. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit fbbf743) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the intercomm allgather to do a local comm bcast as the final step. This should resolve a hang seen in intercomm tests. Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit 54cc829) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
use MPI_MIN instead of MPI_MAX when appropriate, otherwise a currently used CID can be reused, and bad things will likely happen. Refs open-mpi#2061 (cherry picked from commit 3b968ec) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 803897a) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 6c6e35b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@jsquyres FYI. Please let me know whether this meets the requirements of a bug fix. This was originally intended for 2.1.0 when the target was December. The code has been soaking on master for awhile and is probably ready to go now. |
@hjelmn I just checked the MPI standard and it is illegal to supply different values for the split_type (page 247 lien 45) with the exception of MPI_UNDEFINED. Thus, I wonder if we really need the validity check. Second, I understand this operation as a different form of MPI_Comm_split, where the color is globally defined based on prior local knowledge. In other words, as each process has the information about the entire process placement and architecture, it can decide the local color based on this. Once the color defined it can simply call MPI_Comm_split. |
@bosilca Processes that supply MPI_UNDEFINED still need to know what the split type is or we will hang. MPI_Comm_split_type is indeed just a special case of MPI_Comm_split but the restrictions allow us to do some optimization. The algorithm I implemented does the following:
|
I see. So instead of the MPI_Comm_split all gather you assume that a 4 int reduction, followed by a communicator creation, and by smaller allgather will lead to better results. Do you have any pointer to what the improvement is ? |
There is a graph on #1873 that shows the improvement on an XC40 on up to 2048 ranks. See https://cloud.githubusercontent.com/assets/1226817/16821220/5658435c-4912-11e6-8e9c-bde7e8639711.png |
bot:lanl:retest |
the code looks good. 👍 |
@bosilca We're using the Github reviews these days -- the 👍 is no longer enough. 😄 |
For some reason I don't have the review at the top on this ticket (but I did on Gilles's PR). |
This is way too big a code change this late in the attempt to release 2.1.0. If the release is delayed considerably we'll think about merging this in. |
After discussion with @hppritcha, I moved the milestone back to v2.1.0. I have also confirmed that this fixes COMM_SPAWN (i.e., #2234). It would be nice if we could have a smaller version of this for v2.0.x -- e.g., could we leave out 91337bf? |
@jsquyres Sure. I don't see why it can't be done. Will take a look tomorrow. |
@hjelmn Any progress on making a smaller PR for v2.0.2? |
This PR contains the following:
The last "feature" is probably the most important part of the PR. Before this PR MPI_Comm_split_type used MPI_Allgather on the communicator. This scales very poorly and is a performance bug in Open MPI. As such I see this PR as a performance bug fix not necessarily a new feature. If you disagree then postpone this to 2.2.0.