Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.x communicator code updates #2215

Merged
merged 13 commits into from
Oct 31, 2016
Merged

v2.x communicator code updates #2215

merged 13 commits into from
Oct 31, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Oct 12, 2016

This PR contains the following:

  • Cleanup of the CID code to only have non-blocking. This reduces the overhead of trying to maintain multiple CID paths. (Cleanup)
  • Optimization for CID generation on intercomm.
  • Optimization for MPI_Comm_split_type.

The last "feature" is probably the most important part of the PR. Before this PR MPI_Comm_split_type used MPI_Allgather on the communicator. This scales very poorly and is a performance bug in Open MPI. As such I see this PR as a performance bug fix not necessarily a new feature. If you disagree then postpone this to 2.2.0.

bosilca and others added 13 commits October 12, 2016 11:48
(cherry picked from commit 7397276)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit simplifies the communicator context ID generation by
removing the blocking code. The high level calls: ompi_comm_nextcid
and ompi_comm_activate remain but now call the non-blocking variants
and wait on the resulting request. This was done to remove the
parallel paths for context ID generation in preperation for further
improvements of the CID generation code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit 035c2e2)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit introduces a new algorithm for MPI_Comm_split_type. The
old algorithm performed an allgather on the communicator to decide
which processes were part of the new communicators. This does not
scale well in either time or memory.

The new algorithm performs a couple of all reductions to determine the
global parameters of the MPI_Comm_split_type call. If any rank gives
an inconsistent split_type (as defined by the standard) an error is
returned without proceeding further. The algorithm then creates a
communicator with all the ranks that match the split_type (no
communication required) in the same order as the original
communicator. It then does an allgather on the new communicator (which
should be much smaller) to determine 1) if the new communicator is in
the correct order, and 2) if any ranks in the new communicator
supplied MPI_UNDEFINED as the split_type. If either of these
conditions are detected the new communicator is split using
ompi_comm_split and the intermediate communicator is freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 4c49c42)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 36a9063)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Back-ported from 01a653d

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
…_allreduce_intra_pmix_nb()

(cherry picked from commit bbc6d4b)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit ba77d9b)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
That was causing CUDA collective to crash.

(cherry picked from commit 61e900e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit should restore the pre-non-blocking behavior of the CID
allocator when threads are used. There are two primary changes: 1)
do not hold the cid allocator lock past the end of a request callback,
and 2) if a lower id communicator is detected during CID allocation
back off and let the lower id communicator finish before continuing.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit fbbf743)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the intercomm allgather to do a local comm bcast
as the final step. This should resolve a hang seen in intercomm
tests.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
(cherry picked from commit 54cc829)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
use MPI_MIN instead of MPI_MAX when appropriate, otherwise
a currently used CID can be reused, and bad things will likely happen.

Refs open-mpi#2061

(cherry picked from commit 3b968ec)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 803897a)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 6c6e35b)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Oct 12, 2016

@jsquyres FYI. Please let me know whether this meets the requirements of a bug fix. This was originally intended for 2.1.0 when the target was December. The code has been soaking on master for awhile and is probably ready to go now.

@bosilca
Copy link
Member

bosilca commented Oct 12, 2016

@hjelmn I just checked the MPI standard and it is illegal to supply different values for the split_type (page 247 lien 45) with the exception of MPI_UNDEFINED. Thus, I wonder if we really need the validity check. Second, I understand this operation as a different form of MPI_Comm_split, where the color is globally defined based on prior local knowledge. In other words, as each process has the information about the entire process placement and architecture, it can decide the local color based on this. Once the color defined it can simply call MPI_Comm_split.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 12, 2016

@bosilca Processes that supply MPI_UNDEFINED still need to know what the split type is or we will hang.

MPI_Comm_split_type is indeed just a special case of MPI_Comm_split but the restrictions allow us to do some optimization. The algorithm I implemented does the following:

  • Form the local and remote groups based on information about the process placement.
  • Use the above groups to create a new intermediary communicator.
  • Perform an all-gather on the (hopefully much-smaller) intermediate communicator if either 1) the procs may need to be reordered, or 2) if any procs supplied MPI_UNDEFINED.
  • If reordering/dropping ranks is needed MPI_Comm_split is run on the intermediate communicator to do the dirty work.

@bosilca
Copy link
Member

bosilca commented Oct 12, 2016

I see. So instead of the MPI_Comm_split all gather you assume that a 4 int reduction, followed by a communicator creation, and by smaller allgather will lead to better results. Do you have any pointer to what the improvement is ?

@hjelmn
Copy link
Member Author

hjelmn commented Oct 12, 2016

There is a graph on #1873 that shows the improvement on an XC40 on up to 2048 ranks.

See https://cloud.githubusercontent.com/assets/1226817/16821220/5658435c-4912-11e6-8e9c-bde7e8639711.png

@jsquyres
Copy link
Member

bot:lanl:retest

@bosilca
Copy link
Member

bosilca commented Oct 17, 2016

the code looks good. 👍

@jsquyres
Copy link
Member

@bosilca We're using the Github reviews these days -- the 👍 is no longer enough. 😄

@bosilca
Copy link
Member

bosilca commented Oct 17, 2016

For some reason I don't have the review at the top on this ticket (but I did on Gilles's PR).

@jsquyres jsquyres added this to the v2.2.0 milestone Oct 17, 2016
@jsquyres jsquyres removed this from the v2.1.0 milestone Oct 17, 2016
@hppritcha
Copy link
Member

hppritcha commented Oct 17, 2016

This is way too big a code change this late in the attempt to release 2.1.0. If the release is delayed considerably we'll think about merging this in.

@jsquyres jsquyres modified the milestones: v2.1.0, v2.2.0 Oct 31, 2016
@jsquyres
Copy link
Member

After discussion with @hppritcha, I moved the milestone back to v2.1.0.

I have also confirmed that this fixes COMM_SPAWN (i.e., #2234).

It would be nice if we could have a smaller version of this for v2.0.x -- e.g., could we leave out 91337bf?

@jsquyres jsquyres merged commit 87a79fa into open-mpi:v2.x Oct 31, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Oct 31, 2016

@jsquyres Sure. I don't see why it can't be done. Will take a look tomorrow.

@jsquyres
Copy link
Member

jsquyres commented Nov 3, 2016

@hjelmn Any progress on making a smaller PR for v2.0.2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants