-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-blocking collectives: retain MPI_op and MPI_Datatype(s) #2154
non-blocking collectives: retain MPI_op and MPI_Datatype(s) #2154
Conversation
5b1cc1d
to
a525940
Compare
@bosilca can you please review this ? when a non blocking collective is invoked, user operators and datatypes (e.g. this solves issues when user op/datatype are free'd by the app before a non-blocking collective completes. i previously made a prototype but at the there cannot be any issues with intrinsic operators nor predefined datatypes, so i did not bother retain/release them. my understanding is there is a plan to stop doing that for non blocking point-to-point (e.g. i implemented that by "inserting" a completion callback ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Signed-off-by line to this PR's commit.
a525940
to
c9b7ac4
Compare
bot:ibm:retest |
@ggouaillardet Is this PR is supposed to fix Issue #2151? |
@ggouaillardet as we briefly discussed during SC, I think the best approach will be to stop doing any refcount inside the library and instead move them all up into the MPI layer. This will allow us to release/retain all the non-predefined types and ops only once per operation. |
moving to 2.0.3 |
@bosilca this PR can be seen as a first step. it handles ddt/ops for non blocking collectives at the MPI level. meanwhile, do you have any comments on this PR ? |
This patch looks extremely big, a lot of code for a little benefit. I wonder why can't we simply add the datatype and op release code next to the calls to Aren't we supposed to retain the datatype (ompi_coll_base_retain_datatypes) and op (ompi_coll_base_retain_op) before the call to the non-blocking functions themselves ? |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/3a4218f168345072c128648ce107aeef |
c9b7ac4
to
e776fab
Compare
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/2d5c1a28af517e6eef3a24c202f7cce9 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/c7c9ec868e4ece21fece1f8e503c878c |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/3b47b61d703cc97ac46275531114cc15 |
e776fab
to
4fe431b
Compare
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/e0d9e4795c9d80a35b550f2ed3ae84ea |
4fe431b
to
f5fcd84
Compare
f5fcd84
to
8f6a839
Compare
:bot:aws:retest |
8f6a839
to
646391b
Compare
@bosilca this is quite an old PR we discussed way before it was issued. |
Someone asked me about the state of this PR today. Just checking in to see where things are standing? This seems to fix Issue #2151 |
I am waiting for directions from @bosilca since he objected this PR. |
@jjhursey note I did not test this PR vs persistent non blocking collectives. |
@bosilca Thoughts on this PR? |
Here are few questions and suggestions:
|
Thanks @bosilca, I will update the PR based on your feedback. The MPI standard is explicit about the life expectancy of the array arguments for the non blocking calls, and this makes things even easier as you suggested.
|
Thanks @ggouaillardet, I've been looking for that sentence for hours. |
646391b
to
e4dcf83
Compare
I pushed a new commit to address the first point |
541a868
to
a578e45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good. I had few minor suggestions for improvement, mostly to make the code more readable. If you make them, feel free to merge directly.
ompi/mca/coll/base/coll_base_util.c
Outdated
|
||
static void release_vecs_callback(ompi_coll_base_nbc_request_t *request) { | ||
ompi_communicator_t *comm = request->super.req_mpi_object.comm; | ||
int count = OMPI_COMM_IS_INTER(comm)?ompi_comm_remote_size(comm):ompi_comm_size(comm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work for persistent neighbor collectives ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you meant non blocking neighbor alltoallw (regardless this is a persistent operation or not), right ? in this case, this is a good catch I fixed in the latest commit.
Two questions:
|
|
@bosilca I made the requested changes. Unless some changes are requested, I will squash the commits and merge this PR tomorrow. |
MPI standard states a user MPI_Op and/or user MPI_Datatype can be free'd after a call to a non blocking collective and before the non-blocking collective completes. Retain user (only) MPI_Op and MPI_Datatype when the non blocking call is invoked, and set a request callback so they are free'd when the MPI_Request completes. Thanks Thomas Ponweiser for reporting this Fixes open-mpi#2151 Fixes open-mpi#1304 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
f11203a
to
0fe756d
Compare
@ggouaillardet @hppritcha, This functionality (along with 65660e5) was never backported to either the v4.0.x, or v4.1.x release streams. My opinion is that this is quite a bit of change for this late in the release streams, and that we should probably wait for v5.0.0 for this functionality. What are your thoughts? |
No description provided.