Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v5.0.x] backport bugfixes created during mtt bug bash #11821

Merged
merged 5 commits into from
Jul 14, 2023

Conversation

wzamazon
Copy link
Contributor

@wzamazon wzamazon commented Jul 12, 2023

This PR back port 4 patches that addressed issues in mtt:
coll/han: fix bug in reduce in place.
This commit fix reduce_big_in_place

runtime/params: set ompi_mpi_compat_mpi3 to true by default.
This commit fix MPI_Errhandler_set/get/free

pml/cm: fix buffer usage in MCA_PML_CM_HVY_SEND_REQUEST_BSEND_ALLOC()
This commit fix MPI_Bsend_init

group: defer the insertion of ompi_group_all_failed_procs to group_table
This commit fix MPI_Group_union/intersection

communicator: add errhandler_type back
This commit fix "MPI_Errhandler_fatal`

wzamazon added 4 commits July 12, 2023 17:32
The two pre-defined groups: group_null and group_empty must be the
0th and 1st group in the group_table, for MPI_Group_f2c to be able
to convert fortran group index to c_group.

However, prior to this patch ompi_group_all_failed_procs was inserted
to group table as the 0th group, which broke MPI_Group_f2c.

This patch moved the insertion of ompi_group_all_failed_procs to
after group_null and group_empty.

Signed-off-by: Wei Zhang <wzam@amazon.com>
(cherry picked from commit 75ea1df)
In MCA_PML_CM_HVY_SEND_REQUEST_BSEND_ALLOC(), after call to
opal_convert_pack() will changed convertor's status, the convertor
need to be reset to original state.

This is achieved by calling opal_convertor_prepare_for_send(),
and it should be called with original send buffer provided by
application, which is sendreq->req_addr.

However, prior to this change, the function was called with
sendreq->req_buff, which is the temprary buffer used for send.

As a result, when the same request is used the 2nd time, wrong
data was copied to outgoing buffer, and caused memory corrupiton.

This patch addressed the issue.

Signed-off-by: Wei Zhang <wzam@amazon.com>
(cherry picked from commit d71fe93)
ompi_mpi_compat_mpi3 controls whether MPI default behavior
fix MPI-3 standard or MPI-4 standard.

Because the main branch is following MPI-3 standard, this
parameter's default value should be true, but it is false
prior to this patch

This patch addressed the issue.

Signed-off-by: Wei Zhang <wzam@amazon.com>
(cherry picked from commit 67a71fc)
Piror to this patch, the reduce code try to applied
arithematic operations on a sendbuf, even when it
is MPI_IN_PLACE. this patch addressed the issue.

Signed-off-by: Wei Zhang <wzam@amazon.com>
(cherry picked from commit d520921)
@github-actions github-actions bot added this to the v5.0.0 milestone Jul 12, 2023
@wzamazon
Copy link
Contributor Author

bot:nvidia:retest

Previous commit 2d68804 removed "errhandler_type"
from communicator, and replaced it with
"errhandler->eh_mpi_object_type".

However, for an errhandler to be invoked on a communicator,
errhandler_type must always be OMPI_ERRHANDLER_TYPE_COMM.
But, errhandler->eh_mpi_object_type can be
OMPI_ERRHANDLER_TYPE_PREDEFINED for predefined error handlers
such as MPI_ERRORS_ARE_FATAL.

This patch added "errhandler_type" back to communicator
to address the issue.

Signed-off-by: Wei Zhang <wzam@amazon.com>
(cherry picked from commit 216c221)
@wzamazon
Copy link
Contributor Author

added a commit "communicator: add errhandler_type back", which fix "MPI_Errhandler_fatal`

@jsquyres
Copy link
Member

@janjust @B-a-S It looks like NVIDIA CI is failing on all 5.0.x builds. Can someone investigate?

@B-a-S
Copy link
Contributor

B-a-S commented Jul 14, 2023

@janjust @B-a-S It looks like NVIDIA CI is failing on all 5.0.x builds. Can someone investigate?

Fixed by changing compiler to oshcc in the according test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants