Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fence_nb to flux pmix #8380

Merged
merged 1 commit into from
Jan 19, 2021
Merged

Add fence_nb to flux pmix #8380

merged 1 commit into from
Jan 19, 2021

Conversation

samiilvonen
Copy link

Flux runs segfault in MPI_Finalize if ucx pml is used. It seems that opal_common_ucx_mca_pmix_fence calls opal_pmix.fence_nb without checking if the fence_nb is implemented or not. This patch adds a 'fake' fence_nb call that uses PMI_Barrier similarly to the fence call.

opal_common_ucx_del_proc call fails if pmix doesn't implement fence_nb

Signed-off-by: Sami Ilvonen <sami.ilvonen@csc.fi>
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@jjhursey
Copy link
Member

ok to test

@jjhursey
Copy link
Member

Do you also need to apply this to v4.1.x?

@samiilvonen
Copy link
Author

Do you also need to apply this to v4.1.x?

I don't have a test setup for v4.1.x yet, but opal_common_ucx_mca_pmix_fence seem to be identical in v4.1.x so the fix is probably needed there too.

@jjhursey
Copy link
Member

This seems ok. It does make the nonblocking call (fence_nb) a blocking call (PMI_Barrier) under the hood, but I don't think that will break anything in OMPI so it's fine for this component IMHO.

It might be useful to fix opal_common_ucx_mca_pmix_fence to call the blocking interface if it is provided, but that should be a PR against OMPI master the UCX folks should probably review. So that can be a separate activity.

@jjhursey jjhursey added this to the v4.0.6 milestone Jan 14, 2021
@jjhursey jjhursey added the bug label Jan 14, 2021
@jsquyres
Copy link
Member

Is this needed on master?

If so, I'd prefer if it went to master first, and then to v4.0.x and v4.1.x.

Thanks!

@rhc54
Copy link
Contributor

rhc54 commented Jan 14, 2021

There is no flux plugin on master, Jeff - strictly PMIx

@hppritcha hppritcha merged commit c9ed1f7 into open-mpi:v4.0.x Jan 19, 2021
@hppritcha hppritcha added the NEWS label Jan 19, 2021
SteVwonder added a commit to SteVwonder/SDK that referenced this pull request Jul 29, 2021
Problem: the OpenMPI packaged with centos 7 does not include the PMI
plugin, which is required to boot under Flux. The OpenMPI version
packaged with centos 8 (4.0.5) has a bug that causes a segfault on
finalization, which is fixed in the next patch version.
PR that fixes the bug: (open-mpi/ompi#8380)

Solution: for centos 7, hand compile the same version of OpenMPI (1.10),
but with the `--with-pmi` flag.  For centos 8, hand compile the version
of OpenMPI with the patch (4.0.6).  Since this is an involved process,
add a helper script that will also make adding other MPIs much easier.
SteVwonder added a commit to SteVwonder/SDK that referenced this pull request Aug 12, 2021
Problem: the OpenMPI packaged with centos 7 does not include the PMI
plugin, which is required to boot under Flux. The OpenMPI version
packaged with centos 8 (4.0.5) has a bug that causes a segfault on
finalization, which is fixed in the next patch version.
PR that fixes the bug: (open-mpi/ompi#8380)

Solution: for centos 7, hand compile the same version of OpenMPI (1.10),
but with the `--with-pmi` flag.  For centos 8, hand compile the version
of OpenMPI with the patch (4.0.6).  Since this is an involved process,
add a helper script that will also make adding other MPIs much easier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants