Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async modex issue PMIX v4.2.x #3077

Closed
janjust opened this issue May 18, 2023 · 3 comments
Closed

async modex issue PMIX v4.2.x #3077

janjust opened this issue May 18, 2023 · 3 comments

Comments

@janjust
Copy link

janjust commented May 18, 2023

Background information

A simple MPI_Init()/MPI_Finalize() will fail to bootstrap when async modex is enabled -mca pmix_base_async_modex 1.
A workaround is to set -x PMIX_MCA_gds=hash

What version of the PMIx Reference Library are you using?

+1492c0b3102b02dd854851c458ee68229f35f5a9 3rd-party/openpmix (v4.2.3rc1-1-g1492c0b3)
+4636ea79dce7dea0fe9d27e669a5bfda6b095216 3rd-party/prrte (v3.0.1rc1-1-g4636ea79dc)

Describe how PMIx was installed

From source, OMPI v5.0.x internal pmix version (see above)

Please describe the system on which you are running

  • Operating system/version: RHEL 8.7
  • Computer hardware: x86
  • Network type: IB Connect-X 6

Details of the problem

A simple MPI_Init()/Finalize() will reproduce the issue.
Enabling async modex fails to bootstrap with UCX or OB1

[Wed May  3 20:56:51 2023][1,8]<stdout>: [1683136611.217979] [jazz25:109263:0]        mm_xpmem.c:245  UCX  ERROR   xpmem_get(segid=0x20001aad1) failed: No such file or directory
[Wed May  3 20:56:51 2023][1,8]<stdout>: [1683136611.217995] [jazz25:109263:0]           mm_ep.c:172  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x7f6c61a75000: Shared memory error
[Wed May  3 20:56:51 2023][1,6]<stderr>: [jazz25.swx.labs.mlnx:109262] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433  Error: ucp_ep_create(proc=12) failed: Shared memory error
[Wed May  3 20:56:51 2023][1,34]<stderr>: [jazz25.swx.labs.mlnx:109276] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433  Error: ucp_ep_create(proc=12) failed: Shared memory error

`mpirun -np 16 -H jazz12:28,jazz13:28 --display-map --tag-output --timestamp-output --mca pml ob1 -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node -x LD_LIBRARY_PATH -x PMIX_MCA_gdsXX=hash ./a.out`

It’s a function of scale and I can reproduce with as little as NP=8 PPN=4 but it’s maybe 50% failure rate as opposed to 100% at NP=16; PPN=NP/2 or greater.

@rhc54
Copy link
Contributor

rhc54 commented May 19, 2023

set -x PMIX_MCA_gds=ds21

I think you mean =hash as you cannot set the GDS to ds21 - the library will reject that setting.

@janjust
Copy link
Author

janjust commented May 19, 2023

@rhc54 you are correct, I meant =hash

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2023

Closing this as we have identified it as a timeout problem - OMPI needs to up its timeout value for PMIx_Get when using async modex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants