You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A simple MPI_Init()/MPI_Finalize() will fail to bootstrap when async modex is enabled -mca pmix_base_async_modex 1.
A workaround is to set -x PMIX_MCA_gds=hash
What version of the PMIx Reference Library are you using?
From source, OMPI v5.0.x internal pmix version (see above)
Please describe the system on which you are running
Operating system/version: RHEL 8.7
Computer hardware: x86
Network type: IB Connect-X 6
Details of the problem
A simple MPI_Init()/Finalize() will reproduce the issue.
Enabling async modex fails to bootstrap with UCX or OB1
[Wed May 3 20:56:51 2023][1,8]<stdout>: [1683136611.217979] [jazz25:109263:0] mm_xpmem.c:245 UCX ERROR xpmem_get(segid=0x20001aad1) failed: No such file or directory
[Wed May 3 20:56:51 2023][1,8]<stdout>: [1683136611.217995] [jazz25:109263:0] mm_ep.c:172 UCX ERROR mm ep failed to connect to remote FIFO id 0x7f6c61a75000: Shared memory error
[Wed May 3 20:56:51 2023][1,6]<stderr>: [jazz25.swx.labs.mlnx:109262] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433 Error: ucp_ep_create(proc=12) failed: Shared memory error
[Wed May 3 20:56:51 2023][1,34]<stderr>: [jazz25.swx.labs.mlnx:109276] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433 Error: ucp_ep_create(proc=12) failed: Shared memory error
`mpirun -np 16 -H jazz12:28,jazz13:28 --display-map --tag-output --timestamp-output --mca pml ob1 -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node -x LD_LIBRARY_PATH -x PMIX_MCA_gdsXX=hash ./a.out`
It’s a function of scale and I can reproduce with as little as NP=8 PPN=4 but it’s maybe 50% failure rate as opposed to 100% at NP=16; PPN=NP/2 or greater.
The text was updated successfully, but these errors were encountered:
Background information
A simple MPI_Init()/MPI_Finalize() will fail to bootstrap when async modex is enabled -mca pmix_base_async_modex 1.
A workaround is to set
-x PMIX_MCA_gds=hash
What version of the PMIx Reference Library are you using?
+1492c0b3102b02dd854851c458ee68229f35f5a9 3rd-party/openpmix (v4.2.3rc1-1-g1492c0b3)
+4636ea79dce7dea0fe9d27e669a5bfda6b095216 3rd-party/prrte (v3.0.1rc1-1-g4636ea79dc)
Describe how PMIx was installed
From source, OMPI v5.0.x internal pmix version (see above)
Please describe the system on which you are running
Details of the problem
A simple MPI_Init()/Finalize() will reproduce the issue.
Enabling async modex fails to bootstrap with UCX or OB1
It’s a function of scale and I can reproduce with as little as NP=8 PPN=4 but it’s maybe 50% failure rate as opposed to 100% at NP=16; PPN=NP/2 or greater.
The text was updated successfully, but these errors were encountered: