Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/ofi selects libfabric shm provider with libfabric 1.20+ #12233

Closed
wenduwan opened this issue Jan 13, 2024 · 3 comments
Closed

btl/ofi selects libfabric shm provider with libfabric 1.20+ #12233

wenduwan opened this issue Jan 13, 2024 · 3 comments

Comments

@wenduwan
Copy link
Contributor

wenduwan commented Jan 13, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

main and v5.0.x branches

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure ... --with-libfabric=<libfabric that has prov/shm enabled>

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Submodules are not relevant in this case, but we need a new libfabric with this commit: ofiwg/libfabric@7e84ace#diff-9a108fdcddc323ee5ec91488c1fbdd907d733c960c0bdcddd507803ec1bf3081

Please describe the system on which you are running

  • Operating system/version: Tested on Amazon Linux 2, but it's not OS specific
  • Computer hardware: hpc6a.48xlarge EC2
  • Network type: EFA

Details of the problem

The problem can be revealed by one-sided applications. We reproduced with Intel Microbenchmark

$ mpirun --map-by ppr:1:node -n 2 --hostfile hostfile --mca btl_ofi_verbose 1 --mca btl ^tcp mpi-benchmarks-IMB-v2021.7/IMB-RMA All_put_all

[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:587: mtl:ofi:provider_include = "(null)"
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:590: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:725: EFA specific fi_getinfo(): No data available
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:765: fi_getinfo(): No data available
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:725: EFA specific fi_getinfo(): Success
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:344: mtl:ofi:provider: efa
[ip-172-31-20-174.us-east-2.compute.internal:35762] mtl_ofi_component.c:369: mtl:ofi:provider:domain: rdmap0s6-rdm
[ip-172-31-20-174.us-east-2.compute.internal:35762] btl_ofi_component.c:308: btl:ofi:provider_include = "(null)"
[ip-172-31-20-174.us-east-2.compute.internal:35762] btl_ofi_component.c:310: btl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-20-174.us-east-2.compute.internal:35762] btl_ofi_component.c:69: btl:ofi: "shm" in exclude list
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:587: mtl:ofi:provider_include = "(null)"
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:590: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:725: EFA specific fi_getinfo(): No data available
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:765: fi_getinfo(): No data available
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:725: EFA specific fi_getinfo(): Success
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:344: mtl:ofi:provider: efa
[ip-172-31-24-238.us-east-2.compute.internal:32618] mtl_ofi_component.c:369: mtl:ofi:provider:domain: rdmap0s6-rdm
[ip-172-31-24-238.us-east-2.compute.internal:32618] btl_ofi_component.c:308: btl:ofi:provider_include = "(null)"
[ip-172-31-24-238.us-east-2.compute.internal:32618] btl_ofi_component.c:310: btl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream,usnic,net"
[ip-172-31-24-238.us-east-2.compute.internal:32618] btl_ofi_component.c:69: btl:ofi: "shm" in exclude list
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.7, MPI-RMA part
#----------------------------------------------------------------
# Date                  : Sat Jan 13 02:34:41 2024
# Machine               : x86_64
# System                : Linux
# Release               : 5.10.205-195.804.amzn2.x86_64
# Version               : #1 SMP Fri Jan 5 01:22:18 UTC 2024
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# mpi-benchmarks-IMB-v2021.7/IMB-RMA All_put_all

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# All_put_all
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[13163,1],1]) is on host: ip-172-31-24-238
  Process 2 ([[13163,1],0]) is on host: ip-172-31-20-174
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[ip-172-31-24-238:32618] *** Process received signal ***
[ip-172-31-24-238:32618] Signal: Segmentation fault (11)
[ip-172-31-24-238:32618] Signal code: Address not mapped (1)
[ip-172-31-24-238:32618] Failing at address: 0xb8
[ip-172-31-24-238:32618] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f65ada1d8e0]
[ip-172-31-24-238:32618] [ 1] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(+0x22120)[0x7f659f2a0120]
[ip-172-31-24-238:32618] [ 2] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(ompi_osc_rdma_new_peer+0x49)[0x7f659f2a06c9]
[ip-172-31-24-238:32618] [ 3] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(ompi_osc_rdma_peer_lookup+0x87)[0x7f659f2a0907]
[ip-172-31-24-238:32618] [ 4] /opt/amazon/openmpi5/lib64/openmpi/mca_osc_rdma.so(+0x1b239)[0x7f659f299239]
[ip-172-31-24-238:32618] [ 5] /opt/amazon/openmpi5/lib64/libmpi.so.40(ompi_osc_base_select+0x13b)[0x7f65ae5eb13b]
[ip-172-31-24-238:32618] [ 6] /opt/amazon/openmpi5/lib64/libmpi.so.40(ompi_win_create+0x93)[0x7f65ae5642c3]
[ip-172-31-24-238:32618] [ 7] /opt/amazon/openmpi5/lib64/libmpi.so.40(MPI_Win_create+0xc8)[0x7f65ae5aa498]
[ip-172-31-24-238:32618] [ 8] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x44b8f5]
[ip-172-31-24-238:32618] [ 9] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x42d65d]
[ip-172-31-24-238:32618] [10] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x436f01]
[ip-172-31-24-238:32618] [11] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x405a6e]
[ip-172-31-24-238:32618] [12] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f65ad68013a]
[ip-172-31-24-238:32618] [13] mpi-benchmarks-IMB-v2021.7/IMB-RMA[0x40442a]
[ip-172-31-24-238:32618] *** End of error message ***

This is because with the libfabric change the shm provider also satisfies btl/ofi's requirement, i.e. FI_HMEM | FI_ATOMIC | FI_RMA, and it was later ignored because it is on the exclusion list. As a result, btl/ofi did not select any provider :(

In this case, the user's attention was to use another provider, e.g. efa, that does not support FI_HMEM, but that didn't happen because shm was returned by fi_getinfo(...) first.

This behavior was introduced in 5.0.x due to the optional FI_HMEM check.

Proposed solution

  1. I think we should refactor both mtl/ofi and btl/ofi provider selection logic, and respect {mtl/btl}_ofi_provider_{exclude/exclude} MCA parameter. Specifically, right after each fi_getinfo(...) call, we should first apply the include/exclude filter, and return error if no qualified provider is found.

  2. For this particular problem, I'm surprised that shm was selected at all, since it only supports intra-node communication. I wonder whether we should also request FI_REMOTE_COMM | FI_LOCAL_COMM, the same as mtl/ofi.

Mitigation

  • We can force fi_getinfo to return specific providers with -x PROVIDER=<the desired provider>
@wenduwan wenduwan added this to the v5.0.2 milestone Jan 13, 2024
@wenduwan wenduwan self-assigned this Jan 13, 2024
wenduwan added a commit to wenduwan/ompi that referenced this issue Jan 13, 2024
This PR addresses open-mpi#12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider
selection logic(both mtl and btl) in order to support accelerator memory.
As described in the issue, this introduced a bug that can cause the wrong
ofi provider to be selected, even if the user explicitly includes/excludes
the provider name.

This change refactors the selection logic to correctly handle the
include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
wenduwan added a commit to wenduwan/ompi that referenced this issue Jan 16, 2024
This PR addresses open-mpi#12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider
selection logic(both mtl and btl) in order to support accelerator memory.
As described in the issue, this introduced a bug that can cause the wrong
ofi provider to be selected, even if the user explicitly includes/excludes
the provider name.

This change refactors the selection logic to correctly handle the
include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@janjust
Copy link
Contributor

janjust commented Jan 18, 2024

will be fixed when #12234 is backported

@wenduwan
Copy link
Contributor Author

Main branch PR merged. Will open backport.

wenduwan added a commit to wenduwan/ompi that referenced this issue Jan 19, 2024
This PR addresses open-mpi#12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider
selection logic(both mtl and btl) in order to support accelerator memory.
As described in the issue, this introduced a bug that can cause the wrong
ofi provider to be selected, even if the user explicitly includes/excludes
the provider name.

This change refactors the selection logic to correctly handle the
include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
(cherry picked from commit 29efcef)
@wenduwan
Copy link
Contributor Author

PRs merged. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants