-
Notifications
You must be signed in to change notification settings - Fork 923
Description
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
tarball
On x86:
./configure --prefix=/global/home/users/johns/opt/4.0.1 --with-slurm=no --with-ucx=/global/home/users/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal
and on ARM:
./configure --prefix=/home/johns/opt/4.0.1 --with-slurm=no --with-ucx=/home/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal
Please describe the system on which you are running
- Operating system/version: CentOS Linux 7 (Core)/CentOS Linux 7 (AltArch)
- Computer hardware: x86_64-unknown-linux-gnu/aarch64-unknown-linux-gnu
- Network type: infiniband (mlx5)
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
I would like to launch a job with a single mpirun across a heterogenous system that has both arm and x86 cores and ompi is installed in different places.
[johns@jupiter008 ~]$ mpirun -H jupiter008 hostname : --prefix /home/johns/opt/4.0.1 -H jupiter-bf09 /usr/bin/hostname
jupiter008.hpcadvisorycouncil.com
[jupiter-bf09:02375] [[9309,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[9309,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
I do not know if this is relevant, but I can launch jobs from an Intel node on the Arm nodes.
[johns@jupiter008 ~]$ mpirun --prefix /home/johns/opt/4.0.1 -H jupiter-bf08,jupiter-bf09 /usr/bin/hostname
jupiter-bf08
jupiter-bf09
Issue #4437 was similar and was fixed by using a homogeneous system. I have to use the hybrid system.