Skip to content

mpirun launch failure on heterogeneous system #6762

@snyjm-18

Description

@snyjm-18

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

tarball
On x86:
./configure --prefix=/global/home/users/johns/opt/4.0.1 --with-slurm=no --with-ucx=/global/home/users/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal
and on ARM:
./configure --prefix=/home/johns/opt/4.0.1 --with-slurm=no --with-ucx=/home/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal

Please describe the system on which you are running

  • Operating system/version: CentOS Linux 7 (Core)/CentOS Linux 7 (AltArch)
  • Computer hardware: x86_64-unknown-linux-gnu/aarch64-unknown-linux-gnu
  • Network type: infiniband (mlx5)

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I would like to launch a job with a single mpirun across a heterogenous system that has both arm and x86 cores and ompi is installed in different places.

[johns@jupiter008 ~]$ mpirun -H jupiter008 hostname : --prefix /home/johns/opt/4.0.1 -H jupiter-bf09 /usr/bin/hostname
jupiter008.hpcadvisorycouncil.com
[jupiter-bf09:02375] [[9309,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[9309,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

I do not know if this is relevant, but I can launch jobs from an Intel node on the Arm nodes.

[johns@jupiter008 ~]$ mpirun --prefix /home/johns/opt/4.0.1 -H jupiter-bf08,jupiter-bf09 /usr/bin/hostname
jupiter-bf08
jupiter-bf09

Issue #4437 was similar and was fixed by using a homogeneous system. I have to use the hybrid system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions