OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

dfaraj · 2019-07-02T15:36:41Z

Thank you for taking the time to submit an issue!

Background information

we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[63011,0],0] on node r1i2n13
  Remote daemon: [[63011,0],40] on node r1i3n17

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i2n13
  target node:  r1i2n14

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

downloaded 4.0.1 from openmpi site,

./configure --prefix=/store/dfaraj/SW/packages/ompi/4.0.1 CC=icc CXX=icpc FC=ifort --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --with-psm2=/usr --without-verbs --without-psm --without-knem --without-slurm --without-ucx

Please describe the system on which you are running

Operating system/version: RH 7.6
Computer hardware: dual socket Xeon nodes
Network type: OPA

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

when we run:
n=86

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works fine,
n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

we get the tcp error described earlier.
if I do:
n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works.
if I set
n=160

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow

EDIT: Put in proper verbatim markup

The text was updated successfully, but these errors were encountered:

jsquyres · 2019-07-02T17:14:17Z

Can you try the latest 4.0.x nightly snapshot from https://www.open-mpi.org/nightly/v4.0.x/ ?

dfaraj · 2019-07-03T03:02:18Z

Jeff,
it did not work unfortunately.
I downloaded the latest nightly Jun 29 and built it.
mpirun (Open MPI) 4.0.2a1
I get the below (with/without -x OMPI_MCA_routed=direct):

-bash-4.2$ n=88;cat $PBS_NODEFILE|uniq|head -n$n > myhosts; mpirun -v -x PATH -x LD_LIBRARY_PATH -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[10127,0],0] on node r1i0n3
  Remote daemon: [[10127,0],24] on node r1i0n27

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i0n3
  target node:  r1i0n24

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

EDIT: Added proper verbatim quoting

jsquyres · 2019-07-03T14:38:19Z

This seems to be related to #6198

ggouaillardet · 2019-07-08T00:23:05Z

@dfaraj did you build Open MPI with tm support ?
if yes, you can do not need the -host ... option when invoking mpirun from a PBS script.

can you run dmesg on r1i0n27 and see if the orted daemon was killed or crashed ?

dfaraj · 2019-07-08T14:11:14Z

using the nightly build from July 8 I can now run with 120 nodes.
but I get so many of "PSM Endpoint is closed or does not exist" at the end...

mpirun  -x PATH -x LD_LIBRARY_PATH  -np 120 -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 60 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                     201.15      201150585.24
2                     402.90      201452499.81
4                     804.60      201150585.24
8                    1611.02      201376936.23
16                   2692.21      168262926.87
32                   5447.01      170219058.97
64                  11116.06      173688421.87
128                 21903.79      171123325.12
256                 42120.43      164532918.17
512                 77898.52      152145544.68
1024               138210.63      134971317.86
2048               216297.37      105613949.90
4096               233120.24       56914121.91
8192               227438.11       27763440.94
16384              224695.84       13714345.50
32768              223378.66        6816975.58
65536              223170.85        3405316.90
131072             223532.40        1705416.86
262144             224219.72         855330.36
524288             224360.32         427933.35
1048576            224607.08         214202.01
2097152            224046.23         106833.57
4194304            224024.17          53411.52
8388608            223812.17          26680.49
16777216           222752.49          13277.08
All processes entering MPI_Finalize
r1i1n0.369647PSM Endpoint is closed or does not exist
r1i1n11.172666PSM Endpoint is closed or does not exist

dfaraj · 2019-07-08T19:00:29Z

I have built the same OMPI using OFI instead of PSM2 directly and now that Endpoint error is gone.
So I guess this serves as a work around. I would like to run this installation on 100+ nodes before I close this issue. Thanx guys so far.

nitinpatil1985 · 2019-07-09T11:20:09Z

Using Intel compiler 2018 and Openmpi Jul 09, 2019 Nightly Tarball.

Fetching the following errors
undefined reference to `mpi_type_extent_'

undefined reference to `mpi_type_struct_'

I see to the previous posts seems using "--enable-mpi1-compatibility" solves the problem,
but this option is no more supported in the recent version. Any option to get rid of this error!

ggouaillardet · 2019-07-09T11:30:48Z

I guess you are using the master branch. In this case you only have two options

modernize your code
use a release branch such as v4.0.x with the option you mentioned
Code modernization is by far the best way.

dfaraj · 2019-07-10T15:00:46Z

more updates:

so I have 2 Xeon nodes (2 sockets, each socket has 20 cores) and each node has one hfi.
I have built OMPI nightly build July 8 with OFI.

when I run with just 20 cores, things works, the moment I go beyond one socket, I get errors:

hpeopa1:~ dfaraj$ c=20; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 20 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                      52.87       52872849.32
2                     105.39       52696398.90
4                     211.47       52867642.74
8                     421.24       52655052.18
16                    761.92       47620268.94
32                   1530.09       47815364.45
64                   3082.14       48158495.87
128                  5612.96       43851254.76
256                  7964.71       31112129.81
512                  8999.70       17577543.53
1024                 9613.64        9388317.08
2048                 9975.88        4871035.43
4096                10112.03        2468757.93
8192                10243.00        1250366.03
16384               11394.42         695460.17
32768               11380.27         347298.19
65536               11315.76         172664.78
131072              11382.10          86838.52
262144              11360.35          43336.28
^C^Chpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ c=21; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa2
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa2
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa1
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa2
  Framework: pml
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa2
  Framework: pml
--------------------------------------------------------------------------
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa1
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa1
  Framework: pml
--------------------------------------------------------------------------
[hpeopa2:21415] PML cm cannot be selected
[hpeopa2:21429] PML cm cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa1
  Framework: pml
--------------------------------------------------------------------------
[hpeopa1:12797] PML cm cannot be selected

nitinpatil1985 · 2019-07-12T07:35:03Z

From the above discussion openmpi is not working for 87++ nodes.

Is there any way to run openmpi 4.0.1 with -mca pml ucx on more than 100 nodes?

Can we use recently Nightly Master Tarball with the --enable-mpi1-compatibility option?

ggouaillardet · 2019-07-12T08:05:37Z

no

dfaraj · 2019-08-19T11:25:05Z

any updates?

jsquyres · 2019-08-19T14:26:46Z

I'm lost in the conversation here. Is the problem being discussed the TCP/SSH 86 process issue, or some PSM issue?

If this is no longer the TCP/SSH 86 process issue, this issue should be closed and a new issue should be opened to discuss the new question (please don't mix multiple topics on a single github issue -- thanks).

nitinpatil1985 · 2019-08-19T14:30:36Z

Is there any update on openmpi in which we can run our application for 87++ nodes and with the --enable-mpi1-compatibility option?

jsquyres · 2019-08-19T14:38:52Z

A lot of fixes have gone in to the v4.0.x branch in the run-time area. Can you try the latest v4.0x nightly snapshot?

For the MPI-1 compatibility, you should talk to your upstream application providers and (strongly) encourage them to upgrade their source code -- those APIs were deprecated in 1996, and were finally removed in 2012. It's time for them to stop being used.

FWIW: We're likely (but not guaranteed) to continue the MPI-1 compatibility in Open MPI v5.0 -- beyond that, I can't promise anything.

nitinpatil1985 · 2019-08-20T04:34:47Z

Thanks, Jeff for the information!

jjhursey · 2019-08-29T20:49:34Z

I posted a fix to the plm/rsh component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might help with the launch issue that was originally reported (probably not the PSM issue).

dfaraj · 2019-09-12T21:24:37Z

I just tested the latest nightly build and I no longer see this problem on OPA or EDR fabrics.
Thanx guys for the fixes.

gpaulsen · 2019-09-12T22:08:27Z

Thanks for verifying @dfaraj!

jsquyres added bug Target: v4.0.x labels Jul 2, 2019

jsquyres mentioned this issue Jul 3, 2019

SSH launch fails when host file has more than 64 hosts #6198

Open

dfaraj closed this as completed Jul 8, 2019

dfaraj reopened this Jul 8, 2019

dfaraj closed this as completed Sep 12, 2019

MenglingD mentioned this issue Apr 10, 2020

ORTE does not know how to route a message to the specified daemon horovod/horovod#1850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

dfaraj commented Jul 2, 2019 •

edited by jsquyres

Loading

jsquyres commented Jul 2, 2019

dfaraj commented Jul 3, 2019 •

edited by jsquyres

Loading

jsquyres commented Jul 3, 2019

ggouaillardet commented Jul 8, 2019

dfaraj commented Jul 8, 2019 •

edited by jsquyres

Loading

dfaraj commented Jul 8, 2019

nitinpatil1985 commented Jul 9, 2019 •

edited

Loading

ggouaillardet commented Jul 9, 2019

dfaraj commented Jul 10, 2019 •

edited by jsquyres

Loading

nitinpatil1985 commented Jul 12, 2019 •

edited

Loading

ggouaillardet commented Jul 12, 2019

dfaraj commented Aug 19, 2019

jsquyres commented Aug 19, 2019

nitinpatil1985 commented Aug 19, 2019

jsquyres commented Aug 19, 2019

nitinpatil1985 commented Aug 20, 2019

jjhursey commented Aug 29, 2019

dfaraj commented Sep 12, 2019

gpaulsen commented Sep 12, 2019

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

Comments

dfaraj commented Jul 2, 2019 • edited by jsquyres Loading

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

jsquyres commented Jul 2, 2019

dfaraj commented Jul 3, 2019 • edited by jsquyres Loading

jsquyres commented Jul 3, 2019

ggouaillardet commented Jul 8, 2019

dfaraj commented Jul 8, 2019 • edited by jsquyres Loading

dfaraj commented Jul 8, 2019

nitinpatil1985 commented Jul 9, 2019 • edited Loading

ggouaillardet commented Jul 9, 2019

dfaraj commented Jul 10, 2019 • edited by jsquyres Loading

nitinpatil1985 commented Jul 12, 2019 • edited Loading

ggouaillardet commented Jul 12, 2019

dfaraj commented Aug 19, 2019

jsquyres commented Aug 19, 2019

nitinpatil1985 commented Aug 19, 2019

jsquyres commented Aug 19, 2019

nitinpatil1985 commented Aug 20, 2019

jjhursey commented Aug 29, 2019

dfaraj commented Sep 12, 2019

gpaulsen commented Sep 12, 2019

dfaraj commented Jul 2, 2019 •

edited by jsquyres

Loading

dfaraj commented Jul 3, 2019 •

edited by jsquyres

Loading

dfaraj commented Jul 8, 2019 •

edited by jsquyres

Loading

nitinpatil1985 commented Jul 9, 2019 •

edited

Loading

dfaraj commented Jul 10, 2019 •

edited by jsquyres

Loading

nitinpatil1985 commented Jul 12, 2019 •

edited

Loading