Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

Closed
dfaraj opened this issue Jul 2, 2019 · 19 comments
Closed

OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786

dfaraj opened this issue Jul 2, 2019 · 19 comments

Comments

@dfaraj
Copy link

dfaraj commented Jul 2, 2019

Thank you for taking the time to submit an issue!

Background information

we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[63011,0],0] on node r1i2n13
  Remote daemon: [[63011,0],40] on node r1i3n17

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i2n13
  target node:  r1i2n14

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

downloaded 4.0.1 from openmpi site,

./configure --prefix=/store/dfaraj/SW/packages/ompi/4.0.1 CC=icc CXX=icpc FC=ifort --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --with-psm2=/usr --without-verbs --without-psm --without-knem --without-slurm --without-ucx

Please describe the system on which you are running

  • Operating system/version: RH 7.6
  • Computer hardware: dual socket Xeon nodes
  • Network type: OPA

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

when we run:
n=86

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works fine,
n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

we get the tcp error described earlier.
if I do:
n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works.
if I set
n=160

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow

EDIT: Put in proper verbatim markup

@jsquyres
Copy link
Member

jsquyres commented Jul 2, 2019

Can you try the latest 4.0.x nightly snapshot from https://www.open-mpi.org/nightly/v4.0.x/ ?

@dfaraj
Copy link
Author

dfaraj commented Jul 3, 2019

Jeff,
it did not work unfortunately.
I downloaded the latest nightly Jun 29 and built it.
mpirun (Open MPI) 4.0.2a1
I get the below (with/without -x OMPI_MCA_routed=direct):

-bash-4.2$ n=88;cat $PBS_NODEFILE|uniq|head -n$n > myhosts; mpirun -v -x PATH -x LD_LIBRARY_PATH -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[10127,0],0] on node r1i0n3
  Remote daemon: [[10127,0],24] on node r1i0n27

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i0n3
  target node:  r1i0n24

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

EDIT: Added proper verbatim quoting

@jsquyres
Copy link
Member

jsquyres commented Jul 3, 2019

This seems to be related to #6198

@ggouaillardet
Copy link
Contributor

@dfaraj did you build Open MPI with tm support ?
if yes, you can do not need the -host ... option when invoking mpirun from a PBS script.

can you run dmesg on r1i0n27 and see if the orted daemon was killed or crashed ?

@dfaraj
Copy link
Author

dfaraj commented Jul 8, 2019

using the nightly build from July 8 I can now run with 120 nodes.
but I get so many of "PSM Endpoint is closed or does not exist" at the end...

mpirun  -x PATH -x LD_LIBRARY_PATH  -np 120 -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 60 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                     201.15      201150585.24
2                     402.90      201452499.81
4                     804.60      201150585.24
8                    1611.02      201376936.23
16                   2692.21      168262926.87
32                   5447.01      170219058.97
64                  11116.06      173688421.87
128                 21903.79      171123325.12
256                 42120.43      164532918.17
512                 77898.52      152145544.68
1024               138210.63      134971317.86
2048               216297.37      105613949.90
4096               233120.24       56914121.91
8192               227438.11       27763440.94
16384              224695.84       13714345.50
32768              223378.66        6816975.58
65536              223170.85        3405316.90
131072             223532.40        1705416.86
262144             224219.72         855330.36
524288             224360.32         427933.35
1048576            224607.08         214202.01
2097152            224046.23         106833.57
4194304            224024.17          53411.52
8388608            223812.17          26680.49
16777216           222752.49          13277.08
All processes entering MPI_Finalize
r1i1n0.369647PSM Endpoint is closed or does not exist
r1i1n11.172666PSM Endpoint is closed or does not exist

@dfaraj dfaraj closed this as completed Jul 8, 2019
@dfaraj
Copy link
Author

dfaraj commented Jul 8, 2019

I have built the same OMPI using OFI instead of PSM2 directly and now that Endpoint error is gone.
So I guess this serves as a work around. I would like to run this installation on 100+ nodes before I close this issue. Thanx guys so far.

@dfaraj dfaraj reopened this Jul 8, 2019
@nitinpatil1985
Copy link

nitinpatil1985 commented Jul 9, 2019

Using Intel compiler 2018 and Openmpi Jul 09, 2019 Nightly Tarball.

Fetching the following errors
undefined reference to `mpi_type_extent_'

undefined reference to `mpi_type_struct_'

I see to the previous posts seems using "--enable-mpi1-compatibility" solves the problem,
but this option is no more supported in the recent version. Any option to get rid of this error!

@ggouaillardet
Copy link
Contributor

I guess you are using the master branch. In this case you only have two options

  • modernize your code
  • use a release branch such as v4.0.x with the option you mentioned
    Code modernization is by far the best way.

@dfaraj
Copy link
Author

dfaraj commented Jul 10, 2019

more updates:

so I have 2 Xeon nodes (2 sockets, each socket has 20 cores) and each node has one hfi.
I have built OMPI nightly build July 8 with OFI.

when I run with just 20 cores, things works, the moment I go beyond one socket, I get errors:

hpeopa1:~ dfaraj$ c=20; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 20 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                      52.87       52872849.32
2                     105.39       52696398.90
4                     211.47       52867642.74
8                     421.24       52655052.18
16                    761.92       47620268.94
32                   1530.09       47815364.45
64                   3082.14       48158495.87
128                  5612.96       43851254.76
256                  7964.71       31112129.81
512                  8999.70       17577543.53
1024                 9613.64        9388317.08
2048                 9975.88        4871035.43
4096                10112.03        2468757.93
8192                10243.00        1250366.03
16384               11394.42         695460.17
32768               11380.27         347298.19
65536               11315.76         172664.78
131072              11382.10          86838.52
262144              11360.35          43336.28
^C^Chpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ c=21; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa2
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa2
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa1
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa2
  Framework: pml
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa2
  Framework: pml
--------------------------------------------------------------------------
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa1
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa1
  Framework: pml
--------------------------------------------------------------------------
[hpeopa2:21415] PML cm cannot be selected
[hpeopa2:21429] PML cm cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa1
  Framework: pml
--------------------------------------------------------------------------
[hpeopa1:12797] PML cm cannot be selected

@nitinpatil1985
Copy link

nitinpatil1985 commented Jul 12, 2019

From the above discussion openmpi is not working for 87++ nodes.

Is there any way to run openmpi 4.0.1 with -mca pml ucx on more than 100 nodes?

Can we use recently Nightly Master Tarball with the --enable-mpi1-compatibility option?

@ggouaillardet
Copy link
Contributor

no

@dfaraj
Copy link
Author

dfaraj commented Aug 19, 2019

any updates?

@jsquyres
Copy link
Member

I'm lost in the conversation here. Is the problem being discussed the TCP/SSH 86 process issue, or some PSM issue?

If this is no longer the TCP/SSH 86 process issue, this issue should be closed and a new issue should be opened to discuss the new question (please don't mix multiple topics on a single github issue -- thanks).

@nitinpatil1985
Copy link

Is there any update on openmpi in which we can run our application for 87++ nodes and with the --enable-mpi1-compatibility option?

@jsquyres
Copy link
Member

A lot of fixes have gone in to the v4.0.x branch in the run-time area. Can you try the latest v4.0x nightly snapshot?

For the MPI-1 compatibility, you should talk to your upstream application providers and (strongly) encourage them to upgrade their source code -- those APIs were deprecated in 1996, and were finally removed in 2012. It's time for them to stop being used.

FWIW: We're likely (but not guaranteed) to continue the MPI-1 compatibility in Open MPI v5.0 -- beyond that, I can't promise anything.

@nitinpatil1985
Copy link

Thanks, Jeff for the information!

@jjhursey
Copy link
Member

I posted a fix to the plm/rsh component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might help with the launch issue that was originally reported (probably not the PSM issue).

@dfaraj
Copy link
Author

dfaraj commented Sep 12, 2019

I just tested the latest nightly build and I no longer see this problem on OPA or EDR fabrics.
Thanx guys for the fixes.

@dfaraj dfaraj closed this as completed Sep 12, 2019
@gpaulsen
Copy link
Member

Thanks for verifying @dfaraj!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants