Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure while mpirun job depends on the order of the hosts #4516

Closed
karasevb opened this issue Nov 20, 2017 · 13 comments
Closed

Failure while mpirun job depends on the order of the hosts #4516

karasevb opened this issue Nov 20, 2017 · 13 comments

Comments

@karasevb
Copy link
Member

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.x
master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

./configure --prefix=`pwd`/install --enable-orterun-prefix-by-default --with-slurm --with-pmi --with-ucx

Please describe the system on which you are running

  • Operating system/version:
    RedHat 7.2
  • Computer hardware:
    Intel dual socket Broadwell
  • Network type:
    IB

Details of the problem

Running on nodes node1,node2 works well, but if change the order of the nodes to node2,node1 this will result to failure:

ssh node1
mpirun --bind-to core --map-by node -H node2,node1 -np 2  $HPCX_MPI_DIR/tests/osu-micro-benchmarks-5.3.2/osu_allreduce
--------------------------------------------------------------------------
[node2:13941] Error: pml_yalla.c:95 - recv_ep_address() Failed to receive EP address
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)
@karasevb
Copy link
Member Author

This issue was introduced with commit fe9b584

@jsquyres
Copy link
Member

@karasevb FYI: I filed a dup of this in #4726 (and closed it; leaving this one open). Just mentioning it here as a cross reference -- when this issue is fixed, we should check to make sure the case sited in #4726 works as well.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2018

FWIW: I found the problem. The daemon on node2 incorrectly assigns rank=0 to node1 and rank=1 to itself, while the daemon on node1 (correctly) assigns rank=0 to node2 and rank=1 to itself. Thus, the daemons start two rank=1 procs, and no rank=0 as they each think the other one launched it.

@jsquyres
Copy link
Member

@rhc54 Cool -- thanks! Would this also explain what was happening in #4726?

@karasevb Can you fix this, perchance?

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2018

Yeah. The race condition is due to the shortness of "hostname" execution. If it finishes fast enough on the first node, then the second one receives notification that the other proc died and correctly computes its local rank. You can reproduce it 100% of the time by using a longer-running executable.

@gpaulsen gpaulsen added this to the v3.1.0 milestone Jan 18, 2018
@jsquyres
Copy link
Member

I did some testing today: I ran this scenario:

master$ ssh hostA
hostA$ mpirun --hostA,hostB sleep 1
hostA$ mpirun --hostB,hostA sleep 1

Per @rhc's comment #4516 (comment), running hostname is a test program that is short enough to make the problem be a race condition. But invoking a longer test program should make the problem occur 100% of the time. So I ran sleep 1 for my tests today, which should be long enough to make it happen 100% of the time.

Results:

  • On master: problem occurs 100% of the time, as predicted.
  • On v3.0.0: cannot reproduce the problem after 100 runs.
  • On v3.0.x head: cannot reproduce the problem after 100 runs.
  • On v3.x head: cannot reproduce the problem after 100 runs.

I also checked: my original testing (on #4726) was on master, not any of the v3.x.y branches.

So AFAICT, this problem exists on master but does not exist (or no longer exists?) on any of the 3.x.y release branches.

@jladd-mlnx @karasevb Just because I can't reproduce it doesn't mean that the problem isn't still there on the v3.x.y branches (as indicated by @karasevb's initial report). Can you try to reproduce on the v3.x.y branches?

If no one can reproduce on the v3.x.y branches, then we just change the labels/milestone on this issue.

@gpaulsen
Copy link
Member

I tested with an older open-mpi in v3.0.x, and I couldn't get it to reproduce at all on ppc64le.
I also saw in master the problem occur 100% of the time on ppc64le.

Based on Jeff's testing, I'm removing blocker and targets other than master.

@karasevb
Copy link
Member Author

Reproduce with v3.1.x (9885c21):

head$ ssh hostA
hostA$ ./mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -H hostA,hostB -np 2 ./ring_c
<it works>
hostA$ ./mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -H hostB,hostA -np 2 ./ring_c
<hang>
proc at hostA:
(gdb) bt
#0  0x00007fb76ec049a8 in opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:39
#1  0x00007fb76ec04f2b in opal_timer_linux_get_cycles_sys_timer () at timer_linux_component.c:232
#2  0x00007fb76eb4677a in opal_progress_events () at runtime/opal_progress.c:181
#3  0x00007fb76eb4687d in opal_progress () at runtime/opal_progress.c:242
#4  0x00007fb7607833ca in ompi_request_wait_completion (req=0x871580) at ../../../../ompi/request/request.h:413
#5  0x00007fb760784849 in mca_pml_ob1_recv (addr=0x7ffebfcf8b58, count=1, datatype=0x601080 <ompi_mpi_int>, src=0, tag=201,
    comm=0x601280 <ompi_mpi_comm_world>, status=0x0) at pml_ob1_irecv.c:135
#6  0x00007fb76f80b05b in PMPI_Recv (buf=0x7ffebfcf8b58, count=1, type=0x601080 <ompi_mpi_int>, source=0, tag=201,
    comm=0x601280 <ompi_mpi_comm_world>, status=0x0) at precv.c:79
#7  0x0000000000400a53 in main (argc=1, argv=0x7ffebfcf8c58) at ring_c.c:52
(gdb) frame 7
#7  0x0000000000400a53 in main (argc=1, argv=0x7ffebfcf8c58) at ring_c.c:52
52              MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
(gdb) p rank
$1 = 1
proc at hostB:
(gdb) bt
#0  opal_progress () at runtime/opal_progress.c:257
#1  0x00007f6b994b83ca in ompi_request_wait_completion (req=0x1926380) at ../../../../ompi/request/request.h:413
#2  0x00007f6b994b9849 in mca_pml_ob1_recv (addr=0x7ffd96362018, count=1, datatype=0x601080 <ompi_mpi_int>, src=0, tag=201,
    comm=0x601280 <ompi_mpi_comm_world>, status=0x0) at pml_ob1_irecv.c:135
#3  0x00007f6bac94505b in PMPI_Recv (buf=0x7ffd96362018, count=1, type=0x601080 <ompi_mpi_int>, source=0, tag=201,
    comm=0x601280 <ompi_mpi_comm_world>, status=0x0) at precv.c:79
#4  0x0000000000400a53 in main (argc=1, argv=0x7ffd96362118) at ring_c.c:52
(gdb) frame 4
#4  0x0000000000400a53 in main (argc=1, argv=0x7ffd96362118) at ring_c.c:52
52              MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
(gdb) p rank
$1 = 1

From backtraces above we can see that both procs have the same rank.
We are working on that.

@karasevb
Copy link
Member Author

@artpol84

@artpol84
Copy link
Contributor

artpol84 commented Feb 2, 2018

@karasevb the fix was merged into master.
Please PR to release branches.

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2018

Also need to include 73ef976

@karasevb
Copy link
Member Author

karasevb commented Feb 5, 2018

Done for v3.1: #4787
v3.0 branch does not have this issue

@karasevb
Copy link
Member Author

Related PRs have been merged, the issue may be closed.

karasevb referenced this issue Apr 9, 2018
Fixed the desync of job-nodelists between mpirun and orted
daemons. The issue was observed when using RSH launching because user
can provide arbitrary order of nodes regarding HNP placement.
The mpirun process propagate the daemon's nodelist order to nodes.
The problem was that HNP itself is assembling the nodelist based on
user provided order. As the result ranks assignment was calculated
differently on orted and mpirun.

Consider following example:
* User launches mpirun on node cn2.
* Hostlist is cn1,cn2,cn3,cn4; ppn=1
* mpirun is passing hostlist cn[2:2,1,3-4]@0(4) to orteds
So as result mpirun will assing rank 0 on cn1 while orted will assign
rank 0 on cn2 (because orted sees cn2 as the first element in the node
list)

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants