-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure while mpirun job depends on the order of the hosts #4516
Comments
This issue was introduced with commit fe9b584 |
FWIW: I found the problem. The daemon on node2 incorrectly assigns rank=0 to node1 and rank=1 to itself, while the daemon on node1 (correctly) assigns rank=0 to node2 and rank=1 to itself. Thus, the daemons start two rank=1 procs, and no rank=0 as they each think the other one launched it. |
Yeah. The race condition is due to the shortness of "hostname" execution. If it finishes fast enough on the first node, then the second one receives notification that the other proc died and correctly computes its local rank. You can reproduce it 100% of the time by using a longer-running executable. |
I did some testing today: I ran this scenario:
Per @rhc's comment #4516 (comment), running Results:
I also checked: my original testing (on #4726) was on master, not any of the v3.x.y branches. So AFAICT, this problem exists on master but does not exist (or no longer exists?) on any of the 3.x.y release branches. @jladd-mlnx @karasevb Just because I can't reproduce it doesn't mean that the problem isn't still there on the v3.x.y branches (as indicated by @karasevb's initial report). Can you try to reproduce on the v3.x.y branches? If no one can reproduce on the v3.x.y branches, then we just change the labels/milestone on this issue. |
I tested with an older open-mpi in v3.0.x, and I couldn't get it to reproduce at all on ppc64le. Based on Jeff's testing, I'm removing blocker and targets other than master. |
Reproduce with v3.1.x (9885c21):
From backtraces above we can see that both procs have the same rank. |
@karasevb the fix was merged into master. |
Also need to include 73ef976 |
Done for v3.1: #4787 |
Related PRs have been merged, the issue may be closed. |
Fixed the desync of job-nodelists between mpirun and orted daemons. The issue was observed when using RSH launching because user can provide arbitrary order of nodes regarding HNP placement. The mpirun process propagate the daemon's nodelist order to nodes. The problem was that HNP itself is assembling the nodelist based on user provided order. As the result ranks assignment was calculated differently on orted and mpirun. Consider following example: * User launches mpirun on node cn2. * Hostlist is cn1,cn2,cn3,cn4; ppn=1 * mpirun is passing hostlist cn[2:2,1,3-4]@0(4) to orteds So as result mpirun will assing rank 0 on cn1 while orted will assign rank 0 on cn2 (because orted sees cn2 as the first element in the node list) Signed-off-by: Boris Karasev <karasev.b@gmail.com>
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v3.0.x
master
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
./configure --prefix=`pwd`/install --enable-orterun-prefix-by-default --with-slurm --with-pmi --with-ucx
Please describe the system on which you are running
RedHat 7.2
Intel dual socket Broadwell
IB
Details of the problem
Running on nodes
node1,node2
works well, but if change the order of the nodes tonode2,node1
this will result to failure:The text was updated successfully, but these errors were encountered: