You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By accident last night, I found that the following hangs:
head-node$ ssh A
Welcome to A!
A$ orterun --host B,A hostname
A.cluster.cisco.com
B.cluster.cisco.com
[hang]
Note the ordering of the hosts in the --host clause: it's B,A, where A is the host running orterun. And just to be clear, running with the hosts in the other order (A,B) does not seem to result in the same problem: the orterun always seems to complete successfully.
This is using the rsh/ssh launcher outside of SLURM/etc. For simplicity, hostname is the Linux hostname(1) command -- it's not an MPI application.
This behavior seems to be a race condition. It happens most of the time, but not always.
When it happens, it looks like:
mpirun is still running (obviously)
the ssh from A to B is still running
the orted on B is still running
all hostname processes have exited
Running git bisect overnight, it looks like 657e701 is the commit where this behavior was introduced. I can't say this with 100% certainty, because it is a race condition, after all. But my bisect script ran the orterun ... hostname test 100 times for each build, which seemed to be enough to cause the hang to occur if it was going to occur.
It's possible that git bisect got this wrong -- I'll defer to @karasevb, who probably cited the commit he did because he actually looked at the code, etc.
By accident last night, I found that the following hangs:
Note the ordering of the hosts in the
--host
clause: it'sB,A
, whereA
is the host runningorterun
. And just to be clear, running with the hosts in the other order (A,B
) does not seem to result in the same problem: theorterun
always seems to complete successfully.This is using the rsh/ssh launcher outside of SLURM/etc. For simplicity,
hostname
is the Linuxhostname(1)
command -- it's not an MPI application.This behavior seems to be a race condition. It happens most of the time, but not always.
When it happens, it looks like:
mpirun
is still running (obviously)ssh
from A to B is still runningorted
on B is still runninghostname
processes have exitedRunning git bisect overnight, it looks like 657e701 is the commit where this behavior was introduced. I can't say this with 100% certainty, because it is a race condition, after all. But my bisect script ran the
orterun ... hostname
test 100 times for each build, which seemed to be enough to cause the hang to occur if it was going to occur.@rhc54 This seems like a low priority.
The text was updated successfully, but these errors were encountered: