Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'orterun --host B,A' when run from host A hangs #4726

Closed
jsquyres opened this issue Jan 17, 2018 · 2 comments
Closed

'orterun --host B,A' when run from host A hangs #4726

jsquyres opened this issue Jan 17, 2018 · 2 comments

Comments

@jsquyres
Copy link
Member

By accident last night, I found that the following hangs:

head-node$ ssh A
Welcome to A!
A$ orterun --host B,A hostname
A.cluster.cisco.com
B.cluster.cisco.com
[hang]

Note the ordering of the hosts in the --host clause: it's B,A, where A is the host running orterun. And just to be clear, running with the hosts in the other order (A,B) does not seem to result in the same problem: the orterun always seems to complete successfully.

This is using the rsh/ssh launcher outside of SLURM/etc. For simplicity, hostname is the Linux hostname(1) command -- it's not an MPI application.

This behavior seems to be a race condition. It happens most of the time, but not always.

When it happens, it looks like:

  • mpirun is still running (obviously)
  • the ssh from A to B is still running
  • the orted on B is still running
  • all hostname processes have exited

Running git bisect overnight, it looks like 657e701 is the commit where this behavior was introduced. I can't say this with 100% certainty, because it is a race condition, after all. But my bisect script ran the orterun ... hostname test 100 times for each build, which seemed to be enough to cause the hang to occur if it was going to occur.

@rhc54 This seems like a low priority.

@jsquyres jsquyres added the bug label Jan 17, 2018
@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2018

Looks like this is the same as #4516, though you cite a different source commit

@jsquyres
Copy link
Member Author

Ah! Ok.

It's possible that git bisect got this wrong -- I'll defer to @karasevb, who probably cited the commit he did because he actually looked at the code, etc.

Let's close this one as a dup, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants