Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tree spawn routed component issue #6944

Merged
merged 1 commit into from
Sep 9, 2019

Conversation

jjhursey
Copy link
Member

  • Fix ORTE has lost communication with a remote daemon. #6618
  • The plm/rsh component uses the highest priority routed component
    to construct the launch tree. The remote orted's will activate all
    available routed components when updating routes. This allows the
    opportunity for the parent vpid on the remote orted to not match
    that which was expected in the tree launch. The result is that the
    remote orted tries to contact their parent with the wrong contact
    information and orted wireup will fail.
  • This fix forces the orteds to use the same routed component as
    the HNP used when contructing the tree, if tree launch is enabled.

Signed-off-by: Joshua Hursey jhursey@us.ibm.com

 * Fix open-mpi#6618
   - See comments on Issue open-mpi#6618 for finer details.
 * The `plm/rsh` component uses the highest priority `routed` component
   to construct the launch tree. The remote orted's will activate all
   available `routed` components when updating routes. This allows the
   opportunity for the parent vpid on the remote `orted` to not match
   that which was expected in the tree launch. The result is that the
   remote orted tries to contact their parent with the wrong contact
   information and orted wireup will fail.
 * This fix forces the orteds to use the same `routed` component as
   the HNP used when contructing the tree, if tree launch is enabled.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey
Copy link
Member Author

Note that this is a PR directly to the v4.0.x branch. master does not have this problem - as described here on the issue.

However, I could put this fix into master even though it doesn't have the problem being worked around here. It wouldn't hurt anything there and would provide a (super) edge case check if the remote nodes have a different set of routed components than the launch node. What do you all think?

@rhc54
Copy link
Contributor

rhc54 commented Aug 30, 2019

I could put this fix into master

Please don't - it will just confuse people. The routed framework in master is single select, so there is simply no way the orted's can pick something different unless the user forcibly does something to break it (e.g., coming up with a scheme to set a conflicting MCA param on just the remote nodes) - in which case, our philosophy is to "do what the user says, even if it is stupid". They will quickly discover why this is a bad idea.

@jjhursey
Copy link
Member Author

I agree - let's not put this in master.

@hppritcha hppritcha added the NEWS label Aug 30, 2019
@rhc54
Copy link
Contributor

rhc54 commented Sep 3, 2019

Just to be clear for anyone reading this in the future. The problem is not that the remote daemons were selecting a different routed module than mpirun - all daemons, including mpirun, were making the identical selection.

The problem is that the routed framework was converted to be multi-select, which means that multiple routed components were active. Upon launch, we execute an "update" to the routing plan, which means we cycle across all the active routed components, giving each one a chance to update its view of the routing tree. This is correct and works fine.

However, what we had overlooked is the fact that each routed component would update the global ORTE_PROC_MY_PARENT variable to point at the daemon it thought was its parent. Since it is a global variable, this meant that the routed components overwrote each other's value, leaving the value set to that of the lowest priority routed component. This only occurred on the remote daemons - mpirun does not have a parent since it is always at the top of the tree.

As a result, procs that had been connected to mpirun would suddenly disconnect as they thought that their parent had changed. mpirun, not realizing what was going on, would see the disconnect and interpret it as a daemon failure - and therefore terminate the job.

An alternative fix would have been to make the "parent" variable local to each routed component, and have ORTE request that parent value from the selected component before using it. However, since we changed routed back to single-select in master and going forward, it seemed that the easiest solution was just to force routed to be "single select" in this branch by just passing an MCA parameter so that only ONE component could be active.

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment on the PR correcting the explanation of the problem. Fix looks ok.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 4, 2019

Thanks for the clarifying comment. That is an accurate summary of my findings that I described in the corresponding issue.

@gpaulsen gpaulsen merged commit a482edc into open-mpi:v4.0.x Sep 9, 2019
@jjhursey jjhursey deleted the v4/fix-tree-launch branch September 9, 2019 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants