-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tree spawn routed component issue #6944
Conversation
* Fix open-mpi#6618 - See comments on Issue open-mpi#6618 for finer details. * The `plm/rsh` component uses the highest priority `routed` component to construct the launch tree. The remote orted's will activate all available `routed` components when updating routes. This allows the opportunity for the parent vpid on the remote `orted` to not match that which was expected in the tree launch. The result is that the remote orted tries to contact their parent with the wrong contact information and orted wireup will fail. * This fix forces the orteds to use the same `routed` component as the HNP used when contructing the tree, if tree launch is enabled. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Note that this is a PR directly to the v4.0.x branch. However, I could put this fix into |
Please don't - it will just confuse people. The routed framework in master is single select, so there is simply no way the orted's can pick something different unless the user forcibly does something to break it (e.g., coming up with a scheme to set a conflicting MCA param on just the remote nodes) - in which case, our philosophy is to "do what the user says, even if it is stupid". They will quickly discover why this is a bad idea. |
I agree - let's not put this in master. |
Just to be clear for anyone reading this in the future. The problem is not that the remote daemons were selecting a different routed module than mpirun - all daemons, including mpirun, were making the identical selection. The problem is that the routed framework was converted to be multi-select, which means that multiple routed components were active. Upon launch, we execute an "update" to the routing plan, which means we cycle across all the active routed components, giving each one a chance to update its view of the routing tree. This is correct and works fine. However, what we had overlooked is the fact that each routed component would update the global ORTE_PROC_MY_PARENT variable to point at the daemon it thought was its parent. Since it is a global variable, this meant that the routed components overwrote each other's value, leaving the value set to that of the lowest priority routed component. This only occurred on the remote daemons - mpirun does not have a parent since it is always at the top of the tree. As a result, procs that had been connected to mpirun would suddenly disconnect as they thought that their parent had changed. mpirun, not realizing what was going on, would see the disconnect and interpret it as a daemon failure - and therefore terminate the job. An alternative fix would have been to make the "parent" variable local to each routed component, and have ORTE request that parent value from the selected component before using it. However, since we changed routed back to single-select in master and going forward, it seemed that the easiest solution was just to force routed to be "single select" in this branch by just passing an MCA parameter so that only ONE component could be active. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment on the PR correcting the explanation of the problem. Fix looks ok.
Thanks for the clarifying comment. That is an accurate summary of my findings that I described in the corresponding issue. |
plm/rsh
component uses the highest priorityrouted
componentto construct the launch tree. The remote orted's will activate all
available
routed
components when updating routes. This allows theopportunity for the parent vpid on the remote
orted
to not matchthat which was expected in the tree launch. The result is that the
remote orted tries to contact their parent with the wrong contact
information and orted wireup will fail.
routed
component asthe HNP used when contructing the tree, if tree launch is enabled.
Signed-off-by: Joshua Hursey jhursey@us.ibm.com