You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Original comment by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).
This issue is rooted in OMPI and is due to the forwarding of job-level constraints from the original job to all spawnees. In this particular case adding "-npernode 1" restricts all future processes from sharing a node, across all jobid handled by the same HNP. In a normal MPI application such behavior might be desired, but in context of ULFM we need to be able to reuse nodes, which means to respawn processes on a node where older processes failed.
Multiple solution might be envisioned, but I think the cleanest solution is to provide an info key to prevent the original job parameters inheritance. I have create an OMPI issue related to this topic open-mpi/ompi#5376.
Original report by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).
As reported on the ULFM mailing-list the use of a machinefile to restrict or drive the allocation of new processes is difficult.
The text was updated successfully, but these errors were encountered: