-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hostfile ordering not honored when HNP is used in allocation #4327
Comments
Wow, thanks for the detailed report!! This is a bit of a head scratcher. There was never any intended correlation between hostfile ordering and process placement - if it is happening in some prior series, it is entirely by accident. The documentation made clear that hostfile simply specifies available resources. You could, if you choose, use the same hostfile as input to the sequential mapper - but that was a clearly delineated special case. This is where the -host option differed from hostfile. We allowed -host to not only specify nodes, but to also specify precedence of usage. It was an intentional special case. It sounds like you are proposing to change that arrangement and force a similar correlation for hostfile. I'm not saying we shouldn't do that - just pointing out that it is an architectural change. Your simplest solution would frankly be to take the hostfile entries and convert them to a -host list for each app_context, checking first to see if the user already provided a -host option so you don't unintentionally overwrite it. The -host list already gets sent back to all the backend daemons and used there to set order, so it should get you the desired behavior. |
Thanks for the suggestion. Converting the hostfile to a hostlist did work. I didn't look at the hostlist path in the code yet. That might be a workaround for this particular use case (they were focused on SPMD mode). I tested with v3.0.x below to confirm: SPMDLaunching from
MPMDLaunching from
DiscussionI need to check, but I wonder if the batch environments are also impacted by this issue. If the order of the resources reported to ORTE is meaningful, if we honor that ordering. Yeah, I agree that changing the arrangement of the hostfile order now having meaning would be an architectural change with some careful work to do. Maybe we can harness some of the --hostlist logic to make it easier, but I'd want to make sure that we are not unnecessarily making the code more complex or launch slower for this use case. |
Errr...I'm unaware of any "hostlist" option in OMPI, so maybe that is one of yours? We use -H or --host. Last I checked, we correctly parsed those options per-app. I don't believe using --host is going to slow anything down, assuming you want to enforce ordering. I'm familiar with how we do that, and it does involve a nested search across two lists to reorder things. I'm certain someone can improve that algorithm! I'd focus my attention there - passing the ordering in the -host attribute is pretty simple and scalable. If necessary, you could pass -host as a regex to keep its size down. If you decide instead to parse the hostfile on the backend, be aware that the hostfile isn't always on a shared file system. So you'd need to use the DFS framework (or include the hostfile contents in the launch message) to ensure the hostfile is available everywhere. |
Ah yes - sorry about that. I got my MPI installs swapped when I was testing that prior comment. That was a drop of Spectrum, which has the hostlist option. Using a I'll keep tinkering with it. Thanks for the suggestion. I might pursue the packing into a -host regex that we send out with the launch message, instead of parsing the hostfile on the backend. Maybe have an MCA parameter to 'enable' this order preservation if it starts to get too involved. |
Keep in mind that we currently set -host as an attribute on the app_context. Specifying the attribute scope as GLOBAL means that it gets sent as part of the launch message. All strings in the launch msg greater than a cutoff size are automatically compressed using zlib, which yields a pretty big reduction. So you might be getting enough compression already - worth checking. There is an option that will output the size of the launch msg for you as an aid in debugging such things. It sounds like there is a bug in the -host processing as it is supposed to preserve ordering, IIRC. However, I confess it has been long enough that I may be wrong about that - I can at least look if you need some assistance. |
Re tagging as an enhancement (but I'd still like to make the case to get it into the v3.0.x release branch) since it was not behavior that we should have been depending upon. I'm working on a PR, but it's gotten sidelined a bit. Probably won't be ready until later in the week. |
As noted by my inactivity here, I haven't gotten the cycles to work on it. I probably won't for a few months. So if someone is curious and wants to take a pass at it before then feel free, otherwise I'll get back to it later. |
Given 2.5 years have passed, I think it is safe to say this isn't something that is going to happen. |
We first noticed this in the
v3.0.x
release stream, as a difference in behavior from thev2.x
release stream. I believe this to also impactv3.1.x
andmaster
.This is a fallout of pushing the mapping/ordering mechanism to the backend nodes, and likely some of the improvements for the DVM and comm_spawn. Any fix would need to be careful to not break or hinder those features.
SPMD case
We are launching
mpirun
from nodec712f6n01
, and using the following hostfile to land the first few ranks on a remote node first (c712f6n04
) before using the local node.In v2.x you can see the order is preserved with rank 0,1 on
c712f6n04
followed by 2,3 onc712f6n01
(where mpirun/HNP resides):In v3.0.x you can see that the HNP node is always first in the list followed by the ordered list from the hostfile, less the HNP node. This puts ranks 0,1 on
c712f6n01
instead ofc712f6n04
, as the user desired.Expected result should match v2.x's behavior with this hostfile - preserve ordering:
MPMD case
Again, we are launching
mpirun
from nodec712f6n01
.Consider these two hostfiles containing different orderings of these four machines.
The hello world program will print out the argument set in it's output to make it clear which app context it originated from (The
A
andB
values in the output below).In v2.x we get some odd behavior in the second app context mapping (likely due to the bookmark not being reset quite right - notice the iteration step between app context assignments of ranks 3 and 4):
In v3.0.x we get a more consistent pattern, but not quite what we want:
Expected result should be as follows - preserve ordering per-app-context's hostfile:
Discussion
The ordering in the
v3.0.x
series has to do with theorte_node_pool
ordering. In this list, the HNP node is always first followed by hosts as they are discovered. Meaning that in the MPMD case the first time we see nodec712f6n04
, for example is in the first hostfile (hostfile-b
), so it is added to theorte_node_pool
in the second position, just after the HNP. And so on through the first hostfile. When the second hostfile is encountered (hostfile-c
) we already have the hosts in theorte_node_pool
so we don't re-add them.For the RMAPS mechanism they are dealing with the true ordering from the hostfile when they make their mapping decisions (e.g., in
orte_rmaps_rr_map
), but that context is lost when ORTE moves into theorte_plm_base_launch_apps
state. In the application launch, when we pack the job launch message, inorte_util_encode_nodemap
, we use theorte_node_pool
ordering to structure the launch message which determines rank ordering. That ordering is incorrect with respect to the per-app-context hostfile.I think this is a legitimate bug to fix, as it prevents users from controlling the rank ordering with respect to node order. However, after digging into this a bit, it looks like a pretty invasive change to make (and a delicate one at that so we don't break any other expected behavior in the process). I need to reflect on this a bit more, but wanted to post the issue for discussion.
The text was updated successfully, but these errors were encountered: