Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostfile ordering not honored when HNP is used in allocation #4327

Closed
jjhursey opened this issue Oct 11, 2017 · 9 comments
Closed

hostfile ordering not honored when HNP is used in allocation #4327

jjhursey opened this issue Oct 11, 2017 · 9 comments

Comments

@jjhursey
Copy link
Member

We first noticed this in the v3.0.x release stream, as a difference in behavior from the v2.x release stream. I believe this to also impact v3.1.x and master.

This is a fallout of pushing the mapping/ordering mechanism to the backend nodes, and likely some of the improvements for the DVM and comm_spawn. Any fix would need to be careful to not break or hinder those features.

SPMD case

We are launching mpirun from node c712f6n01, and using the following hostfile to land the first few ranks on a remote node first (c712f6n04) before using the local node.

shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2

In v2.x you can see the order is preserved with rank 0,1 on c712f6n04 followed by 2,3 on c712f6n01 (where mpirun/HNP resides):

shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n04] 56846 Hello, world!
  1/  8) [c712f6n04] 56847 Hello, world!
  2/  8) [c712f6n01] 65059 Hello, world!
  3/  8) [c712f6n01] 65060 Hello, world!
  4/  8) [c712f6n03] 62352 Hello, world!
  5/  8) [c712f6n03] 62353 Hello, world!
  6/  8) [c712f6n02] 74593 Hello, world!
  7/  8) [c712f6n02] 74594 Hello, world!

In v3.0.x you can see that the HNP node is always first in the list followed by the ordered list from the hostfile, less the HNP node. This puts ranks 0,1 on c712f6n01 instead of c712f6n04, as the user desired.

shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n01] 64629 Hello, world!
  1/  8) [c712f6n01] 64630 Hello, world!
  2/  8) [c712f6n04] 56447 Hello, world!
  3/  8) [c712f6n04] 56448 Hello, world!
  4/  8) [c712f6n03] 61943 Hello, world!
  5/  8) [c712f6n03] 61944 Hello, world!
  6/  8) [c712f6n02] 74189 Hello, world!
  7/  8) [c712f6n02] 74190 Hello, world!

Expected result should match v2.x's behavior with this hostfile - preserve ordering:

shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n04] 56846 Hello, world!
  1/  8) [c712f6n04] 56847 Hello, world!
  2/  8) [c712f6n01] 65059 Hello, world!
  3/  8) [c712f6n01] 65060 Hello, world!
  4/  8) [c712f6n03] 62352 Hello, world!
  5/  8) [c712f6n03] 62353 Hello, world!
  6/  8) [c712f6n02] 74593 Hello, world!
  7/  8) [c712f6n02] 74594 Hello, world!

MPMD case

Again, we are launching mpirun from node c712f6n01.

Consider these two hostfiles containing different orderings of these four machines.

shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
shell$ cat hostfile-c
c712f6n04 slots=2
c712f6n02 slots=2
c712f6n03 slots=2
c712f6n01 slots=2

The hello world program will print out the argument set in it's output to make it clear which app context it originated from (The A and B values in the output below).

In v2.x we get some odd behavior in the second app context mapping (likely due to the bookmark not being reset quite right - notice the iteration step between app context assignments of ranks 3 and 4):

shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --hostfile ./hostfile-c ./hello_c B | sort
  0/  8) [c712f6n04] 56926 Hello, world! A
  1/  8) [c712f6n01] 65108 Hello, world! A
  2/  8) [c712f6n03] 62435 Hello, world! A
  3/  8) [c712f6n02] 74671 Hello, world! A
  4/  8) [c712f6n02] 74672 Hello, world! B
  5/  8) [c712f6n03] 62436 Hello, world! B
  6/  8) [c712f6n01] 65109 Hello, world! B
  7/  8) [c712f6n04] 56927 Hello, world! B

In v3.0.x we get a more consistent pattern, but not quite what we want:

shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --map-by node --hostfile ./hostfile-c ./hello_c B | sort
  0/  8) [c712f6n01] 64736 Hello, world! A
  1/  8) [c712f6n04] 56615 Hello, world! A
  2/  8) [c712f6n03] 62110 Hello, world! A
  3/  8) [c712f6n02] 74355 Hello, world! A
  4/  8) [c712f6n01] 64737 Hello, world! B
  5/  8) [c712f6n04] 56616 Hello, world! B
  6/  8) [c712f6n03] 62111 Hello, world! B
  7/  8) [c712f6n02] 74356 Hello, world! B

Expected result should be as follows - preserve ordering per-app-context's hostfile:

  0/  8) [c712f6n04] 64736 Hello, world! A
  1/  8) [c712f6n01] 56615 Hello, world! A
  2/  8) [c712f6n03] 62110 Hello, world! A
  3/  8) [c712f6n02] 74355 Hello, world! A
  4/  8) [c712f6n04] 64737 Hello, world! B
  5/  8) [c712f6n02] 56616 Hello, world! B
  6/  8) [c712f6n03] 62111 Hello, world! B
  7/  8) [c712f6n01] 74356 Hello, world! B

Discussion

The ordering in the v3.0.x series has to do with the orte_node_pool ordering. In this list, the HNP node is always first followed by hosts as they are discovered. Meaning that in the MPMD case the first time we see node c712f6n04, for example is in the first hostfile (hostfile-b), so it is added to the orte_node_pool in the second position, just after the HNP. And so on through the first hostfile. When the second hostfile is encountered (hostfile-c) we already have the hosts in the orte_node_pool so we don't re-add them.

For the RMAPS mechanism they are dealing with the true ordering from the hostfile when they make their mapping decisions (e.g., in orte_rmaps_rr_map), but that context is lost when ORTE moves into the orte_plm_base_launch_apps state. In the application launch, when we pack the job launch message, in orte_util_encode_nodemap, we use the orte_node_pool ordering to structure the launch message which determines rank ordering. That ordering is incorrect with respect to the per-app-context hostfile.

I think this is a legitimate bug to fix, as it prevents users from controlling the rank ordering with respect to node order. However, after digging into this a bit, it looks like a pretty invasive change to make (and a delicate one at that so we don't break any other expected behavior in the process). I need to reflect on this a bit more, but wanted to post the issue for discussion.

@rhc54
Copy link
Contributor

rhc54 commented Oct 11, 2017

Wow, thanks for the detailed report!!

This is a bit of a head scratcher. There was never any intended correlation between hostfile ordering and process placement - if it is happening in some prior series, it is entirely by accident. The documentation made clear that hostfile simply specifies available resources. You could, if you choose, use the same hostfile as input to the sequential mapper - but that was a clearly delineated special case.

This is where the -host option differed from hostfile. We allowed -host to not only specify nodes, but to also specify precedence of usage. It was an intentional special case.

It sounds like you are proposing to change that arrangement and force a similar correlation for hostfile. I'm not saying we shouldn't do that - just pointing out that it is an architectural change.

Your simplest solution would frankly be to take the hostfile entries and convert them to a -host list for each app_context, checking first to see if the user already provided a -host option so you don't unintentionally overwrite it. The -host list already gets sent back to all the backend daemons and used there to set order, so it should get you the desired behavior.

@jjhursey
Copy link
Member Author

Thanks for the suggestion. Converting the hostfile to a hostlist did work. I didn't look at the hostlist path in the code yet.

That might be a workaround for this particular use case (they were focused on SPMD mode). I tested with v3.0.x below to confirm:

SPMD

Launching from c712f6n01

shell$ cat hostfile-b 
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
shell$  mpirun --hostfile ./hostfile-b hello_c | sort
  0/  8) [c712f6n01] 103355 Hello, world! 
  1/  8) [c712f6n01] 103356 Hello, world! 
  2/  8) [c712f6n04] 155678 Hello, world! 
  3/  8) [c712f6n04] 155679 Hello, world! 
  4/  8) [c712f6n03] 158810 Hello, world! 
  5/  8) [c712f6n03] 158811 Hello, world! 
  6/  8) [c712f6n02] 8876 Hello, world! 
  7/  8) [c712f6n02] 8877 Hello, world! 
shell$  mpirun --hostlist c712f6n04:2,c712f6n01:2,c712f6n03:2,c712f6n02:2  hello_c | sort
  0/  8) [c712f6n04] 155599 Hello, world! 
  1/  8) [c712f6n04] 155600 Hello, world! 
  2/  8) [c712f6n01] 103319 Hello, world! 
  3/  8) [c712f6n01] 103320 Hello, world! 
  4/  8) [c712f6n03] 158754 Hello, world! 
  5/  8) [c712f6n03] 158755 Hello, world! 
  6/  8) [c712f6n02] 8820 Hello, world! 
  7/  8) [c712f6n02] 8821 Hello, world! 

MPMD

Launching from c712f6n01. It looks like --hostlist is not accepted for the second app-context (need to track that down, might just be an arg parsing issue in schizo)

mpirun --np 4 --map-by node -gmca pml ob1 -gmca btl tcp,vader,self  --hostlist c712f6n04:2,c712f6n01:2,c712f6n03:2,c712f6n02:2 ./hello_c A : --np 4 --map-by node --hostlist c712f6n04:2,c712f6n02:2,c712f6n03:2,c712f6n01:2 ./hello_c B 
mpirun: Error: unknown option "--hostlist"
Type 'mpirun --help' for usage.

Discussion

I need to check, but I wonder if the batch environments are also impacted by this issue. If the order of the resources reported to ORTE is meaningful, if we honor that ordering.

Yeah, I agree that changing the arrangement of the hostfile order now having meaning would be an architectural change with some careful work to do. Maybe we can harness some of the --hostlist logic to make it easier, but I'd want to make sure that we are not unnecessarily making the code more complex or launch slower for this use case.

@rhc54
Copy link
Contributor

rhc54 commented Oct 12, 2017

Errr...I'm unaware of any "hostlist" option in OMPI, so maybe that is one of yours? We use -H or --host. Last I checked, we correctly parsed those options per-app.

I don't believe using --host is going to slow anything down, assuming you want to enforce ordering. I'm familiar with how we do that, and it does involve a nested search across two lists to reorder things. I'm certain someone can improve that algorithm! I'd focus my attention there - passing the ordering in the -host attribute is pretty simple and scalable. If necessary, you could pass -host as a regex to keep its size down.

If you decide instead to parse the hostfile on the backend, be aware that the hostfile isn't always on a shared file system. So you'd need to use the DFS framework (or include the hostfile contents in the launch message) to ensure the hostfile is available everywhere.

@jjhursey
Copy link
Member Author

Ah yes - sorry about that. I got my MPI installs swapped when I was testing that prior comment. That was a drop of Spectrum, which has the hostlist option.

Using a -host with a list like c712f6n04,c712f6n04,c712f6n01,c712f6n01,c712f6n03,c712f6n03,c712f6n02,c712f6n02 worked if I specified the seq mapper (-gmca rmaps seq). Without that it didn't preserve ordering.

I'll keep tinkering with it. Thanks for the suggestion. I might pursue the packing into a -host regex that we send out with the launch message, instead of parsing the hostfile on the backend. Maybe have an MCA parameter to 'enable' this order preservation if it starts to get too involved.

@rhc54
Copy link
Contributor

rhc54 commented Oct 12, 2017

Keep in mind that we currently set -host as an attribute on the app_context. Specifying the attribute scope as GLOBAL means that it gets sent as part of the launch message. All strings in the launch msg greater than a cutoff size are automatically compressed using zlib, which yields a pretty big reduction. So you might be getting enough compression already - worth checking. There is an option that will output the size of the launch msg for you as an aid in debugging such things.

It sounds like there is a bug in the -host processing as it is supposed to preserve ordering, IIRC. However, I confess it has been long enough that I may be wrong about that - I can at least look if you need some assistance.

@jjhursey jjhursey added enhancement and removed bug labels Oct 16, 2017
@jjhursey
Copy link
Member Author

Re tagging as an enhancement (but I'd still like to make the case to get it into the v3.0.x release branch) since it was not behavior that we should have been depending upon.

I'm working on a PR, but it's gotten sidelined a bit. Probably won't be ready until later in the week.

@jjhursey
Copy link
Member Author

jjhursey commented Nov 7, 2017

As noted by my inactivity here, I haven't gotten the cycles to work on it. I probably won't for a few months. So if someone is curious and wants to take a pass at it before then feel free, otherwise I'll get back to it later.

@rhc54 rhc54 added the RTE Issue likely is in RTE or PMIx areas label Jun 26, 2018
@jjhursey
Copy link
Member Author

(might be related) Ref PR #6493 / Issue #6298 / Issue #6501

@rhc54
Copy link
Contributor

rhc54 commented Apr 22, 2021

Given 2.5 years have passed, I think it is safe to say this isn't something that is going to happen.

@rhc54 rhc54 closed this as completed Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants