hostfile ordering not honored when HNP is used in allocation

We first noticed this in the `v3.0.x` release stream, as a difference in behavior from the `v2.x` release stream. I believe this to also impact `v3.1.x` and `master`.

This is a fallout of pushing the mapping/ordering mechanism to the backend nodes, and likely some of the improvements for the DVM and comm_spawn. Any fix would need to be careful to not break or hinder those features.

## SPMD case

We are launching `mpirun` from node `c712f6n01`, and using the following hostfile to land the first few ranks on a remote node first (`c712f6n04`) before using the local node.

```
shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
```

In v2.x you can see the order is preserved with rank 0,1 on `c712f6n04` followed by 2,3 on `c712f6n01` (where mpirun/HNP resides):
```
shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n04] 56846 Hello, world!
  1/  8) [c712f6n04] 56847 Hello, world!
  2/  8) [c712f6n01] 65059 Hello, world!
  3/  8) [c712f6n01] 65060 Hello, world!
  4/  8) [c712f6n03] 62352 Hello, world!
  5/  8) [c712f6n03] 62353 Hello, world!
  6/  8) [c712f6n02] 74593 Hello, world!
  7/  8) [c712f6n02] 74594 Hello, world!
```

In v3.0.x you can see that the HNP node is always first in the list followed by the ordered list from the hostfile, less the HNP node. This puts ranks 0,1 on `c712f6n01` instead of `c712f6n04`, as the user desired.
```
shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n01] 64629 Hello, world!
  1/  8) [c712f6n01] 64630 Hello, world!
  2/  8) [c712f6n04] 56447 Hello, world!
  3/  8) [c712f6n04] 56448 Hello, world!
  4/  8) [c712f6n03] 61943 Hello, world!
  5/  8) [c712f6n03] 61944 Hello, world!
  6/  8) [c712f6n02] 74189 Hello, world!
  7/  8) [c712f6n02] 74190 Hello, world!
```

Expected result should match v2.x's behavior with this hostfile - preserve ordering:
```
shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n04] 56846 Hello, world!
  1/  8) [c712f6n04] 56847 Hello, world!
  2/  8) [c712f6n01] 65059 Hello, world!
  3/  8) [c712f6n01] 65060 Hello, world!
  4/  8) [c712f6n03] 62352 Hello, world!
  5/  8) [c712f6n03] 62353 Hello, world!
  6/  8) [c712f6n02] 74593 Hello, world!
  7/  8) [c712f6n02] 74594 Hello, world!
```

## MPMD case

Again, we are launching `mpirun` from node `c712f6n01`.

Consider these two hostfiles containing different orderings of these four machines.

```
shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
shell$ cat hostfile-c
c712f6n04 slots=2
c712f6n02 slots=2
c712f6n03 slots=2
c712f6n01 slots=2
```

The hello world program will print out the argument set in it's output to make it clear which app context it originated from (The `A` and `B` values in the output below).

In v2.x we get some odd behavior in the second app context mapping (likely due to the bookmark not being reset quite right - notice the iteration step between app context assignments of ranks 3 and 4):
```
shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --hostfile ./hostfile-c ./hello_c B | sort
  0/  8) [c712f6n04] 56926 Hello, world! A
  1/  8) [c712f6n01] 65108 Hello, world! A
  2/  8) [c712f6n03] 62435 Hello, world! A
  3/  8) [c712f6n02] 74671 Hello, world! A
  4/  8) [c712f6n02] 74672 Hello, world! B
  5/  8) [c712f6n03] 62436 Hello, world! B
  6/  8) [c712f6n01] 65109 Hello, world! B
  7/  8) [c712f6n04] 56927 Hello, world! B
```

In v3.0.x we get a more consistent pattern, but not quite what we want:
```
shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --map-by node --hostfile ./hostfile-c ./hello_c B | sort
  0/  8) [c712f6n01] 64736 Hello, world! A
  1/  8) [c712f6n04] 56615 Hello, world! A
  2/  8) [c712f6n03] 62110 Hello, world! A
  3/  8) [c712f6n02] 74355 Hello, world! A
  4/  8) [c712f6n01] 64737 Hello, world! B
  5/  8) [c712f6n04] 56616 Hello, world! B
  6/  8) [c712f6n03] 62111 Hello, world! B
  7/  8) [c712f6n02] 74356 Hello, world! B
```


Expected result should be as follows - preserve ordering per-app-context's hostfile:
```
  0/  8) [c712f6n04] 64736 Hello, world! A
  1/  8) [c712f6n01] 56615 Hello, world! A
  2/  8) [c712f6n03] 62110 Hello, world! A
  3/  8) [c712f6n02] 74355 Hello, world! A
  4/  8) [c712f6n04] 64737 Hello, world! B
  5/  8) [c712f6n02] 56616 Hello, world! B
  6/  8) [c712f6n03] 62111 Hello, world! B
  7/  8) [c712f6n01] 74356 Hello, world! B
```

## Discussion

The ordering in the `v3.0.x` series has to do with the `orte_node_pool` ordering. In this list, the HNP node is always first followed by hosts _as they are discovered_. Meaning that in the MPMD case the first time we see node `c712f6n04`, for example is in the first hostfile (`hostfile-b`), so it is added to the `orte_node_pool` in the second position, just after the HNP. And so on through the first hostfile. When the second hostfile is encountered (`hostfile-c`) we already have the hosts in the `orte_node_pool` so we don't re-add them.

For the RMAPS mechanism they are dealing with the true ordering from the hostfile when they make their mapping decisions (e.g., in `orte_rmaps_rr_map`), but that context is lost when ORTE moves into the `orte_plm_base_launch_apps` state. In the application launch, when we pack the job launch message, in `orte_util_encode_nodemap`, we use the `orte_node_pool` ordering to structure the launch message which determines rank ordering. That ordering is incorrect with respect to the per-app-context hostfile.

I think this is a legitimate bug to fix, as it prevents users from controlling the rank ordering with respect to node order. However, after digging into this a bit, it looks like a pretty invasive change to make (and a delicate one at that so we don't break any other expected behavior in the process). I need to reflect on this a bit more, but wanted to post the issue for discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hostfile ordering not honored when HNP is used in allocation #4327

SPMD case

MPMD case

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hostfile ordering not honored when HNP is used in allocation #4327

Description

SPMD case

MPMD case

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions