Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ORTE to support PMIx v3 #4854

Merged
merged 4 commits into from
Mar 2, 2018
Merged

Update ORTE to support PMIx v3 #4854

merged 4 commits into from
Mar 2, 2018

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Feb 23, 2018

This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

  • initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

  • IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

  • Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

  • Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain rhc@open-mpi.org

@rhc54 rhc54 self-assigned this Feb 23, 2018
@rhc54
Copy link
Contributor Author

rhc54 commented Feb 23, 2018

@ggouaillardet @jjhursey @jsquyres You guys might want to take a look at this one - it is a backport from the PMIx reference server. I didn't want to hold off until the full debugger support is completed - hoped to break it down into some smaller digestible chunks. Still, don't want to let OMPI diverge too far!

I'm assuming OMPI wants the debugger support and the "instant on" features, and so this is likely something the community wants committed. If not, just close it.

@jladd-mlnx
Copy link
Member

@artpol84 FYI

@rhc54 rhc54 requested a review from jjhursey February 27, 2018 16:24
@rhc54
Copy link
Contributor Author

rhc54 commented Feb 27, 2018

MTT results:

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 4.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial         | 4.0.0a1     | 00:00    | 1    |      |          |      | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html          |
| Test Build  | ibm             | 4.0.0a1     | 00:38    | 1    |      |          |      | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html              |
| Test Build  | intel           | 4.0.0a1     | 00:25    | 1    |      |          |      | Test_Build-intel-my_installation-4.0.0a1-my_installation.html            |
| Test Build  | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-java-my_installation-4.0.0a1-my_installation.html             |
| Test Build  | orte            | 4.0.0a1     | 00:00    | 1    |      |          |      | Test_Build-orte-my_installation-4.0.0a1-my_installation.html             |
| Test Run    | trivial         | 4.0.0a1     | 00:03    | 2    |      |          |      | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html            |
| Test Run    | ibm             | 4.0.0a1     | 07:52    | 392  |      |          |      | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html                |
| Test Run    | spawn           | 4.0.0a1     | 01:51    | 6    |      | 1        | 1    | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 4.0.0a1     | 10:01    |      | 1    |          |      | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html          |
| Test Run    | intel           | 4.0.0a1     | 11:55    | 242  |      |          | 2    | Test_Run-intel-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 4.0.0a1     | 08:34    | 222  |      |          | 22   | Test_Run-intel_skip-my_installation-4.0.0a1-my_installation.html         |
| Test Run    | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Run-java-my_installation-4.0.0a1-my_installation.html               |
| Test Run    | orte            | 4.0.0a1     | 00:40    | 19   |      |          |      | Test_Run-orte-my_installation-4.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+


    Total Tests:    892
    Total Failures: 2
    Total Passed:   890
    Total Duration: 2521 secs. (42:01)

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 27, 2018

MTT without this patch - looks like the patch is impacting comm_spawn somehow:

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 4.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial         | 4.0.0a1     | 00:00    | 1    |      |          |      | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html          |
| Test Build  | ibm             | 4.0.0a1     | 00:38    | 1    |      |          |      | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html              |
| Test Build  | intel           | 4.0.0a1     | 00:26    | 1    |      |          |      | Test_Build-intel-my_installation-4.0.0a1-my_installation.html            |
| Test Build  | java            | 4.0.0a1     | 00:02    | 1    |      |          |      | Test_Build-java-my_installation-4.0.0a1-my_installation.html             |
| Test Build  | orte            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-4.0.0a1-my_installation.html             |
| Test Run    | trivial         | 4.0.0a1     | 00:02    | 2    |      |          |      | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html            |
| Test Run    | ibm             | 4.0.0a1     | 07:45    | 392  |      |          |      | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html                |
| Test Run    | spawn           | 4.0.0a1     | 00:11    | 7    |      |          | 1    | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 4.0.0a1     | 10:07    | 1    |      |          |      | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html          |
| Test Run    | intel           | 4.0.0a1     | 11:55    | 242  |      |          | 2    | Test_Run-intel-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 4.0.0a1     | 08:34    | 222  |      |          | 22   | Test_Run-intel_skip-my_installation-4.0.0a1-my_installation.html         |
| Test Run    | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Run-java-my_installation-4.0.0a1-my_installation.html               |
| Test Run    | orte            | 4.0.0a1     | 00:40    | 19   |      |          |      | Test_Run-orte-my_installation-4.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+


    Total Tests:    892
    Total Failures: 0
    Total Passed:   892
    Total Duration: 2422 secs. (40:22)

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 28, 2018

@ggouaillardet @jjhursey I'm not sure why no-disconnect is hanging - it might be a common problem with loop-spawn. I only see it on multi-node jobs, so it has something to do with a race condition.

Unfortunately, I cannot chase it down as it only appears in OMPI with comm_spawn. Can someone take a look so we can get this updated?

@ggouaillardet
Copy link
Contributor

@rhc54 I cannot even run a simple MPI_Comm_spawn() on a remote node.
For example

n0 $mpirun -np 1 --host n0:1,n1:1 ./spawn

the difference I was able to spot is in pmix_server_connect()

   /* if all local contributions have been received,
     * let the local host's server know that we are at the
     * "fence" point - they will callback once the [dis]connect
     * across all participants has been completed */
    if (trk->def_complete &&
        pmix_list_get_size(&trk->local_cbs) == trk->nlocal) {
        rc = pmix_host_server.connect(trk->pcs, trk->npcs, trk->info, trk->ninfo, cbfunc, trk);
    } else {
        rc = PMIX_SUCCESS;
    }

trk->def_complete is true in master.
in this branch, it is false and never gets updated.

fwiw, if I comment the trk->def_complete test, it seems to work.

I just noticed you pushed a new commit, and I will double check if it affects the current behavior.

@ggouaillardet
Copy link
Contributor

the new commit did not fix the issue on multiple nodes

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

Hmmm...it works fine for me, so there must be some difference between our setups. Let me guess - your mpirun executes on a non-compute node (i.e., has no procs on it)? If so, I'll debug that scenario.

@ggouaillardet
Copy link
Contributor

@rhc54 not really ...
mpirun in invoked on n0, and it creates one MPI task on the same n0.
Then this task MPI_Comm_spawn() a second MPI task on n1.

but if I mpirun -np 2 --host n1:2 ./spawn from n0 (e.g. mpirun is alone on n0 and both MPI tasks are on n1, then it works just fine.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

Okay, so in your scenario mpirun and the "parent" are on one node, and the spawned child is on another? I suspect this is the scenario that is causing those tests to fail. They have a lot of spawns in them, and eventually they fill the local node and overflow to the other node - and then hang.

So it sounds like we may have an easy way to reproduce the problem. Let me poke at it a bit.

@ggouaillardet
Copy link
Contributor

also

mpirun -np 1 --host n1:1,n2:1 ./spawn

from n0 work just fine
(e.g. mpirun is alone on n0, it starts a first MPI task on n1 that eventually MPI_Comm_spawn() a second MPI task on n2)

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

Weird - makes me wonder again if perhaps this is just a race condition, and the placement of procs just causes you to fall on one side or another.

@ggouaillardet
Copy link
Contributor

in your scenario mpirun and the "parent" are on one node, and the spawned child is on another

yes, this is my scenario.

@ggouaillardet
Copy link
Contributor

all I can say is I reproduce the issue 100% of the time.

so let me put it this way

  • on mpirun, trk->def_complete is false in this PR, is this the expected behavior ?
  • if yes, is this value supposed to be updated and how ? also, when should pmix_host_server.connect() be invoked on mpirun ?

@ggouaillardet
Copy link
Contributor

@rhc54 I think I get a better understanding of where things differ.

in master, pmix_server_connect() calls new_tracker() that does not know the namespace for jobid=2, and at the end, all_def is true, and so is trk->def_complete

but in this branch, new_tracker() does know about jobid=2 (it was created by PMIx_server_setup_application()), but its all_registered is false, so we end up with all_def is false, and so is trk->def_complete.

In master, PMIX_server_setup_application() is never invoked, and so the issue is avoided.

PMIX_server_register_nspace() could update the all_registered value, but it is never invoked on mpirun for jobid=2, so I do see that as a race condition.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

Ah, excellent!! Thanks so much for digging into this! I'll address it.

Ralph Castain added 4 commits March 2, 2018 02:00
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.

Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f)
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

Looks like that last commit got it:

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 4.0.0a1     | 00:01    | 1    |      |          |      | MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial         | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html          |
| Test Build  | ibm             | 4.0.0a1     | 00:42    | 1    |      |          |      | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html              |
| Test Build  | intel           | 4.0.0a1     | 00:25    | 1    |      |          |      | Test_Build-intel-my_installation-4.0.0a1-my_installation.html            |
| Test Build  | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-java-my_installation-4.0.0a1-my_installation.html             |
| Test Build  | orte            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-4.0.0a1-my_installation.html             |
| Test Run    | trivial         | 4.0.0a1     | 00:03    | 2    |      |          |      | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html            |
| Test Run    | ibm             | 4.0.0a1     | 07:53    | 392  |      |          |      | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html                |
| Test Run    | spawn           | 4.0.0a1     | 00:10    | 7    |      |          | 1    | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 4.0.0a1     | 10:07    | 1    |      |          |      | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html          |
| Test Run    | intel           | 4.0.0a1     | 11:57    | 242  |      |          | 2    | Test_Run-intel-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 4.0.0a1     | 08:46    | 222  |      |          | 22   | Test_Run-intel_skip-my_installation-4.0.0a1-my_installation.html         |
| Test Run    | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Run-java-my_installation-4.0.0a1-my_installation.html               |
| Test Run    | orte            | 4.0.0a1     | 00:43    | 19   |      |          |      | Test_Run-orte-my_installation-4.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+


    Total Tests:    892
    Total Failures: 0
    Total Passed:   892
    Total Duration: 2451 secs. (40:51)

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

@ggouaillardet Thanks again!

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2018

Committing per discussion on the teleconf of 2/26.

@rhc54 rhc54 merged commit f818284 into open-mpi:master Mar 2, 2018
@rhc54 rhc54 deleted the topic/update branch March 2, 2018 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants