Skip to content

WeeklyTelcon_20210622

Geoffrey Paulsen edited this page Jul 5, 2021 · 1 revision

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

  • Brendan Cunningham (Cornelis Networks)
  • David Bernholdt (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (NVIDIA))
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Naughton III, Thomas (ORNL)
  • Sam Gutierrez (LANL)
  • Tomislav Janjusic (NVIDIA)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • Erik Zeiske (HPE)
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Joseph Schuchart (HLRS)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • Xin Zhao (NVIDIA)

New Items

v4.0.x

  • v4.0.6 shipped last week. Looking good.
  • Mpool PR, waiting for review and to go into master first.
  • 8919 nVidia cannot link. Some users may have already hit this.
    • Tomislav will try to find someone to look at it.

v4.1.x

  • Planning on late August for accumulated bugfixes.

v5.0.x

  • PMIX / PRRTE plan to release in next few weeks

    • wil
  • Need to do a v5.0 rc as soon as PRRTE v2 ships.

    • Need feedback if we've missed an important one.
  • PMIx Tools support is still not functional. Opened tickets in PRRTE.

    • Not a common case for most users.
    • This also impacts the MPIR shim.
      • PRRTE v2 will probably ship with broken tool support.
  • Is the driving force for PRRTE v2.0 OMPI?

    • So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
    • Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
    • Or just fix it in PRRTE v2.0?
    • Is broken tool support a blocker for PRRTE v2.0?
      • Don't ship OMPI v5.0 with broken Tools support.
  • Is there any objections to delaying

    • Either we resource this
  • https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665

    • Current state of PMIx tool support.
    • We'd like to get Tool support in CI, but need it to be working to enable the CI.
  • https://github.com/openpmix/prrte/issues/978#issuecomment-856205950

    • Blocking issue for Open-MPI
    • Brian
  • PR 9014 - new blocker.

    • fix should just be a couple of lines of code... hard to decide what we want.
    • Ralph, Jeff and Brian started talking.
    • Simplest solution was to have our own
  • Need people working on v5.0 stuff.

  • Need some configury changes in before we RC.

  • Issue 8850, 8990 and more

  • Brian will file 3-ish issues

    • One is configure pmix
  • Dynamic Windows fix in for UCX.

  • Any update on debugger support?

  • Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if

  • MPIR Shim - pushed up fixes, and enabled CI.

    • Could add it to some more CI, to ensure that PMIx doesn't break
    • IBM is working on some CI testing with MPIR (typically very brittle)
    • Need some guidance on pmix version.
    • Right not, probably not a big deal, but perhaps in 2 years when we have 3 release branches with different pmix versions on different release branches, it might make sense to do open-mpi CI testing.
      • Shouldn't be too much work to do.
  • UCC coll component updating to just set to be default when UCX is selected. PR 8969

    • Intent is that this will eventually replace hcoll.

Documentation

  • Solid progress happening, on Read the docs.
  • These docs would be on the readthedocs.io site, or on our site?
    • Haven't thought either way yet.
    • No strong opinion yet.

Master

  • Issue 8884 - ROMIO detects CUDA differently.

    • Giles proposed a quick fix for now.

MPI 4.0 API

  • Now released.

  • Virtual Face to face.

  • Persistant Collectives

    • So nice to get MPIX_ rename into v5.0
    • Don't think this was planned for v5.0
    • Don't know if anyone asked them this. - Might not matter to them
      • Virtual face to face -
  • a bunch of stuff in pipeline. Then details.

  • Plan to open Sessions pull request.

    • Big, almost all in OMPI.
    • Some of it are more impacted by clang format changes.
    • New functions.
    • Considerably more functions can be called before MPI_Init/Finalize
    • Don't want to do sessions in v5.0
    • Hessam Miradeghi is interested in trying MPI_Sessions.
      • Interested in a timeline of a release that will contain MPI_Sessions.
    • Sessions working group meets every monday at noon central time.
  • We don't KNOW that OMPI v6.0 may not be an ABI break

  • Would be NICE to get MPIX symbols into a seperate library.

    • What's left in MPIX after persistant collectives?
      • Short Float,
      • Pcall_req - persistant collective
      • Affinity
    • If they're NOT built by default, it's not too high of a priority.
      • Should just be some code-shuffling.
        • On the surface shouldn't be too much.
        • If they use wrapper compilers, or official mechanism
        • Top level library, since app -> MPI and app -> MPIX lib.
        • libmpi_x library can then be versioned differently.
  • Dont change to build MPIX by default.

  • Open an issue to track all of our MPI 4.0 items

    • MPI Forum will want, certainly before supercomputing.
  • Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.

    • In person meeting is off the table for many of us. We might want an out of sequence meeting.
    • Lets doodle something a couple of weeks out.
    • Doodle and send it out
    • trivial wiki page in style of other in person wiki.

MTT

  • A lot of failures in Finalize in cisco
  • A lot of segfaults in UCX 1sided in IBM
  • Howard Pritchard Does someone at nVidia have a good set of test for GPU
    • Can ask around.
    • Only tests is OSC tests.
  • ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up.
    • A thread on The GPU can trigger something to happen in MPI.
    • CUDA_Async Not sure of
  • Jeff did some work on Cisco MTT.
    • There are a bunch of one-sided issues across node.
    • Austen and Jeff looking into.
    • Narrowed it down to strange results from MPI_Comm_split
      • Local Peers value appears to be set wrong under PRRTE
  • Joseph see when he installed hwloc in installation path, which leads to warnings if using another hwloc.
    • We changed how all of this worked a few weeks ago.
    • We shouldn't be installing one unless we can't find an external one.
    • Problem is if you link the application to a different hwloc, it now complains.
    • This has always been true, we just warn now. Don't do this.
  • Austen filed a couple of issues from MTT.

PMIx

  • No discussion

PRRTE v2.0

  • No update

Longer Term discussions

  • No discussion.
Clone this wiki locally