Skip to content

WeeklyTelcon_20191217

Geoffrey Paulsen edited this page Dec 17, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Noah Evans (Sandia)
  • Brian Barrett (AWS)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Josh Hursey (IBM)
  • Michael Heinz (Intel)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • Jeff Squyres (Cisco)
  • George Bosilca (UTK)
  • Thomas Naughton (ORNL)
  • Artem Polyakov (Mellanox)
  • David Bernhold (ORNL)

not there today (I keep this for easy cut-n-paste for future notes)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Intel)
  • Akshay Venkatesh (NVIDIA)
  • Edgar Gabriel (UH)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Joshua Ladd (Mellanox)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Xin Zhao (Mellanox)
  • mohan (AWS)

!!! Next Open-MPI Web-ex will be January 7th, 2020 !!!

New Business

MPI Forum meeting last week.

  • Nothing specific for Open MPI
  • Big counts hit some issues, not sure if it's going in.
  • Targeting 2020 Super Computing official v4.0
  • Jeff Squyres working on python layer for Standard to generate bindings
    • Will generate

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • 3.0.5 and 3.1.5 have shipped
  • Planning for no new fixes on 3.x, unless super critical
  • BUT, looks like something was messed up with 3.1.5, not sure about 3.0.x branch
    • Brian will read up on the issue and see if we need to release to address.
    • May be just an issue with Fedora / RHEL 7.8 that we don't see it on earlier RHEL.
    • Issue 7212: Private GlibC reference
      • Why not just revert the original commit?
      • If the commit doesn't work on all systems, just remove it until we can.
      • on v3.0 as well.
      • May be an issue in UCX.

Review v4.0.x Milestones v4.0.3

  • v4.0.3 in the works.
    • Schedule: End of january.
    • There's a problem in Open MPI v4.0.2, that packagers will hit in UCX 1.7
      • PR 1752 may drive an earlier release in case if UCX will be released sooner.
  • PR 7116
    • Ensure no backwards compat issues?
    • Howard will send email to ARM.
  • PR 7149 - Geoff go look at.
  • PRs currently open, either need reviews or
  • PR 7229 - Sandia is reviewing.

Do we want a v4.1.x release?

  • A few new enhancements desirable.

  • Added a Target v4.1.x label

    • Many new enhancements / features would be useful
    • 7151 - This is indeed a performance enhancement.
    • 7173
    • Should look into amount of work back-porting features to a release branch.
    • It would be a major thing. But always say we don't take features into release branch thats out there.
      • people continue to open PRs with features.
    • Two issues:
      • One - we've really stalled out v5.0.0
      • Two - are performance features really an issue to pull in?
        • PR 7151 - seems to be boarderline bugfix / feature / risky
  • PR 7151 - enhancement -

  • Recommend against slipping PRRTE to a v6.0… many people ex

    • JSM Direct launch - could take as a patch in SPACK
  • Consensus to NOT do a v4.1.0 release

    • Geoff will put a message on those PRs and then close them later this week.
  • PRRTE effort.

    • Got PRTE integrated into OMPI.
    • Got ORTE out
    • Got OMPI calling PMIx Directly
    • Need to write mpirun - possibly one
    • Want to discuss HOW to bring this in to minimize disruption.
      • Questions: PVars and Tvars - currently apply to both MPI and Runtime layers.
        • Works because runtime calls orte_init.
        • PMIx doesn't have pvar / tvar support.
      • ORTE today uses mca system. Not aware of explicitly using pvars/tvars.
        • Today ORTE implicitly creates pvars/tvars for each mca parameters.
        • mpirun will be a link to prun(PRRTE):
          1. can attach to an existing system and request a new job. (Direct launch method)
          2. can start it's own system.
          • We want the option from mpirun to do either.
          • Which way do we want to default this?
          • If you run -host, but existing DVM isn't on all those hosts... then launch gets delayed.
        • How to discover? Rondevue file gets dropped... look in locations, and find one you have permissions to read.
      • Currently haven't set a minimum PMIx version, but might want to set minimum PMIx v4.0
      • Ton of unprefixed symbols being spit out by MPI.
        • OMPI, OPAL, ORTE that's ours.
        • Everything that starts with MCA are in there as public symbols.
        • Problem is if Another library reuses the mca system you hit this.
      • Domain frameworks - adding mca components to a list for autoclosure, but sequencing of closing needs to be very specific.
        • Want to strip out as it's causing problems.
        • Might need this for sessions

v5.0.0

  • Schedule: April 2020?

Face to face

  • It's official! Portland Oregon, Feb 17, 2020.
    • Safe to begin booking travel now.
  • Please register on Wiki page, since Jeff has to register you.
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.

Infrastrastructure

Review Master Master Pull Requests

CI status

  • IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
    • Austen is looking into
  • Absoft 32bit fortran failures.

Depdendancies

PMIx Update

ORTE/PRRTE

  • No discussion this week.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally