Skip to content

WeeklyTelcon_20220816

Geoffrey Paulsen edited this page Aug 23, 2022 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Brendan Cunningham (Cornelis Networks)
  • Edgar Gabriel (UoH)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Jingyin Tang
  • Joseph Schuchart
  • Josh Fisher (Cornelis Networks)
  • Matthew Dosanjh (Sandia)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)

not there today (I keep this for easy cut-n-paste for future notes)

  • Howard Pritchard (LANL)
  • William Zhang (AWS)
  • Jan (Sandia -ULT support in Open MPI)
  • Josh Hursey (IBM)
  • Tommy Janjusic (nVidia)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Joshua Ladd (nVidia)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Sam Gutierrez (LLNL)10513
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia)

v4.1.x

  • v4.1.5
    • Schedule: targeting ~6 mon (Sept, Oct? Don't remember)
    • No driver on schedule yet.
  • Potential CVE issue in libevent.. but might not need to do anything.
    • Worse case we'd just updage our libevent version.
    • CVE scanner doesn't find CVEs from Open MPI source or Open MPI with CVE fixes.
      • New scanner doesn't find any issues with libevent anymore...

v5.0.x

  • Any updates on SLURM failures we're currently blocking on?
    • Blocking on merging prte submodule pointer on SLURM.
  • Testing mpirun command line options.
  • Supposed to do automatic translations from old command line options to new options.
    • Are we planning to get rid of options at some point?
    • Not printing deprecated warning by default.
    • We've made new options (that are the new way), but if we're not encouraging people to go to them, why?
      • Can we even map old to new options one-to-one.
    • We "own" the szitso component and we could ditch new options, and only use old options if we want.
    • Before we force any change, we should get user's
    • Old ones had auto-completion.
    • If we have old options that are going to new options, weird that we don't print the messages.
    • v5.0 was supposed to be pretty disruptive, but if we go back and make it less disruptive, that's fine, but we are kinda saying that the old options are the way.
  • PRRTE v2 and v3 testing today.
    • Where's the list that exists in general?
    • What is this list to check on.
  • It's be pretty good to make a test suite that assumes 2-4 Nodes with 4ppr or so... *
  • Schedule:
    • PMIx and PRRTE changes coming at end of August.
      • Try to have bugfixes PRed end of August, to give time to iterate and merged.
    • Still using Critical v5.0.x Issues (https://github.com/open-mpi/ompi/projects/3) yesterday
  • Issue 10641 Ralph changed the PRRTE branches (switching us to v3.2 branch)
    • Lots of changes from PRRTE v2.1 -> v3.2
    • Still working to get CI working
      • MTT still failing with SLURM.
      • Gone from segv in MPIRUN to resource detection.
    • Ralph doesn't have SLURM to help with.
    • Looking for someone with SLURM to help.
    • Austen will open an Issue for this.
  • Does ANYONE use Open MPI's Java Bindings?
  • Docs
    • mpirun --help is OUT OF DATE.
      • Have to do this relatively quickly, before PRRTE releases.
      • Austen, Geoff and Tomi will be
      • REASON for this, is because mpirun command line is in PRRTE.
  • mpirun manpage needs to be re-written.
    • Docs are online and can be updates asyncronously.
    • Jeff posted PR to document runpath vs rpath
      • Our configure checks some linker flags, but there might be default in linker or in system that really governs what happens.

PRRTE

  • Ralph is looking to release PRRTE v3.x by end of the month.
  • Java Binding discussion?
    • If Open MPI wants Java Bindings, we'd need to do some Java work in PRRTE before end of the month.
    • Small non-zero number of users, Howard may be interested.

Main branch

  • SLURM discussion
    • PRRTE won't run mpirun inside of slurm allocation for SLURM < 17.11
    • How many users will we hurt (require them to upgrade SLURM)?
    • Jeff still thinks his case might be out of the norm
  • HAN / Adapt runs.
    • Post to Devel. Summary, and link to results.
    • Want to make these the default.
    • August 18, 2-3pm Central
    • Geoff will send out web-ex to devel.
  • Incompatibilities in User Level threading that Jan
    • What's the schedule for fixes to get into v5.0.x
    • Will try to get PRs in by end of August and then iterate.

Accelerator framework

  • William said yesterday that they wanted one more day of testing.
  • sm_cuda component was moved into framework.
    • nVidia has some issues building, and will try again to test
  • Accelerator framework Good first step, but will need to fix (super high level)
    • Does this framework allow us to get rid of sm_cuda altogether.
  • Brian added some comments and William needs to address before merege.

Attomics PRs.

  • Switching to builtin atomics,
    • 10613 - Prefered PR. GCC / Clang should have that.
      • Fallback to C11 atomics if not available.
      • Had to do a bit in m4.
    • Builtin atomics are volatile.
    • Next step would be to refactor the atomics for post v5.0.
  • Joseph will post some additional info thing in the ticket

MTT

Administrative tasks

Face-to-face

Clone this wiki locally