Skip to content

WeeklyTelcon_20170613

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Edgar Gabriel
  • Artem Polyakov
  • Jeff Squyres (Cisco)
  • Howard Pritchard
  • Josh Hursey
  • Joshua Ladd
  • Mohan
  • Todd Kordenbrock
  • David Bernholdt
  • Nathan Hjelm
  • Ralph
  • Brian Barrett (Amazon)
  • George
  • Ryan Grant

Agenda

2.0.3

  • released June 1st.
  • No driver for a v2.0.4 at this time.

Review v2.x

  • v2.1.1 went out in May
  • No Driver for v2.1.2 at this time.
  • PMIx - Looking good. Need Josh Hursey to confirm the threading issue is resolved.
    • Ralph finished a branch that Josh can build (just now), Josh will kick off build and test today.
  • Cisco tests had lots of issues (everything hung, some issue with slurm on cluster).
    • Yesterday switched it over to rsh instead. Maybe not enabling prefix by default. Not getting LD_LIBRARY_PATH propagation by default.
    • libquadmath, and libimf issues.
    • Has so many things that he preloads on launching node before he launches, so rsh is problematic for that.
    • Cisco is using little older SLURM 14.03. (just after 2.6 after numbering scheme change).
  • We were Planning to do a v3.0 RC last week, but that didn't happen.
  • We'll wait on an RC for two things:
    • Cisco MTT is a bit concerning. Doing an srun under the covers, it just hangs on all Master, 3.0, 2.0.
      • Sporadic hangs (may be fixed now, but it's sporadic), and failures on ppc64le chips.
      • sbatch, and mpirun.
      • both manually runs fine.
      • Similar issue at UT - crank up the logging of SLURMD or run in foreground.
      • Tripple check that it's not picking up daemons on the node.
      • Amazon is running under SLURM 16.? - though currently direct launching.
      • LLNL "gadget" is running okay.
    • Ralphs PR, sounds like a few more days -
  • PMIx 2.0
    • OMPI 2.0 has a bunch of event notification in orte and opal
    • OMPI 3.0, because PMIx now has event notification.
      • Debugger attach now flows through PMIx 2.0.
    • IF Open MPI want to support PMIx 1.x series (via external) in OMPI 3.0, someone will have to write notification support. Issue 3660.
      • It WILL compile with external PMIx 1.x, and will run many things, but without event notification support, you can't attach debuggers.
    • OMPI 3.0 (default configure) doesn't work with SLURM 17.x
      • Default out of the box results in launch failures.
      • Don't know what it would take to fix this.
    • Direct Launch is all handled by SLURM, so PMIx isn't involved.
    • Brian will follow up with Ralph about running with earlier SLURM, but not SLURM 17.x Might be a different issue.
  • Looks like we'll be a little late on v3.0.
  • Lets push back branching of next release branch to July 13th for face to face.
  • v3.0 RC, will we hit Friday?
    • Ralph just has an issue with 3696 master PR.
    • failing in an munmap with an invalid pointer. Brian will try tonight.

MTT Dev status:


Exceptional topics

  • Face2Face Meeting-2017-07
    • Date: July 11-13 (9am Tuesday - noon on Thursday.
    • Cisco has booked space in Chicago.
      • Cisco has reserved some space right next to O-Hare (can get shuttle to hotel).
        • we have met there before.
      • Jeff will come in Monday evening.

Status Updates:

  • Amazon - bringing much more testing online, and CI processes.
    • v3.0.0 Release work
    • Improved Jenkins infrastructure. Hopefully some changes yesterday (in Jenkins setup at Amazon) will make it run a little faster.
  • Travis is now officially deactivated. No longer using Travis.

Status Update Rotation

  1. Amazon
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel
  4. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally