Skip to content

WeeklyTelcon_20160105

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Geoffrey Vallee
  • Geoff Paulsen
  • Nathan Hjelm
  • Sylvain Jeaugey
  • Todd Kordenbrock
  • Ralph C.

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
  • Looks good in MTT world.
    • Jeff's cluster had some timeouts on 10.10 network (no route to house), maybe cluster config. Ignore for now. Weird that no route should FAIL, not timeout. TCP BTL. Maybe multirail issue. inf. loop rather than fail.
    • Don't block 1.10 release since probably not common use case.
  • Paul found some issues.
  • NAG fortran support configuree isn't right.
    • NOT a regression (in 1.8). Do we care if this is a blocker?
  • mpirun is hanging after good run. Only in SLES. Also Cray (uses SLES).
    • proc is defunct / zombied.
    • IO Forwarding file descriptors may not be getting HANGUP.
    • Not a Regression, 1.10, 2.0, master. ORTED in event library. Pretty serious, but hoping it's just SLES issue.
      • Set state machine verbosity to 5.
      • Nathan will look at SLES 11 and SLES 12 (Different kernels even, very different)
      • Try to find if it's sigchild or file descriptor
    • Hold 1.10.2 release until Nathan runs tests today.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker *
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • Howard is out, but talked to Jeff.
    • Would like to close the door on new features, and focus on bugfixes.
    • Mellanox has finished UCX to take advantage of Modex Stuff. Will be adding to 2.0 branch soon.
    • Nathan has a bugfix. Can't run at powers of two ORTEDs now.
    • Debugger Attach - DDT works, because it uses a different mechanism.
      • Options are to restore RML, and bringing back usock component soley for the one time a single message is sent to rank 0.
      • Or use PMI-x notification system.
      • Consensus was to move forward to PMI-x. Ralph is moving forward integrating this.
    • PMI-x 1.1.2 in 2.0 branch now. Would need PMI-x 1.2 in OMPI.
      • Geoff Paulsen will see if we can get some of Dave Solt's time to help expidite this.
    • Discuss OpenIB Progression - Issue 1252](https://github.com/open-mpi/ompi/issues/1252)
      • Network attomics are not neccisarily visible / interchangable with CPU atomics.
      • progress issue. Nathan proposed to add a decay function to progress function dispatch to naturally let components that are not progressing anything, lower in priority.

Review Master

General Discussion

  • Debian/Ubuntu package support
    • Ubuntu doesn't have a maintainer anymore for Open MPI. Packaged not officially "orphaned"
    • Then when it gets adopted, we could adopt it. Nathan has been
    • Old maintainer has a repo, and a bunch of patches, which no one in community has ever looked at.
    • Sent directions to Ralph on his directions, but quite complex.
    • Send request that the package get correctly orphaned.
    • Geoffrey Vallee willing to pickup official maintainer.

MTT status:

Status Updates:

  • Cisco - nothing OMPI specific to report. Please go sign up for face to face on wiki.
  • ORNL - MTT - running. Announced today that they'd be picking up Debian Package maintance of Open MPI.
  • NVIDIA - got MTT back to normal or close to that. Couple of things failing when enabling GPU Direct RDMA. Has something to do with Atomic operations.
    • Can turn off atomic operations via MCA parameter. Look at bit flags in OMPI_INFO BTL openib
    • Turn off the Fetching ops and atomic ops (find bit values, calculate new flags without bits and reset)

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2015 WeeklyTelcon-2015

Clone this wiki locally