Skip to content

WeeklyTelcon_20170214

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Brian Barrett
  • David Bernholdt
  • Geoffroy Vallee
  • Howard
  • josh Hursey
  • Nathan Hjelm
  • Ralph
  • Ryan Grant (SNL)
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

  • Ralph put out a RC, but didn't release.
  • Ralph will release today.
  • Handful of pull requests, some have Do Not merge labels, some need reviews.
  • No schedule yet
  • Only possible canidate so far is SGE bug.

PMIx update

  • last friday release RC1 of PMIx 1.2.1rc1.
  • filed a PR so people could pick it up and test it.
  • Put a week timeout, but PMIx community is happy with it.
  • one small code-cleanup item left.
  • Nathan would like to give it a try at scale again.
  • Is OS X - tempdir thing going to be in there?
    • No, the problem is it's in TCP piece, bringing in the MCA code. Only way to avoid that would be to port TCP component into standalone code.
    • So for PMIx 2.0 it's also a big piece of work? Well, all contained within PMIx component.
    • Howard - Thought we'd PR it into Open MPI v2.1 and then discuss how much code change it is.
    • In PMIx2.0 a lot of new files in PMIx component.
    • Proposal would be to let PMIx 2.0 external component go into a new OMPI v2.2.0
    • Sounds like it would a whole lot of work on 2.x branch, to bring PMIx back to it.
    • One of the main differences is in OMPI 2.x, we handle events, so that doesn't get into PMIx code. PMIx 2.0 doesn't handle events.
    • Mac issue alone is probably not enough justification to backport PMIx 2.0 to OMPI 2.x, but IBM wants to do this for 2.x stream.
      • Could improve MAC error message, so easier to google, and find fix.
      • Orted doesn't know what the error was, but could put a show_help in PMIx code before it returns the error.
      • Ralph will backport the opal_show_help message to PMIx 1.2.1 today.
  • Should OMPI enable dstore before or after PMIx 1.2.1 update?
    • Really must enable dstore after PMIx 1.2.1 goes in.
  • PMIx 1.2.1 is the blocker, so if anyone else wants their content in, get them done.
  • Ralph killed a bunch of his 2.x PRs, because couldn't remember details, especially around scaling.
  • hwloc performs horribly on KNL. Can we do something to tell if we're bound without loading hwloc?
    • MPICH is seeing this too, but fixed it internally on CH4.
    • hwloc 2.0 doesn't exist yet, it's just their master.
    • If we could get a solution for this, it'd be nice in 2.1 or 2.2
  • reviewed a few open issues with v2.x milestone.
    • Comm_Info - a few code changes needed, and a new proposal to talk about at MPI_Forum in 3 weeks.
      • The pushback from MPI_Forum has always been should discover this through MPI_T. (no fortran interface)
  • IBM proposed ditching the idea of a v2.2, and instead starting our date based release (Feb, June, Oct) one cycle earlier for a v3.0
    • This would mean we'd have to branch a v3.0 soon (this month), and stop accepting new features into v3.0. This would then put the new v3.0 release in October. This would also mean NOT needing to back port many of the master items to a v2.x branch.
    • The reaction was generally favorable, though we need time to think about it and discuss it some more next week.
    • We'll discuss again next week if we want to go the v2.2 route or if we want to go v3.0 route.

  • Last snapshot tarball build was done on the 11th.
    • Jeff thought he fixed this over the weekend.

MTT Dev status:


Exceptional topics


Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally