Skip to content

WeeklyTelcon_20190416

Geoffrey Paulsen edited this page Apr 16, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen
  • Jeff Squyres
  • Howard Pritchard
  • Josh Hursey
  • Joshua Ladd
  • Matthew Dosanjh
  • Ralph Castain
  • Todd Kordenbrock
  • Michael Heinz (Intel)
  • Edgar Gabriel
  • Thomas Naughton

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (nVidia)
  • Brian Barrett
  • Dan Topa (LANL)
  • George Bosilca
  • Noah Evans (Sandia)
  • David Bernholdt
  • Mike Heinz (Intel)
  • Jake Hemstad
  • Xin Zhao
  • Nathan Hjelm
  • Geoffroy Vallee
  • Matias Cabral
  • Aravind Gopalakrishnan (Intel)
  • Arm (UTK)
  • Peter Gottesman (Cisco)
  • mohan

Agenda/New Business

  • Vector Datatype https://github.com/open-mpi/ompi/issues/5540

    • If you're using complicated data types for real things, it's important.
    • Should it be back ported to release branches? Perhaps not, since only one customer has hit.
    • Not a blocker for v4.0.1
  • Jeff opened a PR about StaleBot - https://github.com/apps/stale

  • Should have someone begin working on

  • Issue: Some cases cause Co-Author instead of signed-off: https://github.com/pmix/pmix-standard/pull/180/commits We may want to update our policy to also allow for Co-Authored-By When someone proposes a change via a github review, and the Author approves it, the new commit has a "Co-Authored-By". ACTION: Jeff will email Github to ask how other teams deal with this.

  • Host Ordering fix to v3.0.x, v3.1.x, v4.0.x https://github.com/open-mpi/ompi/issues/6501

    • --host (and hostfile) on command line, the ordering of the hosts were not ordered.
    • This Fix went into master. Do we want to bring it back to release branches?
    • Everyone on call liked PRing this to release branches, but want to see what Brian and Howard think.
    • Not a backwards compatibility issues, since a specified ordering is a subset of a random ordering.
    • There is a PR for v4.0.x that Ralph and Jeff are iterating on. Unexpectedly large. Would be good to do this first.
  • OLD Giles openib issue: https://github.com/open-mpi/ompi/pull/6152

    • No one had any thoughts on.
    • Would like Mellanox to chime in and let us know if it's needed in v4.0.x
    • No update.
  • Season of Documentation?

    • Due: April 23rd.
    • If anyone has any ideas on our website other than start from scratch, please let Howard know.
    • Any experience Open MPI had with professional tech writers are all gone.
      • Sun wrote our man pages many years ago. Jeff Squires did proof reading.
    • Jeff did all of the open mpi web-pages.
    • Season of Documentation stipend. Would Open MPI want to accept the stipend?
      • Yes, we'd accept as long as it comes as a donation, not a pay for services.
      • Well, Stipend implies Taxable... which would be a no.
    • Ralph and Jeff are contacts for "Software in the Public Interest"
    • Jeff and Howard will share via box folder.
    • OLD:
      • In libfabric they make man pages in Markdown, and then in make dist, they convert it to nroff.
      • For user facing APIs, they use Sphynx - convert Markdown in comments to user facing HTML man pages.
  • OLD Noah described a new thread framework

    • two bits of cleverness. Static initializes, and need certain functions to be static inline.
    • Get an implementation defined header that gets installed in configure. (similar to libevents)
    • Two components: Argobots and pthreads.
    • Currently exclusive (only one component, since it installs a header at configure time).
      • Probably permanently.
    • Need to look at thread local storage.
    • Had to implement TLS on top of pthreads, Argobots has this already.
    • Request completion would be the most sensitive to having oversubscription.
    • Posted a Work-in-progress PR: https://github.com/open-mpi/ompi/pull/6578
      • certain types of applications want to schedule a finer grained task
      • Ex: OpenMP groups, some apps have shown some improvements.
    • If you have a really unbalanced problem, where you can't rebalance with a mesh rebalance. Then if you do a traditional MPI task per core, the imbalance becomes a big problem. Like the CharmPlusPlus approach, don't think about task / thread mapping yourself, let the system map that. To do this well, you need to implement them correctly. Really want some threads (like libevent) to really be an OS thread, rather than user thread.
    • Do you have to use the same thread framework at all levels of the process?
      • You kinda want the MPI process to know how the user is using threads, but also want the libevent thread to be pre-emtively schedule, not co-operatively scheduled (borrowing from solaris terminology)

Minutes

Review v3.0.x Milestones v3.0.4

  • Shipped v3.0.4 Yesterday
  • Will be at 3.0.5 driver: Fix hostlist ordering Fix for persistent sends memory leak

Review v3.1.x Milestones v3.1.4

  • Shipped v3.1.4 Yesterday
  • Will be at 3.1.5 driver: Fix hostlist ordering Fix for persistent sends memory leak

Review v4.0.x Milestones v4.0.2

  • In the midst of the MAC OS reproducer issue.
  • Issue 6568:
    • 4.0.1 and 4.0.x reproduced for Jeff.
  • Jeff files a PMIx issue and filed a tracker issue in OMPI 6595
    • Only happening in ompi master, not release branches.
  • MPI persistent sends cause a leak in ob1.
    • Jeff tracked it back to v2.x
    • Fujitsu root caused it.
    • We don't have a fix yet.
    • Not a blocker.
    • Should pull future fix to v3.0.x and v3.1.x, and maybe to v2.x (but not release)
  • Josh Ladd please review PR6152 - delay UCX warning to add_procs
  • Josh Hursey Please review PR6508 - ensure nodes are always used in order.
    • This was being held for post v4.0.1 due to size
    • Ralph, was there any reason to add anything to PR6508?
      • Yes, it's not quite right.
    • This still
  • Vader - cleanup, fixes the problem. Intel folks have verified that.
  • PR6508 - Fixes host ordering, but is quite large.
    • Fixing this brings in quite a bit of other things.
    • Problem with this is that it's a significant patch
    • Action: Ralph will look at it.
    • This might force

v5.0.0

  • Schedule: Delaying post Summer ***
  • Discussion of schedule depends on scope discussion
    • if we want to separate Orte out for that? Would be a bit past summer.
    • Giles has a prototype of PRTE replacing ORTE
  • Want to open up release-manager elections.
    • Now that we're delaying, will decide at face2face.
  • Now the possibility of v4.1 from master is a possibility
    • If we instead do a v4.1, some things we'd need fixed on master.
  • will discuss more at face to face.
  • Brian and Ralph are meeting on the 18th
  • Ralph is putting out a doodle to discuss

Master

  • Fortran Array

PMIx

  • Schedule sometime this summer will be v4.0.x

    • No schedule as of April 16
  • A few bugfix releases for v2 and v3 series. RC this week, and release sometime in April.

  • New standardization issue is destined for v5

  • Take a look at Gile's PRTE work. He may have done SOME of that. He should have done that all in PRTE layer, maybe just some MPI layer work remains.

MTT

  • IBM still has 10% failure rate and build issue. Please fix!!!

New topics

  • MPI Forum - nothing too substantial. MPI_Sessions getting a lot of traction. Goal to get it done by next meeting. Need reading, and then vote, and another vote and another. So MPI Next would be in 2020 year. Language bindings, and some crazy proposals
  • Read MPI Forum link here: https://www.mpi-forum.org/

face to face -

  • how do we get more participation, and make MTT more meaningful?

Review Master Master Pull Requests

  • didn't discuss today.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally