Skip to content

WeeklyTelcon_20190430

Geoffrey Paulsen edited this page May 7, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen
  • Jeff Squyres
  • Brian Barrett
  • George Bosilca
  • Michael Heinz (Intel) - Introducing Brandon
  • Brandon Yates (Intel)
  • Ralph Castain
  • Thomas Naughton
  • Todd Kordenbrock
  • Dan Topa (LANL)
  • Josh Hursey

not there today (I keep this for easy cut-n-paste for future notes)

  • Howard Pritchard [ABSENT! - Note for permanent record]
  • Matthew Dosanjh
  • Edgar Gabriel
  • Joshua Ladd
  • Akshay Venkatesh (nVidia)
  • Noah Evans (Sandia)
  • David Bernholdt
  • Jake Hemstad
  • Xin Zhao
  • Nathan Hjelm
  • Geoffroy Vallee
  • Matias Cabral
  • Aravind Gopalakrishnan (Intel)
  • Arm (UTK)
  • Peter Gottesman (Cisco)
  • mohan

Agenda/New Business

  • Introduce Brandon Yates (Intel) to cover while Michael Heinz is gone for 8 weeks.

  • Vector Datatype https://github.com/open-mpi/ompi/issues/5540

    • If you're using complicated data types for real things, it's important.
    • Should it be back ported to release branches? Perhaps not, since only one customer has hit.
    • Not a blocker for v4.0.1
    • George can push this fix.
  • Jeff opened a PR about ProBot - https://github.com/apps/stale

  • Should have someone begin working on

  • Issue: Some cases cause Co-Author instead of signed-off: https://github.com/pmix/pmix-standard/pull/180/commits We may want to update our policy to also allow for Co-Authored-By When someone proposes a change via a github review, and the Author approves it, the new commit has a "Co-Authored-By". ACTION: Jeff will email Github to ask how other teams deal with this.

  • Submodule Topic

    • Background - how to build the OMPI stack, moving to PRRTE. PMIx then becomes the infrastructure, but PRRTE needs to be able to stand alone.
    • Proposal to use submodules to implement
    • Concerns that we need to coirdinate Alliena DDT, and Totalview Tools.
      • Reason that this doesn't work is that there's no MPIR interface in PRRTE.
      • So we need to either get PMIx interface into the tools and remove support for MPIR
      • Ripping out an interface the tools depend on, but we can not wait for them to catch up.
    • Concerns about using submodules:
      • One OPAL would move off to it's own repo, and we'd have a reference.
        • A bot would watch that, and then it would file a PR, and a human would merge.
        • We MAY want to automate this at some point, but manually first.
      • Issue, someone locally makes a change to a submodule and commits locally, then bumps their parent repo's reference to point to that local change. If they push THAT, then other users won't have that submodule change.
        • CI catches case this, where someone accidentally pushes a submodule change
      • Other challenge is someone doesn't rev submodule refrence until right before a release.
      • For release branches, they should really point to a submodule release also.
    • New directory structure, will cause a lot of configury work.
      • Brian did some ugly prototyping in an hour, but not too bad.
    • How would this work for install?
      • just use --prefix and let each submodule install to the right place.
      • --enable-debug across multiple projects then it's going to be a bit of a pain.
      • Since similar lineage for each of these projects, then similar configure flags for each component.
  • Figure 2 of document shows:

    • external->opal->HEAD prrte->HEAD pmix->HEAD libevent->v2.1.8-stable release hwloc->v2.0.3 release
    • In reality, opal depends on libevent. pmix depends on hwloc
    • How do we ensure that the dependencies are "compatible"?
    • If everyone has the same jenkins driving them to update. Issues should be transient.
    • PRRTE doesn't bundle libevent, and hwloc. So OMPI is only owners of bundling.
      • PRRTE only uses external
    • Don't have the "keep in sync" issue for anything but PMIX and OPAL.
    • OMPI currently uses HWLOC directly. Treematch code uses hwloc directly.
      • Most of code today doesn't use hwloc... just goes through pmix.
    • Two versions of OPAL one for OMPI and one for PRRTE?
      • How do we ensure those are not incompatible?
      • Answer: Test a lot.
  • submodule if have patch on two pieces, have to push lower, then wait for patch to get accepted (to get the hash, and CI to finish) then update higher level patch before pushing that.

  • Remember due to linkers, we need to keep OPAL as stable ABI.

  • Are we going to have official opal "releases", or just have everyone track master?

    • Yes want to do release branches of opal. And cut them at the same time.
    • This will make cherry-picking on release branches a bit tricky
    • Fix that spans both ompi/opal will be complicated.
      • Brian will update document
    • Will there be a separate OPAL VERSION file?
      • Yes, and this is why release branches should be cut in both repos.
    • What to do about PMIx and PRRTE ? Do they get their own release branches?
      • That just triples the work, and doesn't mean we're converging on opal
      • No, just version the dependencies, and submodules will
  • If anyone has a problem with submodules fundamentally, please speak up now.

    • Just normal knee jerk reaction, but it looks like with good CI we can manage the risks.
  • Host Ordering fix to v3.0.x, v3.1.x, v4.0.x https://github.com/open-mpi/ompi/issues/6501

    • --host (and hostfile) on command line, the ordering of the hosts were not ordered.
    • This Fix went into master. Do we want to bring it back to release branches?
    • Everyone on call liked PRing this to release branches, but want to see what Brian and Howard think.
    • Not a backwards compatibility issues, since a specified ordering is a subset of a random ordering.
    • There is a PR for v4.0.x that Ralph and Jeff are iterating on. Unexpectedly large. Would be good to do this first.
    • Ralph is slammed but will try to find some time this week or next.
  • OLD Giles openib issue: https://github.com/open-mpi/ompi/pull/6152

    • No one had any thoughts on.
    • Would like Mellanox to chime in and let us know if it's needed in v4.0.x
    • No update.
  • Season of Documentation?

    • We have been denied. They received more than 200 applications.
  • OLD Noah described a new thread framework

    • two bits of cleverness. Static initializes, and need certain functions to be static inline.
    • Get an implementation defined header that gets installed in configure. (similar to libevents)
    • Two components: Argobots and pthreads.
    • Currently exclusive (only one component, since it installs a header at configure time).
      • Probably permanently.
    • Need to look at thread local storage.
    • Had to implement TLS on top of pthreads, Argobots has this already.
    • Request completion would be the most sensitive to having oversubscription.
    • Posted a Work-in-progress PR: https://github.com/open-mpi/ompi/pull/6578
      • certain types of applications want to schedule a finer grained task
      • Ex: OpenMP groups, some apps have shown some improvements.
    • If you have a really unbalanced problem, where you can't rebalance with a mesh rebalance. Then if you do a traditional MPI task per core, the imbalance becomes a big problem. Like the CharmPlusPlus approach, don't think about task / thread mapping yourself, let the system map that. To do this well, you need to implement them correctly. Really want some threads (like libevent) to really be an OS thread, rather than user thread.
    • Do you have to use the same thread framework at all levels of the process?
      • You kinda want the MPI process to know how the user is using threads, but also want the libevent thread to be pre-emtively schedule, not co-operatively scheduled (borrowing from solaris terminology)

Minutes

Review v3.0.x Milestones v3.0.4

  • No new updates. A few more PRs went in.

Review v3.1.x Milestones v3.1.4

  • No new updates. A few more PRs went in.

Review v4.0.x Milestones v4.0.2

  • PR6508 - Fixes host ordering, but is quite large, and still not complete.
    • Fixing this brings in quite a bit of other things.
    • Problem with this is that it's a significant patch
    • Ralph won't have time in next few months.
    • Ask for help on this.

Master

  • PR6556 and 6621 should go to the release branches.
  • George sees regular deadlocks on vader for apps that send >2GB
  • PR6625 - Discussed if we want to take the pain of this PR.
    • Good PR to mop up removal.
    • In favor of cleanup, but nervous about changing the values of non-related enums and constants.
  • Good reminder that we now need to be careful about OPAL's ABI.

v5.0.0

  • Still don't have any release manager.

    • Need to identify someone in next few months.
    • Ralph is volunteering
    • Brian
    • Traditionally have one academic and one industry rep as release manager.
  • Still have one fundamental issue, do we do ORTE/PRRTE change for v5.0 or v6.0?

  • Schedule: Even if we want to do ORTE/PRRTE change NOW, it wouldn't get out until fall.

  • meaning so v6.0 wouldn't get out until Summer of next year.

  • Schedule: May 2020 is Ralphs retirement.

  • If we do ORTE/PRRTE change in Open MPI v5.0 Fall of 2019, then we'll have more time from Ralph before he retires.

  • When will MPI v4.0 standard will be passed?

    • Next meeting is theoretically the last meeting, then 3 more meetings.
    • But one thing we WANT (Big Count) is not ready. so talking 5 meetings,
    • So possibly Sept 2020 (w/Big Count), but maybe May 2020 (without Big Count)
    • Don't need to couple our ORTE/PRRTE with MPI 4.0 standard
  • ORTE/PRRTE change does depend on new CI and submodule changes.

  • Submodule and new CI can be done before ORTE/PRRTE changes, and is in good shape.

    • Jeff, Brian and Howard have been discussing.
      • Need CI improvements first for safety-net.
  • Moving our Website to AWS

    • University of Michigan bought us SSL certificate expires in June
    • Will get new certificate from Amazon.
    • email relay changing from host gator to AWS service
    • Shouldn't affect Documentation initiative.
    • AWS admin isn't too complicated.
  • Discussion of schedule depends on scope discussion

    • if we want to separate Orte out for that? Would be a bit past summer.
    • Giles has a prototype of PRRTE replacing ORTE
  • Want to open up release-manager elections.

    • Now that we're delaying, will decide at face2face.
  • Now the possibility of v4.1 from master is a possibility

    • If we instead do a v4.1, some things we'd need fixed on master.
  • will discuss more at face to face.

  • Brian and Ralph are meeting on the 18th

  • Ralph is putting out a doodle to discuss

Master

  • Fortran Array

PMIx

  • Schedule sometime this summer will be v4.0.x
    • No schedule as of April 16
  • A few bugfix releases for v2 and v3 series. RC this week, and release sometime in April.
  • New standardization issue is destined for v5

ORTE/PRRTE

  • Take a look at Gile's PRRTE work. He may have done SOME of that. He should have done that all in PRRTE layer, maybe just some MPI layer work remains.

MTT

  • IBM still has 10% failure rate and build issue. Please fix!!!

New topics

face to face -

  • Discussed next face to face
  • If we don't meet by Sept, it won't happen due to MPI forum, Euro MPI, Super computing, holidays, etc.
  • Jeff will send out a doodle for Aug and Sept.

Review Master Master Pull Requests

  • didn't discuss today.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally