Skip to content

WeeklyTelcon_20161129

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen IBM
  • Jeff Squyres Cisco
  • Artem Polyakov Mellanox
  • Josh Hursey IBM
  • Joshua Ladd Mellanox
  • Ralph
  • Ryan Grant
  • Sylvain Jeaugey
  • Howard

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.5
    • 10 open PRs on 1.10.5 - Newly changed in GITHUB - look closely under topic, should say if it's been approved). 2 approved, and 7 review required, and 1 pushed back.
    • The ones that are approved are urgent.
      • Schedule a release in January of 1.10.5.
    • Nathan's looking at a segv in PSM2, but not PSM. He will create issue after reproducing.
    • Not the known issue with PSM2 - Something about interrupt handler.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

  • Known / ongoing issues to discuss

    • STAT Debugger: PR #2411.
      • Ralph added 2 more commits to his fork, but need LLNL to test (they're out for 1 week).
      • Not a blocker for 2.0.x (IBM can pull directly into Spectrum MPI).
    • Any other blockers for 2.0.2?
      • blocker: HColl Context Free (PR on 1.10.5, but Mellanox will PR to 2.0.x in next 2 days)
    • Coll_Lib_NBC - need george's review. Adds thread protection for opal_lists. Josh says that George isn't sure if it's complete.
    • PR 2461 - in 2.x
    • If people are not testing with Async modex + _____, maybe they should.
      • for libraries that want all endpoints in Init, using PMIx_Dstore shows 15% improvement.
  • Schedule - Looking for release of 2.0.2 end of week. If everything goes well.

  • PMIx update

    • putting job data in the shared memory dstore.
    • PR for this, shows memory improvements.
    • Seeing some performance problems on Power Arch. dstore is actually showing degradation.
    • next week would be earliest for possible RC.
  • OMPI 2.1

    • THE blocking issue is PMIx.
    • The BSD patcher - Nathan's been asked to work on it. Graceful fail is fine.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *

Review Master MTT testing (https://mtt.open-mpi.org/)

  • No morning messages still. Need to pester Brian about. Apparently not allowed to make changes until after the new year.
    • mail from our AWS instance is not getting to us.
  • Biggest failures we saw in 2.0.x and 2.1.x
    • OSHMEM - BTL fix, fixed a bunch of things, but still a few errors (Segv), Put or Get not registered location.
      • Jeff will make a ticket for few remaining OSHMEM failures.
  • Sylvain seeing a bunch of errors in master oob/ud components
    • mostly timeouts. not sure if hanging, or really slow.
  • Josh - turned on Jenkins testing at IBM, may result in timeouts. Using PGI on PPC64.

MTT Dev status:

  • Put up a PR for combinatorial executor. Still a bug in submitter.

  • Telcom tomorrow.

  • Face to Face in January - https://github.com/open-mpi/ompi/wiki/Meeting-2017-01

  • SC BOF

    • Should we do 2.2 or 3.0? Poll to the community.
      • 87% said go for 3.0.
    • Went way too long
    • Bad time slot (not sure why), since we only had half of people we normally do.
  • PMIx update - Decided to do a PMIx 2.0 release (what was going to be PMIx 3.0) - January time frame.

  • libevent update - they have put out an RC for 2.1.7 (OMPI 2.x is on libevent 2.0)

    • 2 years of code changes, though most are not in our usage path.
    • Still some, somewhat scarey changes in main path, so need to test well. evaluate before adding to OMPI 2.x
    • There is an external component for libevent, so there is that option.

Status Updates:


Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally