Skip to content

WeeklyTelcon_20161213

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Artem Polyakov
  • Jeff Squyres
  • Brian Barrett
  • Howard
  • Jimmy - SPI representative.
  • Josh Hursey
  • Josh Ladd
  • Nathan Hjelm
  • Ralph
  • Todd Kordenbrock (HPE @ Sandia)

Agenda

SPI (Jimmy)

  • Introductions.
  • can we leverage their 501.3c Non-profit to leverage some status?
    • One difference is that with SPI, Open MPI would remain our current legal status. Just associated with
      • With Conservancy, Open MPI would be an activity of the Conservancy.
    • Would be reasonable to request non-profit
    • Github may be willing to add an organization to non-profit (SPI), they are willing to.
      • Jimmy doesn't see a meaningful difference between SPI and Conservancy.
    • If join withing 60 days of Nov 15th (Ralph is lesion).
  • Discussion
    • Probably only need SPI services, Conservancy provides more.
    • When started this process, neither organization would reply to Ralph for 6 months.
    • If you join SPI, not becoming part of their organization.
    • Conservancy would be happier to have us have more formal processes.

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.5
    • Pressing need to release 1.10.5
      • Waiting on PR from Nathan, then will create RC.
        • Master fix is correct, but has to be back ported to 1.10.5.
        • Nathan's users Want release by end of week.
    • Added regression test for darray bug.
    • Mathias PSM2 not setting 1sided bits correctly.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

  • Known / ongoing issues to discuss

    • Darray Datatype issue - 2.0.2 - do a minor point release
    • Early termination is not handled correctly - 2550 - Ralph fixed already. 2552, 2553 (jeff will cleanup)
    • osc_pt2pt wrong answer - 2505.
      • IBM has a 1 line fix. Mark thinks there is another issue in lock-all.
        • Nathan: that sounds like it could be it. Can call Fence, but either in an epoch or not in an epoch. When you try to do a true extent, we return the wrong extent, and wrong lower bound. OMPI was seeing true
  • PMIx update

    • Last changes went in. Josh is rolling a new RC.
    • Josh will update a PR for the v2.x branch.
    • Should improve memory usage, but not yet ideal.
    • Fuzzy, estimate for End of January.
    • Strings on KNL are 40KB, and 80KB (per remote peer). This is not fixed in this RC.
      • If we do compression, then have to do changes in OMPI. Currently clients don't free it. If we return
      • Not sure if we want compression for all strings... for example hwloc output gets put into shared memory.
    • Josh and Artem feels like mid-january. of PMIx 1.2 + integration in Open MPI v2.1.0.
    • Fujitsu was excited about this change. Things should get much much better.
      • Fujitsu gets credit for investigating how bad this issue was. Thanks!
    • Artem has a PMIx perf tool (in contrib of PMIx srces). Measures memory consumption.
      • Nathan's using MPI memory usage. Calls MPI_Init, does some collectives, and then reports process and node memory usage.
  • OMPI 2.1

    • THE blocking issue is PMIx.
    • Focus now is OMPI 2.0.2.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *

Review Master MTT testing (https://mtt.open-mpi.org/)

  • No morning messages still. Need to pester Brian about. Apparently not allowed to make changes until after the new year.
    • mail from our AWS instance is not getting to us.
  • Biggest failures we saw in 2.0.x and 2.1.x
    • OSHMEM - BTL fix, fixed a bunch of things, but still a few errors (Segv), Put or Get not registered location.
      • Jeff will make a ticket for few remaining OSHMEM failures.
  • Sylvain seeing a bunch of errors in master oob/ud components
    • mostly timeouts. not sure if hanging, or really slow.
  • Josh - turned on Jenkins testing at IBM, may result in timeouts. Using PGI on PPC64.

MTT Dev status:

Next face-to-face

Status Updates:


Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally