Skip to content

WeeklyTelcon_20200128

Geoffrey Paulsen edited this page Jan 29, 2020 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Todd Kordenbrock (Sandia)
  • Jeff Squyres (Cisco)
  • Artem Polyakov (Mellanox)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Intel)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • Brian Barrett (AWS)
  • Michael Heinz (Intel)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Edgar Gabriel (UH)
  • Joseph Schuchart
  • Nathan Hjelm (Google)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Noah Evans (Sandia)
  • George Bosilca (UTK)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Erik Zeiske
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Xin Zhao (Mellanox)
  • mohan (AWS)
  • Akshay Venkatesh (NVIDIA)

New Business

  • Coverity coverage for PRRTE

    • Brian is working, but it needs a current copy of PMIX as well.
  • Anything to do to make Cray CI more stable?

    • Some discussion last week.
    • Brian tried to update cray to do shallow clone, but it ran into an issue with the submodule reference and the shallow clone, so abandoned this effort.
  • Josh and Ralph have been working on PRTE in general, stabilizing, etc.

    • finding the remaining issues.
    • Ralph is working on existing direct-modex issue.
    • Not blocking on OMPI info integration.
    • Only thing holding us back on committing.
  • Josh has been adding some PRRTE CI PR49 in PMIX-TEST want to get in this week.

    • Adding some PRRTE tests (non-mpi)
    • Ralph mentioned adding a double get test.
  • In the future we can have some PRRTE tests that run in Open-MPI

    • Perhaps only run these when the PRRTE or PMIX submodule reference updates?
    • Those could be MPI based tests, and could do more.
  • Question, should we embed/submodulerize libfabric?

    • Jeff gave history of why we embed libevent/hwloc/pmix
    • Cisco started distributing their own which gives them more control over schedule and content.

Release Branches

Review v3.0.x Milestones v3.0.6

Review v3.1.x Milestones v3.1.6

  • Jeff has another PR to put in and do another rc for both
  • Issue 7035 - may be a problem with old libfabric
  • Still need fix in ompio/api abstraction break. (7318)
    • RHEL 8 linker seems to be finding this.

Review v4.0.x Milestones v4.0.3

  • v4.0.3 in the works.

    • Put out v4.0.3rc1 over the weekend.
    • Schedule: End of january.
    • Try to get rc1 built this Friday
  • Howard PRed #7321 to v4.0.x

    • xpmem worked on v3.x, so don't think it needs cherry-picking back.
    • Nathan to see if these fixes are relevant on 3.0.x and 3.1.x
  • Issue 7220 - vader not cleaning up properly (vader backing files).

    • in v3.x series, uses pmix 2.x (can't register cleanup files)
      • Nathan: old workaround after add-procs all processes unlink?
      • No longer doing this because moved files from /tmp to /dev/shmem (v3.0?)
        • This would bring up more bugs for users with very small /tmp.
    • in v4.0.x, (uses pmix 3.x, and CAN register files for cleanup)
      • sigterm forgets to call pmix interface to cleanup registered files.
      • in session directory always cleanup, but in /dev/shmem
  • Issue 6960 (closed) had something cherry-picked to release branch, but it's still not fixed.

    • Configuring --enable-ipv6 shouldn't preclude ipv4.
    • Do we need to cherry-pick 6964 back into v4.0.x ?
    • Fix this in PRRTE.

v5.0.0

  • Schedule: April 2020?
    • Geoff will update the milestone.

Face to face

  • Portland Oregon, Feb 17, 2020.
  • Please register on Wiki page, since Jeff has to register you.
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.

Infrastrastructure

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • PMIx v3.1.5 is probably NOT in January.

ORTE/PRRTE

  • Been working on PRTE

  • Strange issue is: Suck up libevent and hwloc into opal staticly, but in Pmix link against libopal to get access to these components. Even with name shifting (under opal names) it can call down into opal. pmix_error_log, found himself in opal_output with an unitialized hostname that segfaults.

    • Need to find a way to link directly to pmix, hwloc,
    • even have disable-dlopen set.
    • Problem: want one process (seperate from MPI process) (i.e. prrte) that calls prrte_init, and ends up linking in opal, because it's the embedded coded.
    • How should we split these out?
      • Make libtool convenence libraries of them.
      • prrte rather than linking against libtool, links against the convenence libraries.
      • convenence libraries then just get sucked into the code.
      • where this fails, is that you can't link against both these convenence libraries and libopal?
    • configury? doesn't prrte need to know if we're linking embedded or external?
    • Brian will write up some thoughts on this on Friday.
  • ORTE-removal/PRRTE PR is ready to be committed.

    • Mellanox CI is still failing on OSHMEM.
      • yes this got resolved. Segfault they were seeing is exactly this Strange issue above.
    • Hand testing is looking fine.
    • using an ORTE parameter, and then OSHMEM then fails because dir doesn't exist or wrong permissions.
  • Still a bunch of things to do after this PR goes in.

  • Singleton comm-spawn... how do we make this work? - PMIx understands it.

    • Do we need to support singleton comm-spawn starting the PRRTEs?
    • Now that we will support a persistant infrastructure, maybe we just require users to start it first.
  • Address comm-spawn issues that have been raised.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally