Skip to content

WeeklyTelcon_20210427

Geoffrey Paulsen edited this page May 4, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Naughton III, Thomas (ORNL)
  • Sam Gutierrez (LANL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia/Mellanox)

New Items

  • Tommy is taking over for Josh Ladd for short-term.
    • Please send Mellanox items to him.
    • He will also help with v5 RM work.
  • Howard was trying to build OSU benchmark (most recent) doesn't build simply against master and v5
    • Howard didn't have mpicxx or mpicpp
    • If this is an actual issue, assign this to Jeff.
    • Also, Joseph set CC not CCX env, and C++ wraper wasn't being built.
      • Didn't dig in...
      • This Could be correct behavior even if it's unexpected.
  • Issue 8850 static linking blocker for v5
    • Need to talk to brian
    • 8860 is related - Howard
  • Issue 8925: MPI apps hang, if runtime decides to kill the job,
    • the PMIx event is not processed properly, and doesn't tear down the job.
    • Need to talk this through with Fault Tollerant.
    • Blocker for v5

4.0.x

  • We're still waiting on Datatype issues now reported in v4.1.1
    • Issue 8856
    • Howard took the DT fix and created a PR
  • Need an explanation for PR 8810
    • Hessem contacted Artem, and that it's a work in progress.
  • Follow up 8818 on datatypes
    • Is this also blocker?
    • No.

v4.1.x

  • Raghu has left AWS.
  • Brian is stepping up for v4.1.x RM work
  • v4.1.1
  • Released over the weekend. Got George's datatype fix.
  • Brian and Jeff did a bunch of testing, and was happy with.
  • Unfortunately two different folks reported partial roundoff error #8856
  • George spent a lot of time trying
  • Holding off on merging v4.1.x PRs until we get a better understanding of #8856

v5.0.0

  • Still haven't done the alpha, but haven't done that until we get Cherry-picks from master.

  • Austen, Tommy, and Geoff will Cherry-pick "easier"

  • Issue #8652 RDMA performance problem.

  • This is more of an enchancement than a severity: blocker

  • Not a blocker, just an issue with the way the user ran.

  • If there's a mode that we know has bad performance, useful to call out in UCX section of docs.

  • Issue 8776 - libevent confusion if running with external 3rd party tools

  • PR 8792 - Need to move this over to v5.0.x

  • Need to check with Brian if this is relevant on v4.0 or v4.1

  • compile with --disable-dlopen, or slurp in all of the plugins.

  • 3 line change, should be small work.

  • Not a linker error, job just hangs and fails, really might want on v4.0 and v4.1

  • PR 8799 - should probably be PRed to v5.0

  • Howard's concerned that these package specific for config lookups, into the way that mpicc is linked, (for example cray)

  • mpicc --show - shows some long dependencies.

  • Just let him know on the ticket.

  • Howard will update the ticket.

  • Docs - Man pages will be included in this effort.

  • Likely include nroff and http in the tarball (so users don't need sphynx, and don't need internet)

  • If this doesn't make v5.0.0, it can go into later.

  • Packagers need some advice, and need a README, few more weeks at minimum.

Master

  • 8808 - same memory backing file.
  • what is the failure profile for this?
  • Rare, but what happens is if two users are sharing a node, and we leave backing files because a job fails, another user tries to create the backing file, it can conflict. So we add user-id to give a little more safety for conflicting.
  • Does mean that there's a cleanup issue for shared memory files.
  • Only reason is because moved the backing file out of dev/shmem.

Reformatting master

  • PR 8816
    • Would like Nathan to rebase and merge to master.
    • Certain blocks we don't want to format (specifically some in datatype)
    • Joseph saw Opal code, some copyright headers got scrambled.
      • clang format trips over

PMIx

  • Something going on in PMIx v4.x branch around tools interface
    • Relable v4.x as v4.1 and then create a new v4.x without some tools interface.
    • Shouldn't

PRRTE v2.0

  • No update

Some outstanding work for the way that OMPI calls PRRTE configure.

  • Also some changes with libcurl, especially since this breaks OMPI built.
  • PMIx can interface with REST interfaces (used by libcurl)
    • JSON
    • Build system issue in PMIx when we changed to static DSOs.
    • Think this has been resolved

issue 8801 - mpirun and prefixing.

  • Jeff and Ralph and Yosi had a good conversation
  • Lengthy discussion, Summary is, that it's a work in progress.
  • Ralph is working this.

MTT

  • Need to look at the public tests repo for merging in both ULFM and Sessions tests.
  • Howard and Geoff will look at this week.

Longer Term discussions

Doc update

Clone this wiki locally