Skip to content

WeeklyTelcon_20211130

Geoffrey Paulsen edited this page Nov 30, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Todd Kordenbrock (Sandia)
  • Sam Gutierrez (LLNL)
  • Harumi Kuno (HPE)
  • Joseph Schuchart
  • Naughton III, Thomas (ORNL)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Brendan Cunningham (Cornelis Networks)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Matthew Dosanjh (Sandia)
  • William Zhang (AWS)
  • Austen Lauria (IBM)
  • Tomislav Janjusic
  • Josh Hursey (IBM)

not there today (I keep this for easy cut-n-paste for future notes)

  • Brian Barrett (AWS)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia/Mellanox)

New Items

  • Tommy is taking over for Josh Ladd. Please send Mellanox items to him.
    • He will also help with v5 RM work.

4.0.x

  • We're still waiting on Datatype issues now reported in v4.1.1
    • Issue 8856
    • Howard took the DT fix and created a PR
  • Need an explanation for PR 8810
    • Hessem contacted Artem, and that it's a work in progress.
  • Follow up 8818 on datatypes
    • Is this also blocker?
    • No.

v4.1.x

  • Raghu has left AWS.
    • Brian is stepping up for v4.1.x RM work
  • v4.1.1
    • Released over the weekend. Got George's datatype fix.
    • Brian and Jeff did a bunch of testing, and was happy with.
      • Unfortunately two different folks reported partial roundoff error #8856
      • George spent a lot of time trying
  • Holding off on merging v4.1.x PRs until we get a better understanding of #8856

v5.0.0

  • Still haven't done the alpha, but haven't done that until we get Cherry-picks from master.
    • Austen, Tommy, and Geoff will Cherry-pick "easier"
  • Issue #8652 RDMA performance problem.
    • This is more of an enchancement than a severity: blocker
    • Not a blocker, just an issue with the way the user ran.
    • If there's a mode that we know has bad performance, useful to call out in UCX section of docs.
  • Pushing back the alpha build for v5.0.0 from this Friday to NEXT friday.
  • Issue 8776 - libevent confusion if running with external 3rd party tools
    • PR 8792 - Need to move this over to v5.0.x
    • Need to check with Brian if this is relevant on v4.0 or v4.1
      • compile with --disable-dlopen, or slurp in all of the plugins.
      • 3 line change, should be small work.
      • Not a linker error, job just hangs and fails, really might want on v4.0 and v4.1
  • PR 8799 - should probably be PRed to v5.0
    • Howard's concerned that these package specific for config lookups, into the way that mpicc is linked, (for example cray)
      • mpicc --show - shows some long dependencies.
      • Just let him know on the ticket.
      • Howard will update the ticket.
  • Docs - Man pages will be included in this effort.
    • Likely include nroff and http in the tarball (so users don't need sphynx, and don't need internet)
    • If this doesn't make v5.0.0, it can go into later.
  • Packagers need some advice, and need a README, few more weeks at minimum.

Master

  • 8808 - same memory backing file.
    • what is the failure profile for this?
    • Rare, but what happens is if two users are sharing a node, and we leave backing files because a job fails, another user tries to create the backing file, it can conflict. So we add user-id to give a little more safety for conflicting.
    • Does mean that there's a cleanup issue for shared memory files.
      • Only reason is because moved the backing file out of dev/shmem.

Reformatting master

  • PR 8816
  • One issue is LARGE macro formats.
    • More readable, or less readable with formatting is subjective.
    • there is some tolkens to surrouned
    • Always put a trailing comma on struct or array initializers (little things could be improved)
    • I wonder if we should optimize for time here?
      • Anything that's contriversial, just surround it with these tokens
      • Once this is formatted, then we can run clang-tidy, and this can
  • Jeff has a CI script to enforce this.
    • You can turn this on before.
    • Don't want to
  • Need to do ompi and oshmem too.
  • Not touching 3rd party.
    • Large macros in tests (might need that same
  • clang-tidy is smaller - pretty small.
    • clang-format has to be completely included.
  • Some 300 forks of Open-MPI on github.
    • Anyone else have long standing branches.
    • Worried that we're not making a lot of friends here, that don't really help.
    • Code read-ability is important. We have coding standards, but haven't been
    • We should have had this convo before we merged into opal.
    • Right now we're in this horrible half-state.
    • If we jump through this pain, then we can automate it.
  • Not going to do clang-format on 3rdParty
    • Including ROMIO

PMIx

  • No update

PRRTE v2.0

  • No update

Some outstanding work for the way that OMPI calls PRRTE configure.

  • Also some changes with libcurl, especially since this breaks OMPI built.
    • PMIx can interface with REST interfaces (used by libcurl)
    • JSON
    • Build system issue in PMIx when we changed to static DSOs.
    • Think this has been resolved
    • Ralph was looking at this (private messaged Geoff)

issue 8801 - mpirun and prefixing.

  • Jeff and Ralph and Yosi had a good conversation *
  • rhc has no strong issues either way.
  • We prepend LD_LIBRARY_PATH pointing to the PRRTE installation.
    • At the moment in OMPI, we overlay this with OMPI library location.
    • Seems like the best fix would be to make these two independent.
  • PREFIX - enable prefix by default.
    • In Open MPI happens to be the same as the OMPI prefix.
    • But PRRTE does this by default, because we want the daemons to match the commands.
      • OMPI doesn't want to do that. And that's okay
  • Instead of --enable prefix-by-default we need --enable mpi-prefix-by-default.
  • Looking at it from OMPI perspective
    • user asked for prefixing, user wants prefixing... dont care if same or not, just want it to work.
    • If user DOESNT want prefixing, then dont want EITHER prefixing.
      • But if have a global PRRTE that might want prefixed.
  • PRRTE will prefix by default
  • What happens when I want MPI libs redirected?
    • Problem is if you build PRRTE INTERNAL, then you can't redirect MPI libraries.
  • Gotta set PATH and LD_LIBRARY_PATH correctly
    • One of those things, --enable-prefix is NOT default in < v4.0
  • There are times when want to redirect OMPIs to a different set of libraries.
    • right now it's a configure / compile time, which is problematic. have to redo all of the subcomponents.
    • What would be nice is if this was at runtime, so that ompi's mpirun can find all of the subcomponents at runtime.
  • Setting LD_LIBRARY_PATH is the way to point to another set of libraries.
    • This breaks because mpirun will overwrite LD_LIBRARY_PATH.
    • Personally Doesn't want this as a default.
    • Joseph doesn't want us setting LD_LIBRARY_PATH

MTT

  • Need to look at the public tests repo for merging in both ULFM and Sessions tests.
    • Howard and Geoff will look at this week.

Open-MPI v5.0

Longer Term discussions

Doc update

  • OMPI docs and manpages, but persistant problem that mpirun is really prrterun

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.

    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work
  • No update - 3/16

    • Could be independent of PMIx and PRRTE.
    • PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.
Clone this wiki locally