Skip to content

WeeklyTelcon_20220222

Geoffrey Paulsen edited this page Mar 4, 2022 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Didn't capture

not there today (I keep this for easy cut-n-paste for future notes)

  • Geoffrey Paulsen (IBM)
  • Austen Lauria (IBM)
  • Jeff Squyres (Cisco)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Josh Hursey (IBM)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic (nVidia)
  • William Zhang (AWS)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Edgar Gabriel (UoH)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Joseph Schuchart
  • Joshua Ladd (nVidia)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Sam Gutierrez (LLNL)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia)

NEW Discussion

MPI Forum related

  • Issue 10029

    • Should we allow MPI_NULL handles to be passed to get_name routines.
      • Now you get an error.
      • Someone filed a bug claiming this is a bug.
      • Ambiguity in the standard, if these are allowed or not in get name.
    • If we do allow this, should do it everywhere.
    • Consensus we should clarify with forum before taking this.
  • A question came up for implementations.

    • Threads that are syncronized. Multiple threads are sending on same comm + tag.
      • Should ordering be across all threads (global ordering)
        • Concerns that a single atomic that is contended by one lock.
      • Some are arguing the text should be ordered per thread.
    • Don't actually need a global lock today (maybe, in practice probably do).
    • From a hardware implementer's point of view per thread would be very expensive.
    • Text is not very clear.
    • For single threaded, ordering is very well defined.
    • Could you make use of this or not?
  • Two HWLOC issues

    • PRRTE/PMIx hwloc issue: https://github.com/openpmix/prrte/pull/1185 and https://github.com/openpmix/openpmix/pull/2445
    • hwloc when built with CUDA support, is hard linking against it.
      • This doesn't work in the common case where CUDA isn't installed on login nodes.
    • hwloc v2.5 - v2.7.0 is putting variables in read-only memory into environ, but prrte is trying to modify these and segvs.
    • PMIx and PRRTE has block-listed large hwloc versions 2.5-2.7.0
      • putstr(env) is segv-ing.
    • Discussions about minimizing mpirun/mpicc to only link against subset of opal.
    • Makes things slightly better, but not really. Still have cuda on some nodes and not on others.
    • Projected solution is to use hwloc plugins (dlopen cuda libs)
      • A while back, hwloc changed default to NOT load components as plugins.
        • He this this for Open MPI (some cyclic dependencies).
        • This is no longer an issue for us.
      • Now hwloc has reasonable defaults for some things build as plugins (dlopened at runtime).
      • Usually customers install in local filesystems.
      • This gets us around the dependencies.
      • So whenever this is actually fixed, Jeff will write docs, and we can touch on points.
      • From JOSH'es HWLOC PR, if there are any other suggestions or modifications, please put this on the hwloc PR.
  • Resuming MTT development - send email

    • Doodle.
    • Like to have a monthly call.
    • Christopph Niethammer is interested.
      • Might need a new cleanup mechanism when rolling out lots of versions.
    • Find out who's using python client, and what problems.
    • IU database plugin (what ends up getting data into MTT viewer) has a number of issues.

4.0.x

  • Schedule: No schedule for v4.0.8 yet
    • bugfixes case-by-case basis
    • PRs to merge Friday.
    • Github Action change.
  • Winding down v4.0.x, and after v5.0.x will stop
  • Really only want small changes reported by users.
  • Otherwise, point users to v4.1.x release.
  • Howard and Geoff will meet Jan 28th

v4.1.x

  • Schedule: Shooting for v4.1.3 end of March/Q1.
    • RC in 2 weeks or so.
  • No other update.

v5.0.x

  • CI is back.
  • https://github.com/openpmix/prrte/pull/1176
  • Sessions - https://github.com/open-mpi/ompi/pull/9097
    • Howard will rebase (again)
  • Prrte has for a long time has had a schizo component, that tries to provide an interface based on what implementation the user's using. CLI was still centralized, and this was leading to difficulties. Example: disagreement about how ranks should be placed with -N option. So moved some of these decisions down into a framework that has an OMPI component.
    • Some questions if we should bring this into v5.0 for OMPI. There is a PRRTE PR up with some early work.
    • This would be backported to the PRRTE release branch for our OMPI v5
    • Blocker v5.0 items are in the Project/2
    • Schedule is Q1
  • Thinking about an RC before and after Sessions.
    • Well as far as tracking, we have nightly tarballs, and it'll be clear in git
  • Docs rework
    • Last round of automated manpage update. Might be pretty close to commit, even tho not complete.
      • Perhaps this weekend.
    • We made a lot of progress on revamping the docs with restructured text.
    • Might actually be able to get this done by v5.0.x
    • Dont go review yet, but lots of good progress.
    • definately have these docs for v5.0.0, but maybe not 100% complete,
      • But do want THIS is what's different in mpirun command line, etc.
  • PR 9996 - bug with current cuda common code.
    • Some user fixed a problem in smcuda.
      • Ask tommi to
    • Writing the API, and will try to port over code.
    • ported this code to UTIL, to try to fix the bug, but been an ask to do a bit more.
    • An accelerator framework,
    • Need to figure out how we move forward here. Moving it into util is not the right place.
      • Don't need more things with knarly dependencies in util.
        • this makes the mpicc problem worse.
    • William will take a stab at it, but if it's not a lot of work.
      • four to six functions that datatype engine calls.
        • Is accellerator?
        • data movement functions.
        • need to figure out memory hooks stuff.
      • libfabric has this abstraction, so we could
      • No new code, just moving things around.

Master

  • No new Gnus

MTT

  • A fix pending to workaround the IBM XL MTT build failure (compiler abort)
  • Issue 9919 - Thinks this common component should still be built.
    • Commons get built when it's likely their is a dependency.
    • Commons self-select if they should be built or not.
Clone this wiki locally