Skip to content

WeeklyTelcon_20201110

Geoffrey Paulsen edited this page Jan 19, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • NOT-YET-UPDATED

4.0.x

  • v4.0.6 toward the end of the month.
    • 7199 - ras based orted on headnode.
    • No other blockers currently on
    • Geoff will create milestone issue for v4.0.6
  • PR 8187 - deterministic build.
    • rhc will review.
  • Bug in ORTE that Ralph fixed last night for AWS. This hasn't gone back to v4.1. (PR8176)
    • It's technically more correct, so probably doesn't matter.
    • Weren't communicating the cpuset correctly. OFI MTL needed it in v4.1.
      • Jeff will see which BTL.
    • Prob not critical.
    • Ralph will pull it out and PR it for v4.0.x

v4.1

  • Did make another RC last night, Jeff will send out email this morning.
    • First to include HAN and Adapt.
      • Intent is that we'd activate both together.
    • Jeff will email with instructions on how to enable.
  • Warnings have gotten out of hand on v4.1 on MAC.
    • Some warnings would be fixed if we just fixed some macros. A bunch of unused variables.
    • In general not critical.
    • /bin/sh command subst error (unexpected EOF), so that's probably a critical error.
  • Coverity only runs on master - Coverity then

Open-MPI v5.0

  • Ralph is working on updating PMIx / PRRTE submodule pointers.

    • Jeff is helping with configury issues.
  • Hoping these are one-time issues, and not every time we update submodule pointers.

    • All new configury.
    • If this isn't a one-time thing, we should consider stop embedding this.
    • Submodules are good for a number of things, but you ahve to take a change in configury, Go through CI, commit it to PMIx master, then go through OMPI CI, and find out that there's a bug. Then you have to go back to PMIx master.
  • Cisco has some of this.

    • Should figure out a keyword on PMIx side, that effectively does
  • Hit a problem in AWS Amazon-Liunux1 trying to build a tarball.

  • And were hitting Mellanox CI due to Python older version.

  • IBM doesn't do dist check - bug is in there (real issue)

  • Some issues are caused by Embedding. (What do you mean by SRCDIR) (Which SRCDIR?)

  • But other issues are NOT caused by embedding pmix.

    • So getting rid of embedding would not solve these issues
  • Not moment it's looking like 2Q next year.

    • IBM's been pushing on PRRTE as well. We're testing the map/bind options.
    • Trying to push tickets up (some are clarification of expected behavior)
    • Community help with these issues would help move forward the prrte deadline.
  • What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?

    • What's the general state? Any known issues?
    • AWS would like to get.
    • Josh Ladd - Will take internally to see what they have to say.
    • From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
    • Hessam Mirsadeg - All Cuda awareness through UCX
    • May ask George Bosilica about this.
    • Don't want to remove a BTL if someone is interested in it.
    • UCX also supports TCP via CUDA
    • PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
  • PR 8191 - converted all OMPI READMEs to markdown.

    • Looks nicer when browsing on Github (can do formatting)
    • Do things in one markdown language (easy to edit)
    • Master only.
  • Should we consolidate Top level README, website FAQ (sorta googlable), All the manpages.

    • If 2 of these are going to Markdown, maybe we should do FAQ in markdown, and put these all in one place like readthedocs.io
    • Unless someone has an allergic reaction, Jeff's interested in working on this.
    • They have a decent versioning scheme to version docs based on release(s).
    • When would the docs get pushed to readthedocs.io?
      • Master would be on a github hook after PRs are merged.
      • Stable Release branches would go out at release time.
    • LICENSE question - what license would the docs be available under? Open-MPI BSD license, or
    • readthedocs.io encourages "restructured text" format over markdown.
      • They also support a hybrid for projects that have both.
    • Thomas Naughton has done the restructured text, and it allows
  • Ralph tried the Instant on at scale:

    • 10,000 nodes x 32PPN
    • Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
    • Through MPI_Init() (if using Instant-On)
    • TCP and Slingshot (OFI provider private now)
    • PRRTE with PMIx v4.0 support
    • SLURM has some of the integration, but hasn't taken this patch yet.
  • Discussion on:

    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Talking about amending to request MCAs to know if it should be slurped in.
      • (if the component hard links or dlopens their libraries)
    • Roadrunner experiments... The Bottleneck in launching was I/O in loading all the .sos
      • spindle, and burst buffer reduce this, but still
    • Still going through function pointers, no additional inlining.
      • can do this today.
    • Still different than STATIC (sharing this image across process), just not calling dlopen that many times.
    • New proposal is to have a 3rd option where component decides it's default is to be slurped into libmpi
      • It's nice to have fabric provider's not bring their dependencies into libmpi so that the main libmpi can be run on nodes that may not have the provider's dependencies installed.
    • Low priority thing anyway, if we get it in for v5.0 it'd be nice, but not critical.

Video Presentation

  • George and Jeff are leading
  • No new updates this week (see last week)
Clone this wiki locally