Skip to content

WeeklyTelcon_20170307

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Brian Barrett
  • David Bernholdt
  • Edgar Gabriel
  • Geoffroy Vallee
  • Howard
  • Josh Hursey
  • Ralph
  • Joshua Ladd
  • Nathan Hjelm
  • Thomas Naughton
  • Todd Kordenbrock
  • Ryan Grant

Agenda

  • No plans for a v1.10.7.
  • rcache 3013 - rcache broken in all verison.
    • big-hammer workaround for v2.0 and v2.1 Don't hook madvise.
    • For 3.0 need to
    • Registration cache is a bottleneck, and will always hit issues like this if we hook madvise.
    • PR 3013 - might not be necessary if create new PR that does NOT hook madvise (remove a function pointer).
      • PR 3013 is an optimization, not bad, but really need to remove madvise. Nathan will Merge PR3013 now.
      • PR 3013 is not NEEDED for v2.1, so just going into master / v3.0.
    • Ask Nathan to create v2.x, and v2.0.x and master PRs that remove hook of madvise.
    • Does this necessitate a v2.0.3 release? Have to be doing malloc and free in threads.
      • Failure-mode threads deadlock, not silent memory corruption. Does not affect 1.10 (using pmalloc hooks).
  • 3 issues on BSD and various flavors.
    • 3 are PMIx related. Include file missing, Josh already put up PR for.
    • oob 3115 dlopen failing to find files, but happening 20%. Open BSD on i386. getcwd() is missing.
    • NetBSD on AMD64 - not sure how common this is.
    • Artem and Paul are looking into Issue 3117 - Waiting to hear if it's easy to fix.
  • Taking one from Edgar: https://github.com/open-mpi/ompi/pull/3105 since we're doing another RC (for rcache fix).
  • Only Blockers for v2.1: Issue 3117, and unhooking madvise.
  • PMIx - reason we're doing an accelerated v3.0
  • Whitelist Issue 3107
  • UCX has it's own Multithreading API that needs to be enabled. UCX is thread safe. Inside UCX PML
    • allocator will be inside of OSHMEM.
    • Sounds reasonable (component level stuff).
  • DELAYED TO v3.1 - Info Keys - IBM Do an Audit with what was posted last week from MPI Forum and rebase.
    • PR 2941.
    • Open-MPI currently doesn't implement.
      • Concerned about Don't want to implement something if it's NOT going to be solid.
    • Nathan would like to have it into v3.0, but not necessary.
    • Sounds like everyone is okay with delaying this to v3.1, but want to get it into Master soon.
  • Updated internal hwloc DELAYED to v3.1. Still support latest hwloc via external.
  • Big Elephant in room is PMIx v2.0, it's not released yet, but it's being Whitelitsted, but we need to branch v3.0 soon to make June 15 release.
  • 43 Issues against v3.0 out there. Feature / Enhancements based ones just be punted to v3.1.
  • Schedule for branching v3.0 after next week's meeting.

PMIx status

  • Planning on doing a v2.1.2 release in next week or so.
    • Don't want to hold up v2.1, would go into v2.1.1
  • PMIx v2.0 - looking good for early april release
  • Don't break the build!!!

  • OSHMEM testing in MTT via CISCO is greatly improved. (Lookat what CISCO did if you're interested in OSHMEM)
    • Gone from many thousands of false failures to a few hundred.

MTT Dev status:


Exceptional topics

  • We should begin thinking about scheduling our next face to face.

Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally