Skip to content

v1.13.0-aws

Compare
Choose a tag to compare
released this 19 Nov 05:37
cf7606e

(2024-11-18)

This release is intended only for use on AWS P* instances. A general release that supports other libfabric networks may be made in the near future.

With this release, building with platform-aws requires 1.22.0amzn4.0 or greater. AWS customers are generally recommended to track the latest-available EFA Installer for performance improvements and bug fixes.

The 1.13.x release series supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New features:

  • AWS P5en platform support was added.

  • support was added for the NCCL v3 tuner API. The tuner now supports multiple
    platforms and supports multiple collectives.

  • Scheduling improvements were made to the plugin RDMA protocol. In multirail
    configurations, this is expected to balance traffic more optimally.

  • dmabuf memory registration support was added. Users facing problems with
    dmabuf may disable dmabuf with OFI_NCCL_DISABLE_DMABUF=1.

Breaking changes:

  • As mentioned above, building with support for platform-aws now requires
    libfabric version 1.22.0amzn4.0 or greater.

  • Under CUDA, the plugin now statically links the CUDA runtime by default.
    Packagers preferring to dynamically link CUDA may pass
    --enable-cudart-dynamic at configure time to disable this.

Supported Distributions

  • Amazon Linux 2
  • Amazon Linux 2023
  • Ubuntu 20.04 LTS, 22.04 LTS.

For releases before v1.6.0, we generally created releases from two separate
branches, an AWS-specific branch and a general release branch. With v1.6.0, we
have unified the code into a single branch, and made the AWS-specific parts a
compile-time option. When a feature (or entire release) only supports one of
the two variants, we note that in the release notes.

What's Changed

  • ci: build oldest working EFA installer and latest by @aws-nslick in #522
  • api: fail when using connect/accept_v4 with RDMA protocol by @rauteric in #529
  • rdma: write topo file only for multi-rail platforms by @rauteric in #532
  • dist: set TAR_OPTIONS to remove ownership info by @rauteric in #523
  • Revert ".ci/aws: Add trainium tests to CI" by @a-szegel in #535
  • nvidia: Change default network name to "Libfabric" by @bwbarrett in #530
  • tuner: support tuner v3 API by @AmedeoSapio in #524
  • init: Avoid hang by forcing SENDRECV in case of neuron v4 API usage by @maxtmann in #537
  • Fix naming of array in nccl_net_ofi_plugin_init by @ryanhankins in #539
  • Revert "param: increase CQ read count to 16 for performance" by @maxtmann in #538
  • .ci/aws: Add g4dn testing to PR CI by @a-szegel in #527
  • .ci/aws: Make failures happen in correct stage by @a-szegel in #528
  • platform: Set RDMA protocol as default for trn1/trn1n platforms by @maxtmann in #540
  • Expose each libfabric NIC as one NIC device to the user in case of non-NVIDIA platforms by @maxtmann in #544
  • ci: cache efa installer by @aws-nslick in #545
  • ci: fix efa installer caching by @aws-nslick in #546
  • fix(rdma): endpont_per_comm: NULL ptr bug by @rauteric in #551
  • tuner: Enable tuner init msg on INFO logs by @arunkarthik-akkart in #549
  • .ci/aws: Decrease NCCL_TEST iterations to 5 by @a-szegel in #550
  • fix(tree): use correct __cplusplus guards by @aws-nslick in #554
  • Separate endpoint for control messages by @rajachan in #543
  • fix(tree): add spaces around PRIu64 by @aws-nslick in #555
  • feat(tree): add static_assert shim macro by @aws-nslick in #556
  • fix(aws): align declaration and init order by @aws-nslick in #557
  • fix(rdma): fi_{send,write}data: do arithmetic on uintptr by @aws-nslick in #558
  • fix(tuner): don't choose NVLSTree if nRanks==nNodes by @AmedeoSapio in #583
  • rdma: Eliminate unnecessary ctrl message waits in eager protocol by @rauteric in #553
  • fix(tracing): use header-only nvtx3 by @aws-nslick in #590
  • chore(.github/workflows): constrain push triggers to known branches by @aws-nslick in #582
  • feat(build): better --enable-debug defaults by @aws-nslick in #596
  • fix(freelist): use uintptr_t for pointer arithmetic by @aws-nslick in #560
  • Fix: access domain from ep during mr on device by @maxtmann in #602
  • Feature/v6 rma ops by @maxtmann in #541
  • platform: trn1 default protocol send receive by @hunnorth in #603
  • fix(tree): import libfabric's container_of macro by @aws-nslick in #605
  • fix(valgrind): fix autotools mistake by @aws-nslick in #607
  • feat(ci/github): use docker instead of codebuild by @aws-nslick in #608
  • CI updates by @rajachan in #612
  • util: Use FI_ENOPROTOOPT to check for a provider's support for option by @rajachan in #613
  • Fix log format string behavior by @bwbarrett in #615
  • Improve protocol selection logic by @bwbarrett in #610
  • .ci/aws: Unpin al2 p3dn ami by @a-szegel in #552
  • .ci/aws: re-Add trainium tests to CI by @a-szegel in #619
  • fix(m4): set redzone size to 0 by @rauteric in #616
  • Fully destroy endpoints when refcount is 0 by @bwbarrett in #617
  • feat: add DMA-BUF support by @aws-nslick in #618
  • Improve end of process cleanup and reporting by @bwbarrett in #620
  • fix(rdma): stop setting FI_ORDER_NONE by @aws-nslick in #621
  • fix(tree): use empty brace initializers for zero-initialization by @aws-nslick in #594
  • fix(build): ensure -pthread is passed by @aws-nslick in #623
  • fix(build): add missing AC_PROG_RANLIB by @aws-nslick in #622
  • fix(ci): prefer ecr to dockerhub by @aws-nslick in #628
  • feat(build): disable semantic interposition by @aws-nslick in #624
  • fix(init): fix sendrecv fallback logic by @aws-nslick in #629
  • fix: rdma: inverted print statement by @aws-nslick in #630
  • rdma: Use get_device_from_ep() accessor by @bwbarrett in #626
  • Combined -Wextra -Werror Commits by @aws-nslick in #627
  • Add platform data settings for TRN2N by @maxtmann in #638
  • tuner: add regions for AllGather/ReduceScatter in the one rank per node case by @AmedeoSapio in #641
  • fix(rdma): send periodic control messages to sync sender/receiver by @rauteric in #640
  • feat(build): add -fanalyzer when --enable-werror by @aws-nslick in #632
  • Add Multiplexed-round-robin scheduler by @arunkarthik-akkart in #604
  • fix : Fix flexible array member allocation by @arunkarthik-akkart in #649
  • Revert "neuron: Disable rdma eager messages by default" by @maxtmann in #650
  • .ci/aws: All CI use ami with EFA Installer by @a-szegel in #648
  • separate out 3rd-party headers by @aws-nslick in #634
  • Add a proper endpoint interface by @bwbarrett in #654
  • feat(ci): add workflow_dispatch to distcheck by @aws-nslick in #658
  • Fix use of uninitialized lock by @bwbarrett in #659
  • aws: Skip the WRITE_IN_ORDER_ALIGNED_128_BYTES check for P5en by @rajachan in #625
  • rdma: remove "request completed with error" message by @rauteric in #660
  • rdma: do local RDMA read on all NIC rails for flush() by @taeilum00 in #652
  • Fix abort when cache is disabled. by @bwbarrett in #662
  • feat: Make tuner platform specific by @arunkarthik-akkart in #657
  • Couple of accessor function / code cleanups by @bwbarrett in #661
  • rdma: Revert commits eliminating eager waits by @rauteric in #664
  • Cleanups from adding a domain interface by @bwbarrett in #670
  • fix: Change multiplexer scheduler to use two rails instead of three by @arunkarthik-akkart in #669
  • Add p5en platform_data and update default latency for undefined platforms by @rajachan in #672
  • Fix a number of duplicate definition names by @bwbarrett in #667
  • .ci/aws: Move p5 capacity to CGK by @sunkuamzn in #680
  • Fix device sorting on aws platforms by @bwbarrett in #679
  • rdma: add option to round robin the ctrl msg, and use shared CQs for control and data endpoints by @AmedeoSapio in #673
  • Add option to abort() on error by @bwbarrett in #683
  • Reduce repetitive INFO printing by @bwbarrett in #684
  • aws: Override libfabric link_attr for certain platforms by @rajachan in #686
  • Switch CI to persistent clusters with containers by @sunkuamzn in #687
  • cuda: build flag for dynamically or statically linking cudart by @aws-nslick in #688
  • Add platform data settings for TRN2 by @hunnorth in #693
  • .ci/Jenkins: General Cleanup and Remove Region/CI From CI by @a-szegel in #694
  • tuner: add model base tuner and refactor for co-exist by @taeilum00 in #692
  • defaults: make dmabuf opt-in by @aws-nslick in #695
  • .ci/aws: Improve CI Speed by @a-szegel in #701
  • fix: ep release in endpoint per comm by @AmedeoSapio in #706
  • rdma: Set FI_MORE when posting receive buffers by @bwbarrett in #705
  • rdma: Set LOW_LATENCY traffic class for control by @bwbarrett in #702
  • reenable dmabuf by default by @aws-nslick in #703
  • MR: Enforce page-aligned buffer registration for iovec and add corresponding test case by @mozarhua in #685
  • core: Leave endpoint created during init by @bwbarrett in #710
  • feat: Region-based tuner support for P5en by @arunkarthik-akkart in #704
  • fix: Fallback to internal tuner on NCCL-2.21.5 for PAT by @arunkarthik-akkart in #714
  • release: v1.13.x aws by @aws-nslick in #712

New Contributors

Full Changelog: v1.12.1-aws...v1.13.0-aws