Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common #1067

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Commits on Sep 6, 2024

  1. moved userbuffers code to TE/common

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    e911bac View commit details
    Browse the repository at this point in the history
  2. moved comm+GEMM overlap code to TE/common

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    4842566 View commit details
    Browse the repository at this point in the history
  3. removed PyTorch depdency from comm+GEMM overlap in TE/common

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    c587e76 View commit details
    Browse the repository at this point in the history
  4. added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE…

    …/common
    
    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    4cc258b View commit details
    Browse the repository at this point in the history
  5. updated TE/PyTorch Python API to match the refactored comm+GEMM overl…

    …ap code
    
    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    b9370a0 View commit details
    Browse the repository at this point in the history
  6. updated unit tests to work with refactored comm+GEMM overlap code

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    b03cf2d View commit details
    Browse the repository at this point in the history
  7. added a pylint exception to comm+GEMM overlap test runner

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    9994989 View commit details
    Browse the repository at this point in the history
  8. [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci
    pre-commit-ci[bot] authored and denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    8c54738 View commit details
    Browse the repository at this point in the history
  9. fixing linting errors

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    82a18c0 View commit details
    Browse the repository at this point in the history
  10. [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci
    pre-commit-ci[bot] authored and denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    29fe3bd View commit details
    Browse the repository at this point in the history
  11. added documentation for te.initialize_ub

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    64ffbbf View commit details
    Browse the repository at this point in the history
  12. [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci
    pre-commit-ci[bot] authored and denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    d840201 View commit details
    Browse the repository at this point in the history
  13. fixed compile errors when building with NVTE_UB_WITH_MPI=1

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    69ee948 View commit details
    Browse the repository at this point in the history
  14. [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci
    pre-commit-ci[bot] authored and denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    f787c4b View commit details
    Browse the repository at this point in the history
  15. fixed default bootstrap backend

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    3237517 View commit details
    Browse the repository at this point in the history
  16. switched default bootstrap backend priority to MPI > Gloo > NCCL

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    2e6da4d View commit details
    Browse the repository at this point in the history
  17. [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci
    pre-commit-ci[bot] authored and denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    aaca26e View commit details
    Browse the repository at this point in the history
  18. updated bootstrap backend documentation

    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    a04d85a View commit details
    Browse the repository at this point in the history
  19. close UB bootstrap socket to avoid interfering with CUDA Multicast sh…

    …areable file handle send/recv
    
    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    d6f1225 View commit details
    Browse the repository at this point in the history
  20. added torch::Tensor wrappers for communication buffer and atomic coun…

    …ters so PyTorch can factor externally allocated memory into its garbage collection threshold
    
    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    271cbf7 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    4586653 View commit details
    Browse the repository at this point in the history
  22. automated handling of world, local and node ranks/sizes within C++ Co…

    …mmOverlapHelper to simplify Python function signatures
    
    Signed-off-by: Alp Dener <adener@nvidia.com>
    denera committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    23f7dca View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    620c1f9 View commit details
    Browse the repository at this point in the history