Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common #1067

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

denera
Copy link
Collaborator

@denera denera commented Jul 31, 2024

Description

This PR moves Userbuffers and comm+GEMM overlap algorithms from TE/PyTorch to TE/common with refactored interfaces to remove the PyTorch dependency.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

  • transformer_engine/pytorch/csrc/userbuffers moved to transformer_engine/common/comm_gemm_overlap/userbuffers.
  • transformer_engine/pytorch/csrc/comm_gemm_overlap.h split into transformer_engine/common/include/transformer_engine/comm_gemm_overlap.h and transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp and refactored to remove torch::Tensor dependency.
  • Added new TE/PyTorch wrappers around the refactored comm+GEMM overlap algorithms.
  • Expanded unit tests to cover all overlap algorithms including atomic GEMM overlaps (tested as AG+RS pairs).

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@denera denera force-pushed the comm-gemm-overlap-refactor branch from 7255ca5 to 7c0cc8d Compare July 31, 2024 18:44
@denera denera self-assigned this Jul 31, 2024
@denera denera added the enhancement New feature or request label Jul 31, 2024
transformer_engine/common/CMakeLists.txt Outdated Show resolved Hide resolved
transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved
transformer_engine/common/util/pybind_helper.h Outdated Show resolved Hide resolved
@timmoon10 timmoon10 self-requested a review August 1, 2024 00:36
_ubuf = torch::empty({(sample.size(0) / _tp_size) * _num_ubuf_chunks, sample.size(1)},
sample.options());
ubuf_ptr = _ubuf.data_ptr();
register_gpu_buffer(&ubuf_ptr, _ubuf_bytes, false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a reminder here) It seems your bugfix of legacy IPC flow is included in this PR, but the P2P part is not included.

Your bugfix: force TE/PyTorch to always let Userbuffers manually allocate its buff…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good. My suggestions are quibbles with the API.

@@ -26,7 +26,7 @@ extern "C" {
* \param[in] stream CUDA stream used for the operation.
*/

enum class NVTE_Activation_Type {
enum NVTE_Activation_Type {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this breaks the C API. That said, I think enums like GELU and RELU are too common and will likely to run into name conflicts. I don't see this used in the C API, so a better approach would be put this inside a #ifdef __cplusplus. In the future it may be better to take advantage of C++ features (put within the transformer_engine namespace and rename to Activation_Type), but that's beyond the scope of this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now:

Suggested change
enum NVTE_Activation_Type {
enum class NVTE_Activation_Type {

Comment on lines 36 to 39
ub_algo: tex.NVTE_Comm_Overlap_Algo = None,
ub: Union[tex.CommOverlap, tex.CommOverlapP2P] = None,
extra_output_tensor: torch.Tensor = None,
bulk_ubuf_fp8_type: Optional[tex.DType] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're changing the API, it may worth considering putting UB-specific options in dicts:

Suggested change
ub_algo: tex.NVTE_Comm_Overlap_Algo = None,
ub: Union[tex.CommOverlap, tex.CommOverlapP2P] = None,
extra_output_tensor: torch.Tensor = None,
bulk_ubuf_fp8_type: Optional[tex.DType] = None,
extra_output_tensor: torch.Tensor = None,
ub_options: Optional[Dict[str, Any]] = None,

UB's API is unstable, and this puts a burden on downstream users (see versioning logic in Mcore). By wrapping everything in a dict, TE can take more responsibility for backward compatibility, i.e. if a user provides a dict with options from an older version.

That said, these functions are considered internal interfaces. This is more of a concern with external APIs like the modules.

transformer_engine/pytorch/cpp_extensions/gemm.py Outdated Show resolved Hide resolved
@denera denera force-pushed the comm-gemm-overlap-refactor branch from 2e55bb2 to dd8cc21 Compare August 1, 2024 20:56
@@ -98,42 +98,38 @@ def initialize_ub(
assert _ub_communicators is None, "UB communicators are already initialized."
_ub_communicators = {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR fixes a bug with DGRAD-RS overlap. #1088
Please make sure that PR changes are not reverted with this change.


NVTE_CHECK_CUDA(cudaEventRecord(_stop_send, _stream_send));
NVTE_CHECK_CUDA(cudaStreamWaitEvent(stream_main, _stop_send, 0));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems these 2 line (NVIDIA/TransformerEngine) is missing.

NVTE_CHECK_CUDA(cudaEventRecord(_stop_recv, (cudaStream_t)_stream_recv));
NVTE_CHECK_CUDA(cudaStreamWaitEvent((cudaStream_t)stream_main, _stop_recv, 0));

The reduce part probably need to wait for all ubuf::userbuffers_recv to finish receiving data. Then, it can continue to reduce.
Not sure if I miss something. Does this make sense?

@denera denera force-pushed the comm-gemm-overlap-refactor branch 8 times, most recently from 4847133 to 10feff5 Compare August 28, 2024 23:14
@denera denera force-pushed the comm-gemm-overlap-refactor branch 2 times, most recently from 1f06f3d to bbd8120 Compare August 28, 2024 23:47
denera and others added 20 commits September 6, 2024 14:17
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
…/common

Signed-off-by: Alp Dener <adener@nvidia.com>
…ap code

Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
…areable file handle send/recv

Signed-off-by: Alp Dener <adener@nvidia.com>
…ters so PyTorch can factor externally allocated memory into its garbage collection threshold

Signed-off-by: Alp Dener <adener@nvidia.com>
pre-commit-ci bot and others added 3 commits September 6, 2024 14:19
…mmOverlapHelper to simplify Python function signatures

Signed-off-by: Alp Dener <adener@nvidia.com>

CommOverlapHelper(c10d::ProcessGroup *world_group,
std::optional<c10d::ProcessGroup *> intra_node_group_holder,
std::optional<c10d::ProcessGroup *> inter_node_group_holde);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor) I inadvertently spotted a small typo here inter_node_group_holde, FYI

@@ -529,6 +529,10 @@ class CommOverlapHelper : torch::CustomClassHolder {
};

class CommOverlap : torch::CustomClassHolder, public transformer_engine::CommOverlapBase {
private:
torch::Tensor _ubuf_torch;
torch::Tensor _ubuf_counter;
Copy link

@anderson101866 anderson101866 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the _ubuf_counter become redundant now? It seems to be only instantiated in constructor but no further usage.

not available. Setting `NVTE_UB_WITH_MPI=1` when building TE overrides this
option and always initializes Userbuffers with direct MPI calls in C++,
which also requires `MPI_HOME=/path/to/mpi/root` to be set at compile time.
"""
if not tex.device_supports_multicast():
assert bool(os.getenv("UB_SKIPMC", "0")), (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to
bool(os.getenv("UB_SKIPMC", None)).

If not set UB_SKIPMC, "0" will be still treated as True due to bool("<any string>"), hence, turn on multicast on device w/o MC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants