[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common #1067

denera · 2024-07-31T18:16:35Z

Description

This PR moves Userbuffers and comm+GEMM overlap algorithms from TE/PyTorch to TE/common with refactored interfaces to remove the PyTorch dependency.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

transformer_engine/pytorch/csrc/userbuffers moved to transformer_engine/common/comm_gemm_overlap/userbuffers.
transformer_engine/pytorch/csrc/comm_gemm_overlap.h split into transformer_engine/common/include/transformer_engine/comm_gemm_overlap.h and transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp and refactored to remove torch::Tensor dependency.
Added new TE/PyTorch wrappers around the refactored comm+GEMM overlap algorithms.
Expanded unit tests to cover all overlap algorithms including atomic GEMM overlaps (tested as AG+RS pairs).

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/common/CMakeLists.txt

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/common/util/pybind_helper.h

anderson101866 · 2024-08-01T08:17:47Z

transformer_engine/pytorch/csrc/extensions/comm_gemm_overlap.cpp

+    _ubuf = torch::empty({(sample.size(0) / _tp_size) * _num_ubuf_chunks, sample.size(1)},
+                         sample.options());
+    ubuf_ptr = _ubuf.data_ptr();
+    register_gpu_buffer(&ubuf_ptr, _ubuf_bytes, false);


(Just a reminder here) It seems your bugfix of legacy IPC flow is included in this PR, but the P2P part is not included.

Your bugfix: force TE/PyTorch to always let Userbuffers manually allocate its buff…

Thanks for catching this!

timmoon10

Overall this looks pretty good. My suggestions are quibbles with the API.

transformer_engine/common/include/transformer_engine/comm_gemm_overlap.h

timmoon10 · 2024-08-01T18:01:39Z

transformer_engine/common/include/transformer_engine/activation.h

@@ -26,7 +26,7 @@ extern "C" {
 *  \param[in]     stream    CUDA stream used for the operation.
 */

-enum class NVTE_Activation_Type {
+enum NVTE_Activation_Type {


Good catch, this breaks the C API. That said, I think enums like GELU and RELU are too common and will likely to run into name conflicts. I don't see this used in the C API, so a better approach would be put this inside a #ifdef __cplusplus. In the future it may be better to take advantage of C++ features (put within the transformer_engine namespace and rename to Activation_Type), but that's beyond the scope of this PR.

For now:

Suggested change

enum NVTE_Activation_Type {

enum class NVTE_Activation_Type {

timmoon10 · 2024-08-01T18:09:09Z

transformer_engine/pytorch/cpp_extensions/gemm.py

+    ub_algo: tex.NVTE_Comm_Overlap_Algo = None,
+    ub: Union[tex.CommOverlap, tex.CommOverlapP2P] = None,
    extra_output_tensor: torch.Tensor = None,
+    bulk_ubuf_fp8_type: Optional[tex.DType] = None,


While we're changing the API, it may worth considering putting UB-specific options in dicts:

Suggested change

ub_algo: tex.NVTE_Comm_Overlap_Algo = None,

ub: Union[tex.CommOverlap, tex.CommOverlapP2P] = None,

extra_output_tensor: torch.Tensor = None,

bulk_ubuf_fp8_type: Optional[tex.DType] = None,

extra_output_tensor: torch.Tensor = None,

ub_options: Optional[Dict[str, Any]] = None,

UB's API is unstable, and this puts a burden on downstream users (see versioning logic in Mcore). By wrapping everything in a dict, TE can take more responsibility for backward compatibility, i.e. if a user provides a dict with options from an older version.

That said, these functions are considered internal interfaces. This is more of a concern with external APIs like the modules.

transformer_engine/pytorch/cpp_extensions/gemm.py

vasunvidia · 2024-08-12T22:59:00Z

transformer_engine/pytorch/module/base.py

@@ -98,42 +98,38 @@ def initialize_ub(
    assert _ub_communicators is None, "UB communicators are already initialized."
    _ub_communicators = {}



The PR fixes a bug with DGRAD-RS overlap. #1088
Please make sure that PR changes are not reverted with this change.

anderson101866 · 2024-08-20T01:32:23Z

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

+
+  NVTE_CHECK_CUDA(cudaEventRecord(_stop_send, _stream_send));
+  NVTE_CHECK_CUDA(cudaStreamWaitEvent(stream_main, _stop_send, 0));
+


It seems these 2 line (NVIDIA/TransformerEngine) is missing.

NVTE_CHECK_CUDA(cudaEventRecord(_stop_recv, (cudaStream_t)_stream_recv)); NVTE_CHECK_CUDA(cudaStreamWaitEvent((cudaStream_t)stream_main, _stop_recv, 0));

The reduce part probably need to wait for all ubuf::userbuffers_recv to finish receiving data. Then, it can continue to reduce.
Not sure if I miss something. Does this make sense?