-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common #1067
base: main
Are you sure you want to change the base?
Conversation
7255ca5
to
7c0cc8d
Compare
_ubuf = torch::empty({(sample.size(0) / _tp_size) * _num_ubuf_chunks, sample.size(1)}, | ||
sample.options()); | ||
ubuf_ptr = _ubuf.data_ptr(); | ||
register_gpu_buffer(&ubuf_ptr, _ubuf_bytes, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a reminder here) It seems your bugfix of legacy IPC flow is included in this PR, but the P2P part is not included.
Your bugfix: force TE/PyTorch to always let Userbuffers manually allocate its buff…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks pretty good. My suggestions are quibbles with the API.
transformer_engine/common/include/transformer_engine/comm_gemm_overlap.h
Outdated
Show resolved
Hide resolved
@@ -26,7 +26,7 @@ extern "C" { | |||
* \param[in] stream CUDA stream used for the operation. | |||
*/ | |||
|
|||
enum class NVTE_Activation_Type { | |||
enum NVTE_Activation_Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this breaks the C API. That said, I think enums like GELU
and RELU
are too common and will likely to run into name conflicts. I don't see this used in the C API, so a better approach would be put this inside a #ifdef __cplusplus
. In the future it may be better to take advantage of C++ features (put within the transformer_engine
namespace and rename to Activation_Type
), but that's beyond the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now:
enum NVTE_Activation_Type { | |
enum class NVTE_Activation_Type { |
ub_algo: tex.NVTE_Comm_Overlap_Algo = None, | ||
ub: Union[tex.CommOverlap, tex.CommOverlapP2P] = None, | ||
extra_output_tensor: torch.Tensor = None, | ||
bulk_ubuf_fp8_type: Optional[tex.DType] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're changing the API, it may worth considering putting UB-specific options in dict
s:
ub_algo: tex.NVTE_Comm_Overlap_Algo = None, | |
ub: Union[tex.CommOverlap, tex.CommOverlapP2P] = None, | |
extra_output_tensor: torch.Tensor = None, | |
bulk_ubuf_fp8_type: Optional[tex.DType] = None, | |
extra_output_tensor: torch.Tensor = None, | |
ub_options: Optional[Dict[str, Any]] = None, |
UB's API is unstable, and this puts a burden on downstream users (see versioning logic in Mcore). By wrapping everything in a dict
, TE can take more responsibility for backward compatibility, i.e. if a user provides a dict
with options from an older version.
That said, these functions are considered internal interfaces. This is more of a concern with external APIs like the modules.
2e55bb2
to
dd8cc21
Compare
b3fb503
to
51519eb
Compare
@@ -98,42 +98,38 @@ def initialize_ub( | |||
assert _ub_communicators is None, "UB communicators are already initialized." | |||
_ub_communicators = {} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR fixes a bug with DGRAD-RS overlap. #1088
Please make sure that PR changes are not reverted with this change.
|
||
NVTE_CHECK_CUDA(cudaEventRecord(_stop_send, _stream_send)); | ||
NVTE_CHECK_CUDA(cudaStreamWaitEvent(stream_main, _stop_send, 0)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems these 2 line (NVIDIA/TransformerEngine) is missing.
NVTE_CHECK_CUDA(cudaEventRecord(_stop_recv, (cudaStream_t)_stream_recv));
NVTE_CHECK_CUDA(cudaStreamWaitEvent((cudaStream_t)stream_main, _stop_recv, 0));
The reduce part probably need to wait for all ubuf::userbuffers_recv
to finish receiving data. Then, it can continue to reduce.
Not sure if I miss something. Does this make sense?
4847133
to
10feff5
Compare
1f06f3d
to
bbd8120
Compare
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
…/common Signed-off-by: Alp Dener <adener@nvidia.com>
…ap code Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Alp Dener <adener@nvidia.com>
…areable file handle send/recv Signed-off-by: Alp Dener <adener@nvidia.com>
…ters so PyTorch can factor externally allocated memory into its garbage collection threshold Signed-off-by: Alp Dener <adener@nvidia.com>
281e1be
to
271cbf7
Compare
for more information, see https://pre-commit.ci
…mmOverlapHelper to simplify Python function signatures Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
|
||
CommOverlapHelper(c10d::ProcessGroup *world_group, | ||
std::optional<c10d::ProcessGroup *> intra_node_group_holder, | ||
std::optional<c10d::ProcessGroup *> inter_node_group_holde); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Minor) I inadvertently spotted a small typo here inter_node_group_holde
, FYI
@@ -529,6 +529,10 @@ class CommOverlapHelper : torch::CustomClassHolder { | |||
}; | |||
|
|||
class CommOverlap : torch::CustomClassHolder, public transformer_engine::CommOverlapBase { | |||
private: | |||
torch::Tensor _ubuf_torch; | |||
torch::Tensor _ubuf_counter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the _ubuf_counter
become redundant now? It seems to be only instantiated in constructor but no further usage.
not available. Setting `NVTE_UB_WITH_MPI=1` when building TE overrides this | ||
option and always initializes Userbuffers with direct MPI calls in C++, | ||
which also requires `MPI_HOME=/path/to/mpi/root` to be set at compile time. | ||
""" | ||
if not tex.device_supports_multicast(): | ||
assert bool(os.getenv("UB_SKIPMC", "0")), ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to
bool(os.getenv("UB_SKIPMC", None))
.
If not set UB_SKIPMC, "0" will be still treated as True
due to bool("<any string>")
, hence, turn on multicast on device w/o MC.
Description
This PR moves Userbuffers and comm+GEMM overlap algorithms from TE/PyTorch to TE/common with refactored interfaces to remove the PyTorch dependency.
Type of change
Changes
transformer_engine/pytorch/csrc/userbuffers
moved totransformer_engine/common/comm_gemm_overlap/userbuffers
.transformer_engine/pytorch/csrc/comm_gemm_overlap.h
split intotransformer_engine/common/include/transformer_engine/comm_gemm_overlap.h
andtransformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
and refactored to removetorch::Tensor
dependency.Checklist: