[PyTorch] Prototype for operation-based API #707

timmoon10 · 2024-03-09T03:00:26Z

Currently, Transformer Engine exposes fused operations with custom modules like LayerNormLinear. These are highly tuned for certain workloads (especially GPT), but are not easy to generalize to other models. This approach is especially cumbersome when the forward and backward passes have different fusion opportunities (e.g. forward GEMM+bias+gelu and backward dgelu+dbias+cast+transpose).

This PR adds a new API for specifying Transformer Engine models. Instead of using large compound modules (e.g. LayerNormLinear), users can build up a Sequential module out of small FusibleOperations (e.g. LayerNorm, Linear). The Sequential module (with a similar API as torch.nn.Sequential) will internally attempt to fuse operations together (possibly differently in the forward and backward passes).

Some of the more important components:

te.ops.FusibleOperation: A neural network operation that can be processed by the fuser. They have forward and backward functions similar to torch.autograd.Function.
te.ops.BasicOperation: A minimal FusibleOperation. Their forward and backward functions must be implemented and they should hold the model state and parameters.
te.ops.FusedOperation: A FusibleOperation that is interchangeable with multiple UnfusedOpeations. If it implements a forward or backward function, they must save the same context as the UnfusedOperations.
te.ops.Sequential: A container module with a similar API as torch.nn.Sequential.
te.ops.OperationFuser: A helper class that manages autograd, performs the operation fusions, and keeps track of corresponding BasicOperations and FusedOperations.

As a proof-of-concept, I've been able to fuse Linear and Bias operations, on a single GPU and with tensor parallelism. These modules have been implemented to support Float8Tensor, which simplifies the implementation and will be important for future work with e.g. FP8 attention. I've also added single-GPU and multi-GPU tests.

This work is heavily influenced by #377 from @janekb04.

Remaining tasks:

FP8 scaling factor updates
~~Checkpointing~~
Documentation

Future work:

Operations: layer norm, activations, attention
Fusions
Possibly reimplementing the existing modules using this infrastructure

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Runs, but need to validate. Runtime errors with non-FP8 params and FP8 compute, or FP8 params and non-FP8 compute. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Test does not pass with FP8. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Not supported by cuBLAS. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Still need to implement amax reductions. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add documentation for unfused ops Signed-off-by: Tim Moon <tmoon@nvidia.com>

Expand documentation Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-06-13T18:12:19Z

/te-ci pytorch

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-06-15T00:46:45Z

/te-ci pytorch

transformer_engine/pytorch/ops/__init__.py

sudhakarsingh27

pass 1

transformer_engine/pytorch/ops/basic/basic_linear.py

sudhakarsingh27 · 2024-06-25T22:35:40Z

transformer_engine/pytorch/ops/op.py

+    @property
+    @abc.abstractmethod
+    def is_fused_op(self) -> bool:
+        """Whether this op is the fusion of one or more basic ops"""


Suggested change

"""Whether this op is the fusion of one or more basic ops"""

"""Whether this op is the fusion of one or more basic ops"""

pass

PyLint prefers just putting the docstring: 738df8a

sudhakarsingh27 · 2024-06-25T22:37:12Z

transformer_engine/pytorch/ops/op.py

+        """Whether this op is the fusion of one or more basic ops"""
+
+    def pre_forward(self) -> None:
+        """Preprocessing before forward pass"""


Suggested change

"""Preprocessing before forward pass"""

"""Preprocessing before forward pass"""

pass

PyLint prefers just putting the docstring: 738df8a

sudhakarsingh27 · 2024-06-26T00:33:53Z

transformer_engine/pytorch/ops/op.py

+            curr_len = meta.amax_history.size(0)
+            if curr_len == amax_history_len:
+                continue
+            with torch.no_grad():


Just curious why do we need torch.no_grad here?

I don't think it's needed, but I'm being paranoid about leaking the autograd graph. This code path is infrequent but called outside the OperationFuser's autograd function:

TransformerEngine/transformer_engine/pytorch/ops/fuser.py

Line 251 in f4e6af9

op.pre_forward()

sudhakarsingh27 · 2024-06-26T00:34:56Z

transformer_engine/pytorch/ops/op.py

+
+        Parameters
+        ----------
+        mode: {"input", "param", "grad_output"}


is name a better fit for this arg?

This is intended to match num_fp8_scales:
https://github.com/timmoon10/TransformerEngine/blob/f4e6af92e8956d948fe1fbaefbc1b2dd6f32b457/transformer_engine/pytorch/ops/op.py#L170-L173
mode seems better for that one.

sudhakarsingh27 · 2024-06-26T18:10:47Z

transformer_engine/pytorch/ops/op.py

+            for fp8_meta in self._fp8_metas.values():
+                self._check_fp8_meta(fp8_meta)
+
+            # Register FP8 metadata for amax and scale update


Is this part of the code (or in spirit) from prepare_for_forward from the original API?

Exactly:

TransformerEngine/transformer_engine/pytorch/module/base.py

Lines 608 to 611 in 7326af9

if self.fp8 and not FP8GlobalStateManager.fp8_graph_capturing():

FP8GlobalStateManager.add_fp8_tensors_to_global_buffer(

self.fp8_meta, fp8_weights=self._get_fp8_params()

)

Although now that you mention it, we should register "grad_output" in the backward pass instead of the forward.

Actually, this matches the module API. The fp8_metas are registered in the forward pass, and we manually trigger an update in the backward pass:

https://github.com/timmoon10/TransformerEngine/blob/f4e6af92e8956d948fe1fbaefbc1b2dd6f32b457/transformer_engine/pytorch/ops/fuser.py#L169-L171

TransformerEngine/transformer_engine/pytorch/module/linear.py

Lines 612 to 613 in 7326af9

if ctx.reduce_and_update_bwd_fp8_tensors and not is_graph_capturing():

FP8GlobalStateManager.reduce_and_update_fp8_tensors(forward=False)

sudhakarsingh27 · 2024-06-26T18:11:57Z

transformer_engine/pytorch/ops/op.py

+        torch.Tensor:
+            Output tensor
+
+        """


PyLint prefers just putting the docstring: 738df8a

sudhakarsingh27 · 2024-06-26T18:12:07Z

transformer_engine/pytorch/ops/op.py

+        Iterable of torch.Tensor:
+            Loss gradients w.r.t. parameters
+
+        """


PyLint prefers just putting the docstring: 738df8a

sudhakarsingh27 · 2024-06-26T22:21:13Z

transformer_engine/pytorch/ops/sequential.py

+                self.append(module)
+
+    def add_module(self, name: str, module: Optional[torch.nn.Module]) -> None:
+        self._module_groups = None


self._module_groups is already set to None at the begin. of __init__. Why do we set it to None again?

If we add a module after calculating operation fusions, then we need to invalidate the operation fusions and recalculate.

sudhakarsingh27 · 2024-06-26T22:23:06Z

transformer_engine/pytorch/ops/sequential.py

+
+    def _get_keys_by_idx(self, idx: int | slice) -> list[str]:
+        """Get module keys corresponding to indices"""
+        if isinstance(idx, slice):


should there be slice indices check as well?

In principle, but it's simpler to rely on the bounds checking in list. This implementation is similar to torch.nn.Sequential:
https://github.com/pytorch/pytorch/blob/389492e2640730b0a199ffe506582ed4fd2c4afc/torch/nn/modules/container.py#L140

ksivaman · 2024-06-27T13:58:12Z

transformer_engine/pytorch/ops/_common.py

+    # Reshape FP8 tensor
+    # Note: Preserve cached transpose if possible
+    if is_float8_tensor(tensor):
+        out = Float8Tensor.make_like(


How does this preserve the cache?

The transpose is part of Float8Tensor._fp8_attrs:

TransformerEngine/transformer_engine/pytorch/float8_tensor.py

Line 985 in 7326af9

_transpose = property(**_make_fp8_attr_property_funcs("transpose"))

This function is not quite equivalent to the Float8Tensor's view or reshape functions since typically reshaping a tensor changes its transpose, while this function tries to preserve the 2D transpose.

ksivaman · 2024-06-27T14:55:25Z

transformer_engine/pytorch/ops/basic/all_gather.py

+    def op_forward(
+        self,
+        ctx: OperationContext,
+        input: torch.Tensor,  # pylint: disable=redefined-builtin


These are worth changing IMO.
input → inp

I'd agree for internal implementations, but input feels much better for a user-facing API:

op = te.ops.AllGather(...) y = op(input=x)

I suppose BasicOperation.op_forward can be considered internal implementation, so I've changed the arg name to input_. I feel strongly about about keeping the input arg in other functions like FusableOperation.forward.

transformer_engine/pytorch/ops/op.py

ksivaman · 2024-06-27T15:44:51Z

transformer_engine/pytorch/ops/op.py

+            basic_op_ctxs[0],
+            input_,
+            basic_op_prev_ops[0],
+            basic_op_next_ops[0],
+            **basic_op_kwargs[0],


Could you explain why we index 0 here?

OperationFuser doesn't make any distinction between BasicOperation or FusedOperation, but interacts with them via the base class (e.g. FusableOperation.fuser_forward). A FusableOperation consists of one or more BasicOperations, so a BasicOperation will recieve just one ctx from OperationFuser while FusedOperation may recieve multiple.

Fix spelling of "fusible". Avoid "input" name in internal APIs. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-06-27T19:31:50Z

/te-ci pytorch

transformer_engine/pytorch/ops/__init__.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-07-08T20:35:56Z

/te-ci pytorch

timmoon10 · 2024-07-09T22:54:34Z

Merging with approval from @ksivaman, @sudhakarsingh27, @ptrendx. This feature is still experimental and incomplete.

timmoon10 added 17 commits February 15, 2024 00:02

Add basic infrastructure for Sequential module

01c22d9

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add linear op

8657aad

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add FP8 support in linear op

2d7f1fe

Runs, but need to validate. Runtime errors with non-FP8 params and FP8 compute, or FP8 params and non-FP8 compute. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add reshape op and unit test

169d58d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add bias op

0808755

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add unfused linear op

92cdc7a

Test does not pass with FP8. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug unfused linear op

e247265

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add test for linear+bias op

0b389eb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add separate abstract classes for unfused and fused ops

7e76476

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Consolidate unfused ops in submodule

8e5d69c

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add linear-bias fused op

e3941dd

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use fused cast-transpose in linear ops

21d8100

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Disable GEMM+bias fusion with FP32 activations

0750bb6

Not supported by cuBLAS. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add parallel unit test for unfused linear op

9f4b5bf

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Refactor parallel tests to reduce job launches

96f6023

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add all-reduce, all-gather, and reduce-scatter ops

bc7ca5f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Remove unused file

05bd5c0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the enhancement New feature or request label Mar 9, 2024

timmoon10 requested review from ptrendx and ksivaman March 9, 2024 03:00

Merge branch 'main' into fuser-prototype

b9b05f7

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 mentioned this pull request Mar 12, 2024

[PyTorch] cuda graph support #575

Merged

timmoon10 added 8 commits March 12, 2024 22:22

Debug multi-GPU FP8 test

5a49f8e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add support for FP8 scale updates

c2d9964

Still need to implement amax reductions. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add license boilerplate

2a9c90d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fuse GEMM+bias in row TP

6b3db28

Add documentation for unfused ops Signed-off-by: Tim Moon <tmoon@nvidia.com>

Rename pipeline to fuser

57e7055

Expand documentation Signed-off-by: Tim Moon <tmoon@nvidia.com>

Tweak documentation

ee2bb6a

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Preserve cached FP8 transpose between ops

6f25426

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add option for fused wgrad accumulation

337619b

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 and others added 2 commits June 14, 2024 17:44

Merge branch 'main' into fuser-prototype

4faec05

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4e6af9

for more information, see https://pre-commit.ci

ksivaman reviewed Jun 26, 2024

View reviewed changes

transformer_engine/pytorch/ops/__init__.py Outdated Show resolved Hide resolved

sudhakarsingh27 reviewed Jun 26, 2024

View reviewed changes

ksivaman reviewed Jun 27, 2024

View reviewed changes

transformer_engine/pytorch/ops/op.py Outdated Show resolved Hide resolved

ksivaman reviewed Jun 27, 2024

View reviewed changes

timmoon10 added 2 commits June 27, 2024 12:31

Review suggestions from @sudhakarsingh27 and @ksivaman

6dd8712

Fix spelling of "fusible". Avoid "input" name in internal APIs. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fuser-prototype

9746498

timmoon10 requested review from sudhakarsingh27 and ksivaman June 27, 2024 19:33

kshitij12345 mentioned this pull request Jul 4, 2024

Group saved for backward tensors produced by TransfomerEngine together with all other saved for backward tensors Lightning-AI/lightning-thunder#713

Merged

timmoon10 mentioned this pull request Jul 5, 2024

Calling backward(retain_graph=True) multiple times with TE Layer does not work #990

Open

timmoon10 commented Jul 8, 2024

View reviewed changes

transformer_engine/pytorch/ops/__init__.py Outdated Show resolved Hide resolved

timmoon10 added 2 commits July 8, 2024 13:35

Update transformer_engine/pytorch/ops/__init__.py

c5b6ca8

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Merge branch 'main' into fuser-prototype

70cce3a

timmoon10 merged commit a3df1d7 into NVIDIA:main Jul 9, 2024
23 of 24 checks passed

This was referenced Jul 19, 2024

[PyTorch] Branching operations #1027

Merged

[PyTorch] Normalization ops #1033

Open

timmoon10 mentioned this pull request Jul 31, 2024

[PyTorch] Debug checkpointing with operation-based API #1063

Open

13 tasks

timmoon10 mentioned this pull request Aug 16, 2024

[PyTorch] Debug CUDA graph support with operation-based API #1117

Open

13 tasks

kshitij12345 mentioned this pull request Aug 27, 2024

TransformerEngine - Investigate to see if we can integrate with the newer Operation-based API Lightning-AI/lightning-thunder#1056

Open

2 tasks

This was referenced Sep 6, 2024

[PyTorch] Activation operations #1164

Open

[PyTorch] Fused dbias-cast-transpose in bias operation #1168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Prototype for operation-based API #707

[PyTorch] Prototype for operation-based API #707

timmoon10 commented Mar 9, 2024 •

edited

Loading

timmoon10 commented Jun 13, 2024

timmoon10 commented Jun 15, 2024

sudhakarsingh27 left a comment

sudhakarsingh27 Jun 25, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 25, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

timmoon10 Jun 27, 2024 •

edited

Loading

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

sudhakarsingh27 Jun 26, 2024

timmoon10 Jun 27, 2024

ksivaman Jun 27, 2024

timmoon10 Jun 27, 2024

ksivaman Jun 27, 2024

timmoon10 Jun 27, 2024 •

edited

Loading

ksivaman Jun 27, 2024

timmoon10 Jun 27, 2024

timmoon10 commented Jun 27, 2024

timmoon10 commented Jul 8, 2024

timmoon10 commented Jul 9, 2024

	"""Whether this op is the fusion of one or more basic ops"""
	"""Whether this op is the fusion of one or more basic ops"""
	pass

	"""Preprocessing before forward pass"""
	"""Preprocessing before forward pass"""
	pass

	if self.fp8 and not FP8GlobalStateManager.fp8_graph_capturing():
	FP8GlobalStateManager.add_fp8_tensors_to_global_buffer(
	self.fp8_meta, fp8_weights=self._get_fp8_params()
	)

	if ctx.reduce_and_update_bwd_fp8_tensors and not is_graph_capturing():
	FP8GlobalStateManager.reduce_and_update_fp8_tensors(forward=False)

[PyTorch] Prototype for operation-based API #707

[PyTorch] Prototype for operation-based API #707

Conversation

timmoon10 commented Mar 9, 2024 • edited Loading

timmoon10 commented Jun 13, 2024

timmoon10 commented Jun 15, 2024

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timmoon10 Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timmoon10 Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timmoon10 commented Jun 27, 2024

timmoon10 commented Jul 8, 2024

timmoon10 commented Jul 9, 2024

timmoon10 commented Mar 9, 2024 •

edited

Loading

timmoon10 Jun 27, 2024 •

edited

Loading

timmoon10 Jun 27, 2024 •

edited

Loading