Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aten::empty_like #2654

Merged
merged 2 commits into from
Apr 16, 2024
Merged

aten::empty_like #2654

merged 2 commits into from
Apr 16, 2024

Conversation

apbose
Copy link
Collaborator

@apbose apbose commented Feb 23, 2024

No description provided.

@apbose apbose marked this pull request as draft February 23, 2024 01:23
@github-actions github-actions bot added component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Feb 23, 2024
@github-actions github-actions bot requested a review from gs-olive February 23, 2024 01:23
@github-actions github-actions bot added component: lowering Issues re: The lowering / preprocessing passes and removed component: tests Issues re: Tests component: conversion Issues re: Conversion stage labels Feb 27, 2024
@apbose apbose changed the title aten::empty_like evaluator aten::empty_like Feb 27, 2024
@apbose
Copy link
Collaborator Author

apbose commented Feb 27, 2024

Had a doubt on this one. Does this require a test. In the following test:

def test_lowering_empty_like(self):
        class emptyLike(torch.nn.Module):
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                y = torch.ops.aten.empty_like.default(x)
                return y

        # Operations expected to be removed in the traced graph after decompositions
        expected_ops = {}
        unexpected_ops = {torch.ops.aten.empty_like.default}

        inputs = [torch.randn(2, 3).cuda()]

        #inputs = [torch.empty((2,3), dtype=torch.int32, device = 'cuda')]

        fx_graph = torch.fx.symbolic_trace((emptyLike()))
        unexpected_ops_seen, expected_ops_unseen = lower_graph_testing(
            fx_graph,
            inputs,
            expected_ops=expected_ops,
            unexpected_ops=unexpected_ops,
            min_block_size=1,
        )

        torch._dynamo.reset()

        # Validate that the results between Torch and Torch-TRT are similar
        optimized_model = torch_tensorrt.compile(
            fx_graph,
            "torch_compile",
            inputs,
            min_block_size=1,
            pass_through_build_failures=True,
        )
        optimized_model_results = optimized_model(*inputs).detach().cpu()
        torch_model_results = fx_graph(*inputs).detach().cpu()

        max_diff = float(
            torch.max(torch.abs(optimized_model_results - torch_model_results))
        )
        self.assertAlmostEqual(
            max_diff,
            0,
            DECIMALS_OF_AGREEMENT,
            f"empty_like TRT outputs don't match with the original model.",
        )
  1. Is the above required since both the optimized_model torchTRT compiled model and fx_graph will have the same lowering pass applied?
  2. Also when I compile the above I see
  File "/home/abose/Documents/work/torchTRT_empty_2_26/TensorRT/tests/py/dynamo/testing_utilities.py", line 55, in fx_dynamo_testing_backend
    trt_compiled = custom_backend(
  File "/home/abose/Documents/work/torchTRT_empty_2_26/TensorRT/tests/py/dynamo/testing_utilities.py", line 73, in compile_module_testing
    partitioned_module, _ = partitioning.fast_partition(
  File "/home/abose/Documents/work/torchTRT/torch_trt/lib/python3.8/site-packages/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py", line 280, in
partition
    partitioned_graph = partitioner.partition_graph()
  File "/home/abose/Documents/work/torchTRT/torch_trt/lib/python3.8/site-packages/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py", line 197, in
partition_graph
    subgraphs = self.put_nodes_into_subgraphs()
  File "/home/abose/Documents/work/torchTRT/torch_trt/lib/python3.8/site-packages/torch/fx/passes/splitter_base.py", line 805, in put_nodes_into_subgraphs
    raise FxNetSplitterInternalError("Couldn't create subgraphs")
torch._dynamo.exc.BackendCompilerFailed: backend='functools.partial(<function fx_dynamo_testing_backend at 0x7f5c946045e0>, store_intermediate_graphs=[], min_
block_size=1, torch_executed_ops=set(), use_fast_partitioner=True)' raised:
FxNetSplitterInternalError: Couldn't create subgraphs

Is this expected? Is it something to do with no splits happening for the above graph?

@apbose apbose marked this pull request as ready for review February 27, 2024 07:14
@gs-olive
Copy link
Collaborator

I'm not sure what the empty_like lowers to, but potentially you could add another operation in the nn.Module so that the graph is non-empty. It is likely the case that the graph is completely empty, so the partitioning fails. Since this decomposition is Torch-provided, we shouldn't need a test, however it is important to verify that whatever the operator is lowered to, is also supported by Torch-TRT

@apbose
Copy link
Collaborator Author

apbose commented Mar 1, 2024

I do not think that the graph would be empty since it would reduce to the lowering operations of aten::size and torch.Tensor() of the corresponding size getting created. So the graph once lowered should lead to these operations, though I need to confirm.
Ok I will add another operation to the module and verify the lowering.

@apbose
Copy link
Collaborator Author

apbose commented Mar 6, 2024

I verified the above test case with three cases-

  1. Case 1:
class emptyLike(torch.nn.Module):
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                y = torch.ops.aten.empty_like.default(x)
                return y

Without decomposition of empty_like
a. Before AOT trace

%l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
%empty_like_default : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%l_x,), kwargs = {})
 return (empty_like_default,)

b. After AOT trace

%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%clone : [num_users=1] = call_function[target=torch.ops.aten.clone.default](args = (%arg0_1,), kwargs = {})
%empty_like : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%clone,), kwargs = {})
return (empty_like,)

c. After lowering passes

%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%empty_like : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%arg0_1,), kwargs = {})
return (empty_like,)

This is the graph for partition

With the decomposition of empty_like
a. Before AOT trace

%l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
%empty_like_default : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%l_x,), kwargs = {})
 return (empty_like_default,)

b. After AOT trace

%arg0_1 : [num_users=0] = placeholder[target=arg0_1]
%empty_like : [num_users=1] = call_function[target=torch.ops.aten.empty_permuted.default](args = ([2,3],[0,1]), kwargs = {})
return (empty_like,)

c. After lowering passes

%arg0_1 : [num_users=0] = placeholder[target=arg0_1]
%_frozen_param0 : [num_users=1] = get_attr[target=_frozen_param0]
return (_frozen_param0,)

The above graph partitioning errors out at put_nodes_subgraph of fx _splitterbase since only frozen_params have nodes with users (thats my assumption)

  1. Case 2:
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                c = torch.ops.aten.add(x, x)
                y = torch.ops.aten.empty_like.default(c)
                return y

Like the above case during compilation, if the empty_like is included in the decomposition, the shape of x is extracted statically before runtime and the graph subgraphs is not created.

  1. Case 3:
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                c = torch.ops.aten.add(x, x)
                y = torch.ops.aten.empty_like.default(c)
                d = y + c
                return d

With the decomposition of empty_like
a. Before AOT trace

   %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
   %add : [num_users=2] = call_function[target=torch.ops.aten.add](args = (%l_x_, %l_x_), kwargs = {})
   %empty_like_default : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%add,), kwargs = {})
   %add_1 : [num_users=1] = call_function[target=operator.add](args = (%empty_like_default, %add), kwargs = {})
   return (add_1,)

b. After AOT trace

 %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
 %clone : [num_users=1] = call_function[target=torch.ops.aten.clone.default](args = (%arg0_1,), kwargs = {})
 %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%clone, %clone), kwargs = {})
 %empty_permuted : [num_users=1] = call_function[target=torch.ops.aten.empty_permuted.default](args = ([2, 3], [0, 1]), kwa
rgs = {dtype: torch.float32, layout: torch.strided, device: cuda:0, pin_memory: False})
  %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%empty_permuted, %add), kwargs = {})
    return (add_1,)

c. After lowering passes

    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, %arg0_1), kwargs = {})
    %_frozen_param0 : [num_users=1] = get_attr[target=_frozen_param0]
    %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%_frozen_param0, %add), kwargs = {})
    return (add_1,)

In the above case since there are additional add nodes with the frozen_param nodes, so the subgraph is created.

Studying the above cases, it seems that the aten lowering is happening during AOT trace. As discussed ideally a test case should not be required. I do not believe empty_permute is supporteded though.

@gs-olive
Copy link
Collaborator

gs-olive commented Mar 9, 2024

Thanks for the analysis @apbose - this is very helpful. It looks like the constant_folding lowering pass is freezing the memory for the empty_like operator and storing it as an attribute of the model.

Regarding empty_permuted - it seems like it would be necessary in the dynamic shape case, since we would not be able to freeze the parameter in that case. It seems based on the Core ATen IR that prims.empty_permuted is a core op, so I do think the conversion/evaluation of that would be helpful here, but it could go in a separate PR.

@apbose
Copy link
Collaborator Author

apbose commented Mar 12, 2024

Ok I will go ahead and make a separate PR for empty_permute. For now this PR can be merged then?

Copy link
Collaborator

@gs-olive gs-olive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@apbose apbose merged commit c5b8909 into main Apr 16, 2024
16 of 21 checks passed
peri044 pushed a commit that referenced this pull request Apr 19, 2024
laikhtewari pushed a commit that referenced this pull request May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: lowering Issues re: The lowering / preprocessing passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants