[Core ATen Opset] Lower aten_pixel_shuffle #5886

wonjoolee95 · 2023-11-21T21:11:40Z

In order for PyTorch/XLA to support the PyTorch core ATen opset, it requires lowering each core ATen op in PyTorch/XLA. This issue is used to track the PyTorch/XLA lowering for aten_pixel_shuffle.

Here are some general guidelines to lowering this op:

Uncomment @unittest.skip or @unittest.expectFailure and run the unit test at test_core_aten_ops.py. Eg: pytest test/test_core_aten_ops.py -k test_aten_pixel_shuffle_0
Make code changes until the test passes. Read and follow fix_lowering_for_core_aten_ops.md for ideas to fix.
- There may be multiple unit tests for a single op. For this op, the corresponding unit tests are:
  - test_aten_pixel_shuffle_0
  - test_aten_pixel_shuffle_1
  - test_aten_pixel_shuffle_2
- Please also uncomment the skips for all these tests and ensure all tests are fixed.
- Note that sometimes the fix may be to fix the unit tests itself. Please take a look at the corresponding unit tests to make sure the tests are valid.
Submit the PR!

For any questions, feel free to leave a comment in this PR.

The text was updated successfully, but these errors were encountered:

wonjoolee95 · 2024-02-22T03:03:44Z

As @zpcore mentioned in an offline sync, the behavior of this op seems a bit odd:

For aten op pixel_shuffle. When dispatch in XLA, the op shows different behavior when calling it through C++ (

xla/test/cpp/test_aten_xla_tensor_4.cpp

Line 1224 in 657b692

torch::Tensor xla_output = torch::pixel_shuffle(xla_input, upscale_factor);

) and python. If we call through python, the output will be a tensor with all value 0. However, cpp test doesn't show the issue.

Below is the sample code I use for testing pixel_shuffle in python:

import torch
import torch_xla
import torch_xla.core.xla_model as xm
a = torch.rand([5,18,4,4])
device = xm.xla_device()
d = a.to(device)
f = torch.pixel_shuffle(d, 3)
xm.master_print(met.metrics_report())
print(f)

The output will always be tensor with all value 0, which is different from run it through C++.

zpcore · 2024-03-04T07:44:43Z

In CPP I run the following test program:

torch::Tensor input = torch::rand({5, 18, 4, 4}, torch::TensorOptions(torch::kFloat));
torch::Tensor xla_input = CopyToDevice(input, device);
torch::Tensor xla_output = torch::pixel_shuffle(xla_input, upscale_factor);
std::cout << GetTensorsHloGraph({xla_output}, EmitMode::kHloReadable) << std::endl;

We got the following HLO:

HloModule IrToHlo.6, entry_computation_layout={(f32[5,18,4,4]{3,2,1,0})->(f32[5,2,12,12]{3,2,1,0})}

ENTRY %IrToHlo.6 (p0.1: f32[5,18,4,4]) -> (f32[5,2,12,12]) {
  %p0.1 = f32[5,18,4,4]{3,2,1,0} parameter(0), metadata={op_type="xla__device_data" op_name="xla__device_data"}
  %reshape.2 = f32[5,2,3,3,4,4]{5,4,3,2,1,0} reshape(f32[5,18,4,4]{3,2,1,0} %p0.1), metadata={op_type="aten__view" op_name="aten__view"}
  %transpose.3 = f32[5,2,4,3,4,3]{4,2,5,3,1,0} transpose(f32[5,2,3,3,4,4]{5,4,3,2,1,0} %reshape.2), dimensions={0,1,4,2,5,3}, metadata={op_type="aten__permute" op_name="aten__permute"}
  %reshape.4 = f32[5,2,12,12]{3,2,1,0} reshape(f32[5,2,4,3,4,3]{4,2,5,3,1,0} %transpose.3), metadata={op_type="aten__view" op_name="aten__view"}
  ROOT %tuple.5 = (f32[5,2,12,12]{3,2,1,0}) tuple(f32[5,2,12,12]{3,2,1,0} %reshape.4)
}

In Python I run the following test program:

a = torch.rand([5,18,4,4])
device = xm.xla_device()
d = a.to(device)
f = torch.pixel_shuffle(d, 3)
print(torch_xla._XLAC._get_xla_tensors_hlo([f]))

We got the following HLO:

HloModule IrToHlo.7, entry_computation_layout={()->(f32[5,2,12,12]{3,2,1,0})}

ENTRY %IrToHlo.7 () -> (f32[5,2,12,12]) {
  %constant.1 = f32[] constant(0), metadata={op_type="prim__Constant" op_name="prim__Constant" source_file="/mnt/disks/data/pytorch/torch/_meta_registrations.py" source_line=5933}
  %reshape.2 = f32[1,1,1,1]{3,2,1,0} reshape(f32[] %constant.1), metadata={op_type="aten__expand" op_name="aten__expand" source_file="/mnt/disks/data/pytorch/torch/_meta_registrations.py" source_line=5933}
  %broadcast.3 = f32[1,1,1,1]{3,2,1,0} broadcast(f32[1,1,1,1]{3,2,1,0} %reshape.2), dimensions={0,1,2,3}, metadata={op_type="aten__expand" op_name="aten__expand" source_file="/mnt/disks/data/pytorch/torch/_meta_registrations.py" source_line=5933}
  %reshape.4 = f32[] reshape(f32[1,1,1,1]{3,2,1,0} %broadcast.3), metadata={op_type="aten__expand" op_name="aten__expand" source_file="/mnt/disks/data/pytorch/torch/_meta_registrations.py" source_line=5933}
  %broadcast.5 = f32[5,2,12,12]{3,2,1,0} broadcast(f32[] %reshape.4), dimensions={}, metadata={op_type="aten__expand" op_name="aten__expand" source_file="/mnt/disks/data/pytorch/torch/_meta_registrations.py" source_line=5933}
  ROOT %tuple.6 = (f32[5,2,12,12]{3,2,1,0}) tuple(f32[5,2,12,12]{3,2,1,0} %broadcast.5)
}

As we can see, CPP and Python generated different HLO graphs. The op aten__permute doesn't exist in python routine and as a result we see python always produces tensor with all zero values. Will looking into detail of the lowering trace.

zpcore · 2024-03-04T19:59:34Z

This is something to do with at::functionalization::functionalize_aten_op<ATEN_OP(pixel_shuffle)>::call(self, upscale_factor) called here.

pixel_shuffle has it's own decomposition declared in torch here. If we remove the pixel_shuffle op from xla_native_functions.yaml and let torch do the decomposition to call other ops, the result will be correct using the python API in torch_xla.

functionalization has it's own purposes to handle mutations as discussed in #3646 and https://dev-discuss.pytorch.org/t/functionalization-in-pytorch-everything-you-wanted-to-know/965.

@bdhirsh to see why functionalization will incorrectly decompose the pixel_shuffle through python API. Can we simply remove it?

Update 03/19: I forgot the mention that for some reason the input tensor is dropped as is shown in ENTRY %IrToHlo.7 () -> (f32[5,2,12,12]). We need to figure out from which stage the input tensor got lost when we call functionalization.

miladm · 2024-03-19T17:31:44Z

any updates on this issue?

zpcore · 2024-03-19T18:28:48Z

any updates on this issue?

No, let me follow up with @bdhirsh .

miladm · 2024-03-19T20:44:29Z

cc @jiawenliu64 as this op appears to fail tests when Functionalization flag is enabled

bdhirsh · 2024-03-25T21:37:50Z

Hey!

The API run at

xla/torch_xla/csrc/aten_xla_type.cpp

Lines 3639 to 3645 in 5a113af

    
           at::Tensor XLANativeFunctions::pixel_shuffle(const at::Tensor& self, 
        
                                                        int64_t upscale_factor) { 
        
             XLA_CHECK( 
        
                 !runtime::sys_util::GetEnvBool("XLA_DISABLE_FUNCTIONALIZATION", false)); 
        
             return at::functionalization::functionalize_aten_op<ATEN_OP( 
        
                 pixel_shuffle)>::call(self, upscale_factor); 
        
           }

is just re-invoking the dispatcher, and attempting to run whatever decomposition for pixel_shuffle is registered there.

My guess for the python vs C++ difference is that there's a C++ decomposition registered in core here (that secretly runs some view ops like permute(), which is why the linked API above "re-runs" functionalization to be able to convert permute to permute_copy)

And when you use python, import torch probably ends up registering this python decomposition to the dispatcher, so that is run instead

In theory... both of those decomps should be correct. Does one of the IR's look wrong?

Alternatively: you can definitely remove the return at::functionalization::functionalize_aten_op<ATEN_OP(pixel_shuffle)>::call(self, upscale_factor); code and insert your own lowering instead, if you want to lower the op directly without using a decomposition from core

wonjoolee95 · 2024-03-27T23:31:33Z

Thanks for the response, @bdhirsh! So the original reason why we wanted to lower this was that it was marked as a "core" op in pytorch native_functions.yaml and we wanted torch_xla to support lowerings for all core ops.

According to my initial understanding, I thought a "core" op would not be decomposed into any further ops, is this the case? But according to https://github.com/pytorch/pytorch/blob/58047205ed098c04ec045e66fc39dcc70b60600b/torch/_refs/nn/functional/__init__.py#L1169, it appears to have some decompositions.

bdhirsh · 2024-03-28T00:26:31Z

@wonjoolee95 the way I would categorize the decomps we have is that:

(1) there are a ton of decomps, for many (most?) ops in ATen. A compiler backend to torch.compile can specify which of those decomps they do/don't want to run, and lower the remaining primitive ops that show up in the graph directly. For example, inductor has its own set of decomps that it uses (basically core aten + a few other decomps: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/decomposition.py#L75)

(2) there is a canonical "core ATen opset" that backends can choose to target, and you can get out a graph of core ATen IR by specifying that you only want to run core ATen decomps.

So if you e.g. use torch.export and run with core ATen decomps set, you'll get a graph of core ATen IR. But the eager-mode XLA integration doesn't necessarily run the same set of decomps as core ATen (although you can change which ops you choose to decompose vs lower directly)

wonjoolee95 · 2024-03-29T23:11:11Z

Thanks for the explanation, that makes a lot of sense.

@zpcore, I'll remove this issue from the scope of "core aten opset". In torch.export's view, this op can be further decomposed so we don't have to keep within this project's scope.

zpcore · 2024-04-01T18:23:20Z

Hi @bdhirsh , @wonjoolee95 , thanks for the following up. I checked the decomposition trace, it turns out that if we move the tensor to the XLA device, it will use a different decompose. Below is the example code.
If we use the following code:

device = xm.xla_device()
a = torch.rand([5,18,4,4])
a = a.to(device)
f = torch.pixel_shuffle(a, 3)

It will enter _meta_registrations.py

If we remove a = a.to(device), the code will end up in the CPP decomposition pixel_shuffle

Either way, I didn't see the python registration been called.

wonjoolee95 added core aten opset good first issue Good for newcomers labels Nov 21, 2023

wonjoolee95 assigned zpcore Feb 12, 2024

wonjoolee95 removed the core aten opset label Mar 29, 2024

zpcore mentioned this issue Apr 10, 2024

fix pixel_shuffle return empty #6907

Merged

zpcore closed this as completed in #6907 Apr 11, 2024

zpcore added this to Week 11: Core ATen Opset Aug 1, 2024

zpcore moved this to Done in Week 11: Core ATen Opset Aug 1, 2024

github-project-automation bot added this to Week 10: Core ATen Opset Aug 28, 2024

github-project-automation bot moved this to Todo in Week 10: Core ATen Opset Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core ATen Opset] Lower aten_pixel_shuffle #5886

[Core ATen Opset] Lower aten_pixel_shuffle #5886

wonjoolee95 commented Nov 21, 2023

wonjoolee95 commented Feb 22, 2024

zpcore commented Mar 4, 2024

zpcore commented Mar 4, 2024 •

edited

Loading

miladm commented Mar 19, 2024

zpcore commented Mar 19, 2024

miladm commented Mar 19, 2024 •

edited

Loading

bdhirsh commented Mar 25, 2024

wonjoolee95 commented Mar 27, 2024

bdhirsh commented Mar 28, 2024 •

edited

Loading

wonjoolee95 commented Mar 29, 2024

zpcore commented Apr 1, 2024

[Core ATen Opset] Lower aten_pixel_shuffle #5886

[Core ATen Opset] Lower aten_pixel_shuffle #5886

Comments

wonjoolee95 commented Nov 21, 2023

wonjoolee95 commented Feb 22, 2024

zpcore commented Mar 4, 2024

zpcore commented Mar 4, 2024 • edited Loading

miladm commented Mar 19, 2024

zpcore commented Mar 19, 2024

miladm commented Mar 19, 2024 • edited Loading

bdhirsh commented Mar 25, 2024

wonjoolee95 commented Mar 27, 2024

bdhirsh commented Mar 28, 2024 • edited Loading

wonjoolee95 commented Mar 29, 2024

zpcore commented Apr 1, 2024

zpcore commented Mar 4, 2024 •

edited

Loading

miladm commented Mar 19, 2024 •

edited

Loading

bdhirsh commented Mar 28, 2024 •

edited

Loading