[Dynamo] Refine CPU fallback for TD+XLA #5000

wonjoolee95 · 2023-05-11T18:51:53Z

Picks up #4935

Supports unsupported ops to fallback in PyTorch/XLA + dynamo by utilizing CapacityBasedPartitioner.

wonjoolee95 · 2023-05-16T07:22:36Z

Okay, so simply removing the fallback assertions (on master branches) do not cause failures but may produce wrong results. Take a look at this example:

@dynamo.optimize("torchxla_trace_once")
def fn_fallback_unsupported(t):
  # xla currently does not lower aten::median
  return unsupported(t)

def unsupported(t):
  ret = torch.mul(t, 2) # torch.mul is supported by XLA
  final_ret = torch.median(ret) # torch.median is not supported by XLA
  return final_ret

def print_metrics():
  print('CompileTime:', met.metric_data('CompileTime')[0])
  print('ExecuteTime:', met.metric_data('ExecuteTime')[0])
  print('CounterNames:', met.counter_names())

device = xm.xla_device()

# initial trace
a = torch.tensor([1, 2, 3, 4, 5])
a_cpu = unsupported(a)
print('a_cpu:', a_cpu)
a_xla = a.to(device=device)
a_xla_ret = fn_fallback_unsupported(a_xla)
print(a_xla_ret)
print_metrics()

met.clear_counters()
print('ClearedCounterNames:', met.counter_names())
print('-----')

# second time
a_2 = torch.tensor([1, 2, 3, 4, 5])
a_cpu_2 = unsupported(a_2)
print('a_cpu_2:', a_cpu_2)
a_xla_2 = a_2.to(device=device)
a_xla_ret_2 = fn_fallback_unsupported(a_xla_2)
print(a_xla_ret_2)
print_metrics()

met.clear_counters()
print('ClearedCounterNames:', met.counter_names())
print('-----')

# third time
a_3 = torch.tensor([2, 3, 4, 5, 6])
a_cpu_3 = unsupported(a_3)
print('a_cpu_3:', a_cpu_3)
a_xla_3 = a_3.to(device=device)
a_xla_ret_3 = fn_fallback_unsupported(a_xla_3)
print(a_xla_ret_3)
print_metrics()

On master branch, this will produce wrong result on the third run:

a_cpu: tensor(6)
tensor(6, device='xla:1')
CompileTime: 4
ExecuteTime: 4
CounterNames: ['CreateXlaTensor', 'DestroyLtcTensor', 'DestroyXlaTensor', 'DeviceDataCacheMiss', 'UncachedCompile', 'xla::_copy_from', 'xla::_propagate_xla_data', 'xla::_to_copy', 'xla::_to_cpu', 'xla::copy', 'xla::empty_symint', 'xla::mul', 'CreateCompileHandles', 'CreateDataHandles', 'DestroyDataHandles', 'ReleaseDataHandles', 'XRTAllocateFromTensor_Empty', 'aten::median']
ClearedCounterNames: []
-----
a_cpu_2: tensor(6)
tensor(6, device='xla:1')
CompileTime: 4
ExecuteTime: 5
CounterNames: ['CreateXlaTensor', 'xla::_copy_from', 'xla::_to_copy', 'xla::empty_symint', 'CreateDataHandles']
ClearedCounterNames: []
-----
a_cpu_3: tensor(8)
tensor(6, device='xla:1')
CompileTime: 4
ExecuteTime: 6
CounterNames: ['CreateXlaTensor', 'xla::_copy_from', 'xla::_to_copy', 'xla::empty_symint', 'CreateDataHandles']

Note the result tensor(6, device='xla:1') on the third run. The correct result should be tensor(8). The root cause seems to be the graph being wrong hashed.

Now with PR, the result is correct:

a_cpu: tensor(6)
tensor(6, device='xla:1')
CompileTime: 2
ExecuteTime: 4
CounterNames: ['CachedCompile', 'CreateXlaTensor', 'DestroyLtcTensor', 'DestroyXlaTensor', 'xla::_copy_from', 'xla::_to_copy', 'xla::_to_cpu', 'xla::empty_symint', 'xla::mul', 'CreateDataHandles', 'DestroyDataHandles', 'ReleaseDataHandles', 'XrtCompile_Empty', 'XrtExecuteChained_Empty', 'XrtExecute_Empty', 'XrtMemoryInfo_Empty', 'XrtRead_Empty', 'XrtReleaseAllocationHandle_Empty', 'XrtReleaseCompileHandle_Empty', 'XrtSessionCount', 'XrtSubTuple_Empty', 'aten::median']
ClearedCounterNames: []
-----
a_cpu_2: tensor(6)
tensor(6, device='xla:1')
CompileTime: 2
ExecuteTime: 5
CounterNames: ['CachedCompile', 'CreateXlaTensor', 'DestroyLtcTensor', 'DestroyXlaTensor', 'xla::_copy_from', 'xla::_to_copy', 'xla::_to_cpu', 'xla::empty_symint', 'xla::mul', 'CreateDataHandles', 'ReleaseDataHandles', 'aten::median']
ClearedCounterNames: []
-----
a_cpu_3: tensor(8)
tensor(8, device='xla:1')
CompileTime: 2
ExecuteTime: 6
CounterNames: ['CachedCompile', 'CreateXlaTensor', 'DestroyLtcTensor', 'DestroyXlaTensor', 'DeviceDataCacheMiss', 'xla::_copy_from', 'xla::_to_copy', 'xla::_to_cpu', 'xla::empty_symint', 'xla::mul', 'CreateDataHandles', 'DestroyDataHandles', 'ReleaseDataHandles', 'aten::median']

As for next steps, let me review and update the metrics in the failing unit tests. As per last discussion, the CompileTime metric should be increased as we compile once more on the initial trace to fetch all the unsupported ops.

cc @seanlatias, let me know if this makes sense.

seanlatias · 2023-05-16T15:50:48Z

@wonjoolee95 Thanks Wonjoo. This is an intereting finding. It also explains why my previous run on accuracy is correct and it's because I didn't try different inputs. Please go ahead an add those metrics for testing. I have some new local changes that I'd like to push to further optimize the process. I'll do that once you finish your editting.

wonjoolee95 · 2023-05-16T21:44:34Z

@seanlatias, just updated the metrics for the failing unit tests and added some comments. Please take a look to see if they make sense.

I also just realized that the newly added test DynamoInPlaceTest.test_inplace_update_correctness is failing for a real reason with this PR. I'll look into this. Meanwhile, feel free to push your changes.

wonjoolee95 · 2023-05-16T22:13:25Z

Ah okay, the problem seems to be that for in-place tests, the execution when we fetch the fallback ops actually update the tensors.

wonjoolee95 · 2023-05-17T01:17:14Z

The latest commit should fix the DynamoInPlaceTest.test_inplace_update_correctness test. The fix is a bit ugly because it just duplicates the code in the extract_internal function, but I'll let this be for now (left a TODO with my name to make it cleaner).

wonjoolee95 · 2023-05-17T04:45:49Z

The CPU CI is green, but the GPU CI fails for some reason due to precision.

seanlatias · 2023-05-17T17:53:17Z

@wonjoolee95 I'll push my fix today. Facing some issues setting up the environments with the new code. Will let you know once I'm done.

wonjoolee95 · 2023-05-17T18:03:18Z

@wonjoolee95 I'll push my fix today. Facing some issues setting up the environments with the new code. Will let you know once I'm done.

Sounds good, thanks @seanlatias. Just curious, what is your fix about? I wanted to update some fallback unit tests too, so just want to make sure our changes don't conflict.

seanlatias · 2023-05-17T18:12:03Z

My fix will be about adding the metric checks in the unit tests. Also, in the dynamo bridge, we should also check call_method. Previously we only check call_module and call_function.

seanlatias · 2023-05-17T18:12:27Z

BTW, it seems I don't have the access to push to the branch. Do I miss something?

wonjoolee95 · 2023-05-17T18:30:37Z

BTW, it seems I don't have the access to push to the branch. Do I miss something?

Just sent a collaborator invite to your account. You should be able to push directly to this branch/PR after accepting the invitation.

…have XLAData or IR

torch_xla/core/dynamo_bridge.py

wonjoolee95 · 2023-05-17T21:03:29Z

For the slowdown I have some solutions in mind to solve them. We can create a separate PR for that. One is that if the users are certain the whole model is supported, they can turn off the CPU fallback check. Second is to create a cache to avoid checking repeated FX ops.

Great points. We're mostly okay with this amount of slowdown for now, we can move the per improvements to future PRs as you said.

wonjoolee95 · 2023-05-17T21:05:30Z

This is the error I see when trying to import torch_xla.

Traceback (most recent call last):
  File "pytorch/xla/test/dynamo/test_fallback.py", line 2, in <module>
    import torch_xla
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/torch_xla-2.1.0-py3.8-linux-x86_64.egg/torch_xla/__init__.py", line 134, in <module>
    import _XLAC
ImportError: /opt/conda/envs/py38/lib/python3.8/site-packages/torch_xla-2.1.0-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv

This undefined symbol usually happens when there is a mismatch between the pytorch and pytorch/xla versions. Can you verify that your local pytorch/xla is also up to date?

JackCaoG · 2023-05-18T00:50:49Z

test/dynamo/test_dynamo.py

-    self.assertEqual(met.metric_data('ExecuteTime')[0], sample_count)
+    # One graph for fetching the fallback ops.
+    # Another graph for the resnet18 inference.
+    self.assertEqual(met.metric_data('CompileTime')[0], 2)


hmm, haven't read through the code, but we don't need to compile the HLO in order to determine if there is a fallback

Reverted this back with the latest commit that adds ClearPendingIrs.

JackCaoG · 2023-05-18T00:51:08Z

test/dynamo/test_dynamo.py

+    # Another graph for the resnet18 inference.
+    self.assertEqual(met.metric_data('CompileTime')[0], 2)
+    # Again, +1 offset in ExecuteTime for fetching the fallback ops.
+    self.assertEqual(met.metric_data('ExecuteTime')[0], sample_count + 1)


ditto, we should not introduce additional exeuction for non-fallback graphs.

Reverted this back with the latest commit that adds ClearPendingIrs.

JackCaoG · 2023-05-18T00:52:29Z

test/dynamo/test_fallback.py

+    xla_mat2 = mat2.to(xm.xla_device())
+
+    cpu_res = fn_fallback(M, mat1, mat2)
+    xla_res = dynamo_fn(xla_M, xla_mat1, xla_mat2)


can we check counters for CompileTime and ExecuteTimer here?

Handling this with the comment below, working on adding metric checks for all our fallback unit tests.

JackCaoG · 2023-05-18T00:54:41Z

test/dynamo/test_fallback.py

+    cpu_res = fn_fallback(M, mat1, mat2, 0.5)
+    xla_res = dynamo_fn(M, mat1, mat2, 0.5)
+
+    self.assertTrue(torch.allclose(cpu_res, xla_res.cpu()))


ditto, counter check for compilation, execution. I think we also want to check that no aten:: counter is accumulated. Let me keep reading and figure how is the fallback op being executed.

After reading through the pr, it is still unclear to me how fallback op is being handled. Will be execute on CPU on the pytorch end? or it will be execute as lazy and go through our fallback op handling logic? I hope it is the former.

This is a good point. I think currently it is the latter. We do not move the tensors back-and-forth between cpu and xla device in our fallback logic. So we still get aten counters when executing the partitioned graph. Or do you think we should move the tensors explicitly instead of letting the lazy execution do it?

cc @wonjoolee95

To verify, we do see the xla::_to_cpu counter.

I think that also explains why the CompileTime metric increases by one when seeing a fallback op.

To clarify:

CompileTime increases by one for unsupported op when testing if an op goes through CPU fallback or not

ExecuteTime increases

Whenever an unsupported op is executed through lazy CPU fallback (e.g., in FallBackNodeCollector and InputCollector)

Whenever a compiled subgraph is executed

But with these, the final metric numbers still do not match. I'm still looking into the problem.

Yeah, we should see the aten:: metrics when we execute the unsupported ops.

@seanlatias, which final metric numbers do not match, the metric numbers in the fallback unit tests?

For the test_operator_fallback, I see CompileTime to be 2, which makes sense. However, with ExecuteTime, it is 5. I still couldn't explain.

I was playing a bit with the unit test case, and making it into a simpler test case. For what it's worth, with a simpler unit test like below:

def test_operator_fallback(self): def fn_fallback(t): # As of 05/18/2023, torch.median is not lowered by PyTorch/XLA return torch.median(t) torch._dynamo.reset() met.clear_counters() device = xm.xla_device() dynamo_fn = torch.compile(fn_fallback, backend="torchxla_trace_once") t = torch.randn(5) t_xla = t.to(device) cpu_res = fn_fallback(t) xla_res = dynamo_fn(t_xla) print('CompileTime:', met.metric_data('CompileTime')[0]) print('ExecuteTime:', met.metric_data('ExecuteTime')[0]) self.assertTrue(torch.allclose(cpu_res, xla_res.cpu()))

I was able to see:

CompileTime: 2 ExecuteTime: 2

Let me look into the existing test_operator_fallback with the cummin op.

JackCaoG · 2023-05-18T00:58:37Z

torch_xla/core/dynamo_bridge.py

+    for xla_arg, cloned_xla_arg in zip(xla_args, cloned_xla_args):
+      if isinstance(xla_arg, torch.Tensor):
+        xla_arg.copy_(cloned_xla_arg)
+


I think if you call ClearPendingIr here you will avoid the unncessary CompileTime and ExecuteTime.

Updated code to call ClearPendingIr here. And also moved the xm.mark_step to beginning of this function instead. Now, the metrics in the unit tests are left unchanged as expected.

JackCaoG

mostly lgtm, I think if we can call ClearPendingIr in right place, it should not regress the test_dynamo.py

wonjoolee95 · 2023-05-23T22:38:02Z

@JackCaoG, could you take a look at this again? Addressed the comments from last review to add ClearPendingIrs to fix the test_dynamo.py regressions and added asserts/metric checks to the DynamoCpuFallbackTest unit tests. These DynamoCpuFallbackTest tests should be enough to cover correctness of the fallback mechanism, although I still do not completely understand the reasoning behind the excessively increased Execute counters. I'll look into them, but meanwhile, this PR should be review-able.

JackCaoG · 2023-05-23T23:09:05Z

test/dynamo/test_dynamo.py

+    self.assertEqual(met.metric_data('CompileTime')[0], 3)
+    self.assertEqual(met.metric_data('ExecuteTime')[0], 3)


hmm why is it 3 here instead of 4? Wouldn't dynamo_fn fallback and execute 2 execution?

oh ok, I think I understand why t_xla * 3 would trigger a separate Compilation. t_xla * 3 is actually a pending execution and we will call mark_step to materialize the input. If that's the case I don't understand why ExecuteTime is 3 then, it should be 5 I guess?

So here's how the IR dump looks like when I run DynamoCpuFallbackTest.test_operator_fallback locally: https://gist.github.com/wonjoolee95/1426859ef0a9203dca71ad455e4badc8. This is also what I'm trying to figure out 😢

Another odd thing is that when I run the fallback tests individuals, as such: python test/dynamo/test_dynamo.py DynamoCpuFallbackTest.test_operator_fallback and python test/dynamo/test_dynamo.py DynamoCpuFallbackTest.test_fallback_multiple_submodules. However, when I run them in a single run by running python test/dynamo/test_dynamo.py, they fail due to metric assertions (same as the failure in the CI). This makes me think there are possibly some pending IRs somewhere, but I tried to manually invoke torch_xla._XLAC._clear_pending_irs(str(xm.xla_device())) at the end of each unit test but still seeing the error.

@wonjoolee95 I think we need to reset the metrics for each test. Similar to here:

xla/test/dynamo/test_dynamo.py

Line 88 in 3150573

met.clear_all()

Hmmm... but I also met similar problem when adding clear_all(). The behavior of a single test and a set of tests are different.

I am still a bit confuse about

on the 2nd tracing, we can see that both the CompileTime and ExecuteTime remain the same as the 1st tracing because the graph with the fallback op is already captured

ExecuteTime should increase when a dynamo execution happened. It should increase both during the first dynamo run and subsequent execution.

Let's call met.clear_all() instead of met.clear_counters, clear counter won't reset metrics.

I would expect that every call to dynamo_fn will trigger at least 2 execution, since there is a fallback op in the middle and XLA needs to execute the graph before and after the fallback op.

Also trying completely understand why these metrics are having such numbers, but here is what I understand -- please let me know if there is anything that doesn't sound right.

ExecuteTime should increase when a dynamo execution happened. It should increase both during the first dynamo run and subsequent execution.

ExecuteTime increases only when there is an XLA execution. However, in this example, there is only one single aten::median op that is executed by CPU, so the ExecuteTime doesn't increase on the 2nd tracing.

I would expect that every call to dynamo_fn will trigger at least 2 execution, since there is a fallback op in the middle and XLA needs to execute the graph before and after the fallback op.

On the 2nd and 3rd tracing, the torchxla.py already has a compiled_graph from the 1st tracing that looks like:

class GraphModule(torch.nn.Module): def forward(self, L_t_ : torch.Tensor): l_t_ = L_t_ # File: wonjoo_2.py:28, code: return torch.median(t) median = torch.median(l_t_); l_t_ = None return (median,)

Now, since the only op in this is median that is executed by CPU, executing this graph with compiled_graph(*args) does not increase ExecuteTime. In the 3rd tracing with t_xla * 3, the only graph that XLA needs to execute before the aten::median op is the x_xla * 3, hence increasing ExecuteTime only by 1.

This is what I think is happening.. let me know if that makes sense, @JackCaoG . And @seanlatias, also let me know if this aligns with your understanding, just to make sure we're on the same page.

Mostly agree with @wonjoolee95. Following is my explanation for each trace.

The first trace includes two parts: checking CPU fallback OP for compilation and run the fallback OP with the compiled graph. Both parts involve compiling and executing torch.median(). Thus, the initial CompileTime and ExecuteTime are 2.

The second trace does not introduce any changes: same input & same module. Thus, nothing needs to be compiled and executed again. This is handled by the cached computation in XLA.

The third trace does not change the input module. Thus, the module does not need to be recompiled. However, it changes the input for the fallback OP. Thus, the fallback OP needs to be recompiled and re-exectued (remember that we said the fallback OP goes through the lazy tensor execution instead of being executed by PyTorch directly). That's why we see an increase in both CompileTime and ExecuteTime.

Make sense, thanks.

JackCaoG

mostly lgtm, has one question regarding the test

JackCaoG

mostly lgtm, has one question regarding the test

wonjoolee95 · 2023-05-25T17:46:17Z

@JackCaoG, updated the comments above, should be ready for one more review.

JackCaoG

Thanks a lot @wonjoolee95 and @seanlatias . This is great work!

wonjoolee95 mentioned this pull request May 11, 2023

[Dynamo] Refine CPU fallback for TD+XLA #4935

Closed

wonjoolee95 force-pushed the cpufallback branch from b707e9d to a9426e9 Compare May 16, 2023 21:36

wonjoolee95 force-pushed the cpufallback branch from 9cfa1d9 to 27bd4ad Compare May 17, 2023 01:16

wonjoolee95 force-pushed the cpufallback branch from 9918f8c to 0614e54 Compare May 17, 2023 05:16

wonjoolee95 marked this pull request as ready for review May 17, 2023 17:26

wonjoolee95 self-assigned this May 17, 2023

wonjoolee95 and others added 13 commits May 17, 2023 20:30

Support CPU fallback for unsupported ops in dynamo

6e75ab6

Update imports

d259ace

move partition outside of extract compile graph

205e307

attempt to return compiled graph instead

c6f09b4

replace submodule with a function call instead

6f83c96

make materialization check return true when input XLA tensors do not …

99b5e17

…have XLAData or IR

fixed op support check

76cf87a

fix lint issue

c688413

add tests

223a39f

fix cpp lint

35702f2

modify tests

e5d243e

Include module into fallback checking

303c8e4

Add more tests with numerical checks

028106c

wonjoolee95 commented May 17, 2023

View reviewed changes

torch_xla/core/dynamo_bridge.py Show resolved Hide resolved

wonjoolee95 commented May 17, 2023

View reviewed changes

torch_xla/core/dynamo_bridge.py Show resolved Hide resolved

JackCaoG reviewed May 18, 2023

View reviewed changes

Wonjoo Lee added 2 commits May 18, 2023 06:08

Clear pending IRs after initial fetch of fallback ops

d70d21e

Change the metric asserts back to original

81f26e5

wonjoolee95 force-pushed the cpufallback branch from 0614e54 to 81f26e5 Compare May 18, 2023 06:23

Wonjoo Lee added 3 commits May 23, 2023 19:47

Revert irrelevant comments in test_dynamo.py

6c489c3

Refactor fallback dynamo tests

e5cc07d

Run linter

f4b7407

seanlatias added 2 commits May 23, 2023 22:41

add call_method to fallback check list

cba3326

Merge branch 'cpufallback' of github.com:pytorch/xla into cpufallback

45c1440

JackCaoG reviewed May 23, 2023

View reviewed changes

wonjoolee95 force-pushed the cpufallback branch from cfd5ccd to ec4abf2 Compare May 24, 2023 20:03

Update fallback unit tests

2b59793

wonjoolee95 force-pushed the cpufallback branch from ec4abf2 to 2b59793 Compare May 24, 2023 20:58

JackCaoG approved these changes May 30, 2023

View reviewed changes

JackCaoG merged commit 3aeeb83 into master May 30, 2023

This was referenced Aug 15, 2024

[RFC] Supporting CPU fallback for the dynamo-xla bridge #4742

Closed

[Dynamo] Integrating CapabilityBasedPartitioner in PyTorch/XLA #4758

Closed

		self.assertEqual(met.metric_data('CompileTime')[0], 3)
		self.assertEqual(met.metric_data('ExecuteTime')[0], 3)

[Dynamo] Refine CPU fallback for TD+XLA #5000

[Dynamo] Refine CPU fallback for TD+XLA #5000

Conversation

wonjoolee95 commented May 11, 2023 • edited Loading

wonjoolee95 commented May 16, 2023

seanlatias commented May 16, 2023

wonjoolee95 commented May 16, 2023

wonjoolee95 commented May 16, 2023

wonjoolee95 commented May 17, 2023

wonjoolee95 commented May 17, 2023 • edited Loading

seanlatias commented May 17, 2023

wonjoolee95 commented May 17, 2023

seanlatias commented May 17, 2023

seanlatias commented May 17, 2023

wonjoolee95 commented May 17, 2023

wonjoolee95 commented May 17, 2023

wonjoolee95 commented May 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanlatias May 18, 2023 • edited Loading

Choose a reason for hiding this comment

wonjoolee95 May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

wonjoolee95 commented May 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

wonjoolee95 commented May 25, 2023

JackCaoG left a comment

Choose a reason for hiding this comment

wonjoolee95 commented May 11, 2023 •

edited

Loading

wonjoolee95 commented May 17, 2023 •

edited

Loading

JackCaoG May 18, 2023 •

edited

Loading

seanlatias May 18, 2023 •

edited

Loading

wonjoolee95 May 18, 2023 •

edited

Loading