Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install torch wheel from pytorch to unblock TPU CI #6691

Merged
merged 7 commits into from
Mar 11, 2024
Merged

Conversation

vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Mar 7, 2024

#6681 helps uncover the silent failures in the TPU Integration Test / tpu-test (push). The failure is

+ python3 test/test_operations.py -v
Traceback (most recent call last):
  File "test/test_operations.py", line 47, in <module>
    import torchvision
  File "/usr/local/lib/python3.8/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "/usr/local/lib/python3.8/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/home/runner/.local/lib/python3.8/site-packages/torch/library.py", line [46](https://github.com/pytorch/xla/actions/runs/8180398097/job/22368345687#step:5:47)7, in inner
    handle = entry.abstract_impl.register(func_to_register, source)
  File "/home/runner/.local/lib/python3.8/site-packages/torch/_library/abstract_impl.py", line 30, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist

The failure is due to the torch wheel built by us and torchvison official wheel is not compatible with the torch wheel built by us.

@vanbasten23
Copy link
Collaborator Author

Is there a way to test the change before the change is merged?

@mbzomowski
Copy link
Collaborator

Actions > TPU Integration Test > Run workflow > then select your branch, and it'll start the run from your edited workflow

FYI I opened a PR for this: #6690, which failed: https://github.com/pytorch/xla/actions/runs/8193035259

@vanbasten23 vanbasten23 changed the title Reinstall torch wheel to unblock TPU CI Build torchvision from source to unblock TPU CI Mar 7, 2024
@vanbasten23
Copy link
Collaborator Author

Building torchvision from src is working https://github.com/pytorch/xla/actions/runs/8195031279: we don't have the the nms error and we don't have to rely on nightly torch wheels so that we have to break every time we make companion change in pytorch. cc @will-cromar @mbzomowski

@vanbasten23
Copy link
Collaborator Author

Per offline discussion, it's less ideal to compile torchvision from src and it's better to stick with the original plan

@vanbasten23 vanbasten23 changed the title Build torchvision from source to unblock TPU CI Install torch wheel from pytorch to unblock TPU CI Mar 8, 2024
@vanbasten23
Copy link
Collaborator Author

Current failure:

+ python3 test/test_operations.py -v
2024-03-08T19:03:57.5462724Z Traceback (most recent call last):
2024-03-08T19:03:57.5463798Z   File "test/test_operations.py", line 31, in <module>
2024-03-08T19:03:57.5464676Z     import torch_xla
2024-03-08T19:03:57.5467031Z   File "/home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/torch_xla/__init__.py", line 7, in <module>
2024-03-08T19:03:57.5469219Z     import _XLAC
2024-03-08T19:03:57.5479640Z ImportError: /home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch4lazy13MetricFnValueB5cxx11Ed
2024-03-08T19:03:57.8749327Z ##[error]Process completed with exit code 1.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Mar 8, 2024

torch has abi disabled. I think we enabled it when building torch_xla

@vanbasten23
Copy link
Collaborator Author

Ok, some progress, we can import torch_xla atm. Now, it fails with:

======================================================================
FAIL: test_resnet18 (__main__.DynamoTrainingOptimizerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/dynamo/test_dynamo.py", line 518, in test_resnet18
    self.assertEqual(met.metric_data('CompileTime')[0], 6)
AssertionError: 7 != 6

after I rebase onto origin/master. It's because the companion upstream pr pytorch/pytorch#115621 was just reverted. But all others should be good.

Copy link
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.

I'm interested to see if #6704 works as a more stable solution.

@vanbasten23
Copy link
Collaborator Author

This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.

Why CPU wheels? The old TPU CI uses the cuda wheel and our github index page also suggest to install the cuda torch wheel. So shouldn't our CI be consistent?

@will-cromar
Copy link
Collaborator

The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.

@vanbasten23
Copy link
Collaborator Author

The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.

If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?

@will-cromar
Copy link
Collaborator

If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?

These are just nightly builds, so in my mind it doesn't matter.

@vanbasten23
Copy link
Collaborator Author

Pending on a new TPU CI run. Stay tuned..

@mbzomowski
Copy link
Collaborator

Pending on a new TPU CI run. Stay tuned..

https://github.com/pytorch/xla/actions/runs/8240857381

Looks good

@vanbasten23
Copy link
Collaborator Author

The new TPU CI passed. Thanks for the review.

@vanbasten23 vanbasten23 merged commit 6630287 into master Mar 11, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants