Bring back PyTorch/XLA GPU tests/builds #8577

tengyifei · 2025-01-15T01:45:55Z

🐛 Bug

PyTorch/XLA on GPUs builds have been failing since Oct 21, 2024.

In order to bring back GPU builds and tests, the first challenge is to build PyTorch/XLA with clang and hermetic CUDA 1. After that, there may be more tests to fix.

This bug tracks work/discussions needed to bring back GPU builds.

tengyifei · 2025-01-21T18:15:01Z

Chengji has a wip branch for hermetic CUDA: https://github.com/yaochengji/xla/tree/chengji/clang-herm

tengyifei · 2025-01-21T18:15:07Z

cc @ysiraichi

miladm · 2025-01-21T18:46:15Z

cc @tengyifei to work with @ysiraichi. Ideally, we would like to use this bug to bring back GPU whl for 2.6 release.

ysiraichi · 2025-01-27T22:42:57Z

Update: I have tried Chengji's branch, but the build kept failing with:

failed: undeclared inclusion(s) in rule '@zlib//:zlib':
this rule is missing dependency declarations for the following files included by 'zutil.c':
  '/usr/lib/clang/17/include/stddef.h'
  '/usr/lib/clang/17/include/__stddef_max_align_t.h'
  '/usr/lib/clang/17/include/limits.h'
  '/usr/lib/clang/17/include/stdarg.h'

Still investigating it.

ysiraichi · 2025-01-28T23:09:45Z

Here's a more verbose update on transitioning PyTorch/XLA to use OpenXLA hermetic CUDA. In summary, these are the things I have added to the build system (branch diff):

Followed the documentation, and added the necessary code to WORKSPACE and .bazelrc
Added a few more things that I could find in the JAX repo
Explicitly set clang-17 as CC and CXX

Even after all these steps, I am still hitting the error above.

Reproducing the Error

Get my hermetic-cuda branch
Go to the following directory: plugins/cuda
Run the following command:

$ bazel build @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so --symlink_prefix=$(pwd)/bazel- --config=cuda

My Thoughts

As far as I understand it, we wouldn't need to specify those dependencies, since they are system dependencies (from clang itself)
These errors are occurring outside PyTorch/XLA (I believe they are OpenXLA dependencies)
I believe I am building an OpenXLA target similarly to how JAX builds it. I just don't know what I'm doing differently...

Question

How to fix this error?

miladm · 2025-01-31T15:52:51Z

Thanks @ysiraichi - I've pined openxla partners to share their input on this issue.

beckerhe · 2025-01-31T16:58:24Z

Update: I have tried Chengji's branch, but the build kept failing with:

failed: undeclared inclusion(s) in rule '@zlib//:zlib':
this rule is missing dependency declarations for the following files included by 'zutil.c':
  '/usr/lib/clang/17/include/stddef.h'
  '/usr/lib/clang/17/include/__stddef_max_align_t.h'
  '/usr/lib/clang/17/include/limits.h'
  '/usr/lib/clang/17/include/stdarg.h'

Still investigating it.

Can you try setting CC and CXX to the full absolute path of clang?

CC=/usr/lib/llvm-17/bin/clang

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721838742

ybaturina · 2025-01-31T19:44:17Z

The fix openxla/xla#22165 is merged. Can you try building Pytorch again please? (without changing Clang symlink to absolute path)

ysiraichi · 2025-02-03T18:43:22Z

Thanks, @ybaturina. Using the absolute path did work! I still haven't tested your patch.

ysiraichi · 2025-02-03T18:47:09Z

@tengyifei @will-cromar What's the recommended process for installing something (e.g. clang-17) inside the current dev VM that runs on CI? I saw that the install_deps role is being skipped by default.

tengyifei · 2025-02-03T23:11:21Z

@ysiraichi I believe you need to install clang-17 into the development image by sending a PR to

xla/infra/ansible/development.Dockerfile

Line 15 in 3578940

ENV TAGS="bazel,configure_env,install_deps"

.

That builds a dev docker image that will be accessible at https://console.cloud.google.com/artifacts/docker/tpu-pytorch-releases/us-central1/docker/development

ysiraichi linked a pull request Jan 21, 2025 that will close this issue

Fix CUDA plugin CI. #8593

Open

miladm assigned ysiraichi Jan 21, 2025

miladm added the xla:gpu label Jan 21, 2025

copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

c4e0d8f

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot mentioned this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository. google/tsl#3174

Merged

copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

2a68e15

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot mentioned this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository. openxla/xla#22165

Closed

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

fc41ff2

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot mentioned this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository. tensorflow/tensorflow#86309

Merged

copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

ec5a37d

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

936712b

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

fcf2e20

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

e6f9aa9

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

8118161

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

301eeb5

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721803568

copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

2e92560

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721838742

copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

e0f9930

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721838742

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025

Set absolute compiler path for hermetic CUDA repository.

64227f1

Addressed the bug pytorch/xla#8577. PiperOrigin-RevId: 721838742

ysiraichi mentioned this issue Feb 4, 2025

Install clang-17 on development Dockerfile. #8673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring back PyTorch/XLA GPU tests/builds #8577

Bring back PyTorch/XLA GPU tests/builds #8577

tengyifei commented Jan 15, 2025

tengyifei commented Jan 21, 2025

tengyifei commented Jan 21, 2025

miladm commented Jan 21, 2025

ysiraichi commented Jan 27, 2025

ysiraichi commented Jan 28, 2025 •

edited

Loading

miladm commented Jan 31, 2025

beckerhe commented Jan 31, 2025

ybaturina commented Jan 31, 2025 •

edited

Loading

ysiraichi commented Feb 3, 2025

ysiraichi commented Feb 3, 2025

tengyifei commented Feb 3, 2025

Bring back PyTorch/XLA GPU tests/builds #8577

Bring back PyTorch/XLA GPU tests/builds #8577

Comments

tengyifei commented Jan 15, 2025

🐛 Bug

tengyifei commented Jan 21, 2025

tengyifei commented Jan 21, 2025

miladm commented Jan 21, 2025

ysiraichi commented Jan 27, 2025

ysiraichi commented Jan 28, 2025 • edited Loading

Reproducing the Error

My Thoughts

Question

miladm commented Jan 31, 2025

beckerhe commented Jan 31, 2025

ybaturina commented Jan 31, 2025 • edited Loading

ysiraichi commented Feb 3, 2025

ysiraichi commented Feb 3, 2025

tengyifei commented Feb 3, 2025

ysiraichi commented Jan 28, 2025 •

edited

Loading

ybaturina commented Jan 31, 2025 •

edited

Loading