Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring back PyTorch/XLA GPU tests/builds #8577

Open
tengyifei opened this issue Jan 15, 2025 · 11 comments · May be fixed by #8593
Open

Bring back PyTorch/XLA GPU tests/builds #8577

tengyifei opened this issue Jan 15, 2025 · 11 comments · May be fixed by #8593
Assignees
Labels

Comments

@tengyifei
Copy link
Collaborator

🐛 Bug

PyTorch/XLA on GPUs builds have been failing since Oct 21, 2024.

In order to bring back GPU builds and tests, the first challenge is to build PyTorch/XLA with clang and hermetic CUDA 1. After that, there may be more tests to fix.

This bug tracks work/discussions needed to bring back GPU builds.

@ysiraichi ysiraichi linked a pull request Jan 21, 2025 that will close this issue
@tengyifei
Copy link
Collaborator Author

Chengji has a wip branch for hermetic CUDA: https://github.com/yaochengji/xla/tree/chengji/clang-herm

@tengyifei
Copy link
Collaborator Author

cc @ysiraichi

@miladm
Copy link
Collaborator

miladm commented Jan 21, 2025

cc @tengyifei to work with @ysiraichi. Ideally, we would like to use this bug to bring back GPU whl for 2.6 release.

@ysiraichi
Copy link
Collaborator

Update: I have tried Chengji's branch, but the build kept failing with:

failed: undeclared inclusion(s) in rule '@zlib//:zlib':
this rule is missing dependency declarations for the following files included by 'zutil.c':
  '/usr/lib/clang/17/include/stddef.h'
  '/usr/lib/clang/17/include/__stddef_max_align_t.h'
  '/usr/lib/clang/17/include/limits.h'
  '/usr/lib/clang/17/include/stdarg.h'

Still investigating it.

@ysiraichi
Copy link
Collaborator

ysiraichi commented Jan 28, 2025

Here's a more verbose update on transitioning PyTorch/XLA to use OpenXLA hermetic CUDA. In summary, these are the things I have added to the build system (branch diff):

  • Followed the documentation, and added the necessary code to WORKSPACE and .bazelrc
  • Added a few more things that I could find in the JAX repo
  • Explicitly set clang-17 as CC and CXX

Even after all these steps, I am still hitting the error above.

Reproducing the Error

$ bazel build @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so --symlink_prefix=$(pwd)/bazel- --config=cuda

My Thoughts

  • As far as I understand it, we wouldn't need to specify those dependencies, since they are system dependencies (from clang itself)
  • These errors are occurring outside PyTorch/XLA (I believe they are OpenXLA dependencies)
  • I believe I am building an OpenXLA target similarly to how JAX builds it. I just don't know what I'm doing differently...

Question

How to fix this error?

@miladm
Copy link
Collaborator

miladm commented Jan 31, 2025

Thanks @ysiraichi - I've pined openxla partners to share their input on this issue.

@beckerhe
Copy link

Update: I have tried Chengji's branch, but the build kept failing with:

failed: undeclared inclusion(s) in rule '@zlib//:zlib':
this rule is missing dependency declarations for the following files included by 'zutil.c':
  '/usr/lib/clang/17/include/stddef.h'
  '/usr/lib/clang/17/include/__stddef_max_align_t.h'
  '/usr/lib/clang/17/include/limits.h'
  '/usr/lib/clang/17/include/stdarg.h'

Still investigating it.

Can you try setting CC and CXX to the full absolute path of clang?

CC=/usr/lib/llvm-17/bin/clang

copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721803568
copybara-service bot pushed a commit to google/tsl that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721838742
copybara-service bot pushed a commit to openxla/xla that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721838742
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Jan 31, 2025
Addressed the bug pytorch/xla#8577.

PiperOrigin-RevId: 721838742
@ybaturina
Copy link

ybaturina commented Jan 31, 2025

The fix openxla/xla#22165 is merged. Can you try building Pytorch again please? (without changing Clang symlink to absolute path)

@ysiraichi
Copy link
Collaborator

Thanks, @ybaturina. Using the absolute path did work! I still haven't tested your patch.

@ysiraichi
Copy link
Collaborator

@tengyifei @will-cromar What's the recommended process for installing something (e.g. clang-17) inside the current dev VM that runs on CI? I saw that the install_deps role is being skipped by default.

@tengyifei
Copy link
Collaborator Author

@ysiraichi I believe you need to install clang-17 into the development image by sending a PR to

ENV TAGS="bazel,configure_env,install_deps"
.

That builds a dev docker image that will be accessible at https://console.cloud.google.com/artifacts/docker/tpu-pytorch-releases/us-central1/docker/development

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants