[Pallas] Introduce jax_import_guard #6747

alanwaketan · 2024-03-14T00:40:06Z

Summary:
Importing JAX will lock the TPU devices and prevent any pytorxh/xla's TPU computations. To address it, we need to acquire the TPU first. Somehow xm.xla_device() is enough to acquire the TPU device.

Test Plan:
python test/test_pallas.py

alanwaketan · 2024-03-14T00:40:20Z

@will-cromar Do you have better ideas?

JackCaoG · 2024-03-14T01:09:03Z

torch_xla/experimental/custom_kernel.py

+
+def jax_import_guard():
+  # Somehow, this could grab the TPU before JAX locks it. Otherwise, any pt-xla TPU operations will hang.
+  xm.xla_device()


torch_xla._XLAC._init_computation_client()

When will this be called in general?

we don't usually call this API directly I believe, runtime init usually happens when we try to get the device. However if all you need is to init the runtime this api is cleaner. I used it in https://github.com/pytorch/pytorch/blob/a04e7fca8eddde492b239da6ac23d6a056666a0e/benchmarks/dynamo/common.py#L91 as well

would this API work correctly on a multipod environment? @JackCaoG

hmm it should. We just need to make sure that PyTorch/XLA init the runtime first and grab the libtpu, JAX init the runtime after pytorch doesn't cause any issue on single pod for us at least. We don't use JAX to execute any device program so it is OK.

In mutipod it sounds like init the runtime twice will cause issue? I never look into that too much

@jonb377 any thoughts on why this issue may happen on multi-pod?

JackCaoG · 2024-03-14T01:09:57Z

This is funny, @will-cromar and I were discusses how you handle the TPU ownership conflict between PyTorch/XLA and JAX this morning.

will-cromar

Will this break multiprocess? If you import this, then the current process will init the runtime, and you would not be able to call xmp.spawn after that point.

alanwaketan · 2024-03-14T18:37:30Z

Will this break multiprocess? If you import this, then the current process will init the runtime, and you would not be able to call xmp.spawn after that point.

What do you mean? xmp.spawn is supposed to be called before any torch-xla code, right? I hope they don't import pallas in the launcher...

will-cromar · 2024-03-14T18:58:41Z

What do you mean? xmp.spawn is supposed to be called before any torch-xla code, right? I hope they don't import pallas in the launcher...

This would be a problem if custom_kernels is imported at the global scope at all, e.g.

# Inits TPU
from torch_xla.experimental import custom_kernel

def main():
   # uses `custom_kernel`
   ...

if __name__ == "__main__":
  # Would fail because TPU is initialized
  xmp.spawn(main)

IMO it would be safer to import jax inside make_kernel_from_pallas where it is used.

alanwaketan · 2024-03-14T19:28:13Z

@will-cromar That works for me as well. Let me update it.

JackCaoG · 2024-03-18T22:59:18Z

hmm test failed with

Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_pallas.py", line 10, in <module>
    from torch_xla.experimental.custom_kernel import jax_import_guard
  File "/opt/conda/lib/python3.8/site-packages/torch_xla-2.3.0+git22e6548-py3.8-linux-x86_64.egg/torch_xla/experimental/custom_kernel.py", line 18, in <module>
    import jax
ModuleNotFoundError: No module named 'jax'

alanwaketan · 2024-03-18T23:56:52Z

hmm test failed with

Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_pallas.py", line 10, in <module>
    from torch_xla.experimental.custom_kernel import jax_import_guard
  File "/opt/conda/lib/python3.8/site-packages/torch_xla-2.3.0+git22e6548-py3.8-linux-x86_64.egg/torch_xla/experimental/custom_kernel.py", line 18, in <module>
    import jax
ModuleNotFoundError: No module named 'jax'

Yea, the CPU and GPU CI is not configured with JAX installed. Will fix it now.

alanwaketan · 2024-03-19T00:10:22Z

@JackCaoG @will-cromar I think this is ready for the new round of reviews.

…n_client

alanwaketan added the backport_2.3 label Mar 14, 2024

alanwaketan requested review from will-cromar and JackCaoG March 14, 2024 00:40

alanwaketan self-assigned this Mar 14, 2024

JackCaoG reviewed Mar 14, 2024

View reviewed changes

will-cromar reviewed Mar 14, 2024

View reviewed changes

alanwaketan force-pushed the alanwaketan/jax branch from 788d6fe to bcfd99c Compare March 19, 2024 00:07

JackCaoG approved these changes Mar 19, 2024

View reviewed changes

alanwaketan added 7 commits March 19, 2024 21:14

initial commit

ce2130e

Fix linters

f694d02

Fix comment

fe14383

Move JAX import into make_kernel_from_pallas and use _init_computatio…

d127022

…n_client

Guard JAX import for TPU tests only

e42b972

Fix linters

a82b5c6

Fix cpu/gpu ci

0115d0f

alanwaketan force-pushed the alanwaketan/jax branch from fb20428 to 0115d0f Compare March 19, 2024 21:14

alanwaketan merged commit 7cf9f10 into master Mar 20, 2024
18 checks passed

alanwaketan mentioned this pull request Mar 21, 2024

2.3 backport PR request list #6676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pallas] Introduce jax_import_guard #6747

[Pallas] Introduce jax_import_guard #6747

alanwaketan commented Mar 14, 2024

alanwaketan commented Mar 14, 2024

JackCaoG Mar 14, 2024

alanwaketan Mar 14, 2024

JackCaoG Mar 14, 2024

miladm Oct 22, 2024

JackCaoG Oct 22, 2024

miladm Oct 23, 2024

JackCaoG commented Mar 14, 2024

will-cromar left a comment

alanwaketan commented Mar 14, 2024 •

edited

Loading

will-cromar commented Mar 14, 2024

alanwaketan commented Mar 14, 2024

JackCaoG commented Mar 18, 2024

alanwaketan commented Mar 18, 2024

alanwaketan commented Mar 19, 2024

[Pallas] Introduce jax_import_guard #6747

[Pallas] Introduce jax_import_guard #6747

Conversation

alanwaketan commented Mar 14, 2024

alanwaketan commented Mar 14, 2024

JackCaoG Mar 14, 2024

Choose a reason for hiding this comment

alanwaketan Mar 14, 2024

Choose a reason for hiding this comment

JackCaoG Mar 14, 2024

Choose a reason for hiding this comment

miladm Oct 22, 2024

Choose a reason for hiding this comment

JackCaoG Oct 22, 2024

Choose a reason for hiding this comment

miladm Oct 23, 2024

Choose a reason for hiding this comment

JackCaoG commented Mar 14, 2024

will-cromar left a comment

Choose a reason for hiding this comment

alanwaketan commented Mar 14, 2024 • edited Loading

will-cromar commented Mar 14, 2024

alanwaketan commented Mar 14, 2024

JackCaoG commented Mar 18, 2024

alanwaketan commented Mar 18, 2024

alanwaketan commented Mar 19, 2024

alanwaketan commented Mar 14, 2024 •

edited

Loading