Enable persistent compilation caching #5804

jonb377 · 2023-11-15T06:19:49Z

This change enables persistent caching by combining #5800 and #5803. It uses the serialization functionality in PjRtLoadedExecutable to convert the executables to/from strings, which are written to disk by the persistent cache.

The persistent cache is enabled by setting the environment variable XLA_PERSISTENT_CACHE_PATH to the desired compilation cache path. An additional environment variable XLA_PERSISTENT_CACHE_READ_ONLY can be used to control whether the cache is readonly, which can be useful when the cache is shared across workers in an SPMD setting.

Note that the persistent cache does not perform any eviction, so it is currently up to the user to clean the cache.

torch_xla/csrc/runtime/pjrt_computation_client.cc

jonb377 · 2023-11-15T06:24:38Z

Ah bummer, it's not running CI since I'm targeting a different branch... I was hoping to see results for GPU. I guess we'll need to wait until the others land so I can target master.

JackCaoG · 2023-11-15T21:26:18Z

test/test_persistent_cache.py

+
+
+@unittest.skipUnless(xr.device_type() in {'TPU', 'GPU'},
+                     'Device type does not support persistent caching')


why not cpu?

I tested it, but CPU isn't supported (deserialization fails). JAX has a similar restriction: https://github.com/google/jax/blob/234be736c4cdd8da4197078278d35a6a1cde3767/tests/compilation_cache_test.py#L69C41-L69C41

torch_xla/csrc/xla_graph_executor.cpp

JackCaoG · 2023-11-15T21:44:22Z

torch_xla/csrc/xla_graph_executor.h

  using ComputationCache =
+      runtime::util::AbstractCache<torch::lazy::hash_t, CachedComputation,
+                                   torch::lazy::HashReducer>;
+  using MemoryCache =
      runtime::util::Cache<torch::lazy::hash_t, CachedComputation,


I am confuse about the difference between ComputationCache and MemoryCache

ComputationCache is used for the umbrella return type, and MemoryCache and PersistentCache are subtypes of ComputationCache

The types were split out in #5800

test/run_tests.sh

JackCaoG · 2023-12-01T19:42:28Z

@jonb377 do you still want to merge this one?

jonb377 · 2023-12-01T20:11:08Z

@jonb377 do you still want to merge this one?

@JackCaoG Yes, just pending reviews. I'll ping some folks.

test/test_persistent_cache.py

torch_xla/csrc/xla_graph_executor.cpp

yeounoh

LGTM

will-cromar

LGTM. Thanks!

will-cromar · 2023-12-06T17:17:15Z

test/test_persistent_cache.py

+def _single_device_test(metrics):
+  t = torch.randn(16)


small suggestion: you can move these test cases into your test class by making them @staticmethods

I tried that initially, but for the ProcessPoolExecutor it complained that they weren't pickleable 🥲

Hmm, are you sure you didn't use @classmethod? We have other tests that use @staticmethod to create pickleable test cases

Just retried the single device test:

class PersistentCacheTest(parameterized.TestCase): """ Test suite to verify compilation cache across processes. Tests will run multiple Python subprocesses which use the XLA runtime to populate the cache and perform assertions on the metrics generated. """ @staticmethod def _single_device_test(metrics): t = torch.randn(16) xt = t.to(xm.xla_device()) _assert_correctness_and_metrics(t, xt, metrics)

It hit this:

Traceback (most recent call last): File "/home/ptxla/.local/lib/python3.8/site-packages/absl/testing/parameterized.py", line 321, in bound_param_test return test_method(self, *testcase_params) File "test_persistent_cache.py", line 23, in run f(*args, **kwargs) File "test_persistent_cache.py", line 103, in test_persistent_cache launch_method(test_fn, ({ File "test_persistent_cache.py", line 31, in _test_spawn pool.submit(fn, *args).result() File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result return self.__get_result() File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 239, in _feed obj = _ForkingPickler.dumps(obj) File "/usr/local/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) TypeError: cannot pickle 'staticmethod' object

The _mp_test also hit this when using staticmethod... I wonder what's going wrong lol

jonb377 · 2023-12-06T22:10:00Z

MP GPU looks good, but single-device seems to be broken by the same issue addressed in #6023 (cc @vanbasten23). I've disabled single-device and SPMD GPU tests.

jonb377 requested review from yeounoh, will-cromar and JackCaoG November 15, 2023 06:19

jonb377 force-pushed the jonbolin/use-persistent-cache branch from 0307d7a to 01fabcd Compare November 15, 2023 06:21

jonb377 commented Nov 15, 2023

View reviewed changes

torch_xla/csrc/runtime/pjrt_computation_client.cc Outdated Show resolved Hide resolved

jonb377 mentioned this pull request Nov 15, 2023

Saving Compiler Cache? #5801

Closed

jonb377 self-assigned this Nov 15, 2023

JackCaoG reviewed Nov 15, 2023

View reviewed changes

torch_xla/csrc/xla_graph_executor.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Nov 15, 2023

View reviewed changes

torch_xla/csrc/xla_graph_executor.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Nov 15, 2023

View reviewed changes

test/run_tests.sh Show resolved Hide resolved

jonb377 force-pushed the jonbolin/use-persistent-cache branch 2 times, most recently from f124553 to e3ae16c Compare November 16, 2023 03:02

JackCaoG added the backport_2.2 label Dec 1, 2023

will-cromar reviewed Dec 4, 2023

View reviewed changes

test/test_persistent_cache.py Outdated Show resolved Hide resolved

test/test_persistent_cache.py Outdated Show resolved Hide resolved

test/test_persistent_cache.py Outdated Show resolved Hide resolved

torch_xla/csrc/xla_graph_executor.cpp Outdated Show resolved Hide resolved

yeounoh approved these changes Dec 4, 2023

View reviewed changes

jonb377 force-pushed the jonbolin/comp-env-hash branch from 52781bc to eeee59e Compare December 5, 2023 20:54

jonb377 force-pushed the jonbolin/use-persistent-cache branch from e3ae16c to 3686f29 Compare December 5, 2023 20:54

jonb377 force-pushed the jonbolin/comp-env-hash branch from eeee59e to d2190e4 Compare December 5, 2023 20:56

jonb377 force-pushed the jonbolin/use-persistent-cache branch from 3686f29 to ab037fc Compare December 5, 2023 20:56

jonb377 force-pushed the jonbolin/comp-env-hash branch from d2190e4 to aee73ba Compare December 6, 2023 01:37

jonb377 force-pushed the jonbolin/use-persistent-cache branch from ab037fc to f009baa Compare December 6, 2023 01:40

Base automatically changed from jonbolin/comp-env-hash to master December 6, 2023 16:47

jonb377 added 3 commits December 6, 2023 16:49

Enable persistent compilation caching

b87c55b

Add MP tests

f36aa51

Improve tests and address comments

6041867

jonb377 force-pushed the jonbolin/use-persistent-cache branch from f009baa to 6041867 Compare December 6, 2023 16:49

will-cromar approved these changes Dec 6, 2023

View reviewed changes

jonb377 added 2 commits December 6, 2023 19:57

Use CUDA device instead of GPU

edb4a4d

Disable SPMD and single-device tests for GPU

ea6be7e

jonb377 merged commit fae0166 into master Dec 7, 2023

jonb377 deleted the jonbolin/use-persistent-cache branch December 7, 2023 00:11

jonb377 mentioned this pull request Dec 7, 2023

2.2 backport PR request list #6036

Open

jonb377 added a commit that referenced this pull request Dec 7, 2023

Enable persistent compilation caching (#5804)

15463c4

jonb377 added a commit that referenced this pull request Dec 8, 2023

Enable persistent compilation caching (#5804)

c381d11

This was referenced Dec 8, 2023

Enable persistent compilation caching #6065

Merged

Initialize compilation env hash #6140

Merged

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Enable persistent compilation caching (pytorch#5804)

2b2d46c

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Enable persistent compilation caching (#5804)

96c48c9

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Enable persistent compilation caching (#5804)

3cd2516

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable persistent compilation caching #5804

Enable persistent compilation caching #5804

jonb377 commented Nov 15, 2023 •

edited

Loading

jonb377 commented Nov 15, 2023

JackCaoG Nov 15, 2023

jonb377 Nov 15, 2023

JackCaoG Nov 15, 2023

jonb377 Nov 15, 2023

jonb377 Nov 15, 2023

JackCaoG commented Dec 1, 2023

jonb377 commented Dec 1, 2023

yeounoh left a comment

will-cromar left a comment

will-cromar Dec 6, 2023

jonb377 Dec 6, 2023

will-cromar Dec 6, 2023

jonb377 Dec 6, 2023

jonb377 commented Dec 6, 2023



		@unittest.skipUnless(xr.device_type() in {'TPU', 'GPU'},
		'Device type does not support persistent caching')

Enable persistent compilation caching #5804

Enable persistent compilation caching #5804

Conversation

jonb377 commented Nov 15, 2023 • edited Loading

jonb377 commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG commented Dec 1, 2023

jonb377 commented Dec 1, 2023

yeounoh left a comment

Choose a reason for hiding this comment

will-cromar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonb377 commented Dec 6, 2023

jonb377 commented Nov 15, 2023 •

edited

Loading