Question regarding performance of jax.lax.scan #16106

tong-xin · 2023-05-24T07:43:35Z

tong-xin
May 24, 2023

Hi all, from the scan document(https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.scan.html), it's expected to have some perf improvements by using scan instead of simple python loop. However interestingly I found the T5X model didn't use scan(https://github.com/google-research/t5x/blob/36e5f02f87669e3c38a9699001a4a154b514a115/t5x/examples/decoder_only/network.py#LL197C9-L197C9), so I did some experiment on scan and the result is surprising. Scan version is slower than simple loop and memory consumption is the same for scan and simple loop. Can some one help to take a look what I'm missing or my understanding is wrong? Is it because the scale of the test is too small, or sharding is not use?

The code I'm using is like following:

# for testing python loop
import jax
from jax import lax
import jax.numpy as jnp

vec=jax.random.uniform(jax.random.PRNGKey(0),(512,1024,1024))
repeat=16

@jax.jit
def mat_loop(tensor):
  x=tensor
  for i in range(repeat):
    x=jnp.sqrt(jnp.matmul(x,x))
  return x

%timeit mat_loop(vec).block_until_ready()

jax.profiler.save_device_memory_profile("/tmp/loop.prof")

# timeit result: 84.9 ms ± 94.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

# for testing lax.scan
import jax
from jax import lax
import jax.numpy as jnp

vec=jax.random.uniform(jax.random.PRNGKey(0),(512,1024,1024))
repeat=16

def mat(tensor, nouse):
  return jnp.sqrt(jnp.matmul(tensor, tensor)),None

%timeit lax.scan(mat, vec, xs=None,length=repeat)[0].block_until_ready()

jax.profiler.save_device_memory_profile("/tmp/scan.prof")


# timeit result: 202 ms ± 529 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Memory profile shows scan and loop used same amount of memory. (Both 2GB if using timeit, 4GB if not using timeit)

mattjj · 2023-05-24T17:20:07Z

mattjj
May 24, 2023
Maintainer

Thanks for the question!

Are you running on GPU?

The reason to use scan is to improve compile time (under a jit), not execution time. On GPU scans can significantly degrade execution performance relative to expressing the same thing with a Python for loop. The reason is that the Python loop gets unrolled into the staged-out and compiled computation, effectively inlining its operations and allowing all XLA optimizations. In contrast, the scan computation gets staged out to a (rolled) loop operation, and loops have high overhead on GPU because each iteration corresponds to a kernel launch. (On CPU there are no kernels being launched, and on TPU the entire program is compiled into one device program, so neither of those backends have the same issue; it's specific to GPU.)

0 replies

tong-xin · 2023-05-25T00:42:20Z

tong-xin
May 25, 2023
Author

Hi Matthew, thanks for your reply!

I'm running the test in a notebook which runs on TPU VM. jax.local_devices() shows following result. Jax version is 0.4.6

jax.local_devices()
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0),
 TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0),
 TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0),
 TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]

2 replies

mattjj May 25, 2023
Maintainer

Interesting! The scan "rolled" loop can still result in lower performance on TPU, basically because the loop prevents some XLA optimizations. But I'm not sure what would explain the performance difference in the code you wrote. Maybe memory layouts? That is, the rolled-loop version may have to keep the input and output of the matmuls in the same layout, whereas the unrolled version wouldn't have that constraint. That's just wild speculation though; a TPU compiler expert would know. Or we could try inspecting the optimized HLO for any clues...

mattjj May 25, 2023
Maintainer

I changed the second block of code to instead time this function:

@jax.jit
def mat_scan(tensor):
  return lax.scan(mat, vec, xs=None,length=repeat)[0]

Then I ran print(mat_loop.lower(vec).compile().as_text()) and print(mat_scan.lower(vec).compile().as_text()). It looks like the former might keep intermediate results in bf16? Probably because it can tell they're being passed into a matmul which itself will cast inputs down to bf16 even when the output results are accumulated in f32. But the loop is forced to keep intermediate results in f32.

yliu120 · 2024-04-06T22:35:12Z

yliu120
Apr 6, 2024

If your loop body jnp.sqrt(jnp.matmul(tensor, tensor)) is basically a matmul + element-wise, a copy needs to be added every iteration because it needs to copy the output buffer of the previous iteration to the input buffer so that the matmul can still read from the same input buffer every iteration.

To think it simple, you at least need two buffers to complete the computation:
(input_buffer, output_buffer)

The size of each will be (512,1024,1024). At each iteration,

The matmul op will read rows and columns of the input_buffer and write to the output_buffer. Matmul is not able to do in-place update to the input buffers as it is a matmul.
The element-wise op sqrt will in-place update the output-buffer.
Then you need to swap the input buffer and output buffer to make output as the input of the next iteration.

Unfortunately, for some reason, 3) was done via a copy which takes as long as the matmul. However, based on the above theory, you can engineer the loop body a bit to manually "swap" the input and output in case that XLA doesn't do it for you:

def _loop(x, _):
      y = jnp.sqrt(jnp.matmul(x, x))
      return jnp.sqrt(jnp.matmul(y, y)), None

@jax.jit
def stacked_unrolled(x):
    x = jnp.astype(x, jnp.bfloat16)
    x = jax.lax.scan(_loop, x, xs=None, length=k / 2)
    return jnp.astype(x[0], jnp.float32)

The first jnp.matmul reads from input_buffer to output_buffer and the second one reads from output_buffer and writes to input_buffer. In that case, each iteration will start out from the same input_buffer address and won't get confused. The way _loop was written helps XLA's buffer assignment work correctly behind the scene. The execution time of this version should be similar to the unrolled version given that the redundant copy is eliminated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding performance of jax.lax.scan #16106

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question regarding performance of jax.lax.scan #16106

tong-xin May 24, 2023

Replies: 3 comments · 2 replies

mattjj May 24, 2023 Maintainer

tong-xin May 25, 2023 Author

mattjj May 25, 2023 Maintainer

mattjj May 25, 2023 Maintainer

yliu120 Apr 6, 2024

tong-xin
May 24, 2023

Replies: 3 comments 2 replies

mattjj
May 24, 2023
Maintainer

tong-xin
May 25, 2023
Author

mattjj May 25, 2023
Maintainer

mattjj May 25, 2023
Maintainer

yliu120
Apr 6, 2024