[Pallas] Support segment ids in flash attention #6943

alanwaketan · 2024-04-19T02:53:51Z

Summary:
This PR is to add segment ids to the flash attention wrapper. The segment ids are a way to create an attention mask where each token can only attend to other tokens within the same segment. The mask is therefore a block diagonal matrix.

To support it, we further split the flash attention forward into tracing and execution part, and implement all the shape operations to make it compatible with the kernel.

Test Plan:
PJRT_DEVICE=TPU python test/test_pallas.py

JackCaoG · 2024-04-30T01:03:24Z

Is this ready for review?

JackCaoG · 2024-04-30T01:21:54Z

torch_xla/experimental/custom_kernel.py

      if not save_residuals:
+        o = o[0]


what's this for?

_xla_tpu_custom_call always return an array.

torch_xla/experimental/custom_kernel.py

alanwaketan · 2024-04-30T18:44:19Z

Is this ready for review?

I still need to add spmd and dynamo support. So not yet.

alanwaketan · 2024-04-30T22:49:40Z

@JackCaoG Do you think we can do the SPMD and dynamo parts later since the customer is not using either of them now?

JackCaoG · 2024-04-30T22:50:41Z

yea.. don't worry about SPMD and dynamo for this pr, let's do that in a separate pr..

JackCaoG · 2024-04-30T23:10:39Z

test/test_pallas.py

+
+  @unittest.skipIf(xr.device_type() != 'TPU' or tpu.version() < 3,
+                   "This test only works on TPUv3+.")
+  def test_flash_attention_wrapper_segment_ids_2(self):


so you have 2 test, one compare to native torch, one compare to jax?

Yea, the JAX test is written before I figure out how to do the non-kernel mask.

JackCaoG · 2024-04-30T23:14:45Z

test/test_pallas.py

+    torch.manual_seed(42)
+    q = torch.randn(4, 2, 128, 8, requires_grad=True).to("xla")
+    k = torch.randn(4, 2, 128, 8, requires_grad=True).to("xla")
+    v = torch.randn(4, 2, 128, 8, requires_grad=True).to("xla")
+    q_segment_ids = torch.zeros(4, 128).to("xla")
+    kv_segment_ids = torch.zeros(4, 128).to("xla")
+    q.retain_grad()
+    k.retain_grad()
+    v.retain_grad()


can we refactor this part out in a helper function in this test?

actually just refactor this part out and uses it on all tests, it is the same for all tests.

You mean the tensor initializations? Those are kinda of expected paperworks. I don't think it's necessary to improve...

I will leave that to you. When I see two large chunks of codes that looks similar, I usually tried to find how they are different. It confused me a bit when I realized it is the same code repeating over and over.

Yea, for testing, it's sometime hard to avoid... haha

JackCaoG · 2024-04-30T23:21:24Z

torch_xla/experimental/custom_kernel.py

@@ -357,18 +418,22 @@ def backward(ctx, grad_output):
      grad_v = xs.disable_manual_sharding(
          grad_v, partition_spec, full_shape, mesh=mesh).global_tensor

-    return grad_q, grad_k, grad_v, None, None, None
+    return grad_q, grad_k, grad_v, None, None, None, None, None


why do we need to return these Nones?

It's the rule of the autograd.Function where all the inputs passed in the forward need to have the corresponding grads. For inputs that we don't diff on, we return None.

alanwaketan · 2024-05-01T01:06:55Z

Thanks, Jack.

alanwaketan self-assigned this Apr 19, 2024

alanwaketan added 2 commits April 26, 2024 00:51

initial commit

60d466f

Polish test case

b9cfc67

alanwaketan force-pushed the alanwaketan/fa_segment_ids branch from b6a8ed8 to b9cfc67 Compare April 26, 2024 01:52

alanwaketan added 6 commits April 29, 2024 21:42

refactor a bit

98dc4d0

Fix tests

5698c51

tmp backward

8e1d81a

Adds a new test case

d1222f7

adds backward test case

8c00bb1

Fix linters

c26cd19

JackCaoG reviewed Apr 30, 2024

View reviewed changes

alanwaketan added 3 commits April 30, 2024 19:12

Make some minor changes

7d78bc2

Fix linters

15e9918

Adds more comments

a7eb513

alanwaketan marked this pull request as ready for review April 30, 2024 22:46

JackCaoG reviewed Apr 30, 2024

View reviewed changes

JackCaoG added the tpuci label Apr 30, 2024

JackCaoG approved these changes May 1, 2024

View reviewed changes

alanwaketan merged commit 400bd0c into master May 1, 2024
21 checks passed

alanwaketan deleted the alanwaketan/fa_segment_ids branch May 1, 2024 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pallas] Support segment ids in flash attention #6943

[Pallas] Support segment ids in flash attention #6943

alanwaketan commented Apr 19, 2024 •

edited

Loading

JackCaoG commented Apr 30, 2024

JackCaoG Apr 30, 2024

alanwaketan Apr 30, 2024

alanwaketan commented Apr 30, 2024

alanwaketan commented Apr 30, 2024

JackCaoG commented Apr 30, 2024

JackCaoG Apr 30, 2024

alanwaketan Apr 30, 2024

JackCaoG Apr 30, 2024

JackCaoG Apr 30, 2024

alanwaketan Apr 30, 2024

JackCaoG May 1, 2024

alanwaketan May 1, 2024

JackCaoG Apr 30, 2024

alanwaketan Apr 30, 2024

alanwaketan commented May 1, 2024

[Pallas] Support segment ids in flash attention #6943

[Pallas] Support segment ids in flash attention #6943

Conversation

alanwaketan commented Apr 19, 2024 • edited Loading

JackCaoG commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan commented Apr 30, 2024

alanwaketan commented Apr 30, 2024

JackCaoG commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan commented May 1, 2024

alanwaketan commented Apr 19, 2024 •

edited

Loading