Add varlen support to AOTriton's Flash Attention #31

xinyazhang · 2024-06-14T20:58:56Z

Varlen Flash Attention is implemented by two new APIs: attn_fwd_compact_varlen and attn_bwd_compact_varlen, with the same set of kernels. Check include/aotriton/flash.h for their details.

Note:

the bias tensor input b is reserved. Uses should pass TensorView<4>::get_null_tensor() for this argument for now. Any other inputs are not supported nor tested.
The varlen API still expects Rank-4 tensors, which uniforms the code b/w varlen and non-varlen. The API also expects the size of the first dimension (batch) is exact 1. Extra batches will not be processed. This interface is slightly different from Tri Dao's implementation (https://github.com/Dao-AILab/flash-attention/blob/320fb59487658f033f56711efd3d61b7c7a6f8f3/csrc/flash_attn/flash_api.cpp#L500-L502), and users are supposed to use torch.transpose and torch.unsqueeze to match the API.
This PR also refactored the Triton kernel, and unified the compiled data type of sequence length related arguments to int32_t

…ode.

…ls again.

…test_varlen.py

…_q/k" (Note this is not the exact SQL statement)

wangye805

I took a quick look and realized your implementation should work for the qkv_layout of THD in NVTE.

How about for the layout of BSHD/BHSD with padding mask and cu_seqlen, as we discussed before?

xinyazhang · 2024-06-18T21:06:39Z

How about for the layout of BSHD/BHSD with padding mask and cu_seqlen, as we discussed before?

It's in the triton kernel but not exposed as C++ API. They will be added in a separate PR as attn_?wd_padded_varlen

wangye805 · 2024-06-18T21:13:05Z

How about for the layout of BSHD/BHSD with padding mask and cu_seqlen, as we discussed before?

It's in the triton kernel but not exposed as C++ API. They will be added in a separate PR as attn_?wd_padded_varlen

Do you mind pointing me to the triton source codes for padded_varlen?

xinyazhang · 2024-06-18T21:33:16Z

Do you mind pointing me to the triton source codes for padded_varlen?

aotriton/tritonsrc/fwd_kernel.py

Lines 73 to 85 in c8551b1

    
           else: # < 0 for padded seqlen 
        
               cu_seqlens_q_start = tl.load(cu_seqlens_q + off_z) 
        
               cu_seqlens_q_end = tl.load(cu_seqlens_q + off_z + 1) 
        
               seqlen_q = cu_seqlens_q_end - cu_seqlens_q_start 
        
               if start_m * BLOCK_M >= seqlen_q: 
        
                   return 
        
               cu_seqlens_k_start = tl.load(cu_seqlens_k + off_z) 
        
               cu_seqlens_k_end = tl.load(cu_seqlens_k + off_z + 1) 
        
               seqlen_k = cu_seqlens_k_end - cu_seqlens_k_start 
        
               # Varlen, but padded to Rank 4 tensor 
        
               cu_seqlens_q_start = 0 
        
               cu_seqlens_k_start = 0 
        
               batch_index = off_z

wangye805

LGTM

groenenboomj

Looking to add more support for more layouts later but that will be a different piece of work. This looks to meet needs for varlen.

groenenboomj · 2024-06-17T04:49:33Z

include/aotriton/flash.h

@@ -31,6 +31,25 @@ attn_fwd(T4 q, // batch_size x num_heads x seqlen_q x head_size
         bool is_causal,
         aotriton::Stream stream);

+hipError_t
+attn_fwd_compact_varlen(T4 q, // 1 x num_heads x total_q x head_size, total_q := \sum_{i=0}^{b} s_i


Fine for this PR but will we be adding some layout flag or defaulting to this for other layouts?

What do you mean by layout flag? The current plan is to let all inputs be BHSD layout.
If new layouts are needed we are going to add new APIs instead of changing existing ones.

xinyazhang added 14 commits June 3, 2024 22:00

Refactor the forward kernel so that varlen and non-varlen can share c…

f90e616

…ode.

Fix varlen fwd kernel. Tested with light-version of test_varlen.py

ef59cfe

Refactor the backward kernel for varlen reuse.

8161cf4

Change the strategy. The old approach will double the number of kerne…

c5dd9c5

…ls again.

Re-implement varlen forward kernel inside the same kernel. Tested by …

0538264

…test_varlen.py

Still need preprocess kernel for varlen...

d81949e

Fix the varlen kernel. test_varlen UTs passed

adc7fc0

code clean up

72d2f60

Varlen kernel description and and C++ API

8a557a4

ALTER TABLE * RENAME COLUMN "inputs$seqlen_q/k" to "inputs$max_seqlen…

7d79382

…_q/k" (Note this is not the exact SQL statement)

change seqlens from i64 to i32

bd4be70

Fix various compiling problems

db20c94

Add python binding for varlen API

a5eb8a5

Add testing framework for varlen

0749727

xinyazhang requested review from groenenboomj, wenchenvincent and wangye805 June 14, 2024 20:58

xinyazhang added 2 commits June 17, 2024 21:31

Add the Triton compiler fix.

d470e9d

Fix the autotune keys

c8551b1

wangye805 reviewed Jun 18, 2024

View reviewed changes

wangye805 approved these changes Jun 18, 2024

View reviewed changes

groenenboomj approved these changes Jun 19, 2024

View reviewed changes

xinyazhang merged commit 88eae51 into main Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add varlen support to AOTriton's Flash Attention #31

Add varlen support to AOTriton's Flash Attention #31

xinyazhang commented Jun 14, 2024 •

edited

Loading

wangye805 left a comment

xinyazhang commented Jun 18, 2024

wangye805 commented Jun 18, 2024

xinyazhang commented Jun 18, 2024

wangye805 left a comment

groenenboomj left a comment

groenenboomj Jun 17, 2024

xinyazhang Jun 19, 2024 •

edited

Loading

Add varlen support to AOTriton's Flash Attention #31

Add varlen support to AOTriton's Flash Attention #31

Conversation

xinyazhang commented Jun 14, 2024 • edited Loading

wangye805 left a comment

Choose a reason for hiding this comment

xinyazhang commented Jun 18, 2024

wangye805 commented Jun 18, 2024

xinyazhang commented Jun 18, 2024

wangye805 left a comment

Choose a reason for hiding this comment

groenenboomj left a comment

Choose a reason for hiding this comment

groenenboomj Jun 17, 2024

Choose a reason for hiding this comment

xinyazhang Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

xinyazhang commented Jun 14, 2024 •

edited

Loading

xinyazhang Jun 19, 2024 •

edited

Loading