Additional mask support on FA2 #19

umiswing · 2023-09-13T08:44:50Z

Support additional mask on fa2.

so size: 66M -> 96M

Tested on mask addition for one block. Code to be cleaned. Softmax in bwd is removed for testing.

Looks correct on ouput and dv.

…into fa2_mask

CLAassistant · 2023-09-13T08:44:58Z

All committers have signed the CLA.

…into fa2_mask

Remove some unused code and comments.

Xreki · 2023-09-14T07:00:29Z

csrc/capi/flash_attn.cu

@@ -86,8 +86,12 @@ void set_params_fprop(Flash_fwd_params &params,
                      void * const softmax_lse_d,
                      float p_dropout,
                      float softmax_scale,
+                      float softmax_unscale,


这个为啥要用参数传进来呢，可以在Kernel里面计算？

这个为啥要用参数传进来呢，可以在Kernel里面计算？

这个主要是想用乘法代替除法，原因如下，不过我没有对比过两种实现的性能差异：

nv论坛有人在v100上做过实验，除法比乘法慢很多。https://forums.developer.nvidia.com/t/speed-comparison-of-division-compared-to-other-arithmetic-operations-perhaps-something-like-clock-cycles/168371

我没有在a100上做过实验，但是按照经验，乘法应该是比除法更快。像在CPU上用试商法实现的除法指令，时钟周期也是很长的。

存到Param里面，整个Param都是需要从CPU传输到GPU的。我的意思是，可以在启动CUDA Kernel后，再有softmax_scale计算出来给后面使用？

Xreki · 2023-09-14T07:02:02Z

csrc/capi/flash_attn.cu

    ASSERT_CHECK(head_size <= 256);
    ASSERT_CHECK(num_heads == num_heads_k);

+    if (attn_mask) {


也加一下mask_dims[0]的检查吧

Xreki · 2023-09-14T07:02:26Z

csrc/capi/flash_attn.cu

@@ -295,8 +325,12 @@ bool flash_attn_fwd(const void * const q,
 		     softmax_lse_ptr,
 		     p_dropout,
 		     softmax_scale,
+                     softmax_unscale,


缩进又有点错乱了

Xreki · 2023-09-14T07:02:42Z

csrc/capi/flash_attn.cu

+        ASSERT_CHECK(mask_dims[1] == 1 || mask_dims[1] == num_heads);
+        ASSERT_CHECK(mask_dims[2] == 1 || mask_dims[2] == seqlen_q);
+#if 0
+        ASSERT_CHECK(softmax_scale == 1.0f);


可以删除

Xreki · 2023-09-14T07:03:34Z

csrc/capi/flash_attn.cu

@@ -370,8 +416,12 @@ bool flash_attn_varlen_fwd(const void * const q,
 		     softmax_lse_ptr,
 		     p_dropout,
 		     softmax_scale,
+                     1, // just hack


这是什么作用？

Use params.seqlen for mask predication.

Xreki · 2023-09-22T07:50:39Z

csrc/capi/flash_attn.cu

@@ -64,6 +64,30 @@ const char *flash_attn_error() {
 #define FLASHATTNLIB_BEGIN_FUNC try {
 #define FLASHATTNLIB_END_FUNC } catch (::std::exception &__e) { flash_attn_set_error(__e.what()); return false; } catch (...) { flash_attn_set_error(nullptr); return false; }

+#define CHECK_FWD_EXECTUABLE(__seqlen_q, __seqlen_k)                     \


感觉还是定义成函数的形式比较好。

Xreki · 2023-09-22T07:52:49Z

csrc/capi/flash_attn.cu

@@ -86,8 +86,12 @@ void set_params_fprop(Flash_fwd_params &params,
                      void * const softmax_lse_d,
                      float p_dropout,
                      float softmax_scale,
+                      float softmax_unscale,


存到Param里面，整个Param都是需要从CPU传输到GPU的。我的意思是，可以在启动CUDA Kernel后，再有softmax_scale计算出来给后面使用？

Xreki · 2023-09-22T08:01:58Z

csrc/capi/flash_attn.cu

+                     num_splits,
+                     const_cast<void *>(attn_mask),
+                     mask_head_mod_size,
+                     mask_seq_q_mod_size);


我感觉，变量名叫mask_num_heads、mask_seqlen_q，会不会好一些

Xreki · 2023-09-22T08:09:09Z

csrc/flash_attn/src/flash.h

@@ -101,6 +102,11 @@ struct Flash_fwd_params : public Qkv_params {

    bool is_bf16;
    bool is_causal;
+
+    // The attn mask matrix
+    void * __restrict__ attn_mask_ptr;


L85有一个int类型的blockmask指针，是干啥用的呢？

Xreki · 2023-09-22T08:13:13Z

csrc/flash_attn/src/flash_bwd_kernel.h

@@ -448,7 +448,13 @@ inline __device__ void compute_dq_dk_dv_1colblock(const Params &params, const in
    const BlockInfo</*Varlen=*/!Is_even_MN> binfo(params, bidb);
    if (n_block * kBlockN >= binfo.actual_seqlen_k || binfo.actual_seqlen_q == 0) return;

-    int m_block_max = cute::ceil_div(binfo.actual_seqlen_q, kBlockM);
+    // umiswing: residue is for predication of additional mask gmem access.


这些额外的计算，是不是可以加上if (Is_attn_mask)判断？另外，我感觉additional mask gmem access的逻辑，封装一下比较好？不然对Kernel侵入太多，后续合并最新的代码会比较困难。

Xreki · 2023-09-22T08:36:04Z

csrc/flash_attn/src/flash_bwd_kernel.h

+                                                               m_block == m_block_max - 1 ? m_residue : params.seqlen_q,
+                                                               n_block == n_block_max - 1 ? n_residue : params.seqlen_k,
+                                                               params.unscale_softmax);
+            tPgMask.data() = tPgMask.data() + (-kBlockM * params.seqlen_k);


这一行代码是在进行指针的变换吗？加个注释？

Xreki · 2023-09-22T09:51:50Z

csrc/flash_attn/src/flash_bwd_launch_template.h

-                }
-                kernel<<<grid_n, Kernel_traits::kNThreads, smem_size_dq_dk_dv, stream>>>(params);
-                C10_CUDA_KERNEL_LAUNCH_CHECK();
+                BOOL_SWITCH(is_attn_mask, Is_attn_mask, [&] {


这里是不是会引入IsCausal和Is_attn_mask都为true的编译？

Xreki · 2023-09-22T09:54:56Z

csrc/flash_attn/src/softmax.h

+// TODO(umiswing): support cu_attn_mask
+// This kernel should work after dealing with input cu_seq indicating mask position.
+template <typename Engine, typename Layout, typename T>
+inline __device__ void apply_cu_attn_mask(Tensor<Engine, Layout> &tensor, const T* const mask, const float unscale_softmax, const uint32_t col_idx_offset_,


这个函数没有用到？

这个函数没有用到？

这个是之前提的阶梯状mask的需求相关的kernel，即通过row_idx, col_idx来判断是否mask out。这个kernel还没有完全写完，要在这个PR中保留吗？

Xreki

LGTM! Great work~

umiswing added 9 commits September 1, 2023 09:27

Add addiction mask support. Tested on flash_attn_fwd.

471f95f

Bug fix: fix scale setting typo for attn mask.

4c162e8

Add gMask tile rule for bwd.

c2383fd

Tested on mask addition for one block. Code to be cleaned. Softmax in bwd is removed for testing.

Remove unused code.

479ce21

Fix buggy tile rule in fwd and bwd.

29f691d

Looks correct on ouput and dv.

Just remove some log.

d7ad819

Merge branch 'main' of https://github.com/PaddlePaddle/flash-attention …

e1564d5

…into fa2_mask

Merge tag 'v2.0.8' into fa2_mask

447c093

Comment out log to avoid compile error.

c718d53

umiswing added 6 commits September 13, 2023 08:58

Merge branch 'main' of https://github.com/PaddlePaddle/flash-attention …

d85752b

…into fa2_mask

Add softmax_unscale on kernel and padded api.

49d2a23

Add api for varlen kernel.

854c4df

Fix bug when tiles do not fit evenly into seqlen.

a3a1c27

add residue to avoid oob mem access

ec8bf4f

Add template param Is_attn_mask.

f939e9c

Remove some unused code and comments.

Xreki reviewed Sep 19, 2023

View reviewed changes

umiswing added 5 commits September 20, 2023 13:03

Add a check macro.

2867657

Untab.

8454bbb

Remove unused modification.

1f8dbad

Use params.seqlen for mask predication.

Rename: mask_seq_mod_size -> mask_seq_q_mod_size.

957a73c

Remove TODO. Modify comment.

9ebf2af

umiswing changed the title ~~[WIP] Additional mask support on FA2~~ Additional mask support on FA2 Sep 21, 2023

Xreki reviewed Sep 22, 2023

View reviewed changes

Xreki mentioned this pull request Sep 22, 2023

Additional mask support on FA2 PaddlePaddle/Paddle#57276

Merged

Xreki approved these changes Sep 22, 2023

View reviewed changes

Xreki merged commit b74460b into PaddlePaddle:main Sep 22, 2023
1 check passed

Xreki mentioned this pull request Oct 27, 2023

[For Debugging] Revert "Additional mask support on FA2" #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional mask support on FA2 #19

Additional mask support on FA2 #19

umiswing commented Sep 13, 2023 •

edited

Loading

CLAassistant commented Sep 13, 2023 •

edited

Loading

Xreki Sep 14, 2023

umiswing Sep 20, 2023

Xreki Sep 22, 2023

Xreki Sep 14, 2023

umiswing Sep 20, 2023

Xreki Sep 14, 2023

umiswing Sep 20, 2023

Xreki Sep 14, 2023

Xreki Sep 14, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

Xreki Sep 22, 2023

umiswing Sep 22, 2023

Xreki left a comment

Additional mask support on FA2 #19

Additional mask support on FA2 #19

Conversation

umiswing commented Sep 13, 2023 • edited Loading

CLAassistant commented Sep 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

umiswing commented Sep 13, 2023 •

edited

Loading

CLAassistant commented Sep 13, 2023 •

edited

Loading