Add attn_mask supported for FlashAttnKernel. #55969

iosmers · 2023-08-03T14:05:19Z

PR types

Others

PR changes

Others

Description

Pcard-70459

FlashAttnKernel支持Tensor类型输入的attn_mask。

#55758 已经将支持常规Causal类型Mask的FlashAttention-2集成到框架，由于Tensor类型的Mask需要对FlashAttention-2进行Kernel功能增强，故本PR目前集成的是基于Flash-Attention-1的Mask。性能数据如下：

已知在seqlen较大时，当前基于Flash-Attention-1的Tensor类型Mask实现性能很差，后续PR再进行优化。

Shape		Native	Causal FlashAttn-1	Causal FlashAttn-2	Masked FlashAttn-1
[2, 1024, 40, 128]	前向	27.17	3.77	1.65	12.20
[2, 1024, 40, 128]	前向+反向	56.52	14.23	8.41	36.53
[1, 8192, 8, 128]	前向	168.36	20.95	8.44	671.29
[1, 8192, 8, 128]	前向+反向	351.07	62.29	30.81	1950.40

paddle-bot · 2023-08-03T14:05:25Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Xreki · 2023-08-03T14:35:29Z

test/legacy_test/test_flash_attention.py

+        )
+
+        out = scaled_dot_product_attention(
+            q, k, v, m, self.dropout, self.causal, fixed_seed_offset=None


scaled_dot_product_attention是加了Tensor mask m的，但attention_naive没有加m

Xreki · 2023-08-03T14:40:42Z

test/legacy_test/test_flash_attention.py

@@ -293,6 +293,48 @@ def test_all(self):
                fetches_result[0], out_, rtol=5e-03, atol=1e-03
            )

+    def test_dot_scale_product(self):


重新实现一个单测的类会比较好，因为后面还有好几个单测是以TestFlashAttentionAPI为基类，并且修改了shape、return_softmax等配置，当前这种修改方式会都会测试mask版本

加一下如下判断，跳过不支持的cuda版本和硬件：

if not core.is_compiled_with_cuda() or get_cuda_version() < 11030 or not is_sm_supported: pass

Xreki · 2023-08-03T14:43:25Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+    }
+  } else {
+    succ =
+        phi::dynload::flash_attn_fwd(q.data(),


代码再封装一下吧，这个函数太长了

Xreki · 2023-08-04T13:10:46Z

paddle/phi/api/yaml/backward.yaml

@@ -818,8 +818,9 @@
  inplace : (out_grad -> x_grad)

 - backward_op : flash_attn_grad
-  forward : flash_attn (Tensor q, Tensor k, Tensor v, Tensor fixed_seed_offset, float dropout = 0.0, bool causal = false, bool return_softmax = false, bool is_test = false, str rng_name = "") -> Tensor(out), Tensor(softmax), Tensor(softmax_lse), Tensor(seed_offset)
-  args : (Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor seed_offset, Tensor out_grad, float dropout = 0.0, bool causal = false)
+  forward : flash_attn (Tensor q, Tensor k, Tensor v,  Tensor fixed_seed_offset,Tensor attn_mask, float dropout = 0.0, bool causal = false, bool return_softmax = false, bool is_test = false, str rng_name = "") -> Tensor(out), Tensor(softmax), Tensor(softmax_lse), Tensor(seed_offset)


,Tensor attn_mask的,后面加空格

Xreki · 2023-08-04T13:11:40Z

paddle/phi/api/yaml/backward.yaml

-  args : (Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor seed_offset, Tensor out_grad, float dropout = 0.0, bool causal = false)
+  forward : flash_attn (Tensor q, Tensor k, Tensor v,  Tensor fixed_seed_offset,Tensor attn_mask, float dropout = 0.0, bool causal = false, bool return_softmax = false, bool is_test = false, str rng_name = "") -> Tensor(out), Tensor(softmax), Tensor(softmax_lse), Tensor(seed_offset)
+  args : (Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor seed_offset, Tensor attn_mask, Tensor out_grad, float dropout = 0.0, bool causal = false)
+  optional : attn_mask


fixed_seed_offset也要加到optional吧

fixed_seed_offset是原有参数，类型为const Tensor，不是optional

感觉原写法也不是很合理，先保持原样吧

没问题了，反向的输入是seed_offset，是前向的输出，是必须的。

Xreki · 2023-08-04T13:14:34Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

@@ -33,6 +34,24 @@ DECLARE_bool(cudnn_deterministic);

 namespace phi {

+// template <typename T, typename Context>


删除无用的代码

Xreki · 2023-08-04T13:17:06Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+    SimleScaleWithMaskKernel<<<gpu_config.block_per_grid,
+                               gpu_config.thread_per_block,
+                               0,
+                               ctx.stream()>>>(q_size, scale, q_ptr);


L163定义了一个临时Tensor，SimleScaleWithMaskKernel是不是应该读取原始输入q的数据，scale后写入q_？

这个函数的作用是在原地scale

Xreki · 2023-08-04T13:18:00Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+
+    auto gpu_config = phi::backends::gpu::GetGpuLaunchConfig1D(ctx, q_size, 1);
+    DenseTensor q_(q);
+    T* q_ptr = static_cast<T*>(q_.data<T>());


Tensor需要先Resize设置维度，再Alloc空间，否则是没有分配显存的。

Xreki · 2023-08-04T13:18:13Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+    int64_t q_size = total_q * num_heads * head_size;
+
+    auto gpu_config = phi::backends::gpu::GetGpuLaunchConfig1D(ctx, q_size, 1);
+    DenseTensor q_(q);


q_ -> scaled_q

Xreki · 2023-08-04T13:22:05Z

python/paddle/nn/functional/flash_attention.py

+    fixed_seed_offset=None,
+    return_softmax=False,
+    training=True,
+    rng_name="",


只新增training参数，其他的参数先不要加

Xreki · 2023-08-04T13:31:38Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

+  bool succ;
+
+  if (attn_mask.get_ptr()) {
+    scale = 1.0f;


加一个PADDLE_ENFORCE检查，即传入了attn_mask时，is_causal不能为true

注意报错信息的写法，使用：

PADDLE_ENFORCE_NE(causal, true, phi::errors::InvalidArguemnts(....));

scale是通过参数传进来的，不应该修改输入scale的值，而是flash_attn的scale参数传1，后续SimleScaleWithMaskKernel函数调用传这个输入的scale

Xreki · 2023-08-05T09:00:25Z

paddle/phi/api/yaml/backward.yaml

-  args : (Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor seed_offset, Tensor out_grad, float dropout = 0.0, bool causal = false)
+  forward : flash_attn (Tensor q, Tensor k, Tensor v,  Tensor fixed_seed_offset,Tensor attn_mask, float dropout = 0.0, bool causal = false, bool return_softmax = false, bool is_test = false, str rng_name = "") -> Tensor(out), Tensor(softmax), Tensor(softmax_lse), Tensor(seed_offset)
+  args : (Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor seed_offset, Tensor attn_mask, Tensor out_grad, float dropout = 0.0, bool causal = false)
+  optional : attn_mask


感觉原写法也不是很合理，先保持原样吧

Xreki · 2023-08-05T09:03:23Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

+  bool succ;
+
+  if (attn_mask.get_ptr()) {
+    scale = 1.0f;


注意报错信息的写法，使用：

PADDLE_ENFORCE_NE(causal, true, phi::errors::InvalidArguemnts(....));

Xreki · 2023-08-05T09:07:08Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

+  bool succ;
+
+  if (attn_mask.get_ptr()) {
+    scale = 1.0f;


scale是通过参数传进来的，不应该修改输入scale的值，而是flash_attn的scale参数传1，后续SimleScaleWithMaskKernel函数调用传这个输入的scale

Xreki · 2023-08-05T09:13:47Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+
+    int64_t q_size = total_q * num_heads * head_size;
+    DenseTensor scale_q;
+    scale_q.ShareDataWith(q).Resize({total_q, num_heads, head_size});


ShareDataWith不合适，因为ComputeScaleQ中相当于会把输入的值改了，不确定会出现什么问题。scaled_q应该自己申请一块空间，ComputeScaleQ应该改成非inplace的版本。

Xreki · 2023-08-05T09:14:35Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+    // compute scale Q
+    ComputeScaleQ(ctx, q_size, scale_q.data<T>(), scale);
+
+    scale = 1.0f;


不要修改函数输入参数的值

Xreki · 2023-08-05T09:15:43Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

-  }
+    DenseTensor workspace;
+    if (workspace_size > 0) {
+      workspace = Empty<float>(ctx, {int64_t(workspace_size / sizeof(float))});


类型转换用static_cast<int64_t>

Xreki · 2023-08-05T15:39:52Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

+        temp_rand_mask_dim.data() ? temp_rand_mask_dim.data() : nullptr,
+        nullptr);
+    PADDLE_ENFORCE_EQ(
+        succ, true, phi::errors::External(phi::dynload::flash_attn_error()));


封装一个函数，并且报错信息多加一些，写成类似“Error in Flash-Attention，detail information is xxx”，xxx是phi::dynload::flash_attn_error()。

Xreki · 2023-08-05T15:45:13Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+                          "attn_mask is not nullptr, causal can not be true"));
+
+    int64_t q_size = total_q * num_heads * head_size;
+    DenseTensor* scale_q = new DenseTensor;


直接用DensorTensor scale_q;定义Tensor。

Xreki · 2023-08-05T15:53:02Z

python/paddle/nn/functional/flash_attention.py

+    attn_mask=None,
+    dropout_p=0.0,
+    is_causal=False,
+    training=True,


新增了参数，需要在L453添加参数说明的文档，可以参考下其他API

Xreki · 2023-08-05T15:59:07Z

paddle/phi/kernels/gpu/flash_attn_kernel.cu

+                      true,
+                      phi::errors::InvalidArgument(
+                          "attn_mask is not nullptr, causal can not be true"));
+


参考如下函数，支持Tensor Mask的Flash-Attention功能存在一定的限制，即if (head_dim == 32 || head_dim == 64 || head_dim == 128)，这里也需要加个PADDLE_ENFORCE判断下：

Paddle/paddle/fluid/operators/fused/fused_gate_attention.h

Lines 203 to 217 in 9877fb8

bool CanUseFlashAttn() const {

#ifdef PADDLE_WITH_FLASHATTN

if (!std::is_same<T, phi::dtype::bfloat16>::value &&

!std::is_same<T, phi::dtype::float16>::value) {

return false;

}

if (merge_qkv && batch_size == 1) {

if (head_dim == 32 || head_dim == 64 || head_dim == 128) {

return use_flash_attn;

}

}

#endif

return false;

}

Xreki · 2023-08-06T07:24:24Z

paddle/phi/kernels/flash_attn_kernel.h

@@ -19,6 +19,74 @@

 namespace phi {

+template <typename T, typename Context>


该头文件中，不需要加这个函数声明。

Xreki · 2023-08-06T07:24:32Z

paddle/phi/kernels/flash_attn_kernel.h

+    const int64_t* mask_dims);
+
+template <typename T, typename Context>
+void FlashAttnFwd(


该头文件中，不需要加这个函数声明。

Xreki · 2023-08-06T07:25:48Z

paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu

+
+    PADDLE_ENFORCE_EQ(succ,
+                      true,
+                      "Error in Flash-Attention, detail information is ",


报错信息依然需要加错误类型，phi::errors::External(...)

Xreki · 2023-08-07T08:37:11Z

本地（40G-A100、CUDA11.8）单测执行结果如下：

test 1307
    Start 1307: test_flash_attention

1307: Test command: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/cmake-3.18.0-Linux-x86_64/bin/cmake "-E" "env" "PYTHONPATH=/root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python" "/usr/bin/python3.8" "/root/paddlejob/workspace/work/liuyiqun/Paddle/tools/test_runner.py" "test_flash_attention"
1307: Test timeout computed to be: 10000000
1307: W0807 15:55:55.975852 139275 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.8, Runtime API Version: 11.8
1307: W0807 15:55:56.008502 139275 gpu_resources.cc:149] device: 0, cuDNN Version: 8.6.
1307: I0807 15:55:59.264534 139275 program_interpreter.cc:173] New Executor is Running.
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/framework.py:2832: UserWarning: The Attr(force_cpu) of Op(fill_constant) will be deprecated in the future, please use 'device_guard' instead. 'device_guard' has higher priority when they are used at the same time.
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/data_feeder.py:177: UserWarning: The data type of 'x' in reshape only support float16 in GPU now. 
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/executor.py:1243: UserWarning: The variable k is not found in program. It is not declared or is pruned.
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/executor.py:1243: UserWarning: The variable v is not found in program. It is not declared or is pruned.
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/data_feeder.py:177: UserWarning: The data type of 'x' in transpose only support float16 in GPU now. 
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/data_feeder.py:177: UserWarning: The data type of 'x' in matmul only support float16 in GPU now. 
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/data_feeder.py:177: UserWarning: The data type of 'y' in matmul only support float16 in GPU now. 
1307:   warnings.warn(
1307: /root/paddlejob/workspace/work/liuyiqun/Paddle/build_paddle/build_cuda11.8_gcc8.2.0_py3.8/python/paddle/fluid/data_feeder.py:177: UserWarning: The data type of 'x' in softmax only support float16 in GPU now. 
1307:   warnings.warn(
1307: Test case shape (2, 128, 8, 16) dtype float16 causal False
1307: Test unpadded case shape (2, 128, 8, 16) dtype float16 causal False
1307: Test case shape (2, 128, 8, 16) dtype paddle.float16 causal False
1307: Test unpadded case shape (2, 128, 8, 16) dtype paddle.float16 causal False
1307: Test case shape (2, 256, 8, 16) dtype paddle.float16 causal False
1307: Test unpadded case shape (2, 256, 8, 16) dtype paddle.float16 causal False
1307: Test case shape (2, 512, 8, 16) dtype paddle.float16 causal True
1307: Test unpadded case shape (2, 512, 8, 16) dtype paddle.float16 causal True
1307: Test case shape (8, 1024, 16, 128) dtype paddle.float16 causal False
1307: Test unpadded case shape (8, 1024, 16, 128) dtype paddle.float16 causal False
1307: Test case shape (8, 1024, 16, 128) dtype paddle.float16 causal False
1307: Test unpadded case shape (8, 1024, 16, 128) dtype paddle.float16 causal False
1307: Test case shape (8, 1024, 16, 128) dtype paddle.float16 causal False
1307: Test unpadded case shape (8, 1024, 16, 128) dtype paddle.float16 causal False
1/1 Test #1307: test_flash_attention .............   Passed   56.05 sec

The following tests passed:
        test_flash_attention

100% tests passed, 0 tests failed out of 1

Total Test time (real) =  56.15 sec

JamesLim-sy

LGTM

ZzSean

LGTM for skipIf

lanxianghit

LGTM for new args

chenwhql

LGTM for yaml change

sunzhongkai588

LGTM

Doc-Preview 的 CI 有一些问题，导致一直没过，暂时排查不出。先aprrove

add mask

63289f4

Xreki reviewed Aug 3, 2023

View reviewed changes

add backword

faf07d7

Xreki reviewed Aug 4, 2023

View reviewed changes

add enforce info

43303fa

Xreki reviewed Aug 5, 2023

View reviewed changes

update scale

99ff7fd

Xreki reviewed Aug 5, 2023

View reviewed changes

iosmers added 2 commits August 6, 2023 15:09

integrate code

130034c

update enforce

61fe34d

Xreki reviewed Aug 6, 2023

View reviewed changes

iosmers and others added 8 commits August 6, 2023 16:26

add enforce eq

fd4ce6a

add error type

14a7911

update enforce

70f8e75

add test_flash_attention

ffeaa14

Merge branch 'develop' into new_add_mask

5327835

Polish codes and fix compiling errors.

1c2c592

Set num_splits to 0 for flash-attn with tensor mask.

28a40ab

Fix the compiling error for non flash-attn case.

ec4bcc4

Xreki force-pushed the new_add_mask branch from 853a3aa to ec4bcc4 Compare August 7, 2023 06:30

JamesLim-sy approved these changes Aug 7, 2023

View reviewed changes

ZzSean approved these changes Aug 7, 2023

View reviewed changes

Xreki changed the title ~~add mask~~ Add attn_mask suupprted for FlashAttnKernel. Aug 7, 2023

Xreki changed the title ~~Add attn_mask suupprted for FlashAttnKernel.~~ Add attn_mask supported for FlashAttnKernel. Aug 7, 2023

lanxianghit approved these changes Aug 7, 2023

View reviewed changes

chenwhql approved these changes Aug 7, 2023

View reviewed changes

heavengate approved these changes Aug 7, 2023

View reviewed changes

qingqing01 approved these changes Aug 7, 2023

View reviewed changes

sunzhongkai588 approved these changes Aug 7, 2023

View reviewed changes

Xreki merged commit 42e0c6b into PaddlePaddle:develop Aug 7, 2023

		@@ -33,6 +34,24 @@ DECLARE_bool(cudnn_deterministic);

		namespace phi {

		// template <typename T, typename Context>

	bool CanUseFlashAttn() const {
	#ifdef PADDLE_WITH_FLASHATTN
	if (!std::is_same<T, phi::dtype::bfloat16>::value &&
	!std::is_same<T, phi::dtype::float16>::value) {
	return false;
	}

	if (merge_qkv && batch_size == 1) {
	if (head_dim == 32 \|\| head_dim == 64 \|\| head_dim == 128) {
	return use_flash_attn;
	}
	}
	#endif
	return false;
	}

		@@ -19,6 +19,74 @@

		namespace phi {

		template <typename T, typename Context>

Add attn_mask supported for FlashAttnKernel. #55969

Add attn_mask supported for FlashAttnKernel. #55969

Conversation

iosmers commented Aug 3, 2023 • edited by Xreki Loading

PR types

PR changes

Description

paddle-bot bot commented Aug 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki commented Aug 7, 2023 • edited Loading

JamesLim-sy left a comment

Choose a reason for hiding this comment

ZzSean left a comment

Choose a reason for hiding this comment

lanxianghit left a comment

Choose a reason for hiding this comment

chenwhql left a comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

iosmers commented Aug 3, 2023 •

edited by Xreki

Loading

Xreki commented Aug 7, 2023 •

edited

Loading