[TRT] Transpose layernorm fusion with different input format #50082

wwbitejotunn · 2023-01-30T08:28:31Z

PR types

Performance optimization

PR changes

OPs

Describe

This pr make fusion of transpose(nchw->nhwc)+layernorm. The fusion can speed up stable diffsuion model about 0.01s on A100 40G gpu (1.25%)

code clean

heavengate · 2023-02-03T03:19:40Z

paddle/fluid/framework/ir/trans_layernorm_fuse_pass.cc

@@ -0,0 +1,188 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


2022 -> 2023

已修改license日期,感谢review~

heavengate · 2023-02-03T03:26:05Z

paddle/fluid/inference/tensorrt/plugin/preln_residual_bias_plugin.cu

-        mean,
-        var,
-        stream);
-#endif


这个pass在GPT这些里也能用到，看着P系列的卡不能满足CUDA_ARCH_FP16_SUPPORTED(CUDA_ARCH)吧

已恢复CUDA_ARCH_FP16_SUPPORTED(CUDA_ARCH)编译选项, 感谢review~

b3602sss · 2023-02-03T03:42:51Z

paddle/fluid/framework/ir/trans_layernorm_fuse_pass.cc

+      return;
+    }
+    std::unordered_set<const Node *> del_node_set;
+    // Create an preln_groupnorm_act op node


以对pass相关说明进行详细注释,感谢review以对pass相关说明进行详细注释,感谢review

zhoutianzi666 · 2023-02-03T03:45:58Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+                 begin_norm_axis,
+                 eps);
+
+    } else if (input_desc[0].format == nvinfer1::PluginFormat::kHWC8) {


输入为float的时候，我看上面是只支持klinear的，这部分是不是可以删掉？
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac3e115b1a2b1e578e8221ef99d27cd45， TRT文档上也显示C8仅限于fp16.

已经删除float时多余的C8 format的部分, 完善了float时的计算逻辑, 感谢review

b3602sss · 2023-02-03T03:51:23Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+    float val_2 = 0.0f;
+    half2 tmp;
+    {
+      tmp = __ldg(&src[index]);


这种指令在老卡上编译会失败

已采用编译选项防止老卡编译失败, 感谢review

zhoutianzi666 · 2023-02-03T04:07:56Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+    }
+    tmp.x = __float2half_rn(val_1);
+    tmp.y = __float2half_rn(val_2);
+    output[index] = tmp;


output[index] = tmp; 是否应该放到tmp.x = __float2half_rn(val_1);之前呢？
上面的逻辑好像是把half转成float，然后又转成了half？

应该是要放到之前的

对的, 这部分计算存在冗余, 已经修改了部分的计算逻辑, 麻烦再review一下

MARD1NO · 2023-02-03T06:28:39Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+    }
+    tmp.x = __float2half_rn(val_1);
+    tmp.y = __float2half_rn(val_2);
+    output[index] = tmp;


应该是要放到之前的

MARD1NO · 2023-02-03T06:30:03Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+  phi::funcs::BlockReduceSumV2<float, 2>(sums);
+
+  if (threadIdx.x == 0) {
+    s_mean = sums[0] / n / 2;


这里的2，我觉得可以放到前面用一个

constexpr int Half2VecSize = 2;

比较清晰点

感谢review, 已使用constexpr 理顺逻辑

MARD1NO · 2023-02-03T06:30:36Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+
+#endif
+
+// using half = phi::dtype::float16;


已删除多余debug注释, 感谢review

MARD1NO · 2023-02-03T06:45:12Z

paddle/fluid/framework/ir/trans_layernorm_fuse_pass.h

+//     |
+//  trans_layernorm
+//   |      |
+//  out    layernorm_out


这里加点详细注释吧，说明这个Pass只对 transpose(0, 2, 3, 1) 且对最后一个维度做norm才生效

以对pass相关说明进行详细注释,感谢review

MARD1NO · 2023-02-03T06:50:03Z

paddle/fluid/inference/tensorrt/plugin/trans_layernorm_op_plugin.cu

+    float val_2 = 0.0f;
+    half2 tmp;
+    {
+      tmp = __ldg(&src[index]);


用了const restrict* 修饰后，其实不用加ldg，编译器也会检测到做相同优化
From:https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x

Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.

注意到文档中提法为提高likelihood, 这里的影响较大, 暂时先不对__ldg进行修改, 使用编译选项过滤可能出现问题的gpu架构

XiaoguangHu01

LGTM

wwbitejotunn added 4 commits February 2, 2023 08:12

trans_layernorm

e85eb92

preln_residual_compile

ac4c01f

add linear format input, add ut

deaba61

add op teller

7836929

wwbitejotunn force-pushed the trans_layernorm branch from 287210e to 7836929 Compare February 2, 2023 08:34

wwbitejotunn changed the title ~~Trans layernorm~~ [TRT] Transpose layernorm fusion with different input format Feb 2, 2023

wwbitejotunn force-pushed the trans_layernorm branch from 35dfc83 to c7f962e Compare February 2, 2023 16:48

fix pass and ut

9bdeb83

code clean

wwbitejotunn force-pushed the trans_layernorm branch from c7f962e to 9bdeb83 Compare February 2, 2023 16:50

add dataformat for conv2d in ut

3be8314

heavengate reviewed Feb 3, 2023

View reviewed changes

b3602sss reviewed Feb 3, 2023

View reviewed changes

zhoutianzi666 reviewed Feb 3, 2023

View reviewed changes

b3602sss reviewed Feb 3, 2023

View reviewed changes

zhoutianzi666 reviewed Feb 3, 2023

View reviewed changes

MARD1NO reviewed Feb 3, 2023

View reviewed changes

wwbitejotunn added 11 commits February 3, 2023 06:54

ut channel setting fix

4a60807

fix half2 kernel compile, fix trans_ln fp32 kernel

ae1b14a

fix ut for windows

a527704

bugfix for feature size

67d3ae6

fix windows ut

53bee16

ut fix

ebc255b

refine ut init weight

11afb5e

fix pass

ba49e73

ut fix

34692db

ut fix for windows

fd3fe7f

fix win ut

725cc45

heavengate approved these changes Feb 8, 2023

View reviewed changes

XiaoguangHu01 approved these changes Feb 8, 2023

View reviewed changes

chenwhql approved these changes Feb 9, 2023

View reviewed changes

heavengate merged commit b2bb7ec into PaddlePaddle:develop Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRT] Transpose layernorm fusion with different input format #50082

[TRT] Transpose layernorm fusion with different input format #50082

wwbitejotunn commented Jan 30, 2023 •

edited

Loading

heavengate Feb 3, 2023

wwbitejotunn Feb 4, 2023 •

edited

Loading

heavengate Feb 3, 2023

wwbitejotunn Feb 4, 2023 •

edited

Loading

b3602sss Feb 3, 2023

wwbitejotunn Feb 4, 2023

zhoutianzi666 Feb 3, 2023 •

edited

Loading

wwbitejotunn Feb 4, 2023

b3602sss Feb 3, 2023

wwbitejotunn Feb 4, 2023

zhoutianzi666 Feb 3, 2023 •

edited

Loading

MARD1NO Feb 3, 2023

wwbitejotunn Feb 4, 2023

MARD1NO Feb 3, 2023

MARD1NO Feb 3, 2023

wwbitejotunn Feb 4, 2023

MARD1NO Feb 3, 2023

wwbitejotunn Feb 4, 2023

MARD1NO Feb 3, 2023

wwbitejotunn Feb 4, 2023

MARD1NO Feb 3, 2023

wwbitejotunn Feb 4, 2023

XiaoguangHu01 left a comment

		@@ -0,0 +1,188 @@
		/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

[TRT] Transpose layernorm fusion with different input format #50082

[TRT] Transpose layernorm fusion with different input format #50082

Conversation

wwbitejotunn commented Jan 30, 2023 • edited Loading

PR types

PR changes

Describe

Choose a reason for hiding this comment

wwbitejotunn Feb 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wwbitejotunn Feb 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhoutianzi666 Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhoutianzi666 Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

wwbitejotunn commented Jan 30, 2023 •

edited

Loading

wwbitejotunn Feb 4, 2023 •

edited

Loading

wwbitejotunn Feb 4, 2023 •

edited

Loading

zhoutianzi666 Feb 3, 2023 •

edited

Loading

zhoutianzi666 Feb 3, 2023 •

edited

Loading