-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRT] Transpose layernorm fusion with different input format #50082
[TRT] Transpose layernorm fusion with different input format #50082
Conversation
287210e
to
7836929
Compare
35dfc83
to
c7f962e
Compare
code clean
c7f962e
to
9bdeb83
Compare
@@ -0,0 +1,188 @@ | |||
/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2022 -> 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改license日期,感谢review~
mean, | ||
var, | ||
stream); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个pass在GPT这些里也能用到,看着P系列的卡不能满足CUDA_ARCH_FP16_SUPPORTED(CUDA_ARCH)吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已恢复CUDA_ARCH_FP16_SUPPORTED(CUDA_ARCH)编译选项, 感谢review~
return; | ||
} | ||
std::unordered_set<const Node *> del_node_set; | ||
// Create an preln_groupnorm_act op node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以对pass相关说明进行详细注释,感谢review以对pass相关说明进行详细注释,感谢review
begin_norm_axis, | ||
eps); | ||
|
||
} else if (input_desc[0].format == nvinfer1::PluginFormat::kHWC8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
输入为float的时候,我看上面是只支持klinear的,这部分是不是可以删掉?
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac3e115b1a2b1e578e8221ef99d27cd45, TRT文档上也显示C8仅限于fp16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经删除float时多余的C8 format的部分, 完善了float时的计算逻辑, 感谢review
float val_2 = 0.0f; | ||
half2 tmp; | ||
{ | ||
tmp = __ldg(&src[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种指令在老卡上编译会失败
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已采用编译选项防止老卡编译失败, 感谢review
} | ||
tmp.x = __float2half_rn(val_1); | ||
tmp.y = __float2half_rn(val_2); | ||
output[index] = tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output[index] = tmp;
是否应该放到tmp.x = __float2half_rn(val_1);
之前呢?
上面的逻辑好像是把half转成float,然后又转成了half?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该是要放到之前的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对的, 这部分计算存在冗余, 已经修改了部分的计算逻辑, 麻烦再review一下
} | ||
tmp.x = __float2half_rn(val_1); | ||
tmp.y = __float2half_rn(val_2); | ||
output[index] = tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该是要放到之前的
phi::funcs::BlockReduceSumV2<float, 2>(sums); | ||
|
||
if (threadIdx.x == 0) { | ||
s_mean = sums[0] / n / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的2,我觉得可以放到前面用一个
constexpr int Half2VecSize = 2;
比较清晰点
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢review, 已使用constexpr 理顺逻辑
|
||
#endif | ||
|
||
// using half = phi::dtype::float16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除多余debug注释, 感谢review
// | | ||
// trans_layernorm | ||
// | | | ||
// out layernorm_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里加点详细注释吧,说明这个Pass只对 transpose(0, 2, 3, 1) 且对最后一个维度做norm才生效
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以对pass相关说明进行详细注释,感谢review
float val_2 = 0.0f; | ||
half2 tmp; | ||
{ | ||
tmp = __ldg(&src[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用了const restrict* 修饰后,其实不用加ldg,编译器也会检测到做相同优化
From:https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x
Marking pointers used for loading such data with both the
const and __restrict__ qualifiers increases the likelihood that
the compiler will detect the read-only condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意到文档中提法为提高likelihood, 这里的影响较大, 暂时先不对__ldg进行修改, 使用编译选项过滤可能出现问题的gpu架构
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
This pr make fusion of transpose(nchw->nhwc)+layernorm. The fusion can speed up stable diffsuion model about 0.01s on A100 40G gpu (1.25%)