Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TRT] Transpose layernorm fusion with different input format #50082

Merged
merged 17 commits into from
Feb 9, 2023

Conversation

wwbitejotunn
Copy link
Contributor

@wwbitejotunn wwbitejotunn commented Jan 30, 2023

PR types

Performance optimization

PR changes

OPs

Describe

This pr make fusion of transpose(nchw->nhwc)+layernorm. The fusion can speed up stable diffsuion model about 0.01s on A100 40G gpu (1.25%)

@wwbitejotunn wwbitejotunn changed the title Trans layernorm [TRT] Transpose layernorm fusion with different input format Feb 2, 2023
code clean
@@ -0,0 +1,188 @@
/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2022 -> 2023

Copy link
Contributor Author

@wwbitejotunn wwbitejotunn Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改license日期,感谢review~

mean,
var,
stream);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个pass在GPT这些里也能用到,看着P系列的卡不能满足CUDA_ARCH_FP16_SUPPORTED(CUDA_ARCH)吧

Copy link
Contributor Author

@wwbitejotunn wwbitejotunn Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已恢复CUDA_ARCH_FP16_SUPPORTED(CUDA_ARCH)编译选项, 感谢review~

return;
}
std::unordered_set<const Node *> del_node_set;
// Create an preln_groupnorm_act op node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以对pass相关说明进行详细注释,感谢review以对pass相关说明进行详细注释,感谢review

begin_norm_axis,
eps);

} else if (input_desc[0].format == nvinfer1::PluginFormat::kHWC8) {
Copy link
Contributor

@zhoutianzi666 zhoutianzi666 Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输入为float的时候,我看上面是只支持klinear的,这部分是不是可以删掉?
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac3e115b1a2b1e578e8221ef99d27cd45, TRT文档上也显示C8仅限于fp16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经删除float时多余的C8 format的部分, 完善了float时的计算逻辑, 感谢review

float val_2 = 0.0f;
half2 tmp;
{
tmp = __ldg(&src[index]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种指令在老卡上编译会失败

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已采用编译选项防止老卡编译失败, 感谢review

}
tmp.x = __float2half_rn(val_1);
tmp.y = __float2half_rn(val_2);
output[index] = tmp;
Copy link
Contributor

@zhoutianzi666 zhoutianzi666 Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output[index] = tmp; 是否应该放到tmp.x = __float2half_rn(val_1);之前呢?
上面的逻辑好像是把half转成float,然后又转成了half?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是要放到之前的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对的, 这部分计算存在冗余, 已经修改了部分的计算逻辑, 麻烦再review一下

}
tmp.x = __float2half_rn(val_1);
tmp.y = __float2half_rn(val_2);
output[index] = tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是要放到之前的

phi::funcs::BlockReduceSumV2<float, 2>(sums);

if (threadIdx.x == 0) {
s_mean = sums[0] / n / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的2,我觉得可以放到前面用一个

constexpr int Half2VecSize = 2; 

比较清晰点

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢review, 已使用constexpr 理顺逻辑


#endif

// using half = phi::dtype::float16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除多余debug注释, 感谢review

// |
// trans_layernorm
// | |
// out layernorm_out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加点详细注释吧,说明这个Pass只对 transpose(0, 2, 3, 1) 且对最后一个维度做norm才生效

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以对pass相关说明进行详细注释,感谢review

float val_2 = 0.0f;
half2 tmp;
{
tmp = __ldg(&src[index]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用了const restrict* 修饰后,其实不用加ldg,编译器也会检测到做相同优化
From:https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x

Marking pointers used for loading such data with both the 
const and __restrict__ qualifiers increases the likelihood that 
the compiler will detect the read-only condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意到文档中提法为提高likelihood, 这里的影响较大, 暂时先不对__ldg进行修改, 使用编译选项过滤可能出现问题的gpu架构

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heavengate heavengate merged commit b2bb7ec into PaddlePaddle:develop Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants