Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PaddlePaddle Hackathon 4 No.34】为 Paddle 优化 Lerp OP 在 GPU 上的性能 #53154

Merged
merged 19 commits into from
May 16, 2023

Conversation

WintersMontagne10335
Copy link
Contributor

PR types

Performance optimization

PR changes

OPs

Description

目前 Paddle 内 lerp 算子采用第三方库组合实现,性能不足。可以基于飞桨内部的Broadcast Kernel实现良好的优化效果。
设计文档:PaddlePaddle/community#513

  • 开发环境:
  1. 设备:RTX 960
  2. 环境:CUDA 10.2,cuDNN 7.6
  • 优化方法
  1. 主要基于Broadcast Kernel与自定义的Functor
  2. weight很多时候是scalar,故将weight是scalar还是tensor两种情况分开讨论

完成优化后,Paddle与优化前的Paddle的前向推理性能对比效果:

Case No. device input_type x_shape y_shape origin Paddle Perf(ms) current Paddle Perf(ms) diff
1 GeForce GTX960 float32 [-1L, 102400L] [-1L, 102400L] 0.6911145 0.6568878 faster than 5.2%
2 GeForce GTX960 float32 [16L, 1L, 1L, 1L] [16L, 3L, 224L, 224L] 3.1153775 0.9911732 faster than 214.3%
3 GeForce GTX960 float16 [-1L, 102400L] [-1L, 102400L] 0.5005047 0.3446356 faster than 45.2%
4 GeForce GTX960 float16 [16L, 1L, 1L, 1L] [16L, 3L, 224L, 224L] 2.8568278 0.5562943 faster than 413.5%

可以看到,平均性能至少提升了20%,对于性能差的case,性能提升到了原先的5倍。经过优化,性能得到了较大的提升。

@paddle-bot
Copy link

paddle-bot bot commented Apr 21, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Apr 21, 2023
@paddle-bot
Copy link

paddle-bot bot commented Apr 21, 2023

❌ The PR is not created using PR's template. You can refer to this Demo.
Please use PR's template, it helps save our maintainers' time so that more developers get helped.

"The number of dimensions for LerpOp must be "
"greater than or equal to 0, but the value received is %d.",
rank));
PADDLE_ENFORCE_LE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为采用的核心计算是BroadcastKernel,内置了一些判断规则,不必沿用这里的 rank <= 6的设定,这个设定是为Eigen服务的,可以删除掉.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

broadcast_min_functor);
inputs.emplace_back(&x);
inputs.emplace_back(&b_min);
inputs.emplace_back(&weight);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分的计算逻辑我理解是对输入的数据首先将维度按照out_tensor的维度进行补齐,然后再调用一次BroadcastKernelBroadcastKernel内置了一套逻辑,可以直接的对维度信息进行补齐,唯一需要主义的就是设定补齐的axis 轴即可,不需要分两次调用,关于维度补齐中axis的设置,可以参考numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

老师您好!非常感谢您的指点!
请问您的意思,是设置合适axis的值调用BroadcastKernel就可以实现纬度对齐+运算的功能,还是先预处理数据实现纬度对齐再调用BroadcastKernel实现运算呀?
如果是前者,这里有一种特殊情况我无法处理。
在使用BroadcastKernel时,如果参数ET为ElementwiseType::kBinary、ins中的三个tensor的维度各不相同,不管axis参数的值为多少,因为参数axis是一个数,ins中总有一个tensor不能正常broadcast。
Pasted image 20230421221204
y将不能正常broadcast。
查看paddle/phi/kernels/funcs/dims_simplifier.hExtendInputDimensions函数可知,ins中的tensor是根据outs[0]一个个进行broadcast的。以上面的例子为例,假如axis为1,则x不能正常broadcast;假如axis为2,则y不能正常broadcast。
是我调用错了Kernel吗?
如果是后者,请问对于预处理部分,有什么可以参考的代码吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从PM同学那里听说你对我的这部分修改建议持否定态度,请问下理由是什么吗?如果理由OK的话,我这边会合入的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 老师您好!您误会了,我不是持否定的态度哈。我是遇到了自己难以解决的困难,向您寻求一下进一步的指导。
您提出不需要分两次调用调用BroadcastKernel,我是很赞同的,我最初也是那样写的,但是在测试遇到了问题(具体内容可以见上面的回复)。我做过了一些别的尝试,但是都失败了,最终选择了对于特殊情况分两次调用BroadcastKernel。
您有更好的解决方法吗?

Copy link
Contributor

@JamesLim-sy JamesLim-sy May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的意思是,Paddle的Broadcast计算,支持 (input_0.broadcast + input_1.broadcast + input_2.broadcast) = (output_0, output_1) 这种计算模式,不必先单独broaddcast::kUnary ,再执行计算的。可以本地先测试下通用一次性的BoradcastTenery 完成计算.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 您的意思,是用一次多输出的Broadcast::kTernary替换掉一次单输出的Broadcast:::kUnary+一次单输出的Broadcast::kTernary吗?如果是这样的话,我查看源码之后发现并不可行。
从'/paddle/phi/kernels/funcs/broadcast_function.h'的49行代码可以看出,多输出的情况下,要求各个输出的dims()相同。从'/paddle/phi/kernels/funcs/broadcast_function.h'的974行代码可以看出,多输出的情况下,各个输入的broadcast过程是由(*outs)[0]->dims()与int型参数axis决定的,这与单输出的Broadcast::kTernary的broadcast过程是完全相同的,这也意味着也会出现上述所说的问题。
另外之前CI中所有Required的部分都过了,但是现在paddle-ci-bot显示'Sorry to inform you that 81c86105e84f03cbc635fc247e050a20da1d96b1's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.',我重新构建失败的部分,也不能成功QAQ,这是哪里的原因呀?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 老师您好,麻烦您再看一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 呜呜呜,等好久了,您抽空再审核下吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改完毕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy mingshu老师有时间review一下吗?

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented May 4, 2023

Sorry to inform you that 81c8610's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

#include "paddle/phi/kernels/funcs/broadcast_function.h"
#include "paddle/phi/kernels/funcs/common_shape.h"
#include "paddle/phi/kernels/funcs/math_function.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include "paddle/phi/kernels/empty_kernel.h"
#include "paddle/phi/kernels/funcs/broadcast_function.h"
#include "paddle/phi/kernels/funcs/common_shape.h"
#include "paddle/phi/kernels/funcs/math_function.h"

这几个头文件都裹在#include "paddle/phi/kernels/funcs/broadcast_function.h"里面了,之后希望能再提一个PR修改掉.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JamesLim-sy JamesLim-sy merged commit e592534 into PaddlePaddle:develop May 16, 2023
@WintersMontagne10335 WintersMontagne10335 deleted the winters000 branch May 17, 2023 23:58
This was referenced May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants