Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PaddlePaddle Hackathon 4 No.34】为 Paddle 优化 Lerp OP 在 GPU 上的性能 #53154

Merged
merged 19 commits into from
May 16, 2023
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
d6dbd12
modify lerp_kernel.cu
WintersMontagne10335 Apr 20, 2023
60e071f
pre-commit
WintersMontagne10335 Apr 21, 2023
ab512b8
fix some CI issues
WintersMontagne10335 Apr 21, 2023
1d63732
fix some CI issues
WintersMontagne10335 Apr 21, 2023
2d43b55
fix some CI issues
WintersMontagne10335 Apr 21, 2023
f791c5d
fix some CI issues
WintersMontagne10335 Apr 21, 2023
09c0042
fix some CI issues
WintersMontagne10335 Apr 24, 2023
c73530b
fix some CI issues
WintersMontagne10335 Apr 25, 2023
5d02d8a
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
WintersMontagne10335 Apr 25, 2023
8fdb5d1
Merge branch 'PaddlePaddle:develop' into winters000
WintersMontagne10335 Apr 25, 2023
1ad8a27
Merge branch 'winters000' of github.com:WintersMontagne10335/Paddle i…
WintersMontagne10335 Apr 25, 2023
ac7d1f2
fix some CI issues
WintersMontagne10335 Apr 25, 2023
fa83ab1
fix some CI issues
WintersMontagne10335 Apr 26, 2023
1bbdeb6
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
WintersMontagne10335 Apr 26, 2023
45823ac
Merge branch 'PaddlePaddle:develop' into winters000
WintersMontagne10335 Apr 26, 2023
81c8610
Merge branch 'winters000' of github.com:WintersMontagne10335/Paddle i…
WintersMontagne10335 Apr 26, 2023
4171029
Merge branch 'PaddlePaddle:develop' into winters000
WintersMontagne10335 May 5, 2023
ff52b7c
Add files via upload
WintersMontagne10335 May 6, 2023
895db15
Merge branch 'PaddlePaddle:develop' into winters000
WintersMontagne10335 May 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 108 additions & 1 deletion paddle/phi/kernels/gpu/lerp_kernel.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,115 @@
#include "paddle/phi/kernels/lerp_kernel.h"

#include "paddle/phi/backends/gpu/gpu_context.h"
#include "paddle/phi/common/amp_type_traits.h"
#include "paddle/phi/core/kernel_registry.h"
#include "paddle/phi/kernels/impl/lerp_kernel_impl.h"
#include "paddle/phi/kernels/empty_kernel.h"
#include "paddle/phi/kernels/funcs/broadcast_function.h"
#include "paddle/phi/kernels/funcs/common_shape.h"
#include "paddle/phi/kernels/funcs/math_function.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include "paddle/phi/kernels/empty_kernel.h"
#include "paddle/phi/kernels/funcs/broadcast_function.h"
#include "paddle/phi/kernels/funcs/common_shape.h"
#include "paddle/phi/kernels/funcs/math_function.h"

这几个头文件都裹在#include "paddle/phi/kernels/funcs/broadcast_function.h"里面了,之后希望能再提一个PR修改掉.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

namespace phi {

template <typename T>
struct BroadcastMinElementWiseDirectCUDAFunctor {
HOSTDEVICE inline T operator()(const T min) const { return min; }
};

template <typename T>
struct LerpElementWiseDirectCUDAFunctor {
HOSTDEVICE inline T operator()(const T x, const T y, const T weight) const {
return x + weight * (y - x);
}
};

template <typename T>
struct LerpScalarDirectCUDAFunctor {
const T *weight_;

HOSTDEVICE inline LerpScalarDirectCUDAFunctor(const T *weight)
: weight_(weight) {}

HOSTDEVICE inline T operator()(const T x, const T y) const {
return x + weight_[0] * (y - x);
}
};

template <typename T, typename Context>
void LerpKernel(const Context &ctx,
const DenseTensor &x,
const DenseTensor &y,
const DenseTensor &weight,
DenseTensor *out) {
int rank = out->dims().size();
PADDLE_ENFORCE_GE(
rank,
0,
phi::errors::InvalidArgument(
"The number of dimensions for LerpOp must be "
"greater than or equal to 0, but the value received is %d.",
rank));

ctx.template Alloc<T>(out);
std::vector<DenseTensor *> outputs = {out};

std::vector<const DenseTensor *> inputs;
if (weight.numel() == 1) {
const T *weight_ptr = weight.data<T>();
inputs.reserve(2);
inputs.emplace_back(&x);
inputs.emplace_back(&y);
auto functor = LerpScalarDirectCUDAFunctor<T>(weight_ptr);
phi::funcs::BroadcastKernel<T>(ctx, inputs, &outputs, functor);
} else {
inputs.reserve(3);
auto functor = LerpElementWiseDirectCUDAFunctor<T>();
DenseTensor b_min = phi::EmptyLike<T>(ctx, *out);
if (x.dims().size() != y.dims().size() &&
weight.dims().size() != y.dims().size()) {
std::vector<const DenseTensor *> broadcast_min_inputs;
broadcast_min_inputs.reserve(1);
std::vector<DenseTensor *> broadcast_min_outputs = {&b_min};
auto broadcast_min_functor =
BroadcastMinElementWiseDirectCUDAFunctor<T>();
if (x.dims().size() < y.dims().size() &&
x.dims().size() < weight.dims().size()) {
broadcast_min_inputs.emplace_back(&x);
phi::funcs::BroadcastKernel<T>(ctx,
broadcast_min_inputs,
&broadcast_min_outputs,
broadcast_min_functor);
inputs.emplace_back(&b_min);
inputs.emplace_back(&y);
inputs.emplace_back(&weight);
} else if (y.dims().size() < weight.dims().size()) {
broadcast_min_inputs.emplace_back(&y);
phi::funcs::BroadcastKernel<T>(ctx,
broadcast_min_inputs,
&broadcast_min_outputs,
broadcast_min_functor);
inputs.emplace_back(&x);
inputs.emplace_back(&b_min);
inputs.emplace_back(&weight);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分的计算逻辑我理解是对输入的数据首先将维度按照out_tensor的维度进行补齐,然后再调用一次BroadcastKernelBroadcastKernel内置了一套逻辑,可以直接的对维度信息进行补齐,唯一需要主义的就是设定补齐的axis 轴即可,不需要分两次调用,关于维度补齐中axis的设置,可以参考numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

老师您好!非常感谢您的指点!
请问您的意思,是设置合适axis的值调用BroadcastKernel就可以实现纬度对齐+运算的功能,还是先预处理数据实现纬度对齐再调用BroadcastKernel实现运算呀?
如果是前者,这里有一种特殊情况我无法处理。
在使用BroadcastKernel时,如果参数ET为ElementwiseType::kBinary、ins中的三个tensor的维度各不相同,不管axis参数的值为多少,因为参数axis是一个数,ins中总有一个tensor不能正常broadcast。
Pasted image 20230421221204
y将不能正常broadcast。
查看paddle/phi/kernels/funcs/dims_simplifier.hExtendInputDimensions函数可知,ins中的tensor是根据outs[0]一个个进行broadcast的。以上面的例子为例,假如axis为1,则x不能正常broadcast;假如axis为2,则y不能正常broadcast。
是我调用错了Kernel吗?
如果是后者,请问对于预处理部分,有什么可以参考的代码吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从PM同学那里听说你对我的这部分修改建议持否定态度,请问下理由是什么吗?如果理由OK的话,我这边会合入的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 老师您好!您误会了,我不是持否定的态度哈。我是遇到了自己难以解决的困难,向您寻求一下进一步的指导。
您提出不需要分两次调用调用BroadcastKernel,我是很赞同的,我最初也是那样写的,但是在测试遇到了问题(具体内容可以见上面的回复)。我做过了一些别的尝试,但是都失败了,最终选择了对于特殊情况分两次调用BroadcastKernel。
您有更好的解决方法吗?

Copy link
Contributor

@JamesLim-sy JamesLim-sy May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的意思是,Paddle的Broadcast计算,支持 (input_0.broadcast + input_1.broadcast + input_2.broadcast) = (output_0, output_1) 这种计算模式,不必先单独broaddcast::kUnary ,再执行计算的。可以本地先测试下通用一次性的BoradcastTenery 完成计算.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 您的意思,是用一次多输出的Broadcast::kTernary替换掉一次单输出的Broadcast:::kUnary+一次单输出的Broadcast::kTernary吗?如果是这样的话,我查看源码之后发现并不可行。
从'/paddle/phi/kernels/funcs/broadcast_function.h'的49行代码可以看出,多输出的情况下,要求各个输出的dims()相同。从'/paddle/phi/kernels/funcs/broadcast_function.h'的974行代码可以看出,多输出的情况下,各个输入的broadcast过程是由(*outs)[0]->dims()与int型参数axis决定的,这与单输出的Broadcast::kTernary的broadcast过程是完全相同的,这也意味着也会出现上述所说的问题。
另外之前CI中所有Required的部分都过了,但是现在paddle-ci-bot显示'Sorry to inform you that 81c86105e84f03cbc635fc247e050a20da1d96b1's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.',我重新构建失败的部分,也不能成功QAQ,这是哪里的原因呀?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 老师您好,麻烦您再看一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 呜呜呜,等好久了,您抽空再审核下吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改完毕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy mingshu老师有时间review一下吗?

} else {
broadcast_min_inputs.emplace_back(&weight);
phi::funcs::BroadcastKernel<T>(ctx,
broadcast_min_inputs,
&broadcast_min_outputs,
broadcast_min_functor);
inputs.emplace_back(&x);
inputs.emplace_back(&y);
inputs.emplace_back(&b_min);
}
} else {
inputs.emplace_back(&x);
inputs.emplace_back(&y);
inputs.emplace_back(&weight);
}
phi::funcs::BroadcastKernel<T>(ctx, inputs, &outputs, functor);
}
}

} // namespace phi

PD_REGISTER_KERNEL(lerp,
GPU,
Expand Down