-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize elementwise_mul_grad using new interfaces #37728
Conversation
Update forked PaddlePaddle
Update my fork
update from PaddlePaddle
Update forked paddle repo
Update USERNAME/paddle
update Paddle USERNAME repo
update username repo
update local paddlepaddle
update paddlepaddle
… elem_mul_grad
Thanks for your contribution! |
std::vector<int> reduce_dims = GetReduceDim(x->dims(), out->dims(), axis); | ||
gpuStream_t stream = ctx.cuda_device_context().stream(); | ||
|
||
framework::Tensor wayto_dx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名最好可以修改一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
wayto_dx.Resize(dout->dims()); | ||
default_elementwise_mul<DeviceContext, T>(ctx, dout, y, &wayto_dx); | ||
|
||
const framework::Tensor* const_to_dx = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这句可以和下面合并
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -48,6 +49,17 @@ template <typename T> | |||
struct MulFunctor { | |||
inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; } | |||
}; | |||
|
|||
template <typename T> | |||
struct MulFunctor<paddle::platform::complex<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数可以删除,可以判断当复数形式时从原来的y(y.real, y.imag)构造y_conj(y.real, -y.imag);传入乘法就行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此函数放在这里确实不合适,与MulFunctor语义冲突。已修改
inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; } | ||
}; | ||
template <typename T> | ||
struct MulDxDyFunctor<paddle::platform::complex<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数MulGradDY
接受4个参数,如果使用这个函数,就要将调用者改为函数ElemwiseGradCompute
,该函数最终调用kernel是优化前的kernel,并且意义不同,此处的功能是elemwiseCompute
而非ElemwiseGradCompute
。所以我认为,此处复用不了MulGradDY
,故提供了MulDxDyFunctor
。
inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; } | ||
}; | ||
template <typename T> | ||
struct MulDxDyFunctor<paddle::platform::complex<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
名字可以修改一下,并且提取到elementwise_functor.h公共文件中
std::vector<int> reduce_dims = GetReduceDim(x->dims(), out->dims(), axis); | ||
gpuStream_t stream = ctx.cuda_device_context().stream(); | ||
|
||
framework::Tensor dx_tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议修改命名方式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. 改为dx_origin_dims,表示reduce之前的dx结果。
template <typename DeviceContext, typename T> | ||
typename std::enable_if< | ||
std::is_same<DeviceContext, platform::CPUDeviceContext>::value>::type | ||
default_elementwise_mul_grad(const framework::ExecutionContext& ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default_elementwise_mul_grad和elementwise_mul_grad代码存在重复
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done。
@@ -114,6 +116,73 @@ __global__ void SimpleElemwiseMulGradCUDAKernel<plat::complex<double>>( | |||
} | |||
} | |||
|
|||
template <typename T> | |||
struct MulDxDyFunctor { | |||
inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的参数a,b与下面的参数 x、y不统一
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
… elem_mul_grad
… elem_mul_grad
… elem_mul_grad
… elem_mul_grad
… elem_mul_grad
} | ||
} | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除无效的注释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
template <typename DeviceContext, typename T> | ||
typename std::enable_if< | ||
std::is_same<DeviceContext, platform::CUDADeviceContext>::value>::type | ||
default_elementwise_mul_grad(const framework::ExecutionContext& ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分的代码根据Zjq9409的最新合入PR修改一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done。
@@ -113,6 +114,181 @@ __global__ void SimpleElemwiseMulGradCUDAKernel<plat::complex<double>>( | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleElemwiseMulGradCUDAKernel函数代码可以删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this pr, if other reviewers also agree with it, then it can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "paddle/fluid/operators/elementwise/elementwise_op_broadcast.cu.h"
#include "paddle/fluid/platform/complex.h"
#include "paddle/fluid/platform/float16.h"
elementwise_mul_op.cu
文件中以上头文件可以删除,可以在下个PR中删掉
PR types
Performance optimization
PR changes
OPs
Describe
功能
该PR使用新的接口优化了elementwise_mul的反向计算
opbenchmark 表现
关于CI-opbenchmark 性能较develop性能变差的3个配置:使用reduce接口优化后,前3个配置较dev下降15%左右;适配多输出代码优化后,该3个配置与dev打平,其他超越(1.85x~12.16x)或打平。整体上,不差于dev和竞品。