【PTen】Add dot and matmul grad kernel in pten #38713

zyfncg · 2022-01-05T04:07:37Z

PR types

Others

PR changes

Others

Describe

迁移dot和matmul反向一阶、二阶和三阶计算kernel到pten中。

为了完成PTen反向计算kernel与框架的适配，本PR中还包括了以下几项调整：

原Op体系中反向Op没有OpProto信息，与前向Op的处理有所不同，因此本PR中调整了相应的处理逻辑并为迁移的每个反向kernel对应的Op配置GetExpectedPtenKernelArgs，该解决方案后续有可能会替换。
反向kernel的部分输入DenseTensor存在为空的情况，并且在kernel内部有相应的判断分支逻辑，为了处理这里的判断条件，使用了paddle::optional<const DenseTensor&>来包裹此类可能为空的输入变量。为此也在pten中增加了对paddle::optional<const DenseTensor&>输入类型的支持。
增加了kernel输出DenseTensor可能为NULL的适配支持。
为动态图执行调用PTen反向kernel增加复数转换逻辑。
DenseTensor新增移动赋值函数DenseTensor& operator=(DenseTensor&& other)。

… matmul_kernel_move

… pten_dot_grad

…e/Paddle into pten_matmul_grad

… pten_matmul_grad

paddle-bot-old · 2022-01-05T04:08:51Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… pten_matmul_grad

chenwhql · 2022-01-06T09:37:14Z

paddle/fluid/imperative/prepared_operator.cc

@@ -560,17 +586,19 @@ static void PreparedOpRunPtImpl(
  pt_kernel_context->ClearData();

  // TODO(chenweihang): add debug flags later
-  // TODO(chenweihang): deal with complex cases later
+  if (framework::IsComplexType(kernel_type.data_type_)) {


这里是否可以使用pten_kernel的data type

由于传入的KernelSignature和Kernel数据结构都不具有data_type信息，所以需要使用kernel_type的数据

paddle/pten/kernels/complex_kernel.h

paddle/pten/kernels/cpu/dot_grad_kernel.cc

paddle/pten/kernels/empty_kernel.h

paddle/pten/kernels/gpu/dot_kernel.cu

paddle/pten/kernels/hybird/transpose.h

paddle/pten/kernels/impl/dot_grad_kernel_impl.h

paddle/pten/kernels/impl/matmul_grad_kernel_impl.h

… pten_matmul_grad

YuanRisheng · 2022-01-06T11:29:13Z

paddle/fluid/framework/operator.cc

+      if (current_vector_size > start_idx) {
+        pt_kernel_context_->SetOutputWithoutSetRange(start_idx, {nullptr});
+      } else {
+        pt_kernel_context_->EmplaceBackOutputWithoutSetRange({nullptr});
+      }
+      end_idx = start_idx + 1;


这里加点注释吧

YuanRisheng · 2022-01-06T11:30:59Z

paddle/fluid/imperative/prepared_operator.cc

+          } else {
+            kernel_ctx->SetOutputWithoutSetRange(
+                start_idx + offset,
+                experimental::MakePtenTensorBaseFromVar(
+                    outs_vector[offset]->MutableVar(), out_def));
+          }


这个分支有用到吗

动态图模式下会执行到

YuanRisheng · 2022-01-06T11:50:52Z

paddle/fluid/imperative/prepared_operator.cc

+    } else {
+      if (current_vector_size > start_idx) {
+        kernel_ctx->SetOutputWithoutSetRange(start_idx, {nullptr});
      } else {
-        kernel_ctx->EmplaceBackOutputWithoutSetRange(
-            experimental::MakePtenTensorBaseFromVar(
-                outs_vector[offset]->MutableVar(), out_def));
+        kernel_ctx->EmplaceBackOutputWithoutSetRange({nullptr});
      }
+      kernel_ctx->AssignOutputRange(std::make_pair(start_idx, start_idx + 1),
+                                    i);


这里建议将这段逻辑挪到开头，使用iter == outs.end判断执行后直接continue，这样可以优化代码结构，减少if else逻辑嵌套便于代码维护与理解

YuanRisheng · 2022-01-06T12:01:29Z

paddle/pten/kernels/cpu/matmul_grad_kernel.cc

+                       paddle::platform::complex<float>,
+                       paddle::platform::complex<double>) {}
+
+PT_REGISTER_CTX_KERNEL(matmul_grad_grad,


这里建议命名与函数一致：matmul_double_grad，alias_name也如此

paddle/pten/kernels/dot_grad_kernel.h

paddle/pten/kernels/gpu/dot_grad_kernel.cu

paddle/pten/kernels/gpu/matmul_grad_kernel.cu

paddle/pten/kernels/matmul_grad_kernel.h

chenwhql

LGTM

… pten_matmul_grad

Xreki · 2022-01-13T02:37:19Z

怀疑该PR导致了linear反向性能下降一倍：

1月6日的OP Benchmark数据：

linear_2的nvprof结果如下：

run command: nvprof --profile-from-start off /work/.virtualenvs_cuda11.4/paddle_py38/bin/python /work/benchmark/api/dynamic_tests_v2/linear.py --api_name linear --task speed --framework paddle --testing_mode dynamic --json_file /work/benchmark/api/tests_v2/configs/linear.json --config_id 2 --backward True --use_gpu True --repeat 1000 --allow_adaptive_repeat True --profiler nvprof
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   36.07%  199.88ms      2000  99.938us  93.696us  136.29us  volta_sgemm_64x32_sliced1x4_tn
                   30.52%  169.10ms      2000  84.548us  81.408us  92.960us  volta_sgemm_64x32_sliced1x4_nn
                   27.48%  152.30ms      2000  76.148us  71.040us  86.752us  volta_sgemm_128x32_nt
                    1.96%  10.845ms      2000  5.4220us  5.1520us  10.528us  void splitKreduce_kernel<float, float, float, float, bool=1, bool=0>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *, void*, long, float*, int*)
                    1.41%  7.8399ms      2000  3.9190us  3.7430us  10.912us  void pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>, int=2, int=1, int=4, int=2>(paddle::framework::Array<float const * restrict , pten::funcs::AddFunctor<float>>, paddle::framework<float*, int=2>, paddle::framework<bool, pten::funcs::AddFunctor<float>>, unsigned int, paddle::framework<pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>::operators::kernel_primitives::details::BroadcastConfig<int=4>, int=2, int=1, int=4, int=2>, pten::funcs::AddFunctor<float>>, int, int, float)
                    1.34%  7.4067ms      2000  3.7030us  3.5510us  9.0560us  void pten::kernels::ReduceHigherDimKernel<float, float, float, paddle::operators::kernel_primitives::AddFunctor<float>, paddle::operators::kernel_primitives::IdentityFunctor<float, float>>(float const *, float*, float, paddle::operators::kernel_primitives::AddFunctor<float>, float, int, int, int, paddle::operators::kernel_primitives::DimConfig)
                    1.22%  6.7504ms      2000  3.3750us  3.2000us  9.4730us  [CUDA memcpy DtoD]

total gpu_time: 554.1447 ms

1月12日的OP Benchmark数据：

linear_2的nvprof结果如下：

run command: nvprof --profile-from-start off /work/.virtualenvs_cuda11.4/paddle_py38/bin/python /work/benchmark/api/dynamic_tests_v2/linear.py --api_name linear --task speed --framework paddle --testing_mode dynamic --json_file /work/benchmark/api/tests_v2/configs/linear.json --config_id 2 --backward True --use_gpu True --repeat 1000 --allow_adaptive_repeat True --profiler nvprof
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.13%  275.88ms      4000  68.968us  5.6000us  136.93us  void paddle::platform::ForRangeElemwiseOp<paddle::operators::math::ConjFunctor<float, void>>(float, unsigned long)
                   24.55%  204.44ms      2000  102.22us  96.480us  125.73us  volta_sgemm_64x32_sliced1x4_tn
                   20.20%  168.18ms      2000  84.089us  80.863us  93.632us  volta_sgemm_64x32_sliced1x4_nn
                   18.17%  151.31ms      2000  75.654us  70.720us  81.568us  volta_sgemm_128x32_nt
                    1.31%  10.906ms      2000  5.4530us  5.1510us  11.425us  void splitKreduce_kernel<float, float, float, float, bool=1, bool=0>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *, void*, long, float*, int*)
                    0.94%  7.8099ms      2000  3.9040us  3.7110us  8.8640us  void pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>, int=2, int=1, int=4, int=2>(paddle::framework::Array<float const * restrict , pten::funcs::AddFunctor<float>>, paddle::framework<float*, int=2>, paddle::framework<int, pten::funcs::AddFunctor<float>>, unsigned int, paddle::framework<pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>::operators::kernel_primitives::details::BroadcastConfig<int=4>, int=2, int=1, int=4, int=2>, pten::funcs::AddFunctor<float>>, int, int, float)
                    0.89%  7.3781ms      2000  3.6890us  3.5190us  9.3440us  void pten::kernels::ReduceHigherDimKernel<float, float, float, paddle::operators::kernel_primitives::AddFunctor<float>, paddle::operators::kernel_primitives::IdentityFunctor<float, float>>(float const *, float*, float, paddle::operators::kernel_primitives::AddFunctor<float>, float, int, int, int, paddle::operators::kernel_primitives::DimConfig)
                    0.82%  6.8048ms      2000  3.4020us  3.2000us  8.9600us  [CUDA memcpy DtoD]

total gpu_time: 832.7196 ms

新的linear反向计算多了一个paddle::platform::ForRangeElemwiseOp<paddle::operators::math::ConjFunctor<float, void>>(float, unsigned long)函数调用，但是所有linear配置都不是复数的，请check一下matmul的计算逻辑。

zyfncg · 2022-01-13T03:11:59Z

收到，我排查一下

zyfncg added 19 commits December 16, 2021 14:32

refactor matmul directory in pten

76608b5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

beecda8

… matmul_kernel_move

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

43b3267

… matmul_kernel_move

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

63d1681

… matmul_kernel_move

fix merge conflict

8cfb873

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2dffa46

… matmul_kernel_move

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0f3c6d1

… matmul_kernel_move

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2f077b4

… matmul_kernel_move

add dot_grad kernel

d6217d2

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

dc01e07

… pten_dot_grad

add dot_grad kernel in pten

a3658f5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

41c0b85

… pten_dot_grad

add matmul_grad kernel

682059d

Merge commit 'refs/pull/38441/head' of https://github.com/PaddlePaddl…

6a5c6b9

…e/Paddle into pten_matmul_grad

Merge commit 'refs/pull/38227/head' of https://github.com/PaddlePaddl…

a8830fc

…e/Paddle into pten_matmul_grad

update the code

0846c48

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

83d7207

… pten_matmul_grad

delete useless code in fluid

0d0f654

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e2e4af0

… pten_matmul_grad

zyfncg added 4 commits January 5, 2022 14:52

fix some bug of running matmul grad kernel

854c6fb

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8b60eaa

… pten_matmul_grad

fix merge conflict

3c955bd

refactor some code

ab5c095

chenwhql reviewed Jan 6, 2022

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

bc8a80b

… pten_matmul_grad

YuanRisheng reviewed Jan 6, 2022

View reviewed changes

refactor code

c113ab5

chenwhql approved these changes Jan 7, 2022

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

fc13a47

… pten_matmul_grad

raindrops2sea approved these changes Jan 11, 2022

View reviewed changes

zyfncg merged commit be81771 into PaddlePaddle:develop Jan 11, 2022

zyfncg deleted the pten_matmul_grad branch January 13, 2022 06:46

zyfncg mentioned this pull request Jan 13, 2022

Fix performance problem caused by Conj #38939

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【PTen】Add dot and matmul grad kernel in pten #38713

【PTen】Add dot and matmul grad kernel in pten #38713

zyfncg commented Jan 5, 2022 •

edited

Loading

paddle-bot-old bot commented Jan 5, 2022

chenwhql Jan 6, 2022

zyfncg Jan 6, 2022

YuanRisheng Jan 6, 2022

zyfncg Jan 6, 2022

YuanRisheng Jan 6, 2022

zyfncg Jan 6, 2022

YuanRisheng Jan 6, 2022

zyfncg Jan 6, 2022

YuanRisheng Jan 6, 2022

zyfncg Jan 6, 2022

chenwhql left a comment

Xreki commented Jan 13, 2022

zyfncg commented Jan 13, 2022

【PTen】Add dot and matmul grad kernel in pten #38713

【PTen】Add dot and matmul grad kernel in pten #38713

Conversation

zyfncg commented Jan 5, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jan 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenwhql left a comment

Choose a reason for hiding this comment

Xreki commented Jan 13, 2022

zyfncg commented Jan 13, 2022

zyfncg commented Jan 5, 2022 •

edited

Loading