Transpose optimization for AlphaFold2 #45230

JamesLim-sy · 2022-08-18T02:13:44Z

PR types

Performance optimization

PR changes

OPs

Describe

Feature :

Divide TransposeTypeClassifier into two parts: DimsSimplifier and PermTypeClassifier, to decrease the cpu cost in eigen kernel;
Addition of dimension merge tricks for Eigen-Transpose kernel;
Performance optimization for transpose kernel and extend the effect to swap-transpose kernel, which means permute from [n, h, w] to [w, h, c].

Performance :

After adopting this pr, the batch_cost for paddle_helix (AlphaFold2) dercrease from 4.28s to 4.17s, performance increasing about 2.7%.
In transpose cases below: (under the help of autotune)

no.	x.shape	permute.shape	dtype	Transpose WithSimple /us	Transpose WithEigen /us	Transpose WithBatchKernel /us	Fastest Impl
1	[16, 40, 256]	[0, 2, 1]	float32	6.384	5.554	5.178	WithBatchKernel
2	[16, 256, 40]	[0, 2, 1]	float32	5.791	5.624	4.969	WithBatchKernel
3	[2, 65536, 256]	[0, 2, 1]	float32	203.143	628.098	197.912	WithBatchKernel
4	[2, 19, 1024, 1024]	[0, 2, 3, 1]	float32	246.71	263.0472	234.3242	WithBatchKernel
5	[2, 64, 64, 160]	[0, 3, 1, 2]	float32	9.38	20.9499	8.4838	WithBatchKernel
6	[2, 65536, 32]	[0, 2, 1]	float32	26.15	61.246	22.7879	WithBatchKernel
7	[8, 2, 64, 64]	[0, 2, 3, 1]	float32	5.28	5.3258	4.696	WithBatchKernel
8	[2, 65536, 32]	[0, 2, 1]	float32	26.62	61.439	22.528	WithBatchKernel
9	[2744, 12, 8, 64]	[0, 1, 3, 2]	float32	141.8301	112.9146	147.7764	WithEigen
10	[112, 12, 197, 64]	[0, 1, 3, 2]	float32	103.4949	133.1661	99.3877	WithBatchKernel

paddle-bot · 2022-08-18T02:13:47Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… optimize_index_calculator

Xreki · 2022-11-30T06:18:28Z

paddle/fluid/operators/transpose_op.cu.h

+  } else {
+    int dim_idx = 0;
+    std::vector<int> new_dim_pos(shape.size(), -1);
+    std::vector<int64_t> combined_dims(shape.size(), 0);


如果不需要使用std::vector的一些高级用法，可能用phi::Dim或phi::Array会快一些，因为前者用的是堆空间，需要动态申请释放内存，后者使用的是栈空间，静态分配内存。

Xreki · 2022-11-30T06:21:47Z

paddle/fluid/operators/transpose_op.cu.h

-                                 const std::vector<int>& perm,
-                                 std::vector<int>* new_perm,
-                                 framework::DDim* new_dims) {
+inline std::vector<int> CombineTransposeDim3(const framework::DDim& shape,


该函数实际有2个返回值，统一使用指针作为输出参数传入比较好。另外，dims统一使用int64_t比较好？

原先传入的都是指针，我加了个返回值返回 dims参数，可以改回全部传入指针的模式。

修改int64_t 牵扯到对整体的 phi::Dim3, phi::kps::Dim3 这些数据的修改，牵扯面太广，以后再提PR修改l

Xreki · 2022-11-30T06:57:11Z

paddle/fluid/operators/transpose_op.cu.h

-    // Only use tile copy GPU kernel when dimension is 2 or 3.
-    int dims = new_dims.size();
-    std::vector<int> new_dim_vec = phi::vectorize<int>(new_dims);
-    if (dims < 2 || dims > 3) return false;


这个判断可以保留？

Xreki · 2022-11-30T07:08:52Z

paddle/fluid/operators/transpose_op.cu.h

@@ -772,9 +750,9 @@ class IdxHelper {
    }


N -> Rank, T -> IndexT？不同Kernel里面的模板变量名可以尽量统一下。

根据建议做了统一修改

Xreki · 2022-11-30T07:16:46Z

paddle/fluid/operators/transpose_op.cu.h

+          const int tile_tail = tile_y * ReadSize + i;
+          const int major_share_idx = share_tile + tile_tail;
+          const IndexT row_in_mat =
+              (blockIdx.x * kColTile + tile_tail) * col_stride;


这些随着循环变化的变量，没有必要定义成const类型？

也没必要完全变成const，下个commit 修改.

Xreki · 2022-11-30T07:37:42Z

paddle/fluid/operators/transpose_op.cu.h

+          int ReadSize,
+          int WriteSize = (IsVecWrite && (sizeof(T) < sizeof(float)))
+                              ? sizeof(float) / sizeof(T)
+                              : 1>
 __global__ void BatchTransposeKernel(const T* __restrict__ src_data,


SwapTransposeKernel和BatchTransposeKernel实现diff很少啊，能合并吗？

已经对两个kernel 进行了公用__device__ kernel抽取，后面没法再合并了，因为数据的跨度一个是 col, 一个是 col * row，

Xreki · 2022-11-30T07:39:30Z

paddle/fluid/operators/transpose_op.h


 // Simplify the input dims and permute dims if possible.
 template <typename T>
-class TranposeTypeClassifier {
+struct DimsSimplifier {


PermuteDimsSimlifier？可以考虑是否要实现到dims_simplifier.h文件中。

根据建议修改.

Xreki · 2022-11-30T07:40:54Z

paddle/fluid/operators/transpose_op.h

+  explicit DimsSimplifier(const int rank,
+                          const int64_t numel,
+                          const std::vector<int32_t>& perm,
+                          const std::vector<int64_t>& dims)


src、dst删除了，就不需要模板了

根据建议修改.

Xreki · 2022-11-30T07:42:01Z

paddle/fluid/operators/transpose_op.h

-  kGeneralPermute = 4
+  kSwapTranspose = 2,
+  kGeneralTranspose = 3,
+  kVecPermute = 4,


kVecPermute挪到kCopy后面，这样值的大小能够反应kernel的选取顺序？

Xreki · 2022-11-30T07:43:46Z

paddle/fluid/operators/transpose_op.h

 };

 constexpr int kBlockRows = 16;
 constexpr int kTileSize = 32;
+constexpr int kShareCol = (kTileSize + 1);
+
+#define GETTILESIZE(LEN, ALIGN) ((LEN + (ALIGN - 1)) & ~(ALIGN - 1)) / ALIGN


加_分隔下

根据建议修改.

Xreki · 2022-11-30T07:47:12Z

paddle/fluid/operators/transpose_op.cu.h

               ctx,
-               in,
+               rank,
+               const_cast<phi::DenseTensor*>(&in),


这里没有必要使用const_cast吧？

需要的，因为in_tensor 本身是 const 变量，要修改in_tensor 的 dims信息，需要把in_tensor 调整为非const 变量.

应该还是不建议修改输入Tensor，可以参考别的Kernel，定义一个临时Tensor并使用SharedDataWith的方式共享存储。

初期想过SharedDataWith的操作，考虑到const_cast 是在编译时完成的动作，可能能减少CPU的计算

… optimize_index_calculator

Xreki

LGTM

Xreki · 2022-12-06T07:42:25Z

paddle/phi/kernels/funcs/dims_simplifier.h

@@ -243,5 +243,106 @@ struct BroadcastDimsSimplifier {
  }
 };

+// Simplify the input dims and permute dims if possible.
+struct DimsSimplifier {


DimsSimplifier -> PermuteDimsSimplifier

Xreki · 2022-12-06T07:43:30Z

paddle/phi/kernels/funcs/dims_simplifier.h

+// Simplify the input dims and permute dims if possible.
+struct DimsSimplifier {
+ public:
+  explicit DimsSimplifier(const int rank,


构造函数参数个数是1个的时候，才需要加explicit。其他情况都不需要加。

Xreki · 2022-12-06T07:49:03Z

paddle/phi/kernels/funcs/dims_simplifier.h

+ private:
+  int rank_{1};
+  int64_t count_{0};
+  bool is_seq_perm_{true};


is_seq_perm_是啥意思，perm是连续的？连续的用完整的单词，seq容易理解成sequence。

Xreki · 2022-12-06T07:50:22Z

paddle/phi/kernels/funcs/transpose_function.cu.h

@@ -652,7 +652,7 @@ struct SwapDim0And2InTranspose {
 inline void CombineTransposeDim3(const DDim& shape,
                                 const std::vector<int>& perm,
                                 std::vector<int>* new_perm,
-                                 DDim* new_dims) {
+                                 std::vector<int>* new_dims) {


改成std::vector<int64_t>吧，不然极端情况会因为这里跑不了。

Xreki · 2022-12-06T07:53:29Z

paddle/phi/kernels/funcs/transpose_function.cu.h

-    return;
-  }
-  std::vector<int> new_dim_pos(shape.size(), -1);
-  std::vector<int64_t> combined_dims(shape.size(), 0);


原始代码，用的DDim和std::vector<int64_t>，都是int64_t类型。

Xreki · 2022-12-06T08:09:52Z

paddle/phi/kernels/funcs/transpose_function.cu.h

-                                const DeviceContext& ctx,
+template <typename T>
+inline void PermuteAndTranspose(const phi::GPUContext& ctx,
+                                const int& rank,


rank也没必要作为参数传进来，函数中也没有用到。

Xreki · 2022-12-06T08:16:42Z

paddle/phi/kernels/funcs/transpose_functor.h

@@ -27,161 +27,115 @@ enum { kTransposeMKLDNNFP32 = 1, kTransposeMKLDNNINT8 = 2 };



这个enum类型定义没有用到。

Xreki · 2022-12-06T08:17:58Z

paddle/phi/kernels/funcs/transpose_functor.h

@@ -27,161 +27,115 @@ enum { kTransposeMKLDNNFP32 = 1, kTransposeMKLDNNINT8 = 2 };

 enum PermuteType {


这部分内容，感觉没有必要单独成一个文件了，可以挪到transpose_function.cu.h中。此外，其实也没必要用.cu.h，直接用.h就行。

Xreki · 2022-12-06T08:18:26Z

paddle/phi/kernels/funcs/transpose_functor.h

 };

 constexpr int kBlockRows = 16;
 constexpr int kTileSize = 32;
+constexpr int kShareCol = (kTileSize + 1);
+
+#define GETTILESIZE(LEN_, ALIGN_) \


GETTILESIZE -> GET_TILE_SIZE

Parts of the pr suggestions have been estabulished in PR33051

Xreki · 2022-12-06T08:20:05Z

paddle/phi/kernels/funcs/transpose_function.cu.h

+        simplifier.GetRank(), ctx, in, out, simplifier.GetPerm());
+  }
+}
+
 template <typename T>
 void TransposeGPUKernelDriver(const phi::GPUContext& ctx,


函数名改成PermuteKernel吧。

JamesLim-sy · 2022-12-08T01:56:16Z

Parts of unfinished pr suggestions have been estabulished in PR33051

* first commit * fix bugs according to ci * add some changes * change file name into function.cu.h * remove const_cast

first commit

0d26282

JamesLim-sy changed the title ~~Optimization for index calculation~~ [WIP]: Optimization for index calculation Aug 18, 2022

JamesLim-sy added 3 commits August 18, 2022 06:38

fix bugs according to ci

ba3f25b

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b9cb90c

… optimize_index_calculator

add some changes

f53084b

JamesLim-sy force-pushed the optimize_index_calculator branch from 4f6a289 to f53084b Compare November 29, 2022 16:29

JamesLim-sy changed the title ~~[WIP]: Optimization for index calculation~~ Transpose optimization for AlphaFold2 Nov 29, 2022

Xreki reviewed Nov 30, 2022

View reviewed changes

JamesLim-sy added 5 commits November 30, 2022 21:24

modifiy according to pr suggestions

1fcbce7

change file name into function.cu.h

192d545

remove const_cast

504ac96

fix code conflicts

e7b5f73

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2b3aa8f

… optimize_index_calculator

Xreki approved these changes Dec 5, 2022

View reviewed changes

JamesLim-sy merged commit a0f4388 into PaddlePaddle:develop Dec 5, 2022

Xreki reviewed Dec 6, 2022

View reviewed changes

lxsbupt pushed a commit to lxsbupt/Paddle that referenced this pull request Dec 17, 2022

Transpose optimization for AlphaFold2 (PaddlePaddle#45230)

910739c

* first commit * fix bugs according to ci * add some changes * change file name into function.cu.h * remove const_cast

		@@ -27,161 +27,115 @@ enum { kTransposeMKLDNNFP32 = 1, kTransposeMKLDNNINT8 = 2 };

		@@ -27,161 +27,115 @@ enum { kTransposeMKLDNNFP32 = 1, kTransposeMKLDNNINT8 = 2 };

		enum PermuteType {

Transpose optimization for AlphaFold2 #45230

Transpose optimization for AlphaFold2 #45230

Conversation

JamesLim-sy commented Aug 18, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Aug 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy commented Dec 8, 2022

JamesLim-sy commented Aug 18, 2022 •

edited

Loading