[HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% #33982

xymyeah · 2021-07-06T03:18:22Z

PR types

Performance optimization

PR changes

OPs

Describe

Hip和Cuda runtime实现机制不同，在cuda下，从host到device copy（hipMemcpyAsync）pagable memory的数据小于64K时，会异步，但hip下只有pinned memory的数据才能异步
参考cuda文档：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device
3.2.6.1. Concurrent Execution between Host and Device
Memory copies from host to device of a memory block of 64 KB or less

效果及收益
1）性能提升效果，bert训练速度提升大于10%
修改前

修改后

2）计算和通信overlap
修改前

修改后

…hip computing and communication overlap

paddle-bot-old · 2021-07-06T03:18:26Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

qili93

LGTM

wangxicoding

LGTM

wangxicoding · 2021-07-06T06:54:56Z

paddle/fluid/operators/math/concat_and_split.cu

@@ -222,6 +223,9 @@ static inline void GetBlockDims(const platform::CUDADeviceContext& context,
  *grid_dims = dim3(grid_cols, grid_rows, 1);
 }

+int has_been_malloc_input = 0;
+int has_been_malloc_output = 0;


这两个没用到吧，可以删了

wangxicoding · 2021-07-06T07:01:39Z

paddle/fluid/operators/math/concat_and_split.cu

+#ifdef PADDLE_WITH_HIP
+    auto* data_alloc_released = data_alloc.release();
+    auto* col_alloc_released = col_alloc.release();
+    context.AddStreamCallback([data_alloc_released, col_alloc_released] {


这个是因为pin memory会被析构，在gpu端真正执行前，cpu端被别的op使用改变了值吗
可以再加个注释，为什么要用StreamCallback

是的，已加注释

wangxicoding · 2021-07-06T07:11:07Z

paddle/fluid/operators/math/concat_and_split.cu

      int* dev_ins_col_data = static_cast<int*>(tmp_dev_ins_col_data->ptr());

      ConcatKernel<<<grid_dims, block_dims, 0, context.stream()>>>(
-          dev_ins_data, dev_ins_col_data, static_cast<int>(inputs_col.size()),
+          dev_ins_data, dev_ins_col_data, static_cast<int>(inputs_col_num),


其实这个cuda kernel还可以优化一下较小in_num下的性能，

template <T, int NUM> struct ConcatArgs { T* inputs_data[NUM], T* inputs_col[NUM], ... }

根据in_num数按照1、2、4、8、16、32、64这样的模板来，
ConcatArgs<T, 1> <T, 2> <T, 4> <T, 8> <T, 16> <T, 32> <T, 48> <T, 64>
只有当in_num大于64时，才按照当前的copy方式来。

在hip runtime中并没有按64K来处理，hip runtime中只有pinned memory的hipMemcpyAsync时才会异步，如果是pageable memory则hipMemcpyAsync不会异步执行

在hip中并没有按64K来处理，hip中只有pinned memory的hipMemcpyAsync时才会异步，如果是pageable memory则hipMemcpyAsync不会异步执行

嗯，我上面发的是按照传参的方式来的，把参数封装成结构体。如果有4个输入，可以用ConcatArgs<T, 4>来传参，传参的话就不涉及Memcpy了。

只是给个建议，有这样的优化方式。

明白，thanks

wangxicoding

LGTM

xymyeah added 3 commits May 9, 2021 23:03

fix rccl bug

85cd782

Merge branch 'develop' of https://github.com/xymyeah/Paddle into develop

9a72f40

It deals with the difference between hip and NV runtime, which makes …

0e5a9b6

…hip computing and communication overlap

xymyeah changed the title ~~[HIP] 解决hipMemcpy数据小于64K时无法overlap的问题，AMD GPU性能提升大于10%~~ [HIP] 解决hipMemcpy数据小于64K时无法overlap的问题，修改后AMD GPU性能提升大于10% Jul 6, 2021

qili93 previously approved these changes Jul 6, 2021

View reviewed changes

Thunderbrook previously approved these changes Jul 6, 2021

View reviewed changes

wangxicoding previously approved these changes Jul 6, 2021

View reviewed changes

wangxicoding reviewed Jul 6, 2021

View reviewed changes

解决hipMemcpy数据小于64K时无法overlap的问题

feaf09b

xymyeah dismissed stale reviews from wangxicoding, Thunderbrook, and qili93 via feaf09b July 6, 2021 07:00

wangxicoding reviewed Jul 6, 2021

View reviewed changes

add commit

c22c394

xymyeah changed the title ~~[HIP] 解决hipMemcpy数据小于64K时无法overlap的问题，修改后AMD GPU性能提升大于10%~~ [HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% Jul 6, 2021

wangxicoding approved these changes Jul 6, 2021

View reviewed changes

wangxicoding merged commit 20da770 into PaddlePaddle:develop Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% #33982

[HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% #33982

xymyeah commented Jul 6, 2021

paddle-bot-old bot commented Jul 6, 2021

qili93 left a comment

wangxicoding left a comment

wangxicoding Jul 6, 2021

xymyeah Jul 6, 2021

wangxicoding Jul 6, 2021

xymyeah Jul 6, 2021

wangxicoding Jul 6, 2021 •

edited

Loading

xymyeah Jul 6, 2021 •

edited

Loading

wangxicoding Jul 6, 2021 •

edited

Loading

wangxicoding Jul 6, 2021

xymyeah Jul 6, 2021

wangxicoding left a comment

[HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% #33982

[HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% #33982

Conversation

xymyeah commented Jul 6, 2021

PR types

PR changes

Describe

paddle-bot-old bot commented Jul 6, 2021

qili93 left a comment

Choose a reason for hiding this comment

wangxicoding left a comment

Choose a reason for hiding this comment

wangxicoding Jul 6, 2021

Choose a reason for hiding this comment

xymyeah Jul 6, 2021

Choose a reason for hiding this comment

wangxicoding Jul 6, 2021

Choose a reason for hiding this comment

xymyeah Jul 6, 2021

Choose a reason for hiding this comment

wangxicoding Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

xymyeah Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

wangxicoding Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

wangxicoding Jul 6, 2021

Choose a reason for hiding this comment

xymyeah Jul 6, 2021

Choose a reason for hiding this comment

wangxicoding left a comment

Choose a reason for hiding this comment

wangxicoding Jul 6, 2021 •

edited

Loading

xymyeah Jul 6, 2021 •

edited

Loading

wangxicoding Jul 6, 2021 •

edited

Loading