-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HIP] 解决hipMemcpy无法overlap的问题,修改后AMD GPU性能提升大于10% #33982
Conversation
…hip computing and communication overlap
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -222,6 +223,9 @@ static inline void GetBlockDims(const platform::CUDADeviceContext& context, | |||
*grid_dims = dim3(grid_cols, grid_rows, 1); | |||
} | |||
|
|||
int has_been_malloc_input = 0; | |||
int has_been_malloc_output = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个没用到吧,可以删了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
feaf09b
#ifdef PADDLE_WITH_HIP | ||
auto* data_alloc_released = data_alloc.release(); | ||
auto* col_alloc_released = col_alloc.release(); | ||
context.AddStreamCallback([data_alloc_released, col_alloc_released] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是因为pin memory会被析构,在gpu端真正执行前,cpu端被别的op使用改变了值吗
可以再加个注释,为什么要用StreamCallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,已加注释
int* dev_ins_col_data = static_cast<int*>(tmp_dev_ins_col_data->ptr()); | ||
|
||
ConcatKernel<<<grid_dims, block_dims, 0, context.stream()>>>( | ||
dev_ins_data, dev_ins_col_data, static_cast<int>(inputs_col.size()), | ||
dev_ins_data, dev_ins_col_data, static_cast<int>(inputs_col_num), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实这个cuda kernel还可以优化一下较小in_num下的性能,
template <T, int NUM>
struct ConcatArgs {
T* inputs_data[NUM],
T* inputs_col[NUM],
...
}
根据in_num数按照1、2、4、8、16、32、64这样的模板来,
ConcatArgs<T, 1> <T, 2> <T, 4> <T, 8> <T, 16> <T, 32> <T, 48> <T, 64>
只有当in_num大于64时,才按照当前的copy方式来。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在hip runtime中并没有按64K来处理,hip runtime中只有pinned memory的hipMemcpyAsync时才会异步,如果是pageable memory则hipMemcpyAsync不会异步执行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在hip中并没有按64K来处理,hip中只有pinned memory的hipMemcpyAsync时才会异步,如果是pageable memory则hipMemcpyAsync不会异步执行
嗯,我上面发的是按照传参的方式来的,把参数封装成结构体。如果有4个输入,可以用ConcatArgs<T, 4>来传参,传参的话就不涉及Memcpy了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只是给个建议,有这样的优化方式。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
明白,thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
Hip和Cuda runtime实现机制不同,在cuda下,从host到device copy(hipMemcpyAsync)pagable memory的数据小于64K时,会异步,但hip下只有pinned memory的数据才能异步
参考cuda文档:https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device
3.2.6.1. Concurrent Execution between Host and Device
Memory copies from host to device of a memory block of 64 KB or less
效果及收益

1)性能提升效果,bert训练速度提升大于10%
修改前
修改后

2)计算和通信overlap

修改前
修改后
