-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use StridedMemCpy in Concat/Split Kernel #4188
Use StridedMemCpy in Concat/Split Kernel #4188
Conversation
float* dest = b + b_offset * after * i; | ||
cudaMemcpy(dest, src, len, cudaMemcpyDeviceToDevice); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有memory::Copy
接口,可以直接用?
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memcpy.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
void copy_matrix<platform::CPUPlace, float>(const float* a, size_t a_offset, | ||
float* b, size_t b_offset, | ||
size_t len, size_t before, | ||
size_t after) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before
,after
分别是什么概念呢?这个函数如果要作为一个通用于Matrix
的一个Copy
函数,在参数命名上可能需要更Matrix
化一些,比如row
, col
, width
, height
之类的。
单纯从copy_matrix
的实现来看,我理解这个函数包含以下信息:
a
矩阵size为X x after
b
矩阵的size为Y x after
- 要拷贝的子矩阵size为
before x (len / sizeof(T))
a
矩阵起始位置(a_offset, 0)
b
矩阵起始位置(b_offset, 0)
是否类似原来的subMatrix ?
caffe2的CopyMatrix
另外,@qingqing01 我觉得这个函数不算是math
操作,是否单独实现在一个比如matrix.cc
的文件里比较好?matrix.cc
也可以定义一些其他的通用于Matrix
的函数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多谢 @Xreki , copy_matrix
这个函数是和caffe2中的CopyMatrix
类似的,把它作为一个单独的functor主要也是为了简化代码,使CPU和GPU能够公用同一个Kernel。
根据 @qingqing01 的提醒, 或许我们可以不需要这个copy_matrix
,直接在Kernel里调用memory::Copy
来实现内存/显存的Copy。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/ ,可以使用cudaMemcpyAsync, 来实现异步的拷贝显存来优化性能,但这需要单独实现GPU的Kernel,这是因为cudaMemcpyAsync
需要cudaStream,会造成CPU和GPU Kernel的代码不一致,不过或许可以在另外的PR来实现这个优化。
paddle/operators/split_op.h
Outdated
const T* src = | ||
in->data<T>() + input_offset + input_axis_dim * after * j; | ||
memcpy(dest, src, len); | ||
paddle::memory::Copy<Place, Place>( | ||
boost::get<Place>(ctx.GetPlace()), static_cast<void*>(dst), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add an interface to get Place #4203
Please see #4205, let's give a general function for |
Fixed #4166
Fixed #3929
Fixed #3772