Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StridedMemCpy in Concat/Split Kernel #4188

Merged
merged 9 commits into from
Sep 28, 2017

Conversation

Yancey1989
Copy link
Contributor

@Yancey1989 Yancey1989 commented Sep 19, 2017

  1. Add GradKernel for ConcatOP
  2. Use a general functor to copy memory in Split/Concat CPU/GPU Kernel.

Fixed #4166
Fixed #3929
Fixed #3772

float* dest = b + b_offset * after * i;
cudaMemcpy(dest, src, len, cudaMemcpyDeviceToDevice);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

void copy_matrix<platform::CPUPlace, float>(const float* a, size_t a_offset,
float* b, size_t b_offset,
size_t len, size_t before,
size_t after) {
Copy link
Contributor

@Xreki Xreki Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beforeafter分别是什么概念呢?这个函数如果要作为一个通用于Matrix的一个Copy函数,在参数命名上可能需要更Matrix化一些,比如row, col, width, height之类的。
单纯从copy_matrix的实现来看,我理解这个函数包含以下信息:

  • a矩阵size为X x after
  • b矩阵的size为Y x after
  • 要拷贝的子矩阵size为before x (len / sizeof(T))
  • a矩阵起始位置(a_offset, 0)
  • b矩阵起始位置(b_offset, 0)

是否类似原来的subMatrix ?
caffe2的CopyMatrix

另外,@qingqing01 我觉得这个函数不算是math操作,是否单独实现在一个比如matrix.cc的文件里比较好?matrix.cc也可以定义一些其他的通用于Matrix的函数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多谢 @Xreki , copy_matrix这个函数是和caffe2中的CopyMatrix类似的,把它作为一个单独的functor主要也是为了简化代码,使CPU和GPU能够公用同一个Kernel。

根据 @qingqing01提醒, 或许我们可以不需要这个copy_matrix,直接在Kernel里调用memory::Copy来实现内存/显存的Copy。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/ ,可以使用cudaMemcpyAsync, 来实现异步的拷贝显存来优化性能,但这需要单独实现GPU的Kernel,这是因为cudaMemcpyAsync需要cudaStream,会造成CPU和GPU Kernel的代码不一致,不过或许可以在另外的PR来实现这个优化。

const T* src =
in->data<T>() + input_offset + input_axis_dim * after * j;
memcpy(dest, src, len);
paddle::memory::Copy<Place, Place>(
boost::get<Place>(ctx.GetPlace()), static_cast<void*>(dst),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an interface to get Place #4203

@reyoung reyoung self-requested a review September 19, 2017 21:11
@reyoung
Copy link
Collaborator

reyoung commented Sep 19, 2017

Please see #4205, let's give a general function for concat and crop

@Yancey1989 Yancey1989 changed the title add copy_matrix functor Use general function in Concat/Split CPU/GPU Kernel Sep 20, 2017
@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Sep 26, 2017

From @reyoung

Please see #4205, let's give a general function for concat and crop

Done.

@Yancey1989 Yancey1989 changed the title Use general function in Concat/Split CPU/GPU Kernel Use StridedMemCpy in Concat/Split Kernel Sep 27, 2017
@Yancey1989 Yancey1989 merged commit d7db15f into PaddlePaddle:develop Sep 28, 2017
@Yancey1989 Yancey1989 deleted the copy_matrix_functor branch September 28, 2017 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants