-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add block and grid loop for index_sample kernel to deal with a large-shape tensor #37816
add block and grid loop for index_sample kernel to deal with a large-shape tensor #37816
Conversation
Thanks for your contribution! |
Sorry to inform you that 9861e2f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
… develop_index_sample
unsigned int index_i = blockDim.x * blockIdx.x + threadIdx.x; | ||
unsigned int index_j = blockDim.y * blockIdx.y + threadIdx.y; | ||
for (; index_j < batch_size; index_j += blockDim.y * gridDim.y) { | ||
index_i = blockDim.x * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个确定不是冗余的么😂完全没必要重新计算一遍吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除,感谢~
unsigned int index_j = blockDim.y * blockIdx.y + threadIdx.y; | ||
|
||
for (; index_j < batch_size; index_j += blockDim.y * gridDim.y) { | ||
index_i = blockDim.x * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,index_i
没必要重计算一遍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除,感谢~
@@ -153,9 +166,16 @@ class IndexSampleGradKernel<platform::CUDADeviceContext, T> | |||
auto block_height = | |||
platform::RoundToPowerOfTwo(index_length * batch_size) / block_width; | |||
dim3 block_dim(block_width, block_height); | |||
unsigned int threads = 512; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重复代码,可以提取出来:
void CheckLaunchParamValid(const framework::ExecutionContext& ctx, dim3* block_dim, dim3* grid_dim) {
unsigned int threads = 512;
block_dim->x = block_dim->x < threads ? block_dim->x : threads;
block_dim->y = block_dim->y < threads ? block_dim->y : threads;
dim3 max_grid_dim =
ctx.template device_context<platform::CUDADeviceContext>()
.GetCUDAMaxGridDimSize();
grid_dim->x = grid_dim->x < max_grid_dim.x ? grid_dim->x : max_grid_dim.x;
grid_dim->y = grid_dim->y < max_grid_dim.y ? grid_dim->y : max_grid_dim.y;
}
然后调用
CheckLaunchParamValid(ctx, &block_dim, &grid_dim);
而非重复写两次。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
定义了函数MIN检查block dim,函数LimitGridDim检查grid dim。感谢~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
OPs
Describe
When the length of input tensor is large than block_dim * grid_dim,the index_sample kernel would not deal with the exceeding part. So we add block and grid loop in the kernel.