-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon4 No.40】为 Paddle 优化 kthvalue op 在 GPU 上的计算性能 #452
Merged
Merged
Changes from 9 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
233daf8
erfinv
thunder95 36f32a3
false commit
thunder95 33a8a1f
Merge branch 'master' of https://github.com/PaddlePaddle/community
thunder95 23da46b
erfinv
thunder95 e3dc6b8
false commit
thunder95 04fe24d
Merge branch 'master' of https://github.com/PaddlePaddle/community
thunder95 93db7c1
merge
thunder95 5deae72
maerge
thunder95 e7f8eba
add rfc for kth_value
thunder95 8f8fed7
add fp16 perf
thunder95 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Kthvalue OP性能优化设计文档 | ||
|
||
|
||
| 基本信息 | 内容 | | ||
| ------------------------------------------------------------ |--------------------------------------| | ||
| 提交作者<input type="checkbox" class="rowselector hidden"> | thunder95 | | ||
| 提交时间<input type="checkbox" class="rowselector hidden"> | 2023-03-12 | | ||
| 版本号 | V1.0 | | ||
| 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | PaddleDevelop | | ||
| 文件名 | 20230312_kthvalue_op_optimization.md<br> | | ||
|
||
|
||
# 1 背景与意义 | ||
|
||
目前 Paddle 内 kthvalue 算子 GPU 计算采用了cub库实现,性能仍有明显的提升空间。 | ||
|
||
## 1.1 飞桨现状 | ||
|
||
当前性能如下表(基于PaddlePaddle develop分支): | ||
|
||
目前的实现有一定的性能优化空间,可以加入一些性能优化的技巧。当前forward性能如下表: | ||
|
||
| Case No. | device | input_shape | input_type | k | Paddle Perf(ms) | | ||
|---|---|---|---|---|---| | ||
| 1 | RTX 2070s | [16L, 10000L] | float32 | 5 | 0.29134 | | ||
| 2 | RTX 2070s | [16L, 3000L] | float32 | 1 | 0.13398 | | ||
|
||
API文档 https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/kthvalue_cn.html | ||
|
||
## 1.2 业内方案调研 | ||
|
||
Pytorch中对kthvalue算子基于GPU计算实现, forward整体性能如下(基于pytorch v1.12): | ||
|
||
| Case No. | device | input_shape | input_type | k | Pytorch Perf(ms) | | ||
|---|---|---|---|---|---| | ||
| 1 | RTX 2070s | [16L, 10000L] | float32 | 5 | 0.08037 | | ||
| 2 | RTX 2070s | [16L, 3000L] | float32 | 1 | 0.041758 | | ||
|
||
## 1.3 对比分析 | ||
|
||
目前Paddle与Pytorch的API设计方案相似,两种case下测试Pytorch性能更优, | ||
二者主要差别是Paddle采用的是cub方式计算,🕑然而Pytorch采用的是基数排序RadixSelect方式大大提升了性能。 | ||
|
||
# 2 设计方案与性能预期 | ||
|
||
## 2.1 关键模块与性能提升点 | ||
|
||
可参考topk算子, 通过使用飞桨内部已经实现的RadixSearch,优化现在的cub方式的排序计算,预期性能提升2.7倍以上。 | ||
|
||
## 2.2 Host端计算流程 | ||
|
||
将输入转置到最后一维度,优化配置相应的grid和block。 | ||
|
||
## 2.4 Device端计算流程 | ||
|
||
对kthvalue, 用kernel嵌套的方式,对每一个目标维度上基于基数排序方式计算kth-value。 | ||
|
||
# 3 测试和验收的考量 | ||
|
||
参考:[算子性能优化验收标准](http://agroup.baidu.com/paddle-perf/md/article/4892913) | ||
|
||
|
||
|
||
# 4 可行性分析和排期规划 | ||
|
||
时间和开发排期规划,主要milestone | ||
|
||
| No. | 开发内容 | 预期时间 | | ||
|---|---|---| | ||
| 1 | 理清Paddle中OP设计思路,同类产品中最佳设计方案 | 2023-02-22 | | ||
| 2 | 完成开发文档设计 | 2023-03-12 | | ||
| 3 | kthvalue优化实现 | 2023-03-31 | | ||
| 3 | 完成代码开发工作,并通过线程CI测试 | 2023-04-15 | | ||
|
||
|
||
|
||
# 5 影响面 | ||
|
||
待优化的算子独立运行,不涉及其他算子和模块的修改,API设计与之前保持一致。 | ||
|
||
|
||
# 名词解释 | ||
|
||
|
||
# 附件及参考资料 | ||
|
||
[1]. [OP Benchmark使用指南](https://github.com/PaddlePaddle/benchmark/blob/master/api/README.md) | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请补充 FP16状态下的性能数据
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesLim-sy 目前该算子还不支持float16, 所以运行float16的benchamark会报错。
RuntimeError: (NotFound) The kernel with key (GPU, Undefined(AnyLayout), float16) of kernel
kthvalue
is not registered and fail to fallback to CPU one. Selected wrong DataTypefloat16
. Paddle support following DataTypes: float32, int64, float64, int32.[Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /paddle/paddle/phi/core/kernel_factory.cc:219)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以尝试在 https://github.com/PaddlePaddle/Paddle/blob/6bd5b7cec0ae2dc86e5683f89880bf22528a311e/paddle/phi/kernels/gpu/kthvalue_kernel.cu#L259-L266
中加入
phi::dtype::float16
的类型注册,再测试 FP16性能There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesLim-sy 已添加fp16性能测试