-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiclass_nms3 GPU kernel #52401
Add multiclass_nms3 GPU kernel #52401
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
8ffd9d3
to
78255fa
Compare
Sorry to inform you that 2c8891f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
2c8891f
to
82779c7
Compare
Sorry to inform you that 82779c7's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you demonstrate some test cases with perf number?
nmsedBoxes[i * 4 + 3] = clipBoxes ? saturate(yMax) : yMax; | ||
nmsedIndices[i] = bboxId >> 2; | ||
nmsedValidMask[i] = 1; | ||
atomicAdd(&numDetections[i / keepTopK], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: we may also need a deterministic version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomicAdd
is performed on integer, so there is no determinism issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml
config.
Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640.
Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400]
Other parameters:
- 'nms_top_k': 1000,
- 'keep_top_k': 300,
- 'score_threshold': 0.01,
- 'nms_threshold': 0.7
Benchmark result:
NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
atomicAdd
is performed on integer, so there is no determinism issue.
sure, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We benchmarked in PP-YOLOE+ evaluation, with
ppyoloe_plus_crn_l_80e_coco.yml
config. Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640. Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400] Other parameters:
- 'nms_top_k': 1000,
- 'keep_top_k': 300,
- 'score_threshold': 0.01,
- 'nms_threshold': 0.7
Benchmark result: NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x
could you put it into PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -614,6 +614,13 @@ class MultiClassNMS3Op : public MultiClassNMS2Op { | |||
const framework::VariableNameMap& outputs, | |||
const framework::AttributeMap& attrs) | |||
: MultiClassNMS2Op(type, inputs, outputs, attrs) {} | |||
|
|||
protected: | |||
phi::KernelKey GetExpectedKernelType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi体系下,指定Kernel选择的数据类型方式,可参考https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/legacy_ops.yaml#L129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是因为
platform::CPUPlace()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi目录下的kernel是multiclass_nms3,这里重写multiclass_nms3的GetExpectedKernelType
,也是为了指定依据哪个输入的数据类型来选Kernel。
index->Resize({valid_samples, 1}); | ||
ctx.template Alloc<int>(index); | ||
phi::funcs::GPUGatherNd<int, int64_t>( | ||
ctx, nmsed_indices, valid_indices, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数有197行代码,影响阅读,请考虑下进一步封装
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实大多数都是在做输入的准备以及输出的后处理,参数比较多所以显得长,我觉得不太好再封装了。我加了一些注释,请看是否可以。
很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for @unittest.skipIf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@XiaoguangHu01 Would you please review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Xreki I think we are OK to merge here. |
* Add GPU kernel for multiclass_nms3 op * Make multiclass_nms3 gpu kernel output consistent with cpu kernel * Fix API incompatibility * Fix unittests on builds without CUDA * Fix ROCM build * Remove fluid headers; Use default atol for unittest * Change function and variable naming * Add comments; Reduce redundant code * Use paddle test framework
PR types
New features
PR changes
OPs
Description
In this PR, we add a GPU kernel for multiclass_nms3 op, which could greatly speed up model evaluation for detection models.
We benchmarked in PP-YOLOE+ evaluation, with
ppyoloe_plus_crn_l_80e_coco.yml
config.Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640.
Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400]
Other parameters:
Benchmark result:
NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x