-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI, Relay] A new NMS op variant for ONNX NMS / TF Combined NMS #7796
Conversation
da3eaf9
to
b16f457
Compare
Exciting! I won't review too much because it's still in draft, but the overall structure looks good to me. I agree that that relay-level loop was a problem since we couldn't parallelize it. I'm seeing some removed attributes from the normal NMS op, I didn't add those as part of the ONNX NMS PR, I imagine they're used by other frontends? |
awesome. it would be great to also discuss the naming of the API(by checking with the naming in the existing libraries e.g. TF, Pytorch). e.g. shall we call it |
@mbrookhart These attributes are now runtime arguments and they are not used anymore, see tvm/src/relay/op/vision/nms.cc Lines 106 to 113 in 8131364
|
Moreover, in TF the other NMS op is the standard, single class NMS, so for them combined or not is really single vs multiclass distinction. In contrast, our existing NMS also does multiclass NMS where a single box is associated with a single class. This variant of multiclass NMS has a different notion of "multiclass" and can support PyTorch and MXNet. So I don't think
|
b04d4dc
to
af5f821
Compare
Thanks @masahi , can you do a quick round of search of pytorch, onnx, tensorrt and others and summarize the naming rationale? The main thing is to write down why we pick the name, I am less attached to a particular name. |
ok I see roughly three categories of NMS used in frameworks:
So the bottom line is, I think it is reasonable to say that NMS, without adjectives, should refer to the single class variant (category 2 above), but there isn't a consensus on what to call the third category one which this PR is about. I think |
I think that the name |
By |
I think that sounds good. |
af5f821
to
2361321
Compare
f71e619
to
6d314de
Compare
Really nice! Just did first pass through the code and overall it looks good. A couple of thoughts/questions:
|
Thanks @trevor-m
|
@mbrookhart @jwfromm @trevor-m @Laurawly ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I have two minor nitpicks:
- The ONNX op has a center_point_box attribute to switch between TF-style boxes and Pytorch-style boxes. I never implemented the pytorch style, but I don't think we need to change this kernel, we should be able to transform them in the future.
- It seems like we are now using exactly the same NMS tests for:
The ONNX Node tests I'm automatically importing
The manual ONNX importer tests
The relay tests
The topi tests
This feels...redundant. Do we need the same test at every level?
Yes it is redundant, but it's simply too tedious to come up with good test cases for this op. Do you have any suggestion on how to go about this? Ideally I want to test on real workload, a small artificial test case like that doesn't really exercise all the tricky aspects in our NMS kernel. |
Honestly, I'd be okay with just dropping the topi tests, they seem especially redundant after relay. Perhaps there's a TF test or two we could include with the TF importer? I'm planning on dropping the onnx tests once I get the imported node tests working on more backends. Again, this a nitpick, I don't think it needs to be in this PR, I just think we generally have too much redundancy in testing and not enough coverage |
Yes, the way it always works is we first implement topi op with tests, and then relay op. So topi tests always come first. Two tests are indeed redundant and ideally we should have difference test cases. But I'd say it is also weird to delete topi tests on purpose after we have Relay tests. It doesn't hurt to leave topi tests anyway. |
Thanks @masahi |
…ache#7796) * initial import * add c++ boilarplate * add python boilarpolate * update onnx frontend * fixing * update onnx frontend * fix shape * minor update * fix * fix shape func * fix for no box * more fix * made things 64 bit * int64 tweak * max_output_size doesn't need to be a callback * remove all_class_nms schedule * minor simplify * remove expand_dim * refactoring * simplify nms loop * cpu all_class_nms stub * updating ir for cpu * working with cpu * update cpu strategy, relay op also working * fix cpplint * fixing pylint * enable gpu test for onnx nms * tweak parallel * pyformat and lint * fix relay nms test * doc update for cpp relay * updating tests * updated tests * fix converting score_threshold to Expr * update doc * doc fix Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
…ache#7796) * initial import * add c++ boilarplate * add python boilarpolate * update onnx frontend * fixing * update onnx frontend * fix shape * minor update * fix * fix shape func * fix for no box * more fix * made things 64 bit * int64 tweak * max_output_size doesn't need to be a callback * remove all_class_nms schedule * minor simplify * remove expand_dim * refactoring * simplify nms loop * cpu all_class_nms stub * updating ir for cpu * working with cpu * update cpu strategy, relay op also working * fix cpplint * fixing pylint * enable gpu test for onnx nms * tweak parallel * pyformat and lint * fix relay nms test * doc update for cpp relay * updating tests * updated tests * fix converting score_threshold to Expr * update doc * doc fix Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
…ache#7796) * initial import * add c++ boilarplate * add python boilarpolate * update onnx frontend * fixing * update onnx frontend * fix shape * minor update * fix * fix shape func * fix for no box * more fix * made things 64 bit * int64 tweak * max_output_size doesn't need to be a callback * remove all_class_nms schedule * minor simplify * remove expand_dim * refactoring * simplify nms loop * cpu all_class_nms stub * updating ir for cpu * working with cpu * update cpu strategy, relay op also working * fix cpplint * fixing pylint * enable gpu test for onnx nms * tweak parallel * pyformat and lint * fix relay nms test * doc update for cpp relay * updating tests * updated tests * fix converting score_threshold to Expr * update doc * doc fix Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
…ache#7796) * initial import * add c++ boilarplate * add python boilarpolate * update onnx frontend * fixing * update onnx frontend * fix shape * minor update * fix * fix shape func * fix for no box * more fix * made things 64 bit * int64 tweak * max_output_size doesn't need to be a callback * remove all_class_nms schedule * minor simplify * remove expand_dim * refactoring * simplify nms loop * cpu all_class_nms stub * updating ir for cpu * working with cpu * update cpu strategy, relay op also working * fix cpplint * fixing pylint * enable gpu test for onnx nms * tweak parallel * pyformat and lint * fix relay nms test * doc update for cpp relay * updating tests * updated tests * fix converting score_threshold to Expr * update doc * doc fix Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
…ache#7796) * initial import * add c++ boilarplate * add python boilarpolate * update onnx frontend * fixing * update onnx frontend * fix shape * minor update * fix * fix shape func * fix for no box * more fix * made things 64 bit * int64 tweak * max_output_size doesn't need to be a callback * remove all_class_nms schedule * minor simplify * remove expand_dim * refactoring * simplify nms loop * cpu all_class_nms stub * updating ir for cpu * working with cpu * update cpu strategy, relay op also working * fix cpplint * fixing pylint * enable gpu test for onnx nms * tweak parallel * pyformat and lint * fix relay nms test * doc update for cpp relay * updating tests * updated tests * fix converting score_threshold to Expr * update doc * doc fix Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
This PR adds a new variant of NMS that better supports NMS in ONNX and combined NMS in TF than the existing NMS operator in TOPI/Relay. For now, I'm calling this variant of NMS "All class NMS".
https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonMaxSuppression
https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression
The biggest difference between our NMS and "All class NMS" is that in our NMS, a single box is associated with a single class, while in the latter case a single box has scores for all classes, and NMS is performed for each class separately.
max_out_size
parameter is also applied per class, rather than to all boxes as in our implementation.Until now, we've been supporting this variant of NMS via complicated encodings using our implementation of NMS. It kind of "works" in practice, but there are many problems to it:
max_out_size
handling. Since in our NMSmax_out_size
is applied to all boxes, we cannot translate "All class NMS" into a single call to our NMS.For these reasons, I decided it is better to introduce a new variant of NMS to overcome these pains. This breaks our general "one operator to support all frameworks" philosophy, but the two variants of NMS are so different it doesn't make sense to call them the same op.
The result is significant: using the new NMS, I got speedup of 1 second on mlperf SSD resnet34, running on vk + amd. This is an extreme case in that the existing approach that calls
get_valid_counts
andnon_maximum_surpression
80 times in a while loop is extremely slow on vk + amd for some reason, taking literally 1 second. Now it is only 5.7 milli second.Implementation details
The new NMS implementation consists of the following steps:
score_threshold
. This gives what we callvalid_count[i]
in the existing NMS, computed byget_valid_counts
.(batch * num_class, num_boxes)
and a tensornum_detections
of size(batch * num_class,)
holding the number of survived boxes per row. We need to copynum_detections[i]
indices from each row into a one linear output. This is a perfect application of exclusive scan: Doing ex scan onnum_detections
gives row offsets to write into for each row.The efficiency of the new implementation comes from:
get_valid_count
call is replaced with binary searchnum_class
x speedup over the existing encoding in ONNX / TF frontend.