Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

Open
datinje opened this issue Sep 6, 2023 · 113 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider

Comments

@datinje
Copy link

datinje commented Sep 6, 2023

Describe the issue

on my Faster-rcnn-rpn models doing detections of patterns, after considerable efforts to infer with TensorRT EP, (see #16886 as this shows that I have simplified the model and infered the shapes of the model nodes before submitting to TRT) , I found that TRT EP is about 30% slower than with Cuda EP in FP32 (and in TF32) - only with FP16 TRT EP -almost- catches up.

I only mentions here the second inference , not the warm up once (which is considerably slower which is normal)

After looking at the VERBOSE mode logs , found out that not all the nodes are running on TRT, one is still on CPU and 6 on Cuda EP. That cause many memory transfers between Host and GPU . I suppose this is the reason. So my question is why is ther still nodes on CPU and Cuda EPs ? Can this be fixed ?

Here are the logs :
2023-09-06 16:45:59.604024060 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] Node placements
2023-09-06 16:45:59.604038849 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11
2023-09-06 16:45:59.604042765 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_0 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_0_0)
2023-09-06 16:45:59.604046398 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_1 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_1_1)
2023-09-06 16:45:59.604049385 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_2 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_2_2)
2023-09-06 16:45:59.604052381 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_3 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_3_3)
2023-09-06 16:45:59.604055213 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_4 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_4_4)
2023-09-06 16:45:59.604057978 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_5 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_5_5)
2023-09-06 16:45:59.604060720 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_6 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_6_6)
2023-09-06 16:45:59.604063521 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy)
2023-09-06 16:45:59.604066111 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_422)
2023-09-06 16:45:59.604068754 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_423)
2023-09-06 16:45:59.604078119 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_424)
2023-09-06 16:45:59.604081367 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1
2023-09-06 16:45:59.604086459 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/roi_heads/box_pooler/level_poolers.0/RoiAlign)
2023-09-06 16:45:59.604093948 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 5
2023-09-06 16:45:59.604099017 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/proposal_generator/NonZero)
2023-09-06 16:45:59.604103942 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_497)
2023-09-06 16:45:59.604108777 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/roi_heads/NonZero)
2023-09-06 16:45:59.604113159 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_796)
2023-09-06 16:45:59.604117903 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/NonZero)

I got the same issue in both C++ and python runtime APIs

To reproduce

I can't share my model for IP , but I see similar issues with public Detectron Model zoo faster-rcnn-rpn (see #16886) how to run it - but with this one even more nodes are fallback on CPU and cuda , among which the nodes in bold above. So maybe fixes investigating this one will lead to same fixes.

Urgency

I have been blocked for several months on trying to run the model on TRT EP (see #16886 thx for the ort staff that helped me) now to find out that this may not be worth. Looks like I am not fat - only actually 3 operator/nodes to go on TRT EP, but times up I will need in a couple of month to freeze the model to certify the results with no second chance certifying with TRT FP16 or better INT8. I am expecting a x2 perf improvement in TRT fp16 and another x2 improvement in INT8 (accuracy is still excellent in FP16).

Platform

Linux

OS Version

SLES15 SP4

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1+ (using main latest for a fix to build TRT EP)

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6.1

Model File

I can't but could use fatser-rcnn-rpn from detectron2 model zoo (see #16886)

Is this a quantized model?

No

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Sep 6, 2023
@chilo-ms
Copy link
Contributor

chilo-ms commented Sep 6, 2023

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue.
It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions.
We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

@chilo-ms
Copy link
Contributor

chilo-ms commented Sep 6, 2023

Tensorrt supports nmsplugin and rioAlignPlugin. Probably we can replace onnx NonMaxSuppression and RoiAlign nodes with those two TRT plugins to see the latency?

@skottmckay
Copy link
Contributor

Typically the nodes from NonMaxSuppression and on are selecting the best bounding boxes. These are relatively cheap operations where it's more efficient to stay on CPU than go back to GPU. In the NNAPI EP we have the option to set an operator after which NNAPI is not used, and we do that for NonMaxSuppression. Maybe something similar would also work for TRT/CUDA for this type of model.

@datinje
Copy link
Author

datinje commented Sep 7, 2023

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue. It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions. We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

So, even since , according to @skottmckay, these 3 operators are cheaper on CPU, can we try to keep them on GPU to avoid the overhead of moving the data btw CPU and GPU (in my case images of 13MB) ? Is that the goal/capability of the nmsplugin and roiAlignPlugin ? I am ready to try . Any example how to do that ? Shall I modify the Model code, the resulting ONNX or is that a mere declaration in onnxruntime tensorRT EP configuration ? What about the third operator nonZero ? I could not find a plugin any possibility to keep it on GPU to avoid memory transfers due to other subgraph split ?

@datinje
Copy link
Author

datinje commented Sep 7, 2023

If I want to test the performance I get by not filtering out these operators by commenting out the lines https://github.com/onnx/onnx-tensorrt/blob/main/ModelImporter.cpp#L377, then where shall I modify the ModelImporter.cpp file before recompiling onnxruntime ?

I am recompiling onnxruntime with nvidia gpu and tensorrt EP in my docker image with:
RUN git clone https://github.com/microsoft/onnxruntime
WORKDIR /tmp/onnxruntime
RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root
(I am using the latest main as a bug was fixed on ort 1.15.1 to compile tensorrt EP)

@datinje
Copy link
Author

datinje commented Sep 7, 2023

what if I compile onnxruntime with --use_tensorrt_builtin_parser : will teh nodes be filtered out ?

@datinje
Copy link
Author

datinje commented Sep 7, 2023

no change if I recompile onnxruntime with -use_tensorrt_builtin_parser
The nodes are still placed on CPU

@chilo-ms
Copy link
Contributor

chilo-ms commented Sep 7, 2023

Here are the steps to build OSS onnx-tensorrt parser with not filtering out those operators:

  1. add --use_tensorrt_oss_parser as one of the ORT build arguments and start building.
  2. At the beginning stage of ORT build, you will find onnx-tensorrt repos being downloaded to path ./build/Linux/Debug/_deps/onnx_tensorrt-src, simply comment out those lines of node filtering in ModelImporter.cpp
  3. Resume build.
    Note: you might encounter build error of CUDA_INCLUDE_DIR not found. Modify here to
    set(CUDA_INCLUDE_DIR ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

I tested the not filtering out onnx-tensorrt parser with faster rcnn form onnx model zoo and it can include those nodes for TRT, but it failed to build the TRT engine. I need to investigate further, but you can try your faster-rcnn model.

Update: Checked with Nvidia, those nodes should only work with TRT api enqueueV3, and TRT EP is using enqueueV2, so it's expected to see enqueue error. As for engine build error that I saw, will follow up with Nvidia. TensorRT EP is planning to update to use latest TRT apis, but it's going to take some time.

@chilo-ms
Copy link
Contributor

chilo-ms commented Sep 7, 2023

I think we can try the TRT plugins. please see the doc here. You need to modify the graph and replace RoiAlign and NonMaxSuppression with the custom ops that will later map to trt plugins. (Remember to correctly put the name and domain of the custom node). Unfortunately, there is no related NonZeroPlugin for now.

@datinje
Copy link
Author

datinje commented Sep 8, 2023

thx a lot @chilo-ms : I will try to integrate the 2 plugins in my model to test performance improvement. Hoping that ONNRT TRT EP to use TRT API enqueueV3 asap.
Expect some time before next post as I am ooo next week.

@datinje
Copy link
Author

datinje commented Sep 25, 2023

after discussing with NVIDIA on how to integrate plugins , we found out that NMS and nonzero ARE implemented in tensorRT . cf

for ROIALign , the only way is via the TRT plugin, but is there a way to have TRT EP call the native TRT instruction to avoid data transfer between CPU and GPU ?

@datinje
Copy link
Author

datinje commented Sep 25, 2023

in 1.16.0 there is this new session option disable_cpu_ep_fallback. How can we set it ? and will this prevent falling back nonZero and NMS on CPU EP ?

@chilo-ms
Copy link
Contributor

chilo-ms commented Sep 29, 2023

@datinje
Last time from Nvidia, they mentioned NMS and NonZero are natively supported only by enqueueV3 (TRT EP currently uses enqueueV2).
I am current working on a dev branch to use enqueueV3. Before the dev branch is merged to main, i think you can only try TRT NMS/NonZero plugins, please see my previous reply for how to use it. (Note: i encountered engine build error, so i might also update the engine build api as well. Will let you know once the dev branch is ready and merged to main)

Please see here for how to use disable_cpu_ep_fallback. But in your case, you still need CUDA EP or CPU to run those three nodes if you don't want to use TRT plugins.
If you use TRT plugin and because the whole model can be run by TRT, regardless of native TRT or TRT plugins, there should be no data transfer betwee CPU and GPU except the model input/output.

@datinje
Copy link
Author

datinje commented Oct 31, 2023

As stated above by @chilo-ms , I tried in 1.16 to disable cpu ep fallback to try to avoid moving onnx operators to CPU if onnxrt parser estimated so , but effect is not to keep the operators on GPU with TRT as expected , it is preventing the program to continue .

Then what is the purpose of this option ? The mains interest for me would be for ONNRT to keep the Operators on the GPU even if faster on CPU because overhead of transferring the data would be offsetting the benefit.

2023-10-31 11:27:23.916547026 [E:onnxruntime:, inference_session.cc:1678 Initialize] This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

Traceback (most recent call last):

File "/cad-engine/run-onnx-pytorch.model.py", line 299, in

main()

File "/cad-engine/run-onnx-pytorch.model.py", line 60, in main

sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init

self._create_inference_session(providers, provider_options, disabled_optimizers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 471, in _create_inference_session

sess.initialize_session(providers, provider_options, disabled_optimizers)

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

@datinje
Copy link
Author

datinje commented Oct 31, 2023

something wrong in the copy paste above , sorry. forget about the "File ..." lines.

@chilo-ms
Copy link
Contributor

chilo-ms commented Nov 7, 2023

@datinje

Then what is the purpose of this option ?

One of the purposes of using this disable_cpu_ep_fallback is to make sure all the nodes are placed on GPUs before ORT starts to run inference. ORT may place some nodes on CPU for performance, but in some cases, it might not be the case. So this option works as a check.

However, in your case, the error you got is expected because current ORT TRT doesn't support NonZero, NMS and RoiAlign, and cpu is the only ep to run these nodes. So, only if all the nodes in your model are supported by ORT TRT, you are suggested to use disable_cpu_ep_fallback. Otherwise, you will get this error.

As I mentioned previously, you can try following steps:

  • Use the branch of this PR

  • Replace the line in deps.txt as below:

          - onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/a43ce67187bab219520fd80f21af8bbd4354bc8c.zip;572535aefef477050f86744dfab1fef840198035
          + onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/bacfaaa951653cd4e72efe727a543567cb38f7de.zip;26434329612e804164ab7baa6ae629ada56c1b26
    
  • Build ORT TRT using this branch with --use_oss_trt_parser

  • Run the model (Don't set disable_cpu_ep_fallback)

then you can see that ORT TRT can run all the nodes of your FasterRCNN model except RoiAlign.

chilo-ms added a commit that referenced this issue Nov 17, 2023
It's possible that subgraph of the "If" control flow op has no nodes.
TRT EP should consider this kind of subgraph is fully supported by TRT.

The faster rcnn model mentioned in this issue
#17434 is the case.
@jcdatin
Copy link

jcdatin commented Mar 16, 2024

closing since I realized that with ORT 1.16.3 I succeeded runing my model with TRT and it gets faster than Cuda EP in TF32

kleiti pushed a commit to kleiti/onnxruntime that referenced this issue Mar 22, 2024
…osoft#18449)

It's possible that subgraph of the "If" control flow op has no nodes.
TRT EP should consider this kind of subgraph is fully supported by TRT.

The faster rcnn model mentioned in this issue
microsoft#17434 is the case.
@jcdatin
Copy link

jcdatin commented Apr 11, 2024

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP.
I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

@jcdatin
Copy link

jcdatin commented Apr 11, 2024

even NonZero op seems implemented in TRT : could it be implemented in ONNXRT TRT EP ?
With these 3 operator ALL of the faster-rcnns would run on TRT and avoid host to device memory transfers !

@chilo-ms
Copy link
Contributor

chilo-ms commented Apr 11, 2024

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP. I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

@jcdatin
Unfortunately, for ORT 1.17.x, TRT EP doesn't include those DDS operators (NMS/NonZero/RoiAlign).
But, current ORT main branch + OSS onnx-tensorrt parser will make TRT EP use NMS/NonZero/RoiAlign TRT operators.
You can simply build ORT main with --use_oss_trt_parser to achieve this.

We are testing TRT EP + TRT DDS output support (meaning including the NMS/NonZero/RoiAlign operators) to see the performance and then decide whether to enable this feature in the ORT official release.

If you could help test it and provide the feedback, that will be great!. Thank you!

@jcdatin
Copy link

jcdatin commented Apr 11, 2024

Sure ! I will help.
definition :DDS ops means Data Dependent (dynamic) Shape operators : see https://forums.developer.nvidia.com/t/data-dependent-tensor-shapes-in-tensorrt/194988

@jcdatin
Copy link

jcdatin commented Apr 12, 2024

shall --use_oss_trt_parser REPLACE --use_tensorrt_builtin_parser or simply complete it

@chilo-ms
Copy link
Contributor

chilo-ms commented Aug 16, 2024

@jcdatin

What is not working any more compared to build bb19722 is the embedded context mode of TRT which is nice to speed up (x10) the onnx model load time (from 3s to 300ms).

That's weird, there is no change in terms of EPContext/Embedded engine feature between ORT 1.17 and ORT 1.18.
What's the error you saw?

my build above was still using TRT 10.0.1 (w/ cudnn 9), retrying with TRT 10.3.0.29 (and cudnn 9)

Yes, please use the latest TRT 10.3 which fixes issues when running Faster-RCNN.

@jcdatin
Copy link

jcdatin commented Aug 17, 2024

Rebuilt ORT 1.18.1 with TRT 10.3.0.26 (and cudnn 9.3.0.75) - with cuda 12.2

First observation (when not using embedded context of TRT)=
-I still see not only nodes NonZero, NonMaxSuppression and RoiAlign on CPU EP , but now also node ScatterND (compared to build bb19722).
This is causing execution slow down compared to when all these nodes were on TRT. (x2).
I used ort 1.18.1 build command =
CC=gcc-11 CXX=g++-11 ./build.sh --nvcc_threads 2 --config $ORT_BUILD_MODE --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"
Why is --use_tensorrt_oss_parser not choosing TRT EP nodes for allocation like build bb19722 ?

first thing first , do you know why I am still having these nodes on CPU EP ? Shall I remove option --use_tensorrt_oss_parser ? (I am going to try).

Second Observation when using TRT embedded context with config above, I am getting the same error as with TRT 10.0=
terminate called after throwing an instance of 'Ort::Exception'
what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input.

Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0
Aborted (core dumped)

This used to work with TRT 8.6/cudnn 8.9 and ORT build bb19722
A I said I used onnx symbolic_shap_infer.py on my faster rcnn model prior though to running ORT with TRT EP (only way to run TRT EP anyway)
This is another big issue since it x10 the load time of the onnx model.

So far these are too big regressions for me to use ORT 1.18.1 and beyond.

@jcdatin
Copy link

jcdatin commented Aug 17, 2024

other question : what is the ONNXRT optimisation level to use in conjunction with TRT EP (which has its own optimizations) ?
sessionOptions.SetGraphOptimizationLevel(optiLevel);

@jcdatin
Copy link

jcdatin commented Aug 17, 2024

tried to build ORT 1.18.1 w/ TRT 10.3 without --use_tensorrt_oss_parser and the following nodes are still on CPU EP
NonZero, NonMaxSuppression and RoiAlign , ScatterND

  1. seems ORT 1.18.1 does not work optimally with DDS on TRT 10.3 (for my faster rcnn)

  2. similarly embedded trt context is not working w/ ORT 1.18.1 and TRT 10.3 , it crashes with the error above .
    I noted the following warning though with TRT 10 that may indicate the problem is a regression in TRT 10 ? Do you confirm ?
    2024-08-17 09:50:08.899501193 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
    2024-08-17 09:50:08.899536800 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 4 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
    2024-08-17 09:50:08.899550675 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
    2024-08-17 09:50:08.899562173 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 4 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
    2024-08-17 09:50:20.218509763 [W:onnxruntime:iInference,tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:20 WARNING] Profile kMIN values are not self-consistent. IShuffleLayer /model/my_model/rpn/Reshape: reshaping failed for tensor: /model/my_model/rpn/head/cls_logits/Conv_output_0 Reshape placeholder 0 has no corresponding input dimension. Instruction: RESHAPE_ZERO_IS_PLACEHOLDERinput dims{1 13 0 0} reshape dims{1 -1 1 0 0}.

First thing first , can you investigate why DDS nodes not on TRT EP ?

@chilo-ms
Copy link
Contributor

chilo-ms commented Aug 27, 2024

Rebuilt ORT 1.18.1 with TRT 10.3.0.26 (and cudnn 9.3.0.75) - with cuda 12.2

First observation (when not using embedded context of TRT)= -I still see not only nodes NonZero, NonMaxSuppression and RoiAlign on CPU EP , but now also node ScatterND (compared to build bb19722). This is causing execution slow down compared to when all these nodes were on TRT. (x2). I used ort 1.18.1 build command = CC=gcc-11 CXX=g++-11 ./build.sh --nvcc_threads 2 --config $ORT_BUILD_MODE --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11" Why is --use_tensorrt_oss_parser not choosing TRT EP nodes for allocation like build bb19722 ?

first thing first , do you know why I am still having these nodes on CPU EP ? Shall I remove option --use_tensorrt_oss_parser ? (I am going to try).

Let me reply the first question.
ORT 1.18.x and current main with --use_tensorrt_oss_parser doesn't enable TRT DDS nodes.
The build bb19722 (dated back to April) did enable DDS nodes, however, TRT 10 has some DDS related issues, therefore, we disable TRT DDS nodes since then.

i agree it's a bit complicated to enable DDS like i mentioned here.
Please use this branch to build ORT with --use_tensorrt_oss_parser against TRT 10.3. You don't need to modify additional files, then you can run TRT EP with DDS enabled meaning NonZero, NMS and RoiAlign should be run by TRT.

One thing to note is, when running the NMS node, TRT EP + TRT 10.3 is taking much longer time to finish (compared to TRT 8.6). We are still investigating the issue. And if possible, could you share your model with us to test? Or could you help test from your side?

@chilo-ms
Copy link
Contributor

chilo-ms commented Sep 3, 2024

Second Observation when using TRT embedded context with config above, I am getting the same error as with TRT 10.0=
terminate called after throwing an instance of 'Ort::Exception'
what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input.
Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0
Aborted (core dumped)

@jcdatin
In order to use embedded context, the whole model should be TRT eligible meaning the whole model should be placed on TRT EP.
https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#more-about-embedded-engine-model--epcontext-model
In your case, some nodes are placed on CPU. (Please see my previous reply to fix this issue)

@jcdatin
Copy link

jcdatin commented Sep 10, 2024

@chilo-ms : thx for your answer. I was in vacations. I will try DDS with your branch and investigate TRT EP with TRT10.3 for NMS node. I will also check that TRT embedded context is working once all nodes on TRT EP.

@jcdatin
Copy link

jcdatin commented Oct 3, 2024

Nvidia informed me that the NMS performance issue is a known problem that will be fixed in TRT 10.6

@chilo-ms
Copy link
Contributor

chilo-ms commented Oct 3, 2024

Nvidia informed me that the NMS performance issue is a known problem that will be fixed in TRT 10.6

Yeah, the NMS regression in TRT 10 is a known issue and Nvidia has been investigated this issue.
We have been tracking this issue with them and hopefully it can be fixed in TRT 10.6.

@jcdatin
Copy link

jcdatin commented Nov 6, 2024

TRT 10.6 is out as well as ONNRT 1.20. But I see some restrictions :

  • ORT 1.20 only supports TRt 10.4 and 10.5 (and I need TRT10.6)
  • still TRT 10.6 "performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations. The amount of regression is roughly proportional to the number of such layers in the network"
    Previous ORT and TRT 10.x could not dispatch aNMS nor nonZero ops to TRT tree, so I have to take TRT10.6 : will ORT still dispatch NMS/NonZero to TRT , I prefer TRT perf limitation than ORT displating these DNS ops still to CPU.

what is the version of Cuda supported by ORT : I am using 12.2 and TRT 10.6 seems to require 12.6
but first things first , what about ORT 1.20 supporting TRT10.6 and DNS ?

@chilo-ms
Copy link
Contributor

chilo-ms commented Nov 13, 2024

Re: ORT 1.20 only supports TRt 10.4 and 10.5 (and I need TRT10.6)

ORT 1.20 supports TRT 10.4 and 10.5 means our CIs tested against those TRT versions and the prebuilt package built against those versions.
But you can still run the ORT TRT prebuilt library with TRT 10.6. (Note: specify TRT 10.6 lib path to LD_LIBRARY_PATH)

Re: Previous ORT and TRT 10.x could not dispatch aNMS nor nonZero ops to TRT tree, so I have to take TRT10.6 : will ORT still dispatch NMS/NonZero to TRT , I prefer TRT perf limitation than ORT displating these DNS ops still to CPU.

Start from TRT 10.7 (which is not released yet), TRT will completely enable DDS ops, aka ORT will dispatch NMS/NonZero/RoiAlign to TRT by default. Before TRT 10.7, user needs to build ORT with open-source parser to achieve this. But please be aware of the known DDS perf issue from TRT 10.0 to 10.7 (Nvidia likely won't fix the issue in TRT 10.7)
ORT TRT has a PR (which will be included in ORT 1.20.1 patch release) to add a new provider option trt_op_types_to_exclude which will exclude some ops to be run on TRT. This PR also adds NMS/NonZero/RoiAlign to the exclude list by default due to perf issue. User can provide empty string to it, i.e. trt_op_types_to_exclude="" to override so that all ops will be considered run on TRT.

Re: what is the version of Cuda supported by ORT : I am using 12.2 and TRT 10.6 seems to require 12.6.
ORT should be compatible with CUDA 12.x.
Did you find any issue of running ORT with CUDA 12.6?

@jcdatin
Copy link

jcdatin commented Nov 14, 2024

Thank you @chilo-ms , I am building and testing 1.20.0 with trt 10.6 and oss trt parser . I will report the TRT 10.6 DNS operator performance degradation. When TRT10.7 is available I will test it with ORT 1.20.1 and its empty trt_op_types_to_exclude list and default trt parser. Keep posted

@jcdatin
Copy link

jcdatin commented Nov 14, 2024

I am getting a an ort 1.20.0 compilation error when building with TRT 10.6 (TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz) over CUDA 12.2
with build command :
CC=gcc-11 CXX=g++-11 ./build.sh --skip_submodule_sync --nvcc_threads 2 --config ${ORT_BUILD_MODE} --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

cf
[ 31%] Building CXX object _deps/onnx_tensorrt-build/CMakeFiles/nvonnxparser_static.dir/onnxErrorRecorder.cpp.

In file included from /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:5:
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:32:38: error: ‘ILogger’ in namespace ‘nvinfer1’ does not name a type
32 | using ILogger = nvinfer1::ILogger;
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:39:9: error: ‘ILogger’ has not been declared
39 | ILogger* logger, IErrorRecorder* otherRecorder = nullptr);
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:70:36: error: expected ‘)’ before ‘’ token
70 | ONNXParserErrorRecorder(ILogger
logger, IErrorRecorder* otherRecorder = nullptr);
| ~ ^
| )
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:74:26: error: ‘ILogger’ has not been declared
74 | static void logError(ILogger* logger, const char* str);
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:103:5: error: ‘ILogger’ does not name a type
103 | ILogger* mLogger{nullptr};
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:12:26: error: ‘onnx2trt::ONNXParserErrorRecorder* onnx2trt::ONNXParserErrorRecorder::create’ is not a static data member of ‘class onnx2trt::ONNXParserErrorRecorder’
12 | ONNXParserErrorRecorder* ONNXParserErrorRecorder::create(
| ^~~~~~~~~~~~~~~~~~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:13:15: error: ‘ILogger’ is not a member of ‘nvinfer1’
13 | nvinfer1::ILogger* logger, nvinfer1::IErrorRecorder* otherRecorder)
| ^~~~~~~
gmake[2]: *** Waiting for unfinished jobs....

@chilo-ms
Copy link
Contributor

chilo-ms commented Nov 14, 2024

Please specified the correct onnx-tensorrt commit in the cmake/deps.txt of your ort repo.
https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt#L40

@jcdatin
Copy link

jcdatin commented Nov 27, 2024

I am using nightly build ORT_VERSION=bb1972264b which is based on somewhat 1.18.1 (this is the only onnrt version I can use with full inference speed on my faster-rcnn model with TRT 8.6.1

@jcdatin
Copy link

jcdatin commented Nov 27, 2024

I am using nightly build ORT_VERSION=bb1972264b which is based on somewhat 1.18.1 (this is the only onnrt version I can use with full inference speed on my faster-rcnn model with TRT 8.6.1

my bad , forget the above . I am using TRT ORT 1.20.0 and TRT 10.6 as I said above .

  1. How can I know how to replace line40 in the deps.txt above ?
    I just downloaded the TRT 10.6 binaries tar ball from Nvidia TRT download site , I am not building from TRT sources as I always did when ORT was only supporting TRT 8.6.
    I never had to change this line before with the standard version . except when you advised me to do so and provided the equivalent L40 line to use previous TRT 8.6 instead of the TRT 10.x with new ORT >1.19
    Can you tell me which line to use . So that I can test ONNRT 1.20 with TRT 10.6 and DDS ?

  2. ORT 1.20.2 as the RL says : "TensorRT EP Exclude DDS ops from running on TRT ([TensorRT EP] Exclude DDS ops from running on TRT #22875) - @chilo-ms" However , I NEED DDS on TRT to get full speed on my faster-rcnn model -whic ORT build bb19722 does pretty well but with TRT8.6
    So I am completely lost : How am I supposed to upgrade my ORT version from bb19722 build ?

@chilo-ms
Copy link
Contributor

chilo-ms commented Dec 4, 2024

How can I know how to replace line40 in the deps.txt above ?
...
Can you tell me which line to use . So that I can test ONNRT 1.20 with TRT 10.6 and DDS ?

ORT TRT works with either built-in tensorrt parser or oss tensorrt parser.

Currently, the built-in tensorrt parser (from version 10.0 to 10.7) disable DDS.
The only way to use DDS is to use oss tensorrt parser, so that's why in line40 of deps.txt, you can see it points to "commit of 10.6-GA-ORT-DDS.". And you need to manually build ORT TRT with --use_tensorrt_oss_parser

The line 40 in deps.txt (points to specific commit/branch in onnx-tensorrt, you can change it to use different TRT version) is only used when you manually build ORT with --use_tensorrt_oss_parser, meaning ORT TRT will work with oss tensorrt parser.
In other cases, ORT TRT always works with built-in tensorrt parser.

@chilo-ms
Copy link
Contributor

chilo-ms commented Dec 4, 2024

ORT 1.20.2 as the RL says : "TensorRT EP Exclude DDS ops from running on TRT (#22875) - @chilo-ms" However , I NEED DDS on TRT to get full speed on my faster-rcnn model -whic ORT build bb19722 does pretty well but with TRT8.6

In your case, placing DDS ops on TRT, please don't use ORT patch release 1.20.1.
You can use ORT 1.20.0 or the any commit before following snapshot commit in main and build from source with --use_tensorrt_oss_parser.
Image

Then you will be able to run ORT TRT + TRT 10 with DDS ops run by TRT.

@chilo-ms
Copy link
Contributor

chilo-ms commented Dec 4, 2024

Sorry for the confusing and inconvenience, Nvidia has root caused the perf issue of running DDS ops and they are finding a better solution now.
Once they fix the perf issue in the new TRT, ORT TRT can enable DDS by default and there won't be any hassle for user to use DDS.

@jcdatin
Copy link

jcdatin commented Jan 28, 2025

Actually got the info from Nvidia will NOT implement an official fix in TRT for this “regression” in DDS Ops, not even in TRT 10.8.
This is due to a regression in DDS operators when TRT 10 switched from CUDA async malloc, instead of synchronized CUDA malloc as in 8.6 . And Nvidia estimates "too risky to include a fix that only proves to improve a subset of models with DDS operators".

Here is Nvidia recommendation for DDS Operators when used with ORT =
_modify the cudaMemPoolAttrReleaseThreshold attribute. Here are the detailed steps to make sure this has minimal impact on other CUDA kernels:

  • You should implement a custom IGpuAllocator.
  • Inside the custom IGpuAllocator, modify cudaMemPoolAttrReleaseThreshold to UINT64_MAX, with the code that I shared on Dec 4
  • Associate the custom allocator with TensorRT runtime using setGpuAllocator_

I have then no other way than to try.
So I will use ORT 1.20.1 , TRT 10.7 and Cudnn 9.6 (the latest)

@jcdatin
Copy link

jcdatin commented Jan 28, 2025

Unfortunately I am getting regression with TRT on my faster-rcnn model = terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0

the same model used to work well with ORT 1.18.0 and TRT 8.6.1 (I used onnxruntime tool symbolic_shape_infer.py to infer dimensions) as python /usr/local/lib/python3.10/dist-packages/onnxruntime/tools/symbolic_shape_infer.py --input=faster-rcnn.onnx --output=faster-rcnn-inferred.onnx --auto_merge

Update :
Error above is gotten when using TRT EP API
// WRAPPED TRT EP

    const auto& api = Ort::GetApi();
    OrtTensorRTProviderOptionsV2* tensorrt_options;
    Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));
    
    std::vector<const char*> option_keys = {
        "trt_fp16_enable",
        "trt_timing_cache_enable",
        "trt_timing_cache_path",
        "trt_force_timing_cache",
        "trt_engine_cache_enable",
        "trt_engine_cache_path",
        "trt_dump_ep_context_model",
        "trt_ep_context_file_path",
        "trt_profile_min_shapes",
        "trt_profile_max_shapes",
        "trt_profile_opt_shapes",
    };
    std::vector<const char*> option_values = {
        useFP16 ? "1" : "0",         // trt_fp16_enable
        "1",                         // use timing cache to be deterministic
        timingCachePath.c_str(),     // trt timing cache relative path
        "1",                         // accept slight GPU mismatch within CC
        "1",                         // trt_engine_cache_enable : create the embedded profile and engine in cache path (trt_ep_context_file_path)
        cachePath.c_str(),           // trt_engine_cache_path : relative path to the embedded profile and engine cache
        "1",                         // trt_dump_ep_context_model : create the embedded model context (_ctx.onnx) file that contains names of profile and engine
        contextPath.c_str(),         // trt_ep_context_file_path : path to the embedded context files
        "image:0x0",                 // trt_profile_min_shapes
        "image:3072x2400",           // trt_profile_max_shapes
        "image:2048x1024",           // trt_profile_opt_shapes
    };
    Ort::ThrowOnError(api.UpdateTensorRTProviderOptions(tensorrt_options,
                                                        option_keys.data(), option_values.data(), option_keys.size()));
    
    sessionOptions.AppendExecutionProvider_TensorRT_V2(*tensorrt_options);

@jcdatin
Copy link

jcdatin commented Jan 30, 2025

Unfortunately I am getting regression with TRT10 on my faster-rcnn model = terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0
the same model used to work well with ORT 1.18.0 and TRT 8.6.1 (I used onnxruntime tool symbolic_shape_infer.py to infer dimensions) as python /usr/local/lib/python3.10/dist-packages/onnxruntime/tools/symbolic_shape_infer.py --input=faster-rcnn.onnx --output=faster-rcnn-inferred.onnx --auto_merge

Update 0 : :
I am using the ORT built in parser:
Update 1 :
My faster-rcnn model can be inferred with trtexec of TRT 10.7. So problem seems to be with ORT front end (parser)
Update 2 :
My faster-rcnn model can be inferred with ORT and Cuda EP - so seems to be a ORT TRT EP parser problem
Update 3 :
Now when I use old TRT API
// Non WRAPPED TRT EP
OrtTensorRTProviderOptions tensorrt_options{};
tensorrt_options.trt_fp16_enable = useFP16;
if (!cachePath.empty())
{
tensorrt_options.trt_engine_cache_enable = true;
tensorrt_options.trt_engine_cache_path = (const char *)cachePath.c_str();
}
sessionOptions.AppendExecutionProvider_TensorRT(tensorrt_options);

then error is

2025-01-30 13:12:57.087270906 [V:onnxruntime:, execution_frame.cc:563 AllocateMLValueTensorSelfOwnBufferHelper] For ort_value with index: 72, block in memory pattern size is: 14450688 but the actual size is: 2809856, fall back to default allocation behavior
2025-01-30 13:12:57.098338607 [E:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.h:88 log] [2025-01-30 13:12:57 ERROR] IExecutionContext::enqueueV3: Error Code 1: Cask ( Failed to update runtime arguments.)
2025-01-30 13:12:57.098391153 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_5350550926050970765_7 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_5350550926050970765_7_7' Status Message: TensorRT EP execution context enqueue failed.
terminate called after throwing an instance of 'Ort::Exception'
what(): Non-zero status code returned while running TRTKernel_graph_main_graph_5350550926050970765_7 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_5350550926050970765_7_7' Status Message: TensorRT EP execution context enqueue failed.

update 4: :
I see all DDS Ops being allocated to CPU .

@jcdatin
Copy link

jcdatin commented Jan 30, 2025

Update 5 :
when using --use_tensorrt_oss_parser I am getting the following compilation errors
[ 41%] Building CXX object _deps/onnx_tensorrt-build/CMakeFiles/nvonnxparser_static.dir/onnxErrorRecorder.cpp.o
In file included from /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:5:
/tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:32:38: error: ‘ILogger’ in namespace ‘nvinfer1’ does not name a type
32 | using ILogger = nvinfer1::ILogger;
| ^~~~~~~
/tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:39:9: error: ‘ILogger’ has not been declared
39 | ILogger* logger, IErrorRecorder* otherRecorder = nullptr);
| ^~~~~~~

here is the the build command :
CC=gcc-11 CXX=g++-11 ./build.sh
--skip_submodule_sync
--nvcc_threads 2
--config $ORT_BUILD_MODE
--use_cuda --cuda_home /usr/local/cuda/
--cudnn_home /usr/local/cuda/lib64
--use_tensorrt_oss_parser --use_tensorrt --tensorrt_home /usr/local/TensorRT
--build_shared_lib --parallel --skip_tests
--allow_running_as_root
--cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89"
--cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

CUDA_VERSION=12.4
CUDNN_VERSION=9.6.0.74
TRT_VERSION=10.7.0.23
ORT_VERSION=v1.20.1
GCC_VERSION=11.5.0

to recap : in ORT 1.20.1
either I use the built-in trt parser and my onnx model cannot be loaded (but the model works with trtexec and Cuda EP)
or
I use the the tensorrt_oss_parser and ORT build fails on ILogger definition

Can you help ?
I am stuck with old ORT 1.18.0 (actually nightly build bb19722) and TRT 8.6
So I can't fix any critical bugs or cyber vulnerabilities that would show up in these"old" versions

@yf711
Copy link
Contributor

yf711 commented Jan 31, 2025

@jcdatin I saw from 1.20.1 that oss parser was using version 10.4-GA-ORT-DDS, and your tensorrt_home is using version 10.7.
The mismatch might be the reason. Could you try updating the line in deps.txt with version-matched onnx-tensorrt
(i.e 10.7: onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/9c69a24bc2e20c8a511a4e6b06fd49639ec5300a.zip;ff1fe9af78eb129b4a4cdcb7450b7390b4436dd3) and rebuild?

@jcdatin
Copy link

jcdatin commented Feb 2, 2025

Now with update on deps.txt , ORT 1.20.1 builds with trt oss parser.
However , the DDS operators are still assigned to CPU node !

2025-02-02 12:24:01.107927206 [V:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.cc:2479 GetCapability] There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) in TRT 10. TRT EP automatically excludes DDS ops from running on TRT, if applicable

2025-02-02 12:24:03.521702534 [V:onnxruntime:, session_state.cc:1154 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 13
2025-02-02 12:24:03.521707716 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/my_model/rpn/NonZero)
2025-02-02 12:24:03.521725971 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (/model/my_model/rpn/NonMaxSuppression)
2025-02-02 12:24:03.521731248 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/rpn/ScatterND)
2025-02-02 12:24:03.521740969 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/my_model/roi_heads/box_roi_pool/RoiAlign)

and not only I got NonZero , NMS and RoiAlign allocated to CPU but also ScatterND

WHat shall I do : I would like to test NVIDA workaround above to get my performance back that I have in 1.18.0 nightly build ?

here is the build command
CC=gcc-11 CXX=g++-11 ./build.sh
--skip_submodule_sync
--nvcc_threads 2
--config $ORT_BUILD_MODE
--use_cuda --cuda_home /usr/local/cuda/
--cudnn_home /usr/local/cuda/lib64
--use_tensorrt_oss_parser --use_tensorrt --tensorrt_home /usr/local/TensorRT
--build_shared_lib --parallel --skip_tests
--allow_running_as_root
--cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89"
--cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 4, 2025

Re: WHat shall I do : I would like to test NVIDA workaround above to get my performance back that I have in 1.18.0 nightly build ?
ORT 1.20.1 implicitly filters out DDS nodes, like you saw above.
Please use ORT 1.20.0 !

Here is the PR that workarounds the potential DDS node perf issue. Feel free to give it a try as well.
The PR is close to merge and once it's merged, we finally can get rid of the implicit filtering of DDS nodes and make DDS nodes run on TRT.

@jcdatin
Copy link

jcdatin commented Feb 4, 2025

using your PR branch (whatever parser built-in or oss I am using) ,
DDS operator not assigned to CPU node any more .

BUT I am crashing out of memory :
2025-02-04 12:16:47.838986200 [E:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.h:88 log] [2025-02-04 12:16:47 ERROR] [engine.cpp::readEngineFromArchive::1091] Error Code 2: OutOfMemory (Requested size was 162785332 bytes.)

Also I noticed that ScatterND op is having an error when parsed by ORT (I don't have the problem with trtexec on the same model):

2025-02-04 12:16:36.012390098 [E:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.h:88 log] [2025-02-04 12:16:36 ERROR] In node 302 with name: /model/my_model/rpn/ScatterND and operator: ScatterND (importScatterND): UNSUPPORTED_NODE_ATTR: Assertion failed: !attrs.count("reduction"): Attribute reduction is not supported.

then ScatterND is allocated to CPU node :

2025-02-04 12:16:36.371363603 [V:onnxruntime:, session_state.cc:1249 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11
2025-02-04 12:16:36.371366748 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_main_graph_2899925923520386793_0
..
2025-02-04 12:16:36.371382235 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy)
..
2025-02-04 12:16:36.371417932 [V:onnxruntime:, session_state.cc:1249 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 2
2025-02-04 12:16:36.371420066 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/rpn/ScatterND)
2025-02-04 12:16:36.371422497 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/roi_heads/ScatterND)

Did you make a change on ScatterND node ? ANy regression on his operator autotest ?

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 5, 2025

Thanks for reporting this.
Our Windows CI also encountered OOM issue, we are reporting to Nvidia about this as the workaround is suggested by Nvidia.
At the same time, please use ORT 1.20.0 to enable DDS nodes on TRT.

@jcdatin
Copy link

jcdatin commented Feb 5, 2025

Thanks for reporting this. Our Windows CI also encountered OOM issue, we are reporting to Nvidia about this as the workaround is suggested by Nvidia. At the same time, please use ORT 1.20.0 to enable DDS nodes on TRT.

with Built in parser or oss parser in th build ?
plus I am afraid I will still have the ScatterND issue above .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

6 participants