TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

datinje · 2023-09-06T17:11:05Z

Describe the issue

on my Faster-rcnn-rpn models doing detections of patterns, after considerable efforts to infer with TensorRT EP, (see #16886 as this shows that I have simplified the model and infered the shapes of the model nodes before submitting to TRT) , I found that TRT EP is about 30% slower than with Cuda EP in FP32 (and in TF32) - only with FP16 TRT EP -almost- catches up.

I only mentions here the second inference , not the warm up once (which is considerably slower which is normal)

After looking at the VERBOSE mode logs , found out that not all the nodes are running on TRT, one is still on CPU and 6 on Cuda EP. That cause many memory transfers between Host and GPU . I suppose this is the reason. So my question is why is ther still nodes on CPU and Cuda EPs ? Can this be fixed ?

Here are the logs :
2023-09-06 16:45:59.604024060 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] Node placements
2023-09-06 16:45:59.604038849 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11
2023-09-06 16:45:59.604042765 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_0 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_0_0)
2023-09-06 16:45:59.604046398 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_1 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_1_1)
2023-09-06 16:45:59.604049385 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_2 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_2_2)
2023-09-06 16:45:59.604052381 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_3 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_3_3)
2023-09-06 16:45:59.604055213 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_4 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_4_4)
2023-09-06 16:45:59.604057978 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_5 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_5_5)
2023-09-06 16:45:59.604060720 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_6 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_6_6)
2023-09-06 16:45:59.604063521 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy)
2023-09-06 16:45:59.604066111 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_422)
2023-09-06 16:45:59.604068754 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_423)
2023-09-06 16:45:59.604078119 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_424)
2023-09-06 16:45:59.604081367 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1
2023-09-06 16:45:59.604086459 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/roi_heads/box_pooler/level_poolers.0/RoiAlign)
2023-09-06 16:45:59.604093948 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 5
2023-09-06 16:45:59.604099017 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/proposal_generator/NonZero)
2023-09-06 16:45:59.604103942 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_497)
2023-09-06 16:45:59.604108777 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/roi_heads/NonZero)
2023-09-06 16:45:59.604113159 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_796)
2023-09-06 16:45:59.604117903 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/NonZero)

I got the same issue in both C++ and python runtime APIs

To reproduce

I can't share my model for IP , but I see similar issues with public Detectron Model zoo faster-rcnn-rpn (see #16886) how to run it - but with this one even more nodes are fallback on CPU and cuda , among which the nodes in bold above. So maybe fixes investigating this one will lead to same fixes.

Urgency

I have been blocked for several months on trying to run the model on TRT EP (see #16886 thx for the ort staff that helped me) now to find out that this may not be worth. Looks like I am not fat - only actually 3 operator/nodes to go on TRT EP, but times up I will need in a couple of month to freeze the model to certify the results with no second chance certifying with TRT FP16 or better INT8. I am expecting a x2 perf improvement in TRT fp16 and another x2 improvement in INT8 (accuracy is still excellent in FP16).

Platform

Linux

OS Version

SLES15 SP4

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1+ (using main latest for a fix to build TRT EP)

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6.1

Model File

I can't but could use fatser-rcnn-rpn from detectron2 model zoo (see #16886)

Is this a quantized model?

No

chilo-ms · 2023-09-06T19:08:13Z

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue.
It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions.
We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

chilo-ms · 2023-09-06T20:57:15Z

Tensorrt supports nmsplugin and rioAlignPlugin. Probably we can replace onnx NonMaxSuppression and RoiAlign nodes with those two TRT plugins to see the latency?

skottmckay · 2023-09-07T00:39:38Z

Typically the nodes from NonMaxSuppression and on are selecting the best bounding boxes. These are relatively cheap operations where it's more efficient to stay on CPU than go back to GPU. In the NNAPI EP we have the option to set an operator after which NNAPI is not used, and we do that for NonMaxSuppression. Maybe something similar would also work for TRT/CUDA for this type of model.

datinje · 2023-09-07T07:40:20Z

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue. It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions. We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

So, even since , according to @skottmckay, these 3 operators are cheaper on CPU, can we try to keep them on GPU to avoid the overhead of moving the data btw CPU and GPU (in my case images of 13MB) ? Is that the goal/capability of the nmsplugin and roiAlignPlugin ? I am ready to try . Any example how to do that ? Shall I modify the Model code, the resulting ONNX or is that a mere declaration in onnxruntime tensorRT EP configuration ? What about the third operator nonZero ? I could not find a plugin any possibility to keep it on GPU to avoid memory transfers due to other subgraph split ?

datinje · 2023-09-07T09:58:37Z

If I want to test the performance I get by not filtering out these operators by commenting out the lines https://github.com/onnx/onnx-tensorrt/blob/main/ModelImporter.cpp#L377, then where shall I modify the ModelImporter.cpp file before recompiling onnxruntime ?

I am recompiling onnxruntime with nvidia gpu and tensorrt EP in my docker image with:
RUN git clone https://github.com/microsoft/onnxruntime
WORKDIR /tmp/onnxruntime
RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root
(I am using the latest main as a bug was fixed on ort 1.15.1 to compile tensorrt EP)

datinje · 2023-09-07T10:08:14Z

what if I compile onnxruntime with --use_tensorrt_builtin_parser : will teh nodes be filtered out ?

datinje · 2023-09-07T11:56:23Z

no change if I recompile onnxruntime with -use_tensorrt_builtin_parser
The nodes are still placed on CPU

chilo-ms · 2023-09-07T19:12:22Z

Here are the steps to build OSS onnx-tensorrt parser with not filtering out those operators:

add --use_tensorrt_oss_parser as one of the ORT build arguments and start building.
At the beginning stage of ORT build, you will find onnx-tensorrt repos being downloaded to path ./build/Linux/Debug/_deps/onnx_tensorrt-src, simply comment out those lines of node filtering in ModelImporter.cpp
Resume build.
Note: you might encounter build error of CUDA_INCLUDE_DIR not found. Modify here to
set(CUDA_INCLUDE_DIR ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

I tested the not filtering out onnx-tensorrt parser with faster rcnn form onnx model zoo and it can include those nodes for TRT, but it failed to build the TRT engine. I need to investigate further, but you can try your faster-rcnn model.

Update: Checked with Nvidia, those nodes should only work with TRT api enqueueV3, and TRT EP is using enqueueV2, so it's expected to see enqueue error. As for engine build error that I saw, will follow up with Nvidia. TensorRT EP is planning to update to use latest TRT apis, but it's going to take some time.

chilo-ms · 2023-09-07T21:06:33Z

I think we can try the TRT plugins. please see the doc here. You need to modify the graph and replace RoiAlign and NonMaxSuppression with the custom ops that will later map to trt plugins. (Remember to correctly put the name and domain of the custom node). Unfortunately, there is no related NonZeroPlugin for now.

datinje · 2023-09-08T08:51:00Z

thx a lot @chilo-ms : I will try to integrate the 2 plugins in my model to test performance improvement. Hoping that ONNRT TRT EP to use TRT API enqueueV3 asap.
Expect some time before next post as I am ooo next week.

datinje · 2023-09-25T11:56:21Z

after discussing with NVIDIA on how to integrate plugins , we found out that NMS and nonzero ARE implemented in tensorRT . cf

for ROIALign , the only way is via the TRT plugin, but is there a way to have TRT EP call the native TRT instruction to avoid data transfer between CPU and GPU ?

datinje · 2023-09-25T15:02:50Z

in 1.16.0 there is this new session option disable_cpu_ep_fallback. How can we set it ? and will this prevent falling back nonZero and NMS on CPU EP ?

chilo-ms · 2023-09-29T03:38:37Z

@datinje
Last time from Nvidia, they mentioned NMS and NonZero are natively supported only by enqueueV3 (TRT EP currently uses enqueueV2).
I am current working on a dev branch to use enqueueV3. Before the dev branch is merged to main, i think you can only try TRT NMS/NonZero plugins, please see my previous reply for how to use it. (Note: i encountered engine build error, so i might also update the engine build api as well. Will let you know once the dev branch is ready and merged to main)

Please see here for how to use disable_cpu_ep_fallback. But in your case, you still need CUDA EP or CPU to run those three nodes if you don't want to use TRT plugins.
If you use TRT plugin and because the whole model can be run by TRT, regardless of native TRT or TRT plugins, there should be no data transfer betwee CPU and GPU except the model input/output.

datinje · 2023-10-31T11:46:17Z

As stated above by @chilo-ms , I tried in 1.16 to disable cpu ep fallback to try to avoid moving onnx operators to CPU if onnxrt parser estimated so , but effect is not to keep the operators on GPU with TRT as expected , it is preventing the program to continue .

Then what is the purpose of this option ? The mains interest for me would be for ONNRT to keep the Operators on the GPU even if faster on CPU because overhead of transferring the data would be offsetting the benefit.

2023-10-31 11:27:23.916547026 [E:onnxruntime:, inference_session.cc:1678 Initialize] This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

Traceback (most recent call last):

File "/cad-engine/run-onnx-pytorch.model.py", line 299, in

main()

File "/cad-engine/run-onnx-pytorch.model.py", line 60, in main

sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init

self._create_inference_session(providers, provider_options, disabled_optimizers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 471, in _create_inference_session

sess.initialize_session(providers, provider_options, disabled_optimizers)

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

datinje · 2023-10-31T11:47:39Z

something wrong in the copy paste above , sorry. forget about the "File ..." lines.

chilo-ms · 2023-11-07T19:54:47Z

@datinje

Then what is the purpose of this option ?

One of the purposes of using this disable_cpu_ep_fallback is to make sure all the nodes are placed on GPUs before ORT starts to run inference. ORT may place some nodes on CPU for performance, but in some cases, it might not be the case. So this option works as a check.

However, in your case, the error you got is expected because current ORT TRT doesn't support NonZero, NMS and RoiAlign, and cpu is the only ep to run these nodes. So, only if all the nodes in your model are supported by ORT TRT, you are suggested to use disable_cpu_ep_fallback. Otherwise, you will get this error.

As I mentioned previously, you can try following steps:

Use the branch of this PR

Replace the line in deps.txt as below:

      - onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/a43ce67187bab219520fd80f21af8bbd4354bc8c.zip;572535aefef477050f86744dfab1fef840198035
      + onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/bacfaaa951653cd4e72efe727a543567cb38f7de.zip;26434329612e804164ab7baa6ae629ada56c1b26

Build ORT TRT using this branch with --use_oss_trt_parser
Run the model (Don't set disable_cpu_ep_fallback)

then you can see that ORT TRT can run all the nodes of your FasterRCNN model except RoiAlign.

It's possible that subgraph of the "If" control flow op has no nodes. TRT EP should consider this kind of subgraph is fully supported by TRT. The faster rcnn model mentioned in this issue #17434 is the case.

jcdatin · 2024-03-16T17:16:06Z

closing since I realized that with ORT 1.16.3 I succeeded runing my model with TRT and it gets faster than Cuda EP in TF32

…osoft#18449) It's possible that subgraph of the "If" control flow op has no nodes. TRT EP should consider this kind of subgraph is fully supported by TRT. The faster rcnn model mentioned in this issue microsoft#17434 is the case.

jcdatin · 2024-04-11T09:58:32Z

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP.
I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

jcdatin · 2024-04-11T10:01:22Z

even NonZero op seems implemented in TRT : could it be implemented in ONNXRT TRT EP ?
With these 3 operator ALL of the faster-rcnns would run on TRT and avoid host to device memory transfers !

jcdatin · 2024-04-11T10:01:28Z

https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_non_zero_layer.html

chilo-ms · 2024-04-11T16:15:36Z

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP. I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

@jcdatin
Unfortunately, for ORT 1.17.x, TRT EP doesn't include those DDS operators (NMS/NonZero/RoiAlign).
But, current ORT main branch + OSS onnx-tensorrt parser will make TRT EP use NMS/NonZero/RoiAlign TRT operators.
You can simply build ORT main with --use_oss_trt_parser to achieve this.

We are testing TRT EP + TRT DDS output support (meaning including the NMS/NonZero/RoiAlign operators) to see the performance and then decide whether to enable this feature in the ORT official release.

If you could help test it and provide the feedback, that will be great!. Thank you!

jcdatin · 2024-04-11T18:23:18Z

Sure ! I will help.
definition :DDS ops means Data Dependent (dynamic) Shape operators : see https://forums.developer.nvidia.com/t/data-dependent-tensor-shapes-in-tensorrt/194988

jcdatin · 2024-04-12T17:43:38Z

shall --use_oss_trt_parser REPLACE --use_tensorrt_builtin_parser or simply complete it

chilo-ms · 2024-08-16T17:25:40Z

@jcdatin

What is not working any more compared to build bb19722 is the embedded context mode of TRT which is nice to speed up (x10) the onnx model load time (from 3s to 300ms).

That's weird, there is no change in terms of EPContext/Embedded engine feature between ORT 1.17 and ORT 1.18.
What's the error you saw?

my build above was still using TRT 10.0.1 (w/ cudnn 9), retrying with TRT 10.3.0.29 (and cudnn 9)

Yes, please use the latest TRT 10.3 which fixes issues when running Faster-RCNN.

jcdatin · 2024-08-17T09:05:48Z

Rebuilt ORT 1.18.1 with TRT 10.3.0.26 (and cudnn 9.3.0.75) - with cuda 12.2

First observation (when not using embedded context of TRT)=
-I still see not only nodes NonZero, NonMaxSuppression and RoiAlign on CPU EP , but now also node ScatterND (compared to build bb19722).
This is causing execution slow down compared to when all these nodes were on TRT. (x2).
I used ort 1.18.1 build command =
CC=gcc-11 CXX=g++-11 ./build.sh --nvcc_threads 2 --config $ORT_BUILD_MODE --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"
Why is --use_tensorrt_oss_parser not choosing TRT EP nodes for allocation like build bb19722 ?

first thing first , do you know why I am still having these nodes on CPU EP ? Shall I remove option --use_tensorrt_oss_parser ? (I am going to try).

Second Observation when using TRT embedded context with config above, I am getting the same error as with TRT 10.0=
terminate called after throwing an instance of 'Ort::Exception'
what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input.
Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0
Aborted (core dumped)

This used to work with TRT 8.6/cudnn 8.9 and ORT build bb19722
A I said I used onnx symbolic_shap_infer.py on my faster rcnn model prior though to running ORT with TRT EP (only way to run TRT EP anyway)
This is another big issue since it x10 the load time of the onnx model.

So far these are too big regressions for me to use ORT 1.18.1 and beyond.

jcdatin · 2024-08-17T09:08:48Z

other question : what is the ONNXRT optimisation level to use in conjunction with TRT EP (which has its own optimizations) ?
sessionOptions.SetGraphOptimizationLevel(optiLevel);

jcdatin · 2024-08-17T09:59:12Z

tried to build ORT 1.18.1 w/ TRT 10.3 without --use_tensorrt_oss_parser and the following nodes are still on CPU EP
NonZero, NonMaxSuppression and RoiAlign , ScatterND

seems ORT 1.18.1 does not work optimally with DDS on TRT 10.3 (for my faster rcnn)
similarly embedded trt context is not working w/ ORT 1.18.1 and TRT 10.3 , it crashes with the error above .
I noted the following warning though with TRT 10 that may indicate the problem is a regression in TRT 10 ? Do you confirm ?
2024-08-17 09:50:08.899501193 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
2024-08-17 09:50:08.899536800 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 4 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
2024-08-17 09:50:08.899550675 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
2024-08-17 09:50:08.899562173 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 4 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
2024-08-17 09:50:20.218509763 [W:onnxruntime:iInference,tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:20 WARNING] Profile kMIN values are not self-consistent. IShuffleLayer /model/my_model/rpn/Reshape: reshaping failed for tensor: /model/my_model/rpn/head/cls_logits/Conv_output_0 Reshape placeholder 0 has no corresponding input dimension. Instruction: RESHAPE_ZERO_IS_PLACEHOLDERinput dims{1 13 0 0} reshape dims{1 -1 1 0 0}.

First thing first , can you investigate why DDS nodes not on TRT EP ?

chilo-ms · 2024-08-27T17:08:10Z

Rebuilt ORT 1.18.1 with TRT 10.3.0.26 (and cudnn 9.3.0.75) - with cuda 12.2

First observation (when not using embedded context of TRT)= -I still see not only nodes NonZero, NonMaxSuppression and RoiAlign on CPU EP , but now also node ScatterND (compared to build bb19722). This is causing execution slow down compared to when all these nodes were on TRT. (x2). I used ort 1.18.1 build command = CC=gcc-11 CXX=g++-11 ./build.sh --nvcc_threads 2 --config $ORT_BUILD_MODE --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11" Why is --use_tensorrt_oss_parser not choosing TRT EP nodes for allocation like build bb19722 ?

first thing first , do you know why I am still having these nodes on CPU EP ? Shall I remove option --use_tensorrt_oss_parser ? (I am going to try).

Let me reply the first question.
ORT 1.18.x and current main with --use_tensorrt_oss_parser doesn't enable TRT DDS nodes.
The build bb19722 (dated back to April) did enable DDS nodes, however, TRT 10 has some DDS related issues, therefore, we disable TRT DDS nodes since then.

i agree it's a bit complicated to enable DDS like i mentioned here.
Please use this branch to build ORT with --use_tensorrt_oss_parser against TRT 10.3. You don't need to modify additional files, then you can run TRT EP with DDS enabled meaning NonZero, NMS and RoiAlign should be run by TRT.

One thing to note is, when running the NMS node, TRT EP + TRT 10.3 is taking much longer time to finish (compared to TRT 8.6). We are still investigating the issue. And if possible, could you share your model with us to test? Or could you help test from your side?

chilo-ms · 2024-09-03T17:55:51Z

Second Observation when using TRT embedded context with config above, I am getting the same error as with TRT 10.0=
terminate called after throwing an instance of 'Ort::Exception'
what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input.
Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0
Aborted (core dumped)

@jcdatin
In order to use embedded context, the whole model should be TRT eligible meaning the whole model should be placed on TRT EP.
https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#more-about-embedded-engine-model--epcontext-model
In your case, some nodes are placed on CPU. (Please see my previous reply to fix this issue)

jcdatin · 2024-09-10T17:37:04Z

@chilo-ms : thx for your answer. I was in vacations. I will try DDS with your branch and investigate TRT EP with TRT10.3 for NMS node. I will also check that TRT embedded context is working once all nodes on TRT EP.

jcdatin · 2024-10-03T12:01:06Z

Nvidia informed me that the NMS performance issue is a known problem that will be fixed in TRT 10.6

chilo-ms · 2024-10-03T23:10:25Z

Nvidia informed me that the NMS performance issue is a known problem that will be fixed in TRT 10.6

Yeah, the NMS regression in TRT 10 is a known issue and Nvidia has been investigated this issue.
We have been tracking this issue with them and hopefully it can be fixed in TRT 10.6.

jcdatin · 2024-11-06T11:12:52Z

TRT 10.6 is out as well as ONNRT 1.20. But I see some restrictions :

ORT 1.20 only supports TRt 10.4 and 10.5 (and I need TRT10.6)
still TRT 10.6 "performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations. The amount of regression is roughly proportional to the number of such layers in the network"
Previous ORT and TRT 10.x could not dispatch aNMS nor nonZero ops to TRT tree, so I have to take TRT10.6 : will ORT still dispatch NMS/NonZero to TRT , I prefer TRT perf limitation than ORT displating these DNS ops still to CPU.

what is the version of Cuda supported by ORT : I am using 12.2 and TRT 10.6 seems to require 12.6
but first things first , what about ORT 1.20 supporting TRT10.6 and DNS ?

chilo-ms · 2024-11-13T22:16:17Z

Re: ORT 1.20 only supports TRt 10.4 and 10.5 (and I need TRT10.6)

ORT 1.20 supports TRT 10.4 and 10.5 means our CIs tested against those TRT versions and the prebuilt package built against those versions.
But you can still run the ORT TRT prebuilt library with TRT 10.6. (Note: specify TRT 10.6 lib path to LD_LIBRARY_PATH)

Re: Previous ORT and TRT 10.x could not dispatch aNMS nor nonZero ops to TRT tree, so I have to take TRT10.6 : will ORT still dispatch NMS/NonZero to TRT , I prefer TRT perf limitation than ORT displating these DNS ops still to CPU.

Start from TRT 10.7 (which is not released yet), TRT will completely enable DDS ops, aka ORT will dispatch NMS/NonZero/RoiAlign to TRT by default. Before TRT 10.7, user needs to build ORT with open-source parser to achieve this. But please be aware of the known DDS perf issue from TRT 10.0 to 10.7 (Nvidia likely won't fix the issue in TRT 10.7)
ORT TRT has a PR (which will be included in ORT 1.20.1 patch release) to add a new provider option trt_op_types_to_exclude which will exclude some ops to be run on TRT. This PR also adds NMS/NonZero/RoiAlign to the exclude list by default due to perf issue. User can provide empty string to it, i.e. trt_op_types_to_exclude="" to override so that all ops will be considered run on TRT.

Re: what is the version of Cuda supported by ORT : I am using 12.2 and TRT 10.6 seems to require 12.6.
ORT should be compatible with CUDA 12.x.
Did you find any issue of running ORT with CUDA 12.6?

jcdatin · 2024-11-14T10:19:53Z

Thank you @chilo-ms , I am building and testing 1.20.0 with trt 10.6 and oss trt parser . I will report the TRT 10.6 DNS operator performance degradation. When TRT10.7 is available I will test it with ORT 1.20.1 and its empty trt_op_types_to_exclude list and default trt parser. Keep posted

jcdatin · 2024-11-14T18:48:42Z

I am getting a an ort 1.20.0 compilation error when building with TRT 10.6 (TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz) over CUDA 12.2
with build command :
CC=gcc-11 CXX=g++-11 ./build.sh --skip_submodule_sync --nvcc_threads 2 --config ${ORT_BUILD_MODE} --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

cf
[ 31%] Building CXX object _deps/onnx_tensorrt-build/CMakeFiles/nvonnxparser_static.dir/onnxErrorRecorder.cpp.

In file included from /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:5:
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:32:38: error: ‘ILogger’ in namespace ‘nvinfer1’ does not name a type
32 | using ILogger = nvinfer1::ILogger;
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:39:9: error: ‘ILogger’ has not been declared
39 | ILogger* logger, IErrorRecorder* otherRecorder = nullptr);
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:70:36: error: expected ‘)’ before ‘’ token
70 | ONNXParserErrorRecorder(ILogger logger, IErrorRecorder* otherRecorder = nullptr);
| ~ ^
| )
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:74:26: error: ‘ILogger’ has not been declared
74 | static void logError(ILogger* logger, const char* str);
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:103:5: error: ‘ILogger’ does not name a type
103 | ILogger* mLogger{nullptr};
| ^~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:12:26: error: ‘onnx2trt::ONNXParserErrorRecorder* onnx2trt::ONNXParserErrorRecorder::create’ is not a static data member of ‘class onnx2trt::ONNXParserErrorRecorder’
12 | ONNXParserErrorRecorder* ONNXParserErrorRecorder::create(
| ^~~~~~~~~~~~~~~~~~~~~~~
/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:13:15: error: ‘ILogger’ is not a member of ‘nvinfer1’
13 | nvinfer1::ILogger* logger, nvinfer1::IErrorRecorder* otherRecorder)
| ^~~~~~~
gmake[2]: *** Waiting for unfinished jobs....

chilo-ms · 2024-11-14T19:09:06Z

Please specified the correct onnx-tensorrt commit in the cmake/deps.txt of your ort repo.
https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt#L40

jcdatin · 2024-11-27T09:09:28Z

I am using nightly build ORT_VERSION=bb1972264b which is based on somewhat 1.18.1 (this is the only onnrt version I can use with full inference speed on my faster-rcnn model with TRT 8.6.1

jcdatin · 2024-11-27T14:59:29Z

I am using nightly build ORT_VERSION=bb1972264b which is based on somewhat 1.18.1 (this is the only onnrt version I can use with full inference speed on my faster-rcnn model with TRT 8.6.1

my bad , forget the above . I am using TRT ORT 1.20.0 and TRT 10.6 as I said above .

How can I know how to replace line40 in the deps.txt above ?
I just downloaded the TRT 10.6 binaries tar ball from Nvidia TRT download site , I am not building from TRT sources as I always did when ORT was only supporting TRT 8.6.
I never had to change this line before with the standard version . except when you advised me to do so and provided the equivalent L40 line to use previous TRT 8.6 instead of the TRT 10.x with new ORT >1.19
Can you tell me which line to use . So that I can test ONNRT 1.20 with TRT 10.6 and DDS ?
ORT 1.20.2 as the RL says : "TensorRT EP Exclude DDS ops from running on TRT ([TensorRT EP] Exclude DDS ops from running on TRT #22875) - @chilo-ms" However , I NEED DDS on TRT to get full speed on my faster-rcnn model -whic ORT build bb19722 does pretty well but with TRT8.6
So I am completely lost : How am I supposed to upgrade my ORT version from bb19722 build ?

chilo-ms · 2024-12-04T06:26:08Z

How can I know how to replace line40 in the deps.txt above ?
...
Can you tell me which line to use . So that I can test ONNRT 1.20 with TRT 10.6 and DDS ?

ORT TRT works with either built-in tensorrt parser or oss tensorrt parser.

Currently, the built-in tensorrt parser (from version 10.0 to 10.7) disable DDS.
The only way to use DDS is to use oss tensorrt parser, so that's why in line40 of deps.txt, you can see it points to "commit of 10.6-GA-ORT-DDS.". And you need to manually build ORT TRT with --use_tensorrt_oss_parser

The line 40 in deps.txt (points to specific commit/branch in onnx-tensorrt, you can change it to use different TRT version) is only used when you manually build ORT with --use_tensorrt_oss_parser, meaning ORT TRT will work with oss tensorrt parser.
In other cases, ORT TRT always works with built-in tensorrt parser.

chilo-ms · 2024-12-04T06:31:29Z

ORT 1.20.2 as the RL says : "TensorRT EP Exclude DDS ops from running on TRT (#22875) - @chilo-ms" However , I NEED DDS on TRT to get full speed on my faster-rcnn model -whic ORT build bb19722 does pretty well but with TRT8.6

In your case, placing DDS ops on TRT, please don't use ORT patch release 1.20.1.
You can use ORT 1.20.0 or the any commit before following snapshot commit in main and build from source with --use_tensorrt_oss_parser.

Then you will be able to run ORT TRT + TRT 10 with DDS ops run by TRT.

chilo-ms · 2024-12-04T06:34:18Z

Sorry for the confusing and inconvenience, Nvidia has root caused the perf issue of running DDS ops and they are finding a better solution now.
Once they fix the perf issue in the new TRT, ORT TRT can enable DDS by default and there won't be any hassle for user to use DDS.

jcdatin · 2025-01-28T15:12:27Z

Actually got the info from Nvidia will NOT implement an official fix in TRT for this “regression” in DDS Ops, not even in TRT 10.8.
This is due to a regression in DDS operators when TRT 10 switched from CUDA async malloc, instead of synchronized CUDA malloc as in 8.6 . And Nvidia estimates "too risky to include a fix that only proves to improve a subset of models with DDS operators".

Here is Nvidia recommendation for DDS Operators when used with ORT =
_modify the cudaMemPoolAttrReleaseThreshold attribute. Here are the detailed steps to make sure this has minimal impact on other CUDA kernels:

You should implement a custom IGpuAllocator.
Inside the custom IGpuAllocator, modify cudaMemPoolAttrReleaseThreshold to UINT64_MAX, with the code that I shared on Dec 4
Associate the custom allocator with TensorRT runtime using setGpuAllocator_

I have then no other way than to try.
So I will use ORT 1.20.1 , TRT 10.7 and Cudnn 9.6 (the latest)

jcdatin · 2025-01-28T15:13:02Z

Unfortunately I am getting regression with TRT on my faster-rcnn model = terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0

the same model used to work well with ORT 1.18.0 and TRT 8.6.1 (I used onnxruntime tool symbolic_shape_infer.py to infer dimensions) as python /usr/local/lib/python3.10/dist-packages/onnxruntime/tools/symbolic_shape_infer.py --input=faster-rcnn.onnx --output=faster-rcnn-inferred.onnx --auto_merge

Update :
Error above is gotten when using TRT EP API
// WRAPPED TRT EP

    const auto& api = Ort::GetApi();
    OrtTensorRTProviderOptionsV2* tensorrt_options;
    Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));
    
    std::vector<const char*> option_keys = {
        "trt_fp16_enable",
        "trt_timing_cache_enable",
        "trt_timing_cache_path",
        "trt_force_timing_cache",
        "trt_engine_cache_enable",
        "trt_engine_cache_path",
        "trt_dump_ep_context_model",
        "trt_ep_context_file_path",
        "trt_profile_min_shapes",
        "trt_profile_max_shapes",
        "trt_profile_opt_shapes",
    };
    std::vector<const char*> option_values = {
        useFP16 ? "1" : "0",         // trt_fp16_enable
        "1",                         // use timing cache to be deterministic
        timingCachePath.c_str(),     // trt timing cache relative path
        "1",                         // accept slight GPU mismatch within CC
        "1",                         // trt_engine_cache_enable : create the embedded profile and engine in cache path (trt_ep_context_file_path)
        cachePath.c_str(),           // trt_engine_cache_path : relative path to the embedded profile and engine cache
        "1",                         // trt_dump_ep_context_model : create the embedded model context (_ctx.onnx) file that contains names of profile and engine
        contextPath.c_str(),         // trt_ep_context_file_path : path to the embedded context files
        "image:0x0",                 // trt_profile_min_shapes
        "image:3072x2400",           // trt_profile_max_shapes
        "image:2048x1024",           // trt_profile_opt_shapes
    };
    Ort::ThrowOnError(api.UpdateTensorRTProviderOptions(tensorrt_options,
                                                        option_keys.data(), option_values.data(), option_keys.size()));
    
    sessionOptions.AppendExecutionProvider_TensorRT_V2(*tensorrt_options);

jcdatin · 2025-01-30T15:46:10Z

Unfortunately I am getting regression with TRT10 on my faster-rcnn model = terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0
the same model used to work well with ORT 1.18.0 and TRT 8.6.1 (I used onnxruntime tool symbolic_shape_infer.py to infer dimensions) as python /usr/local/lib/python3.10/dist-packages/onnxruntime/tools/symbolic_shape_infer.py --input=faster-rcnn.onnx --output=faster-rcnn-inferred.onnx --auto_merge

Update 0 : :
I am using the ORT built in parser:
Update 1 :
My faster-rcnn model can be inferred with trtexec of TRT 10.7. So problem seems to be with ORT front end (parser)
Update 2 :
My faster-rcnn model can be inferred with ORT and Cuda EP - so seems to be a ORT TRT EP parser problem
Update 3 :
Now when I use old TRT API
// Non WRAPPED TRT EP
OrtTensorRTProviderOptions tensorrt_options{};
tensorrt_options.trt_fp16_enable = useFP16;
if (!cachePath.empty())
{
tensorrt_options.trt_engine_cache_enable = true;
tensorrt_options.trt_engine_cache_path = (const char *)cachePath.c_str();
}
sessionOptions.AppendExecutionProvider_TensorRT(tensorrt_options);

then error is

2025-01-30 13:12:57.087270906 [V:onnxruntime:, execution_frame.cc:563 AllocateMLValueTensorSelfOwnBufferHelper] For ort_value with index: 72, block in memory pattern size is: 14450688 but the actual size is: 2809856, fall back to default allocation behavior
2025-01-30 13:12:57.098338607 [E:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.h:88 log] [2025-01-30 13:12:57 ERROR] IExecutionContext::enqueueV3: Error Code 1: Cask ( Failed to update runtime arguments.)
2025-01-30 13:12:57.098391153 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_5350550926050970765_7 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_5350550926050970765_7_7' Status Message: TensorRT EP execution context enqueue failed.
terminate called after throwing an instance of 'Ort::Exception'
what(): Non-zero status code returned while running TRTKernel_graph_main_graph_5350550926050970765_7 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_5350550926050970765_7_7' Status Message: TensorRT EP execution context enqueue failed.

update 4: :
I see all DDS Ops being allocated to CPU .

jcdatin · 2025-01-30T18:59:31Z

Update 5 :
when using --use_tensorrt_oss_parser I am getting the following compilation errors
[ 41%] Building CXX object _deps/onnx_tensorrt-build/CMakeFiles/nvonnxparser_static.dir/onnxErrorRecorder.cpp.o
In file included from /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:5:
/tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:32:38: error: ‘ILogger’ in namespace ‘nvinfer1’ does not name a type
32 | using ILogger = nvinfer1::ILogger;
| ^~~~~~~
/tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:39:9: error: ‘ILogger’ has not been declared
39 | ILogger* logger, IErrorRecorder* otherRecorder = nullptr);
| ^~~~~~~

here is the the build command :
CC=gcc-11 CXX=g++-11 ./build.sh
--skip_submodule_sync
--nvcc_threads 2
--config $ORT_BUILD_MODE
--use_cuda --cuda_home /usr/local/cuda/
--cudnn_home /usr/local/cuda/lib64
--use_tensorrt_oss_parser --use_tensorrt --tensorrt_home /usr/local/TensorRT
--build_shared_lib --parallel --skip_tests
--allow_running_as_root
--cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89"
--cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

CUDA_VERSION=12.4
CUDNN_VERSION=9.6.0.74
TRT_VERSION=10.7.0.23
ORT_VERSION=v1.20.1
GCC_VERSION=11.5.0

to recap : in ORT 1.20.1
either I use the built-in trt parser and my onnx model cannot be loaded (but the model works with trtexec and Cuda EP)
or
I use the the tensorrt_oss_parser and ORT build fails on ILogger definition

Can you help ?
I am stuck with old ORT 1.18.0 (actually nightly build bb19722) and TRT 8.6
So I can't fix any critical bugs or cyber vulnerabilities that would show up in these"old" versions

yf711 · 2025-01-31T19:02:41Z

@jcdatin I saw from 1.20.1 that oss parser was using version 10.4-GA-ORT-DDS, and your tensorrt_home is using version 10.7.
The mismatch might be the reason. Could you try updating the line in deps.txt with version-matched onnx-tensorrt
(i.e 10.7: onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/9c69a24bc2e20c8a511a4e6b06fd49639ec5300a.zip;ff1fe9af78eb129b4a4cdcb7450b7390b4436dd3) and rebuild?

jcdatin · 2025-02-02T12:14:06Z

Now with update on deps.txt , ORT 1.20.1 builds with trt oss parser.
However , the DDS operators are still assigned to CPU node !

2025-02-02 12:24:01.107927206 [V:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.cc:2479 GetCapability] There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) in TRT 10. TRT EP automatically excludes DDS ops from running on TRT, if applicable

2025-02-02 12:24:03.521702534 [V:onnxruntime:, session_state.cc:1154 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 13
2025-02-02 12:24:03.521707716 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/my_model/rpn/NonZero)
2025-02-02 12:24:03.521725971 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (/model/my_model/rpn/NonMaxSuppression)
2025-02-02 12:24:03.521731248 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/rpn/ScatterND)
2025-02-02 12:24:03.521740969 [V:onnxruntime:, session_state.cc:1156 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/my_model/roi_heads/box_roi_pool/RoiAlign)

and not only I got NonZero , NMS and RoiAlign allocated to CPU but also ScatterND

WHat shall I do : I would like to test NVIDA workaround above to get my performance back that I have in 1.18.0 nightly build ?

here is the build command
CC=gcc-11 CXX=g++-11 ./build.sh
--skip_submodule_sync
--nvcc_threads 2
--config $ORT_BUILD_MODE
--use_cuda --cuda_home /usr/local/cuda/
--cudnn_home /usr/local/cuda/lib64
--use_tensorrt_oss_parser --use_tensorrt --tensorrt_home /usr/local/TensorRT
--build_shared_lib --parallel --skip_tests
--allow_running_as_root
--cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89"
--cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

chilo-ms · 2025-02-04T01:38:16Z

Re: WHat shall I do : I would like to test NVIDA workaround above to get my performance back that I have in 1.18.0 nightly build ?
ORT 1.20.1 implicitly filters out DDS nodes, like you saw above.
Please use ORT 1.20.0 !

Here is the PR that workarounds the potential DDS node perf issue. Feel free to give it a try as well.
The PR is close to merge and once it's merged, we finally can get rid of the implicit filtering of DDS nodes and make DDS nodes run on TRT.

jcdatin · 2025-02-04T09:43:12Z

using your PR branch (whatever parser built-in or oss I am using) ,
DDS operator not assigned to CPU node any more .

BUT I am crashing out of memory :
2025-02-04 12:16:47.838986200 [E:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.h:88 log] [2025-02-04 12:16:47 ERROR] [engine.cpp::readEngineFromArchive::1091] Error Code 2: OutOfMemory (Requested size was 162785332 bytes.)

Also I noticed that ScatterND op is having an error when parsed by ORT (I don't have the problem with trtexec on the same model):

2025-02-04 12:16:36.012390098 [E:onnxruntime:ivpSelectorInference, tensorrt_execution_provider.h:88 log] [2025-02-04 12:16:36 ERROR] In node 302 with name: /model/my_model/rpn/ScatterND and operator: ScatterND (importScatterND): UNSUPPORTED_NODE_ATTR: Assertion failed: !attrs.count("reduction"): Attribute reduction is not supported.

then ScatterND is allocated to CPU node :

2025-02-04 12:16:36.371363603 [V:onnxruntime:, session_state.cc:1249 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11
2025-02-04 12:16:36.371366748 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_main_graph_2899925923520386793_0
..
2025-02-04 12:16:36.371382235 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy)
..
2025-02-04 12:16:36.371417932 [V:onnxruntime:, session_state.cc:1249 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 2
2025-02-04 12:16:36.371420066 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/rpn/ScatterND)
2025-02-04 12:16:36.371422497 [V:onnxruntime:, session_state.cc:1251 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/roi_heads/ScatterND)

Did you make a change on ScatterND node ? ANy regression on his operator autotest ?

chilo-ms · 2025-02-05T07:31:53Z

Thanks for reporting this.
Our Windows CI also encountered OOM issue, we are reporting to Nvidia about this as the workaround is suggested by Nvidia.
At the same time, please use ORT 1.20.0 to enable DDS nodes on TRT.

jcdatin · 2025-02-05T07:56:08Z

Thanks for reporting this. Our Windows CI also encountered OOM issue, we are reporting to Nvidia about this as the workaround is suggested by Nvidia. At the same time, please use ORT 1.20.0 to enable DDS nodes on TRT.

with Built in parser or oss parser in th build ?
plus I am afraid I will still have the ScatterND issue above .

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Sep 6, 2023

jywu-msft assigned yf711 and chilo-ms Sep 6, 2023

chilo-ms mentioned this issue Nov 15, 2023

[TensorRT EP] Fix bug for no nodes in subgraph at GetCapability #18449

Merged

jcdatin mentioned this issue Jan 27, 2025

[Build] Compilation error when building Onnxrt 1.20.1 with flag onnxruntime_CUDA_MINIMAL=ON with TRT 10.7.23 and Cudnn 9.6.0.74, #23504

Open

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

Comments

datinje commented Sep 6, 2023

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

chilo-ms commented Sep 6, 2023 • edited Loading

chilo-ms commented Sep 6, 2023 • edited Loading

skottmckay commented Sep 7, 2023

datinje commented Sep 7, 2023 • edited by chilo-ms Loading

datinje commented Sep 7, 2023

datinje commented Sep 7, 2023

datinje commented Sep 7, 2023

chilo-ms commented Sep 7, 2023 • edited Loading

chilo-ms commented Sep 7, 2023 • edited Loading

datinje commented Sep 8, 2023

datinje commented Sep 25, 2023

datinje commented Sep 25, 2023

chilo-ms commented Sep 29, 2023 • edited Loading

datinje commented Oct 31, 2023

datinje commented Oct 31, 2023

chilo-ms commented Nov 7, 2023 • edited Loading

jcdatin commented Mar 16, 2024

jcdatin commented Apr 11, 2024

jcdatin commented Apr 11, 2024

jcdatin commented Apr 11, 2024

chilo-ms commented Apr 11, 2024 • edited Loading

jcdatin commented Apr 11, 2024 • edited Loading

jcdatin commented Apr 12, 2024

chilo-ms commented Aug 16, 2024 • edited Loading

jcdatin commented Aug 17, 2024

jcdatin commented Aug 17, 2024

jcdatin commented Aug 17, 2024

chilo-ms commented Aug 27, 2024 • edited Loading

chilo-ms commented Sep 3, 2024

jcdatin commented Sep 10, 2024

jcdatin commented Oct 3, 2024

chilo-ms commented Oct 3, 2024

jcdatin commented Nov 6, 2024

chilo-ms commented Nov 13, 2024 • edited Loading

jcdatin commented Nov 14, 2024

jcdatin commented Nov 14, 2024

chilo-ms commented Nov 14, 2024 • edited Loading

jcdatin commented Nov 27, 2024

jcdatin commented Nov 27, 2024

chilo-ms commented Dec 4, 2024 • edited Loading

chilo-ms commented Dec 4, 2024 • edited Loading

chilo-ms commented Dec 4, 2024 • edited Loading

jcdatin commented Jan 28, 2025 • edited Loading

jcdatin commented Jan 28, 2025 • edited Loading

jcdatin commented Jan 30, 2025 • edited Loading

jcdatin commented Jan 30, 2025 • edited Loading

yf711 commented Jan 31, 2025

jcdatin commented Feb 2, 2025 • edited Loading

chilo-ms commented Feb 4, 2025 • edited Loading

jcdatin commented Feb 4, 2025 • edited Loading

chilo-ms commented Feb 5, 2025

jcdatin commented Feb 5, 2025

chilo-ms commented Sep 6, 2023 •

edited

Loading

chilo-ms commented Sep 6, 2023 •

edited

Loading

datinje commented Sep 7, 2023 •

edited by chilo-ms

Loading

chilo-ms commented Sep 7, 2023 •

edited

Loading

chilo-ms commented Sep 7, 2023 •

edited

Loading

chilo-ms commented Sep 29, 2023 •

edited

Loading

chilo-ms commented Nov 7, 2023 •

edited

Loading

chilo-ms commented Apr 11, 2024 •

edited

Loading

jcdatin commented Apr 11, 2024 •

edited

Loading

chilo-ms commented Aug 16, 2024 •

edited

Loading

chilo-ms commented Aug 27, 2024 •

edited

Loading

chilo-ms commented Nov 13, 2024 •

edited

Loading

chilo-ms commented Nov 14, 2024 •

edited

Loading

chilo-ms commented Dec 4, 2024 •

edited

Loading

chilo-ms commented Dec 4, 2024 •

edited

Loading

chilo-ms commented Dec 4, 2024 •

edited

Loading

jcdatin commented Jan 28, 2025 •

edited

Loading

jcdatin commented Jan 28, 2025 •

edited

Loading

jcdatin commented Jan 30, 2025 •

edited

Loading

jcdatin commented Jan 30, 2025 •

edited

Loading

jcdatin commented Feb 2, 2025 •

edited

Loading

chilo-ms commented Feb 4, 2025 •

edited

Loading

jcdatin commented Feb 4, 2025 •

edited

Loading