deserialize_cuda_engine returns None, TensorRT 10.0 #3834

eegmnn · 2024-04-29T07:39:03Z

Description

I tried to use trtexec function, in order to convert my .ONNX network to .trt model. My model makes object detection on 640x640 rgb images and I used this command to convert: !/usr/src/tensorrt/bin/trtexec --onnx=best.onnx --saveEngine=best_engine.engine --fp16

Seems like everything goes well and creates me the .trt engine I need. However, deserialize_cuda_engine() function cannot deserialize the model I recently created. I also tried with various ONNX models and got the same result every time.

engine_path = "/content/best_engine.engine"
#trt.init_libnvinfer_plugins(None, "")             ### I tried this line and has not worked
with open(engine_path, 'rb') as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime:
    engine_data = f.read()
    engine = runtime.deserialize_cuda_engine(engine_data)

On above code, runtime.deserialize_cuda_engine returns "None".

Environment

I am running all my code on Google Colab (Despite trying everything on my local too)

TensorRT Version: 10.0.1

NVIDIA GPU: Tesla T4

NVIDIA Driver Version: 535.104.05

CUDA Version: 12.2

CUDNN Version: 8.9.2.26

Operating System:

Python Version: 3.10.12

Tensorflow Version: 2.15.0

PyTorch Version: 2.2.1+cu121

Steps To Reproduce

1- Downlaod the ONNX: https://github.com/onnx/models/blob/main/Computer_Vision/bat_resnext26ts_Opset16_timm/bat_resnext26ts_Opset16.onnx
2- Run: !/usr/src/tensorrt/bin/trtexec --onnx=Path/to/Onnx --saveEngine=best_engine.engine --fp16
3- Run below code to see if model can be deserialized:

# Import necessary libraries
import numpy as np
import cv2
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt

engine_path = "/content/best_engine.engine"
with open(engine_path, 'rb') as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime:
    engine_data = f.read()
    engine = runtime.deserialize_cuda_engine(engine_data)

# Check if engine loading was successful
if engine is None:
    print("Error: Failed to load the TensorRT engine.")
else:
    print("SUCCESS: Succesfully loaded the TensorRT engine.")

Output of trtexec command:

&&&& RUNNING TensorRT.trtexec [TensorRT v100001] # /usr/src/tensorrt/bin/trtexec --onnx=/content/drive/MyDrive/yolov9/runs/train/exp5/weights/best.onnx --saveEngine=best_engine.engine --fp16
[04/29/2024-07:10:49] [I] === Model Options ===
[04/29/2024-07:10:49] [I] Format: ONNX
[04/29/2024-07:10:49] [I] Model: /content/drive/MyDrive/yolov9/runs/train/exp5/weights/best.onnx
[04/29/2024-07:10:49] [I] Output:
[04/29/2024-07:10:49] [I] === Build Options ===
[04/29/2024-07:10:49] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[04/29/2024-07:10:49] [I] avgTiming: 8
[04/29/2024-07:10:49] [I] Precision: FP32+FP16
[04/29/2024-07:10:49] [I] LayerPrecisions:
[04/29/2024-07:10:49] [I] Layer Device Types:
[04/29/2024-07:10:49] [I] Calibration:
[04/29/2024-07:10:49] [I] Refit: Disabled
[04/29/2024-07:10:49] [I] Strip weights: Disabled
[04/29/2024-07:10:49] [I] Version Compatible: Disabled
[04/29/2024-07:10:49] [I] ONNX Plugin InstanceNorm: Disabled
[04/29/2024-07:10:49] [I] TensorRT runtime: full
[04/29/2024-07:10:49] [I] Lean DLL Path:
[04/29/2024-07:10:49] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/29/2024-07:10:49] [I] Exclude Lean Runtime: Disabled
[04/29/2024-07:10:49] [I] Sparsity: Disabled
[04/29/2024-07:10:49] [I] Safe mode: Disabled
[04/29/2024-07:10:49] [I] Build DLA standalone loadable: Disabled
[04/29/2024-07:10:49] [I] Allow GPU fallback for DLA: Disabled
[04/29/2024-07:10:49] [I] DirectIO mode: Disabled
[04/29/2024-07:10:49] [I] Restricted mode: Disabled
[04/29/2024-07:10:49] [I] Skip inference: Disabled
[04/29/2024-07:10:49] [I] Save engine: best_engine.engine
[04/29/2024-07:10:49] [I] Load engine:
[04/29/2024-07:10:49] [I] Profiling verbosity: 0
[04/29/2024-07:10:49] [I] Tactic sources: Using default tactic sources
[04/29/2024-07:10:49] [I] timingCacheMode: local
[04/29/2024-07:10:49] [I] timingCacheFile:
[04/29/2024-07:10:49] [I] Enable Compilation Cache: Enabled
[04/29/2024-07:10:49] [I] errorOnTimingCacheMiss: Disabled
[04/29/2024-07:10:49] [I] Preview Features: Use default preview flags.
[04/29/2024-07:10:49] [I] MaxAuxStreams: -1
[04/29/2024-07:10:49] [I] BuilderOptimizationLevel: -1
[04/29/2024-07:10:49] [I] Calibration Profile Index: 0
[04/29/2024-07:10:49] [I] Weight Streaming: Disabled
[04/29/2024-07:10:49] [I] Debug Tensors:
[04/29/2024-07:10:49] [I] Input(s)s format: fp32:CHW
[04/29/2024-07:10:49] [I] Output(s)s format: fp32:CHW
[04/29/2024-07:10:49] [I] Input build shapes: model
[04/29/2024-07:10:49] [I] Input calibration shapes: model
[04/29/2024-07:10:49] [I] === System Options ===
[04/29/2024-07:10:49] [I] Device: 0
[04/29/2024-07:10:49] [I] DLACore:
[04/29/2024-07:10:49] [I] Plugins:
[04/29/2024-07:10:49] [I] setPluginsToSerialize:
[04/29/2024-07:10:49] [I] dynamicPlugins:
[04/29/2024-07:10:49] [I] ignoreParsedPluginLibs: 0
[04/29/2024-07:10:49] [I]
[04/29/2024-07:10:49] [I] === Inference Options ===
[04/29/2024-07:10:49] [I] Batch: Explicit
[04/29/2024-07:10:49] [I] Input inference shapes: model
[04/29/2024-07:10:49] [I] Iterations: 10
[04/29/2024-07:10:49] [I] Duration: 3s (+ 200ms warm up)
[04/29/2024-07:10:49] [I] Sleep time: 0ms
[04/29/2024-07:10:49] [I] Idle time: 0ms
[04/29/2024-07:10:49] [I] Inference Streams: 1
[04/29/2024-07:10:49] [I] ExposeDMA: Disabled
[04/29/2024-07:10:49] [I] Data transfers: Enabled
[04/29/2024-07:10:49] [I] Spin-wait: Disabled
[04/29/2024-07:10:49] [I] Multithreading: Disabled
[04/29/2024-07:10:49] [I] CUDA Graph: Disabled
[04/29/2024-07:10:49] [I] Separate profiling: Disabled
[04/29/2024-07:10:49] [I] Time Deserialize: Disabled
[04/29/2024-07:10:49] [I] Time Refit: Disabled
[04/29/2024-07:10:49] [I] NVTX verbosity: 0
[04/29/2024-07:10:49] [I] Persistent Cache Ratio: 0
[04/29/2024-07:10:49] [I] Optimization Profile Index: 0
[04/29/2024-07:10:49] [I] Weight Streaming Budget: Disabled
[04/29/2024-07:10:49] [I] Inputs:
[04/29/2024-07:10:49] [I] Debug Tensor Save Destinations:
[04/29/2024-07:10:49] [I] === Reporting Options ===
[04/29/2024-07:10:49] [I] Verbose: Disabled
[04/29/2024-07:10:49] [I] Averages: 10 inferences
[04/29/2024-07:10:49] [I] Percentiles: 90,95,99
[04/29/2024-07:10:49] [I] Dump refittable layers:Disabled
[04/29/2024-07:10:49] [I] Dump output: Disabled
[04/29/2024-07:10:49] [I] Profile: Disabled
[04/29/2024-07:10:49] [I] Export timing to JSON file:
[04/29/2024-07:10:49] [I] Export output to JSON file:
[04/29/2024-07:10:49] [I] Export profile to JSON file:
[04/29/2024-07:10:49] [I]
[04/29/2024-07:10:49] [I] === Device Information ===
[04/29/2024-07:10:49] [I] Available Devices:
[04/29/2024-07:10:49] [I] Device 0: "Tesla T4" UUID: GPU-0343faab-652b-1b38-5fe7-4a515f0ab8a4
[04/29/2024-07:10:49] [I] Selected Device: Tesla T4
[04/29/2024-07:10:49] [I] Selected Device ID: 0
[04/29/2024-07:10:49] [I] Selected Device UUID: GPU-0343faab-652b-1b38-5fe7-4a515f0ab8a4
[04/29/2024-07:10:49] [I] Compute Capability: 7.5
[04/29/2024-07:10:49] [I] SMs: 40
[04/29/2024-07:10:49] [I] Device Global Memory: 15102 MiB
[04/29/2024-07:10:49] [I] Shared Memory per SM: 64 KiB
[04/29/2024-07:10:49] [I] Memory Bus Width: 256 bits (ECC enabled)
[04/29/2024-07:10:49] [I] Application Compute Clock Rate: 1.59 GHz
[04/29/2024-07:10:49] [I] Application Memory Clock Rate: 5.001 GHz
[04/29/2024-07:10:49] [I]
[04/29/2024-07:10:49] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/29/2024-07:10:49] [I]
[04/29/2024-07:10:49] [I] TensorRT version: 10.0.1
[04/29/2024-07:10:49] [I] Loading standard plugins
[04/29/2024-07:10:49] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 16, GPU 339 (MiB)
[04/29/2024-07:10:51] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +945, GPU +180, now: CPU 1097, GPU 519 (MiB)
[04/29/2024-07:10:51] [I] Start parsing network model.
[04/29/2024-07:11:08] [I] [TRT] ----------------------------------------------------------------
[04/29/2024-07:11:08] [I] [TRT] Input filename: /content/drive/MyDrive/yolov9/runs/train/exp5/weights/best.onnx
[04/29/2024-07:11:08] [I] [TRT] ONNX IR version: 0.0.7
[04/29/2024-07:11:08] [I] [TRT] Opset version: 12
[04/29/2024-07:11:08] [I] [TRT] Producer name: pytorch
[04/29/2024-07:11:08] [I] [TRT] Producer version: 2.2.1
[04/29/2024-07:11:08] [I] [TRT] Domain:
[04/29/2024-07:11:08] [I] [TRT] Model version: 0
[04/29/2024-07:11:08] [I] [TRT] Doc string:
[04/29/2024-07:11:08] [I] [TRT] ----------------------------------------------------------------
[04/29/2024-07:11:08] [I] Finished parsing network model. Parse time: 17.0463
[04/29/2024-07:11:08] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[04/29/2024-07:11:08] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[04/29/2024-07:11:08] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[04/29/2024-07:18:23] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[04/29/2024-07:18:33] [I] [TRT] Total Host Persistent Memory: 1203408
[04/29/2024-07:18:33] [I] [TRT] Total Device Persistent Memory: 3570176
[04/29/2024-07:18:33] [I] [TRT] Total Scratch Memory: 852480
[04/29/2024-07:18:33] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 363 steps to complete.
[04/29/2024-07:18:33] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 68.1444ms to assign 21 blocks to 363 nodes requiring 82434560 bytes.
[04/29/2024-07:18:33] [I] [TRT] Total Activation Memory: 82432000
[04/29/2024-07:18:33] [I] [TRT] Total Weights Memory: 121699856
[04/29/2024-07:18:33] [I] [TRT] Engine generation completed in 444.887 seconds.
[04/29/2024-07:18:33] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 5 MiB, GPU 400 MiB
[04/29/2024-07:18:33] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 2074 MiB
[04/29/2024-07:18:33] [I] Engine built in 445.314 sec.
[04/29/2024-07:18:33] [I] Created engine with size: 120.008 MiB
[04/29/2024-07:18:34] [I] [TRT] Loaded engine size: 120 MiB
[04/29/2024-07:18:34] [I] Engine deserialized in 0.186017 sec.
[04/29/2024-07:18:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +82, now: CPU 1, GPU 201 (MiB)
[04/29/2024-07:18:34] [I] Setting persistentCacheLimit to 0 bytes.
[04/29/2024-07:18:34] [I] Created execution context with device memory size: 78.6133 MiB
[04/29/2024-07:18:34] [I] Using random values for input images
[04/29/2024-07:18:34] [I] Input binding for images with dimensions 1x3x640x640 is created.
[04/29/2024-07:18:34] [I] Output binding for output0 with dimensions 1x5x8400 is created.
[04/29/2024-07:18:34] [I] Output binding for 1686 with dimensions 1x5x8400 is created.
[04/29/2024-07:18:34] [I] Starting inference
[04/29/2024-07:18:37] [I] Warmup completed 12 queries over 200 ms
[04/29/2024-07:18:37] [I] Timing trace has 175 queries over 3.04668 s
[04/29/2024-07:18:37] [I]
[04/29/2024-07:18:37] [I] === Trace details ===
[04/29/2024-07:18:37] [I] Trace averages of 10 runs:
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.0329 ms - Host latency: 17.5236 ms (enqueue 2.34825 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.0618 ms - Host latency: 17.5495 ms (enqueue 2.44615 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.2715 ms - Host latency: 17.753 ms (enqueue 2.56733 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.1086 ms - Host latency: 17.5893 ms (enqueue 2.65072 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.6873 ms - Host latency: 18.1707 ms (enqueue 2.3817 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.4858 ms - Host latency: 17.9664 ms (enqueue 2.29066 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.2406 ms - Host latency: 17.7227 ms (enqueue 2.2957 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.2266 ms - Host latency: 17.7164 ms (enqueue 2.38251 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.2058 ms - Host latency: 17.6895 ms (enqueue 2.30498 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.2516 ms - Host latency: 17.7397 ms (enqueue 2.42463 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.5162 ms - Host latency: 17.9998 ms (enqueue 2.37827 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.3992 ms - Host latency: 17.8952 ms (enqueue 2.42373 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.272 ms - Host latency: 17.7574 ms (enqueue 2.3646 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.3744 ms - Host latency: 17.8614 ms (enqueue 2.48599 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.3539 ms - Host latency: 17.8584 ms (enqueue 2.47695 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.3235 ms - Host latency: 17.8096 ms (enqueue 2.43191 ms)
[04/29/2024-07:18:37] [I] Average on 10 runs - GPU latency: 17.4615 ms - Host latency: 17.945 ms (enqueue 2.37036 ms)
[04/29/2024-07:18:37] [I]
[04/29/2024-07:18:37] [I] === Performance summary ===
[04/29/2024-07:18:37] [I] Throughput: 57.4397 qps
[04/29/2024-07:18:37] [I] Latency: min = 17.1947 ms, max = 18.9179 ms, mean = 17.7985 ms, median = 17.6521 ms, percentile(90%) = 18.0736 ms, percentile(95%) = 18.319 ms, percentile(99%) = 18.7384 ms
[04/29/2024-07:18:37] [I] Enqueue Time: min = 2.16687 ms, max = 3.4939 ms, mean = 2.40903 ms, median = 2.34253 ms, percentile(90%) = 2.64905 ms, percentile(95%) = 2.84106 ms, percentile(99%) = 3.42578 ms
[04/29/2024-07:18:37] [I] H2D Latency: min = 0.431519 ms, max = 0.519669 ms, mean = 0.442469 ms, median = 0.438232 ms, percentile(90%) = 0.452332 ms, percentile(95%) = 0.466064 ms, percentile(99%) = 0.515625 ms
[04/29/2024-07:18:37] [I] GPU Compute Time: min = 16.7149 ms, max = 18.4381 ms, mean = 17.3119 ms, median = 17.1631 ms, percentile(90%) = 17.592 ms, percentile(95%) = 17.8182 ms, percentile(99%) = 18.2589 ms
[04/29/2024-07:18:37] [I] D2H Latency: min = 0.0322266 ms, max = 0.0551758 ms, mean = 0.0440975 ms, median = 0.0437012 ms, percentile(90%) = 0.0457764 ms, percentile(95%) = 0.0506592 ms, percentile(99%) = 0.0546875 ms
[04/29/2024-07:18:37] [I] Total Host Walltime: 3.04668 s
[04/29/2024-07:18:37] [I] Total GPU Compute Time: 3.02959 s
[04/29/2024-07:18:37] [W] * GPU compute time is unstable, with coefficient of variance = 1.62079%.
[04/29/2024-07:18:37] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[04/29/2024-07:18:37] [I] Explanations of the performance metrics are printed in the verbose logs.
[04/29/2024-07:18:37] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # /usr/src/tensorrt/bin/trtexec --onnx=/content/drive/MyDrive/yolov9/runs/train/exp5/weights/best.onnx --saveEngine=best_engine.engine --fp16

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-04-30T01:12:19Z

python -c "import tensorrt as trt;print(trt.__version__)"

Check the trt version used by Python is consistent with the version used by trtexec or not.

Then change you log level to trt.Logger.VERBOSE

eegmnn · 2024-04-30T06:04:13Z

Seems like, TensorRT version in my Python was 8.4.3.1 and trtexec ran on TensorRT 10.0.1

For those having a similar problem:
Check Python TensorRT version by

import tensorrt as trt
trt.__version__

And Check trtexec TensorRT version by (Version appears on the top like [TensorRT v100001]):
/usr/src/tensorrt/bin/trtexec --help

I fixed the problem by installing TensorRT 10.0.1 on Python by simply installing as
!pip install tensorrt==10.0.1

@lix19937 Thank you very much for the answer. Problem is fixed.

eegmnn closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deserialize_cuda_engine returns None, TensorRT 10.0 #3834

deserialize_cuda_engine returns None, TensorRT 10.0 #3834

eegmnn commented Apr 29, 2024

lix19937 commented Apr 30, 2024 •

edited

Loading

eegmnn commented Apr 30, 2024

deserialize_cuda_engine returns None, TensorRT 10.0 #3834

deserialize_cuda_engine returns None, TensorRT 10.0 #3834

Comments

eegmnn commented Apr 29, 2024

Description

Environment

Steps To Reproduce

Output of trtexec command:

lix19937 commented Apr 30, 2024 • edited Loading

eegmnn commented Apr 30, 2024

lix19937 commented Apr 30, 2024 •

edited

Loading