Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupNormalization plugin failure of TensorRT 10.0.1.6 when running trtexec on GPU A4000 #3950

Closed
appearancefnp opened this issue Jun 18, 2024 · 11 comments
Labels
Module:Plugins Issues when using TensorRT plugins triaged Issue has been triaged by maintainers

Comments

@appearancefnp
Copy link

Description

Hey guys!
I wanted to upgrade from TensorRT 8.6 to 10.0. I have a ONNX model that contains GroupNormalization plugin. It creates a serialized version, but it fails when deserializing the model while trying to load cudnn 8 instead of cudnn 9.

Environment

Using docker: nvcr.io/nvidia/tensorrt:24.05-py3

TensorRT Version: 10.0.1

NVIDIA GPU: A4000

NVIDIA Driver Version: 550.67

CUDA Version: 12.4

CUDNN Version: 9.1 (per container documentation)

Operating System:

Python Version (if applicable): -

Tensorflow Version (if applicable): -

PyTorch Version (if applicable): -

Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:24.05-py3

Relevant Files

Model link: https://drive.google.com/file/d/1vmGZpWJ_1sfz2ejbZoO3fFaR5udxOLTi/view?usp=sharing

Steps To Reproduce

  1. Run trtexec: trtexec --onnx=model.onnx
  2. trtexec builds the engine
...
[06/17/2024-14:57:28] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 3 MiB, GPU 1984 MiB
[06/17/2024-14:57:28] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 3059 MiB
[06/17/2024-14:57:28] [I] Engine built in 886.712 sec.
[06/17/2024-14:57:28] [I] Created engine with size: 55.3649 MiB
[06/17/2024-14:57:28] [I] [TRT] Loaded engine size: 55 MiB
[06/17/2024-14:57:28] [I] Engine deserialized in 0.0301295 sec.
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

...
[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +156, now: CPU 1, GPU 199 (MiB)
[06/17/2024-14:57:28] [I] Setting persistentCacheLimit to 0 bytes.
[06/17/2024-14:57:28] [I] Created execution context with device memory size: 155.537 MiB
[06/17/2024-14:57:28] [I] Using random values for input images
[06/17/2024-14:57:28] [I] Input binding for images with dimensions 1x500x1000x3 is created.
[06/17/2024-14:57:28] [I] Output binding for class_heatmaps with dimensions 1x5x125x250 is created.
[06/17/2024-14:57:28] [I] Starting inference
[06/17/2024-14:57:28] [F] [TRT] Validation failed: mBnScales != nullptr && mBnScales->mPtr != nullptr
plugin/groupNormalizationPlugin/groupNormalizationPlugin.cpp:132

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [E] Error[2]: [pluginV2DynamicExtRunner.cpp::execute::115] Error Code 2: Internal Error (Assertion pluginUtils::isSuccess(status) failed. )
[06/17/2024-14:57:28] [E] Error occurred during inference

Commands or scripts:
trtexec --onnx=model.onnx

Have you tried the latest release?: yes

@lix19937
Copy link

Can you upload full log with trtexec --onnx=model.onnx --verbose ?

@appearancefnp
Copy link
Author

@lix19937
trtexec.log

@lix19937
Copy link

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

Make sure libcudnn.so load successed. Add path to LD_LIBRARY_PATH.

@appearancefnp
Copy link
Author

[06/17/2024-14:57:28] [E] [TRT] std::exception
[06/17/2024-14:57:28] [F] [TRT] Validation failed: Failed to load libcudnn.so.8.
plugin/common/cudnnWrapper.cpp:90

Make sure libcudnn.so load successed. Add path to LD_LIBRARY_PATH.

The problem is that the NVIDIA container contains cudnn 9.1.0, but the plugin is trying to load libcudnn.so.8. There is a version mismatch, not that cudnn is not available.

@lix19937
Copy link

lix19937 commented Jul 1, 2024

You should make sure your env has one cudnn, and why your nvinfer plugin will load cudnn.8.0 ?

@appearancefnp
Copy link
Author

This is not my plugin - this is the plugin provided in this repo - https://github.com/NVIDIA/TensorRT/tree/release/10.1/plugin/groupNormalizationPlugin

And it loads cudnn 8, not 9 because it has the wrong macro defined here: https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/common/cudnnWrapper.cpp#L26

@lix19937
Copy link

lix19937 commented Jul 1, 2024

From https://github.com/NVIDIA/TensorRT/tree/release/10.0, trt version 10.0.1.6, cudnn recommend follow

TensorRT GA build

TensorRT v10.0.1.6
Available from direct download links listed below
System Packages

CUDA
Recommended versions:
cuda-12.2.0 + cuDNN-8.9
cuda-11.8.0 + cuDNN-8.9
GNU make >= v4.1
cmake >= v3.13
python >= v3.8, <= v3.10.x
pip >= v19.0
Essential utilities
git, pkg-config, wget

map to https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/common/cudnnWrapper.cpp#L26-L42

You can try to creat a soft link ln -s libcudnn.so.9 libcudnn.so.8.

@appearancefnp
Copy link
Author

Why does the container include cudnn 9 then?
https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html#rel-24-06

If TensorRT doesn't work in an NVIDIA container with cudnn 9, why does it ship with it?

@ttyio
Copy link
Collaborator

ttyio commented Aug 7, 2024

@appearancefnp , we now use native groupnorm support in the onnx parser, see https://github.com/onnx/onnx-tensorrt/blob/f161f95883b4ebd8cb789de5efc67b73c0a6e694/onnxOpImporters.cpp#L2151

could you replace the groupnormplugin with groupnorm in your model? thanks!

@ttyio ttyio added Module:Plugins Issues when using TensorRT plugins triaged Issue has been triaged by maintainers labels Aug 7, 2024
@moraxu
Copy link
Collaborator

moraxu commented Sep 7, 2024

@appearancefnp , I will be closing this ticket due to our policy to close tickets with no activity for more than 21 days after a reply had been posted. Please reopen a new ticket if you still need help.

@moraxu moraxu closed this as completed Sep 7, 2024
@toothache
Copy link

@appearancefnp , we now use native groupnorm support in the onnx parser, see https://github.com/onnx/onnx-tensorrt/blob/f161f95883b4ebd8cb789de5efc67b73c0a6e694/onnxOpImporters.cpp#L2151

could you replace the groupnormplugin with groupnorm in your model? thanks!

I was able to run the native groupnorm in opset 18, but I encountered an issue when running the groupnorm in latest op version.
See #4336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module:Plugins Issues when using TensorRT plugins triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants