Accuracy with TensorRT 10.7 and self-attention #4328

mrjackbo · 2025-01-18T14:26:39Z

I am trying to convert an open-clip (pip install open_clip_torch==2.30.0) model to TensorRT:

import open_clip
import torch

model, _, _ = open_clip.create_model_and_transforms("ViT-SO400M-14-SigLIP-384", pretrained="webli")

image_input = torch.randn((1, 3, 384, 384), dtype=torch.float32)
onnx = torch.onnx.export(model, (image_input), "ViT-SO400M-14-SigLIP-384.onnx", input_names = ["input"], output_names = ["output"], dynamo=False, train=False, do_constant_folding=True, opset_version=18, export_params=True, dynamic_axes={"input": {0:"N"}, "output": {0:"N"}})

This produces a valid onnx file, such that onnx-runtime execution matches with pytorch of the original model.

To convert the model to TensorRT, I do:

docker run --gpus=all --rm -it -v $(pwd):/model \
    --env POLYGRAPHY_AUTOINSTALL_DEPS=1 \
    nvcr.io/nvidia/tensorrt:24.12-py3 \
    polygraphy run /model/ViT-SO400M-14-SigLIP-384.onnx --onnxrt --trt

[...]

[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=0.088555] OR [rel=1639.2] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.017181, std-dev=0.013923, var=0.00019386, median=0.014335, min=3.429e-05 at (0, 516), max=0.088555 at (0, 1013), avg-magnitude=0.017181, p90=0.035568, p95=0.043743, p99=0.065308
[I]                 ---- Histogram ----
                    Bin Range           |  Num Elems | Visualization
                    (3.43e-05, 0.00889) |        381 | ########################################
                    (0.00889 , 0.0177 ) |        323 | #################################
                    (0.0177  , 0.0266 ) |        208 | #####################
                    (0.0266  , 0.0354 ) |        122 | ############
                    (0.0354  , 0.0443 ) |         61 | ######
                    (0.0443  , 0.0531 ) |         29 | ###
                    (0.0531  , 0.062  ) |         11 | #
                    (0.062   , 0.0709 ) |          8 | 
                    (0.0709  , 0.0797 ) |          5 | 
                    (0.0797  , 0.0886 ) |          4 | 
[I]             Relative Difference | Stats: mean=5.9796, std-dev=51.921, var=2695.8, median=1.1482, min=0.0033405 at (0, 722), max=1639.2 at (0, 375), avg-magnitude=5.9796, p90=6.266, p95=17.039, p99=72.345
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0.00334 , 164     ) |       1147 | ########################################
                    (164     , 328     ) |          3 | 
                    (328     , 492     ) |          1 | 
                    (492     , 656     ) |          0 | 
                    (656     , 820     ) |          0 | 
                    (820     , 984     ) |          0 | 
                    (984     , 1.15e+03) |          0 | 
                    (1.15e+03, 1.31e+03) |          0 | 
                    (1.31e+03, 1.48e+03) |          0 | 
                    (1.48e+03, 1.64e+03) |          1 | 
[E]         FAILED | Output: 'output' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Note the magnitude of the relative error: (p90=6.266 !!). This happens on my RTX A4500 Laptop GPU (driver 560) and on my V100 (but here I use tensorrt:24.06-py3, as TensorRT 10.7 does not support Volta anymore). The FP16/BF16 case is even worse.

When I do the same conversion with --fp8, the error vanishes (note that the A4500 and V100 do not support FP8 kernels). I compared the trtexec verbose logs, and found that in the fp32 case, TensorRT recognizes the self-attention pattern, but in the FP8 case it does not:

trtexec --onnx=ViT-SO400M-14-SigLIP-384.onnx --verbose
[...]
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul to be part of self-attention pattern.                                                                                                                                                                                                        
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/Softmax to be part of self-attention pattern.                                                                                                                                                                                                       
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul_1 to be part of self-attention pattern.                                                                                                                                                                                                      
[01/18/2025-15:16:59] [V] [TRT] Found and reassigned Myelin backends for Self-Attention nodes  
[...]

This observation got me thinking...when I replace the /attn/Softmax nodes with a custom TensorRT softmax plugin, the TensorRT optimizer can no longer do the self-attention optimization, and the result is that I get TensorRT engines with acceptable accuracy (even in fp16).
My conclusion: Somehow, for this model, the myelin self-attenion fusion is buggy.

The text was updated successfully, but these errors were encountered:

lix19937 · 2025-01-20T06:33:09Z

try to use

 polygraphy run /model/ViT-SO400M-14-SigLIP-384.onnx --trt --onnxrt \
     --trt-outputs mark all \
     --onnx-outputs mark all

mrjackbo · 2025-01-20T07:53:59Z

Yes, this fixes the accuracy issue (it prevents layer fusion), but performance is terrible (as expected).

lix19937 · 2025-01-21T09:31:02Z

It is often useful to reduce it to the smallest possible subgraph that triggers the failure. That makes it easier to pinpoint the cause of the failure.

dichotomy/split the onnx, more to see polygraphy debug .
https://github.com/NVIDIA/TensorRT/blob/release/10.7/tools/Polygraphy/how-to/debug_accuracy.md

mrjackbo · 2025-01-21T21:08:26Z

I understand. I am just reporting that the self-attention fusion in this case appears to have a bug which results in large errors even in fp32.

lix19937 · 2025-01-22T00:48:21Z

Use the dichotomy just to find which layer has compute error.

ohadravid · 2025-01-22T11:31:18Z

Seems like open_clip uses F.scaled_dot_product_attention (https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/transformer.py#L159),
so this issue I opened (#4333) might be related.

lix19937 · 2025-01-23T01:17:55Z

BTW, when you use trtexec, you can add flag --noTF32 to improve accuracy with a little performance cost.

mrjackbo · 2025-01-23T16:52:57Z

@ohadravid Thank you so much! With the help of your reproducer I was able to fix the open clip problem by monkey-patching torch.nn.scaled_dot_product_attention like so:

import math
import torch
def naive_sdpa(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None):
    _, _B, _Nt, E = q.shape
    q = q * math.sqrt(1.0/float(E))
    attn = q @ k.transpose(-2,-1)
    if attn_mask is not None:
        attn += attn_mask
    attn = attn.softmax(dim=-1)
    return (attn @ v)

torch.nn.functional.scaled_dot_product_attention = naive_sdpa

import open_clip
[...]

kevinch-nv · 2025-01-31T19:33:56Z

Thanks for root causing @ohadravid! Closing this issue, let's track the proper scaled dot attention fix in #4333.

kevinch-nv added triaged Issue has been triaged by maintainers duplicate This issue or pull request already exists Accuracy Output mismatch between TensorRT and other frameworks labels Jan 31, 2025

kevinch-nv closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy with TensorRT 10.7 and self-attention #4328

Accuracy with TensorRT 10.7 and self-attention #4328

mrjackbo commented Jan 18, 2025

lix19937 commented Jan 20, 2025

mrjackbo commented Jan 20, 2025

lix19937 commented Jan 21, 2025 •

edited

Loading

mrjackbo commented Jan 21, 2025

lix19937 commented Jan 22, 2025

ohadravid commented Jan 22, 2025

lix19937 commented Jan 23, 2025

mrjackbo commented Jan 23, 2025

kevinch-nv commented Jan 31, 2025

Accuracy with TensorRT 10.7 and self-attention #4328

Accuracy with TensorRT 10.7 and self-attention #4328

Comments

mrjackbo commented Jan 18, 2025

lix19937 commented Jan 20, 2025

mrjackbo commented Jan 20, 2025

lix19937 commented Jan 21, 2025 • edited Loading

mrjackbo commented Jan 21, 2025

lix19937 commented Jan 22, 2025

ohadravid commented Jan 22, 2025

lix19937 commented Jan 23, 2025

mrjackbo commented Jan 23, 2025

kevinch-nv commented Jan 31, 2025

lix19937 commented Jan 21, 2025 •

edited

Loading