Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy with TensorRT 10.7 and self-attention #4328

Closed
mrjackbo opened this issue Jan 18, 2025 · 9 comments
Closed

Accuracy with TensorRT 10.7 and self-attention #4328

mrjackbo opened this issue Jan 18, 2025 · 9 comments
Labels
Accuracy Output mismatch between TensorRT and other frameworks duplicate This issue or pull request already exists triaged Issue has been triaged by maintainers

Comments

@mrjackbo
Copy link

I am trying to convert an open-clip (pip install open_clip_torch==2.30.0) model to TensorRT:

import open_clip
import torch

model, _, _ = open_clip.create_model_and_transforms("ViT-SO400M-14-SigLIP-384", pretrained="webli")

image_input = torch.randn((1, 3, 384, 384), dtype=torch.float32)
onnx = torch.onnx.export(model, (image_input), "ViT-SO400M-14-SigLIP-384.onnx", input_names = ["input"], output_names = ["output"], dynamo=False, train=False, do_constant_folding=True, opset_version=18, export_params=True, dynamic_axes={"input": {0:"N"}, "output": {0:"N"}})

This produces a valid onnx file, such that onnx-runtime execution matches with pytorch of the original model.

To convert the model to TensorRT, I do:

docker run --gpus=all --rm -it -v $(pwd):/model \
    --env POLYGRAPHY_AUTOINSTALL_DEPS=1 \
    nvcr.io/nvidia/tensorrt:24.12-py3 \
    polygraphy run /model/ViT-SO400M-14-SigLIP-384.onnx --onnxrt --trt

[...]

[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=0.088555] OR [rel=1639.2] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.017181, std-dev=0.013923, var=0.00019386, median=0.014335, min=3.429e-05 at (0, 516), max=0.088555 at (0, 1013), avg-magnitude=0.017181, p90=0.035568, p95=0.043743, p99=0.065308
[I]                 ---- Histogram ----
                    Bin Range           |  Num Elems | Visualization
                    (3.43e-05, 0.00889) |        381 | ########################################
                    (0.00889 , 0.0177 ) |        323 | #################################
                    (0.0177  , 0.0266 ) |        208 | #####################
                    (0.0266  , 0.0354 ) |        122 | ############
                    (0.0354  , 0.0443 ) |         61 | ######
                    (0.0443  , 0.0531 ) |         29 | ###
                    (0.0531  , 0.062  ) |         11 | #
                    (0.062   , 0.0709 ) |          8 | 
                    (0.0709  , 0.0797 ) |          5 | 
                    (0.0797  , 0.0886 ) |          4 | 
[I]             Relative Difference | Stats: mean=5.9796, std-dev=51.921, var=2695.8, median=1.1482, min=0.0033405 at (0, 722), max=1639.2 at (0, 375), avg-magnitude=5.9796, p90=6.266, p95=17.039, p99=72.345
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0.00334 , 164     ) |       1147 | ########################################
                    (164     , 328     ) |          3 | 
                    (328     , 492     ) |          1 | 
                    (492     , 656     ) |          0 | 
                    (656     , 820     ) |          0 | 
                    (820     , 984     ) |          0 | 
                    (984     , 1.15e+03) |          0 | 
                    (1.15e+03, 1.31e+03) |          0 | 
                    (1.31e+03, 1.48e+03) |          0 | 
                    (1.48e+03, 1.64e+03) |          1 | 
[E]         FAILED | Output: 'output' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Note the magnitude of the relative error: (p90=6.266 !!). This happens on my RTX A4500 Laptop GPU (driver 560) and on my V100 (but here I use tensorrt:24.06-py3, as TensorRT 10.7 does not support Volta anymore). The FP16/BF16 case is even worse.

When I do the same conversion with --fp8, the error vanishes (note that the A4500 and V100 do not support FP8 kernels). I compared the trtexec verbose logs, and found that in the fp32 case, TensorRT recognizes the self-attention pattern, but in the FP8 case it does not:

trtexec --onnx=ViT-SO400M-14-SigLIP-384.onnx --verbose
[...]
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul to be part of self-attention pattern.                                                                                                                                                                                                        
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/Softmax to be part of self-attention pattern.                                                                                                                                                                                                       
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul_1 to be part of self-attention pattern.                                                                                                                                                                                                      
[01/18/2025-15:16:59] [V] [TRT] Found and reassigned Myelin backends for Self-Attention nodes  
[...]

This observation got me thinking...when I replace the /attn/Softmax nodes with a custom TensorRT softmax plugin, the TensorRT optimizer can no longer do the self-attention optimization, and the result is that I get TensorRT engines with acceptable accuracy (even in fp16).
My conclusion: Somehow, for this model, the myelin self-attenion fusion is buggy.

@lix19937
Copy link

try to use

 polygraphy run /model/ViT-SO400M-14-SigLIP-384.onnx --trt --onnxrt \
     --trt-outputs mark all \
     --onnx-outputs mark all

@mrjackbo
Copy link
Author

Yes, this fixes the accuracy issue (it prevents layer fusion), but performance is terrible (as expected).

@lix19937
Copy link

lix19937 commented Jan 21, 2025

It is often useful to reduce it to the smallest possible subgraph that triggers the failure. That makes it easier to pinpoint the cause of the failure.

dichotomy/split the onnx, more to see polygraphy debug .
https://github.com/NVIDIA/TensorRT/blob/release/10.7/tools/Polygraphy/how-to/debug_accuracy.md

@mrjackbo
Copy link
Author

I understand. I am just reporting that the self-attention fusion in this case appears to have a bug which results in large errors even in fp32.

@lix19937
Copy link

Use the dichotomy just to find which layer has compute error.

@ohadravid
Copy link

Seems like open_clip uses F.scaled_dot_product_attention (https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/transformer.py#L159),
so this issue I opened (#4333) might be related.

@lix19937
Copy link

BTW, when you use trtexec, you can add flag --noTF32 to improve accuracy with a little performance cost.

@mrjackbo
Copy link
Author

@ohadravid Thank you so much! With the help of your reproducer I was able to fix the open clip problem by monkey-patching torch.nn.scaled_dot_product_attention like so:

import math
import torch
def naive_sdpa(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None):
    _, _B, _Nt, E = q.shape
    q = q * math.sqrt(1.0/float(E))
    attn = q @ k.transpose(-2,-1)
    if attn_mask is not None:
        attn += attn_mask
    attn = attn.softmax(dim=-1)
    return (attn @ v)

torch.nn.functional.scaled_dot_product_attention = naive_sdpa

import open_clip
[...]

@kevinch-nv kevinch-nv added triaged Issue has been triaged by maintainers duplicate This issue or pull request already exists Accuracy Output mismatch between TensorRT and other frameworks labels Jan 31, 2025
@kevinch-nv
Copy link
Collaborator

Thanks for root causing @ohadravid! Closing this issue, let's track the proper scaled dot attention fix in #4333.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accuracy Output mismatch between TensorRT and other frameworks duplicate This issue or pull request already exists triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants