Incorrect outputs of TensorRT 10.7 when compiling F.scaled_dot_product_attention
on GPU L4
#4333
Labels
Accuracy
Output mismatch between TensorRT and other frameworks
internal-bug-tracked
Tracked internally, will be fixed in a future release.
triaged
Issue has been triaged by maintainers
Description
Compiling an Attention layer which uses torch's
scaled_dot_product_attention
to a TRT engine results in incorrect outputs.Environment
Tested against
nvcr.io/nvidia/pytorch:24.12-py3
on an L4 GPUTensorRT Version: 10.7.0.23
NVIDIA GPU: L4
NVIDIA Driver Version:
CUDA Version: 12.6.3
CUDNN Version: 9.6.0.74
Operating System: Ubuntu 24.04
Python Version (if applicable): 3.12
Tensorflow Version (if applicable):
PyTorch Version (if applicable): 2.6.0a0+df5bbc09d1.
Baremetal or Container (if so, version): Container,
nvcr.io/nvidia/pytorch:24.12-py3
Relevant Files
Model link:
A gist to reproduce the issue
In high level, we have:
And the first version produces an incorrect TRT engine.
This is an image of comparing both ONNX files:
explicit version (left) and the scaled dot product version (right)
Interestingly, patching the bad ONNX to have B=0.125 on the right node (and B=1 on the left) fixes the issue, but not the other way around.
Steps To Reproduce
Prints:
Commands or scripts:
Have you tried the latest release?:
Yes
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):Yes, running using onnx runtime works as expected.
The text was updated successfully, but these errors were encountered: