Skip to content

Performance Issue when using tools/llm #3803

@ChiikawaSama

Description

@ChiikawaSama

❓ Question

What you have already tried

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 2.8.0
  • CPU Architecture: amd
  • OS (e.g., Linux): ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): pip
  • Build command you used (if compiling from source): NO
  • Are you using local sources or building from archives: NO
  • Python version: 3.10
  • CUDA version: 12.8
  • GPU models and configuration: NVIDIA
  • Any other relevant information: directly use torch-tensorrt 2.8.0 wheel with github 2.8.0 tag to run tools/llm

Additional context

Hi there, I tried to use tools/llm with static_cache_v2 to run qwen2.5 model, and I use such script to run:

python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark

when i use nsight system to profiling, I found that using static_cache_v2 would bring launch overhead to tensorrt engine in each prefill / decode block, do you have this problem too? thought this overhead is too much, almost make torch-tensorrt the same speed compared to just enable torch.compile

here is the nsys profiling result: the red line shows there is approximately 1.7ms overhead and no gpu activities at all (when disabling static_cache_v2 there is no such bubbles, thought maybe because shape copy or other operators with static_cache_v2?)

Image

looking forward to your reply, thanks a lot!

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions