Skip to content

integrated vlm code for benchmark for Eagle2 #3698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 61 additions & 2 deletions docsrc/tutorials/compile_hf_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Overview of tools/llm Directory
The ``tools/llm`` directory provides the following tools to compile LLM models from Huggingface:

* **run_llm.py**: Main entry point for model compilation, generating outputs, and benchmarking
* **run_vlm.py**: Entry point for compiling and benchmarking Visual Language Models (VLMs)
* **Static Cache Utilities**: ``static_cache_v1.py`` and ``static_cache_v2.py`` for KV cache optimization
* **SDPA Attention**: ``sdpa_converter.py`` and ``register_sdpa.py`` for registering scaled dot-product attention converter and lowering pass.
* **Testing Components**: Model-specific test files for validation
Expand Down Expand Up @@ -60,6 +61,30 @@ We have officially verified support for the following LLM families:
- FP16, FP32
- Yes

Supported VLM Models
--------------------
We have officially verified support for the following Visual Language Models (VLMs):

.. list-table::
:widths: 20 40 20 20 20
:header-rows: 1

* - Model Series
- HuggingFace Model Card
- Precision
- KV Cache Support ?
- Component Support
* - Qwen 2.5 VL
- Qwen/Qwen2.5-VL-3B-Instruct
- FP16, FP32
- Yes (static_v1 only)
- Language Model only (Image Encoder not supported)
* - Eagle2
- nvidia/Eagle2-2B
- FP16, FP32
- Yes (static_v1 only)
- Language Model and Image Encoder both supported

Getting Started with run_llm.py
-------------------------------

Expand Down Expand Up @@ -112,6 +137,36 @@ Other Usage Examples
python tools/llm/run_llm.py --model Qwen/Qwen2.5-1.5B-Instruct --precision FP32 --benchmark


Getting Started with run_vlm.py
-------------------------------

For Visual Language Models (VLMs), use ``run_vlm.py`` to compile and benchmark models that process both text and images.

Basic Usage
^^^^^^^^^^^

.. code-block:: bash

python tools/llm/run_vlm.py \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--precision FP16 \
--num_tokens 128 \
--cache static_v1 \
--enable_pytorch_run \
--benchmark

Key Arguments
^^^^^^^^^^^^^

* ``--model``: Name or path of the HuggingFace VLM
* ``--prompt``: Input prompt for generation
* ``--image_path``: (Optional) Path to input image file. If not provided, will use a sample image
* ``--precision``: Precision mode (``FP16``, ``FP32``)
* ``--num_tokens``: Number of output tokens to generate
* ``--cache``: KV cache type (``static_v1`` or empty for no KV caching)
* ``--benchmark``: Enable benchmarking mode
* ``--enable_pytorch_run``: Also run and compare PyTorch baseline

KV Caching in Torch-TensorRT
---------------------------------

Expand All @@ -122,7 +177,7 @@ The length of KV cache = input sequence length + output sequence length (specifi
Static Cache v1
^^^^^^^^^^^^^^^^

The ``static_cache_v1.py`` implements KV cache in the model graph as follows:
The ``static_cache_v1.py`` implements KV cache in the model graph as follows:

.. code-block:: python

Expand Down Expand Up @@ -210,9 +265,13 @@ Limitations and Known Issues

* Sliding window attention (used in Gemma3 and Qwen 3 models) is not yet supported
* Some model architectures (e.g. Phi-4) have issues with exporting the torch model.
* For VLMs, Qwen2.5-VL image encoder compilation is not supported due to dynamic operations incompatible with torch.export.

Requirements
^^^^^^^^^^^^

* Torch-TensorRT 2.8.0 or later
* Transformers v4.52.3
* Transformers v4.52.3
* For VLM models (run_vlm.py):
- ``pip install qwen-vl-utils`` (for Qwen2.5-VL-3B-Instruct model)
- ``pip install flash-attn --no-build-isolation -v`` (for Eagle2-2B model)
25 changes: 21 additions & 4 deletions tools/llm/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Optimizing LLMs in Torch-TensorRT

This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry point is `run_llm.py`, which demonstrates how to export, compile, and run LLMs with various caching strategies and precision modes. Note that this is an **experimental release** and APIs may change in future versions.
This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) and Visual Language Models (VLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry points are `run_llm.py` for text-only LLMs and `run_vlm.py` for vision-language models. Note that this is an **experimental release** and APIs may change in future versions.

### Key Features

- **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
- **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
- **Precision Modes:** Supports FP16, BF16, and FP32.
- **KV Cache:** Supports static and dynamic KV cache for efficient autoregressive decoding.
- **Benchmarking:** Measures and compares throughput and latency for PyTorch and TensorRT backends.
Expand All @@ -24,20 +25,33 @@ We have officially verified support for the following models:
| Qwen 2.5 | Qwen/Qwen2.5-0.5B-Instruct<br>Qwen/Qwen2.5-1.5B-Instruct<br>Qwen/Qwen2.5-4B-Instruct<br>Qwen/Qwen2.5-7B-Instruct | FP16, FP32 | Yes |
| Qwen 3 | Qwen/Qwen3-0.6B<br>Qwen/Qwen3-1.7B<br>Qwen/Qwen3-4B<br>Qwen/Qwen3-8B | FP16, FP32 | Yes |

### Supported VLM Models

| Model Series | HF Model Card | Precision | KV Cache Supported ? |
|--------------|---------------|-----------|-------------------|
| Qwen 2.5 VL | Qwen/Qwen2.5-VL-3B-Instruct | FP16, FP32 | Yes |
| Eagle2 | nvidia/Eagle2-2B | FP16, FP32 | Yes |

### Usage

The main entry point is : `run_llm.py`
#### Text-only LLMs: `run_llm.py`

```bash
python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
```

#### Vision Language Models: `run_vlm.py`

```bash
python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark
```

#### Key Arguments

- `--model`: Name or path of the HuggingFace LLM.
- `--model`: Name or path of the HuggingFace LLM/VLM.
- `--tokenizer`: (Optional) Tokenizer name; defaults to model.
- `--prompt`: Input prompt for generation.
- `--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
- `--precision`: Precision mode (`FP16`, `FP32`).
- `--num_tokens`: Number of output tokens to generate.
- `--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
Expand All @@ -64,4 +78,7 @@ This codebase can be extended to
## Requirements

- Torch-TensorRT 2.8.0
- Transformers v4.52.3
- Transformers v4.52.3
- For VLM models (run_vlm.py):
- `pip install qwen-vl-utils` (for Qwen2.5-VL-3B-Instruct model)
- `pip install flash-attn --no-build-isolation -v` (for Eagle2-2B model)
Loading
Loading