Skip to content

Latest commit

 

History

History
165 lines (117 loc) · 5.67 KB

File metadata and controls

165 lines (117 loc) · 5.67 KB

Accelerating a 4-bit Quantised Llama Model

Overview

This guide demonstrates how to accelerate a 4-bit quantized Llama model using the TensorRTLLM engine. TensorRTLLM is a high-performance inference engine that leverages NVIDIA's TensorRT library to optimize and accelerate models for deployment on NVIDIA GPUs.

Table of Contents

Introduction

In this example, we'll demonstrate how to accelerate a 4-bit quantized model using the TensorRTLLM engine. The process involves quantizing the model using the AWQ quantization technique and then optimizing it for deployment on NVIDIA GPUs using TensorRTLLM.

Requirements

Before you begin, ensure that you have the following:

  • A GPU-enabled environment with CUDA support.

Installation

Step 1: Clone the Nyuntam Repository

Clone the repository and navigate to the nyuntam directory:

git clone https://github.com/nyunAI/nyuntam.git
cd nyuntam/examples/text-generation/tensorrtllm_engine/

Step 2: Set Up the workspace

Create and activate an environment for the AWQ quantization example:

conda create -n tensorrtllm_engine python=3.10 -y # or use virtualenv if preferred
conda activate tensorrtllm_engine

Install the required dependencies:

pip install git+https://github.com/nyunAI/nyunzero-cli.git

Setup the nyun workspace

mkdir workspace && cd workspace
nyun init -e kompress-text-generation # wait for the extensions to be installed

Configuration

Prepare the YAML configuration file specific to AWQ quantization. Use the following template as a starting point:

# tensorrtllm_engine.yaml

# Model configuration
MODEL: "meta-llama/Llama-2-7b-hf"

# Data configuration
DATASET_NAME: "wikitext"
DATASET_SUBNAME: "wikitext-2-raw-v1"
TEXT_COLUMN: "text"                     
SPLIT: "train"

DATA_PATH:
FORMAT_STRING:

# Acceleration configuration
llm:
  TensorRTLLM:
    to_quantize: true # to first quantize the model and then build engine. (Supported only for llama, gptj, & falcon models.)
    dtype: float16

    # quantization parameters
    quant_method: "int4_awq" # 'fp8', 'int4_awq', 'smoothquant', 'int8'
    smoothquant: 0.5 # in case smoothquant value is given
    calib_size: 32

    ...other params

# Job configuration
CUDA_ID: "0"
ALGORITHM: "TensorRTLLM"
JOB_SERVICE: "Kompress"
USER_FOLDER: "/user_data/example"
JOB_ID: "tensorrtllm_engine"
CACHE_PATH: "/user_data/example/.cache"
JOB_PATH: "/user_data/example/jobs/tensorrtllm_engine"
LOGGING_PATH: "/user_data/example/logs/tensorrtllm_engine"
ALGO_TYPE: "llm"
TASK: "llm"

Running the engine build

With your YAML file configured, initiate the process by running:

nyun run ../config.yaml

Monitor the process to ensure that the quantization completes successfully.

Once the job starts, you'll find the following directory structure in the workspace folder:

workspace/
├── custom_data
└── example
    ├── datasets
    │   └── wikitext
    ├── jobs
    │   └── Kompress
    │       └── tensorrtllm_engine
    ├── logs
    │   └── tensorrtllm_engine
    └── models
        └── meta-llama
            └── Llama-2-7b-hf
                ...

The output model will be saved in the workspace/example/jobs/Kompress/tensorrtllm_engine directory.

Performance Evaluation

Following is the comparison of the results* with the original model to assess the impact of quantization on accuracy and inference speed.

Model Optimised with Quantization Type WM (GB) RM (GB) Tokens/s Perplexity
meta-llama/Llama-2-7b-hf TensorRT-LLM AWQ GEMM 4bit (quant_method=int4_awq) 3.42 5.69 194.86 6.02
INT8 (quant_method=int8) 6.53 8.55 143.57 5.89
FP16 (to_quantize=false) 12.55 14.61 83.43 5.85
meta-llama/Llama-2-7b-hf Text-Generation-Inference AWQ GEMM 4bit 3.62 36.67 106.84 6.02
FP16 12.55 38.03 74.19 5.85

*Source: Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

Conclusion

In this example, we demonstrated how to accelerate a 4-bit quantized Llama3.1-8b model using the TensorRTLLM engine. By leveraging the nyun cli, we optimized the model for deployment on NVIDIA GPUs, achieving significant improvements in inference speed and memory efficiency.


Author: Kushwaha, Shubham

Additional Examples