This is a repo to build and inference ChatGLM2-6B with TRT-LLM. This document explains how to build the ChatGLM2-6B model using TensorRT-LLM and run on a single GPU.
The TensorRT-LLM ChatGLM2-6B implementation can be found in example/chatglm2-6b/model.py
. The TensorRT-LLM ChatGLM2-6B example code is located in examples/chatglm2-6b
. There are serveral main files in that folder:
build.py
to load a checkpoint from the HuggingFace (HF) Transformers format to the TensorRT-LLM Chatglm2-6B network, and build the TensorRT engine(s) needed to run the ChatGLM2-6B model,run.py
to run the inference on an input text,
The next section describe how to build the engine and run the inference demo.
apt-get update
apt-get install git-lfs
git clone https://huggingface.co/THUDM/chatglm2-6b pyTorchModel
TensorRT-LLM builds TensorRT engine(s) after loaded the weight from HuggingFace pytorch Model.
The build.py
script requires a single GPU to build the TensorRT engine(s).
Examples of build invocations:
python3 build.py --model_dir=./pyTorchModel \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16
You can enable the int 8 weight-only quantization by adding --use_weight_only
, this will siginficantly lower the latency and memory footprint.
You can enable the FMHA kernels for ChatGLM2-6B by adding --enable_context_fmha
to the invocation of build.py
. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.
To run a TensorRT-LLM ChatGLM2-6B model on a single GPU, you can use python3
:
# Run the ChatGLM2-6B model on a single GPU.
python3 run.py
(TODO)