The quickstart is based on official vllm guide and modify with deploy Qwen2.5-72B-Instruct and Single-Node Multi-GPU (tensor parallel inference) on AWS g6e.12xlarge (4 L40s Nvida L40S Tensor Core GPUs, 192GiB GPU Memory, 3800GB nvme instance store SSD)
- Launch a EC2 g6e.12xlarge instance with AMI
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.6.0 (Ubuntu 22.04)
- install vllm
nvidia-smi
nvcc --version
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:38:00.0 Off | 0 |
| N/A 26C P8 22W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S On | 00000000:3A:00.0 Off | 0 |
| N/A 26C P8 22W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S On | 00000000:3C:00.0 Off | 0 |
| N/A 27C P8 22W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S On | 00000000:3E:00.0 Off | 0 |
| N/A 26C P8 19W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
# setup python environment
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
# install vllm
uv pip install vllm
uv pip show vllm
uv pip show pynvml
uv pip show pytorch
uv pip show torch
# validation pythorch is right version and avaiable
python -c "import torch; print(torch.version.cuda)"
python -c 'import torch; print(torch.cuda.is_available())'
-
quick test based on offline_inference basic guide
git clone https://github.com/vllm-project/vllm.git cd vllm python examples/offline_inference/basic/basic.py
-
Run Single-Node Multi-GPU (tensor parallel inference) with
--tensor-parallel-size 4
from vllm import LLM import torch._dynamo torch._dynamo.config.suppress_errors = True llm = LLM("mistralai/Mistral-7B-Instruct-v0.3", device="cuda", tensor_parallel_size=4) print(llm.generate("What is batch inference?"))
Or using bash
python -m vllm.entrypoints.openai.api_server \ --model mistralai/Mixtral-8x7B-Instruct-v0.1 \ --device cuda \ --tensor-parallel-size 4
-
Trouble shooting
-
RuntimeError: Failed to infer device type
. The error "RuntimeError: Failed to infer device type" in vLLM on Ubuntu indicates that the system cannot determine whether to run vLLM on the CPU or GPU# Specify the Device from vllm import LLM llm = LLM(model="Qwen/Qwen2-7B-Instruct", device="cuda") # For GPU # or # llm = LLM(model="Qwen/Qwen2-7B-Instruct", device="cpu") # For CPU
export VLLM_TARGET_DEVICE="cpu" # For CPU # or export VLLM_TARGET_DEVICE="cuda" # For GPU python your_script.py
-
raise NotImplementedError
The error NotImplementedError indicates that a specific function or method, in this case, is_async_output_supported(), lacks a concrete implementation for the environment where vLLM is running. vLLM contains pre-compiled C++ and CUDA (12.1) binaries.- select the NVIDIA deep learning AMI such as
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.6.0 (Ubuntu 22.04)
. Follow this blog self-host-llm-with-ec2-vllm-langchain-fastapi-llm - Ensure CUDA is correctly installed. Check the CUDA version with
nvcc --version; nvidia-smi
- If no nvidia-smi return message or nvcc return not installed
sudo apt install nvidia-driver-570 sudo apt install nvidia-cuda-toolkit pip install torch torchvision sudo reboot
- If no nvidia-smi return message or nvcc return not installed
- Check the known issue is_async_output_supported raise NotImplementedError
pip install --upgrade pynvml
- Check if you have gpu support
python -c "import torch; print(torch.version.cuda)"
;python -c 'import torch; print(torch.cuda.is_available())'
. torch.cuda.is_available() 返回 False. 这个错误表示 PyTorch 未检测到 CUDA。可能的原因有 CUDA 未正确安装、CUDA 版本不兼容等。可以尝试重新安装适合的 CUDA 和 PyTorch 版本。
- select the NVIDIA deep learning AMI such as
-
Cannot access gated repo for url https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/resolve/main/config.json.
Access to model mistralai/Mixtral-8x7B-Instruct-v0.1 is restricted. You must have access to it and be authenticated to access it. Please log in. check the Cannot access gated repo You must be authenticated to access it.from huggingface_hub import login login(token="your_access_token")
or
pip install --upgrade huggingface_hub huggingface-cli login YOUR_ACCESS_TOKEN
- start hosting
vllm serve Qwen/Qwen2.5-1.5B-Instruct
- testing with curl
curl http://localhost:8000/v1/models curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0.7 }'
- testing with python
from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke."}, ] ) print("Chat response:", chat_response) chat_response = client.chat.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", messages=[ {"role": "user", "content": "Who won the world series in 2020?"} ] ) print("Chat response:", chat_response)
- Single GPU hosting
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --max-model-len 32768 curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [{"role": "user", "content": "Please reason step by step, and put your final answer within \boxed{}. 用 4、1、9 组成的三位数造减法塔,最后一层的算式是什么?"} ], "temperature": 0.7 }' curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [{"role": "user", "content": "Please reason step by step. 甲、乙、丙、丁四个人上大学的时候在一个宿舍住,毕业10年后他们又约好回母校相聚。老朋友相见分外热情。四个人聊起来,知道了这么一些情况:只有三个人有自己的车 ;只有两个人有自己喜欢的工作;只有一个人有了自己的别墅;每个人至少具备一样条件;甲和乙对自己的工作条件感觉一样;乙和丙的车是同一牌子的;丙和丁中只有一个人有车。如果有一个人三种条件都具备,那么,你知道他是谁吗 ?"} ], "temperature": 0.7 }' curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [{"role": "user", "content": "Please reason step by step, and put your final answer within \boxed{}. 用 4、1、9 组成的三位数造减法塔,最后一层的算式是什么?"} ], "temperature": 0.7 }'
- multi GPU on Single Node
VLLM_LOG_LEVEL=debug vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --tensor-parallel-size 4 --max-model-len 32768 --max_num_seqs 2048 --enforce-eager --gpu_memory_utilization=0.8 --enable-chunked-prefill curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [{"role": "user", "content": "Please reason step by step, and put your final answer within \boxed{}. AI是否取代人类?AI和人类如何共处?"} ], "temperature": 0.7 }'
Chatting with DeepSeek under Linux by Command Line is not very friendly. We can install Open WebUI as a web interface.
- Install Open WebUI on the same g6e.12xlarge.
# install Conda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
# Install Open WebUI
source ~/.bashrc
pip install open-webui
open-webui serve
- Access http://localhost:8080 or http://IP:8080