vllm Quickstart

The quickstart is based on official vllm guide and modify with deploy Qwen2.5-72B-Instruct and Single-Node Multi-GPU (tensor parallel inference) on AWS g6e.12xlarge (4 L40s Nvida L40S Tensor Core GPUs, 192GiB GPU Memory, 3800GB nvme instance store SSD)

Installation

Launch a EC2 g6e.12xlarge instance with AMI Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.6.0 (Ubuntu 22.04)
install vllm

nvidia-smi
nvcc --version

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:38:00.0 Off |                    0 |
| N/A   26C    P8             22W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    On  |   00000000:3A:00.0 Off |                    0 |
| N/A   26C    P8             22W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    On  |   00000000:3C:00.0 Off |                    0 |
| N/A   27C    P8             22W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    On  |   00000000:3E:00.0 Off |                    0 |
| N/A   26C    P8             19W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# setup python environment
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv myenv --python 3.12 --seed
source myenv/bin/activate

# install vllm
uv pip install vllm

uv pip show vllm
uv pip show pynvml
uv pip show pytorch
uv pip show torch

# validation pythorch is right version and avaiable
python -c "import torch; print(torch.version.cuda)"
python -c 'import torch; print(torch.cuda.is_available())'

Offline Batched Inference

quick test based on offline_inference basic guide

git clone https://github.com/vllm-project/vllm.git
cd vllm
python examples/offline_inference/basic/basic.py

Run Single-Node Multi-GPU (tensor parallel inference) with --tensor-parallel-size 4

from vllm import LLM
import torch._dynamo
torch._dynamo.config.suppress_errors = True

llm = LLM("mistralai/Mistral-7B-Instruct-v0.3", device="cuda", tensor_parallel_size=4)

print(llm.generate("What is batch inference?"))

Or using bash

python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--device cuda \
--tensor-parallel-size 4

Trouble shooting

RuntimeError: Failed to infer device type. The error "RuntimeError: Failed to infer device type" in vLLM on Ubuntu indicates that the system cannot determine whether to run vLLM on the CPU or GPU

# Specify the Device
from vllm import LLM

llm = LLM(model="Qwen/Qwen2-7B-Instruct", device="cuda") # For GPU
# or
# llm = LLM(model="Qwen/Qwen2-7B-Instruct", device="cpu")  # For CPU

export VLLM_TARGET_DEVICE="cpu"  # For CPU
# or
export VLLM_TARGET_DEVICE="cuda" # For GPU
python your_script.py

raise NotImplementedError The error NotImplementedError indicates that a specific function or method, in this case, is_async_output_supported(), lacks a concrete implementation for the environment where vLLM is running. vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
- select the NVIDIA deep learning AMI such as Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.6.0 (Ubuntu 22.04). Follow this blog self-host-llm-with-ec2-vllm-langchain-fastapi-llm
- Ensure CUDA is correctly installed. Check the CUDA version with nvcc --version; nvidia-smi
  - If no nvidia-smi return message or nvcc return not installed
```
sudo apt install nvidia-driver-570
sudo apt install nvidia-cuda-toolkit
pip install torch torchvision
sudo reboot
```
- Check the known issue is_async_output_supported raise NotImplementedError pip install --upgrade pynvml
- Check if you have gpu support python -c "import torch; print(torch.version.cuda)"; python -c 'import torch; print(torch.cuda.is_available())'. torch.cuda.is_available() 返回 False. 这个错误表示 PyTorch 未检测到 CUDA。可能的原因有 CUDA 未正确安装、CUDA 版本不兼容等。可以尝试重新安装适合的 CUDA 和 PyTorch 版本。
Cannot access gated repo for url https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/resolve/main/config.json. Access to model mistralai/Mixtral-8x7B-Instruct-v0.1 is restricted. You must have access to it and be authenticated to access it. Please log in. check the Cannot access gated repo You must be authenticated to access it.
```
from huggingface_hub import login
login(token="your_access_token")
```
or
```
pip install --upgrade huggingface_hub
huggingface-cli login
YOUR_ACCESS_TOKEN
```

OpenAI-Compatible Server

start hosting
```
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```

testing with curl

curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0.7
    }'

testing with python

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)
print("Chat response:", chat_response)

Run Deepseek-R1

Single GPU hosting

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B  --max-model-len 32768

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        "messages": [{"role": "user", "content": "Please reason step by step, and put your final answer within \boxed{}. 用 4、1、9 组成的三位数造减法塔，最后一层的算式是什么?"} ],
        "temperature": 0.7
    }'


curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "messages": [{"role": "user", "content": "Please reason step by step. 甲、乙、丙、丁四个人上大学的时候在一个宿舍住，毕业10年后他们又约好回母校相聚。老朋友相见分外热情。四个人聊起来，知道了这么一些情况：只有三个人有自己的车 ;只有两个人有自己喜欢的工作；只有一个人有了自己的别墅；每个人至少具备一样条件；甲和乙对自己的工作条件感觉一样；乙和丙的车是同一牌子的；丙和丁中只有一个人有车。如果有一个人三种条件都具备，那么，你知道他是谁吗 ?"} ],
    "temperature": 0.7 
    }'

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        "messages": [{"role": "user", "content": "Please reason step by step, and put your final answer within \boxed{}. 用 4、1、9 组成的三位数造减法塔，最后一层的算式是什么?"} ],
        "temperature": 0.7
    }'

multi GPU on Single Node

VLLM_LOG_LEVEL=debug vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --tensor-parallel-size 4 --max-model-len 32768 --max_num_seqs 2048 --enforce-eager --gpu_memory_utilization=0.8 --enable-chunked-prefill

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        "messages": [{"role": "user", "content": "Please reason step by step, and put your final answer within \boxed{}. AI是否取代人类？AI和人类如何共处？"} ],
        "temperature": 0.7
    }'

Open WebUI

Chatting with DeepSeek under Linux by Command Line is not very friendly. We can install Open WebUI as a web interface.

Install Open WebUI on the same g6e.12xlarge.

# install Conda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash

# Install Open WebUI
source ~/.bashrc
pip install open-webui
open-webui serve

Access http://localhost:8080 or http://IP:8080

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm_quickstart.md

vllm_quickstart.md

vllm Quickstart

Installation

Offline Batched Inference

OpenAI-Compatible Server

Run Deepseek-R1

Open WebUI

Files

vllm_quickstart.md

Latest commit

History

vllm_quickstart.md

File metadata and controls

vllm Quickstart

Installation

Offline Batched Inference

OpenAI-Compatible Server

Run Deepseek-R1

Open WebUI