The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs from day one. SGLang also supports MLA optimization and DP attention, making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official DeepSeek team.
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
For optimizations made on the DeepSeek series models regarding SGLang, please refer to DeepSeek Model Optimizations in SGLang.
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.
# Pull latest image
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest
# Launch
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000
If you are using RDMA, please note that:
--network host
and--privileged
are required by RDMA. If you don't need RDMA, you can remove them.- You may need to set
NCCL_IB_GID_INDEX
if you are using RoCE, for example:export NCCL_IB_GID_INDEX=3
.
Add performance optimization options as needed.
# Installation
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
# Launch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
Add performance optimization options as needed.
MLA optimizations are enabled by default. Here are some optional optimizations can be enabled as needed.
- Data Parallelism Attention: For high QPS scenarios, add the
--enable-dp-attention
argument to boost throughput. - Torch.compile Optimization: Add
--enable-torch-compile
argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with--torch-compile-max-bs
. It's recommended to set it between1
and8
. (e.g.,--torch-compile-max-bs 8
)
import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)
For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is 10.0.0.1
, and the second node's IP is 10.0.0.2
. Please use the first node's IP for both commands.
If the command fails, try setting the GLOO_SOCKET_IFNAME
parameter. For more information, see Common Environment Variables.
If the multi nodes support NVIDIA InfiniBand and encounter hanging issues during startup, consider adding the parameter export NCCL_IB_GID_INDEX=3
. For more information, see this.
# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code
# node 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
If you have two H100 nodes, the usage is similar to the aforementioned H20.
Note that the launch command here does not enable Data Parallelism Attention or
torch.compile
Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.
There are two H200 nodes, each with 8 GPUs. The first node's IP is 192.168.114.10
, and the second node's IP is 192.168.114.11
. Configure the endpoint to expose it to another Docker container using --host 0.0.0.0
and --port 40000
, and set up communications with --dist-init-addr 192.168.114.10:20000
.
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.
# node 1
docker run --gpus all \
--shm-size 32g \
--network=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--name sglang_multinode1 \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# node 2
docker run --gpus all \
--shm-size 32g \
--network=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--name sglang_multinode2 \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000
To ensure functionality, we include a test from a client Docker container.
docker run --gpus all \
--shm-size 32g \
--network=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--name sglang_multinode_client \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 --output-file "deepseekv3_multinode.jsonl"
Note that the launch command here does not enable Data Parallelism Attention or
torch.compile
Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.
To serve DeepSeek-V3 with A100 GPUs, we need to convert the FP8 model checkpoints to BF16 with script mentioned here first.
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is 10.0.0.1
, and the converted model path is /path/to/DeepSeek-V3-BF16
, we can have following commands to launch the server.
# node 1
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000
# node 2
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code
# node 3
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code
# node 4
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code
Note that the launch command here does not enable Data Parallelism Attention or
torch.compile
Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.
Then we can benchmark the accuracy and latency by accessing the first node's exposed port with the following example commands.
# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000
# bench latency
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128
AWQ does not support BF16, so add the --dtype half
flag if AWQ is used for quantization. One example is as follows:
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
Assuming that master node IP is MASTER_IP
, checkpoint path is /path/to/DeepSeek-R1-INT8
and port=5000, we can have following commands to launch the server:
#master
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
MASTER_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
#cluster
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
MASTER_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
Note that the launch command here enables
torch.compile
Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.
Then on the master node, supposing the ShareGPT data is located at /path/to/ShareGPT_V3_unfiltered_cleaned_split.json
, you can run the following commands to benchmark the launched server:
# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319
# bench serving
python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --random-input 128 --random-output 128 --num-prompts 1000 --request-rate 128 --random-range-ratio 1.0
Note: using
--parallel 200
can accelerate accuracy benchmarking.
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details here.
To serve on multiple nodes:
git clone https://github.com/skypilot-org/skypilot.git
# Serve on 2 H100/H200x8 nodes
sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B.yaml --retry-until-up
# Serve on 4 A100x8 nodes
sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B-A100.yaml --retry-until-up
If you encounter the following error with fp16/bf16 checkpoint:
ValueError: Weight output_partition_size = 576 is not divisible by weight quantization block_n = 128.
edit your config.json
and remove the quantization_config
block. For example:
"quantization_config": {
"activation_scheme": "dynamic",
"fmt": "e4m3",
"quant_method": "fp8",
"weight_block_size": [128, 128]
},
Removing this block typically resolves the error. For more details, see the discussion in sgl-project/sglang#3491.