Skip to content

Latest commit

 

History

History
258 lines (184 loc) · 12.4 KB

README.md

File metadata and controls

258 lines (184 loc) · 12.4 KB

DeepSeek V3 Support

The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs from day one. SGLang also supports MLA optimization and DP attention, making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official DeepSeek team.

Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

For optimizations made on the DeepSeek series models regarding SGLang, please refer to DeepSeek Model Optimizations in SGLang.

Installation & Launch

If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.

Using Docker (Recommended)

# Pull latest image
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest

# Launch
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

If you are using RDMA, please note that:

  1. --network host and --privileged are required by RDMA. If you don't need RDMA, you can remove them.
  2. You may need to set NCCL_IB_GID_INDEX if you are using RoCE, for example: export NCCL_IB_GID_INDEX=3.

Add performance optimization options as needed.

Using pip

# Installation
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

# Launch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Add performance optimization options as needed.

Performance Optimization Options

MLA optimizations are enabled by default. Here are some optional optimizations can be enabled as needed.

  • Data Parallelism Attention: For high QPS scenarios, add the --enable-dp-attention argument to boost throughput.
  • Torch.compile Optimization: Add --enable-torch-compile argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with --torch-compile-max-bs. It's recommended to set it between 1 and 8. (e.g., --torch-compile-max-bs 8)

Example: Sending requests with OpenAI API

import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)

Example: Serving with two H20*8 nodes

For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is 10.0.0.1, and the second node's IP is 10.0.0.2. Please use the first node's IP for both commands.

If the command fails, try setting the GLOO_SOCKET_IFNAME parameter. For more information, see Common Environment Variables.

If the multi nodes support NVIDIA InfiniBand and encounter hanging issues during startup, consider adding the parameter export NCCL_IB_GID_INDEX=3. For more information, see this.

# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

If you have two H100 nodes, the usage is similar to the aforementioned H20.

Note that the launch command here does not enable Data Parallelism Attention or torch.compile Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.

Example: Serving with two H200*8 nodes and docker

There are two H200 nodes, each with 8 GPUs. The first node's IP is 192.168.114.10, and the second node's IP is 192.168.114.11. Configure the endpoint to expose it to another Docker container using --host 0.0.0.0 and --port 40000, and set up communications with --dist-init-addr 192.168.114.10:20000. A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.

# node 1
docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# node 2
docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

To ensure functionality, we include a test from a client Docker container.

docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode_client \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 --output-file "deepseekv3_multinode.jsonl"

Note that the launch command here does not enable Data Parallelism Attention or torch.compile Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.

Example: Serving with four A100*8 nodes

To serve DeepSeek-V3 with A100 GPUs, we need to convert the FP8 model checkpoints to BF16 with script mentioned here first.

Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is 10.0.0.1, and the converted model path is /path/to/DeepSeek-V3-BF16, we can have following commands to launch the server.

# node 1
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000

# node 2
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code

# node 3
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code

# node 4
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code

Note that the launch command here does not enable Data Parallelism Attention or torch.compile Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.

Then we can benchmark the accuracy and latency by accessing the first node's exposed port with the following example commands.

# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000

# bench latency
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128

Example: Serving with 8 A100/A800 with AWQ Quantization

AWQ does not support BF16, so add the --dtype half flag if AWQ is used for quantization. One example is as follows:

python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half

Example: Serving with 16 A100/A800 with int8 Quantization

There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:

Assuming that master node IP is MASTER_IP, checkpoint path is /path/to/DeepSeek-R1-INT8 and port=5000, we can have following commands to launch the server:

#master
python3 -m sglang.launch_server \
	--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
	MASTER_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
#cluster
python3 -m sglang.launch_server \
	--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
	MASTER_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8

Note that the launch command here enables torch.compile Optimization. For optimal performance, please refer to the command options in Performance Optimization Options.

Then on the master node, supposing the ShareGPT data is located at /path/to/ShareGPT_V3_unfiltered_cleaned_split.json, you can run the following commands to benchmark the launched server:

# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319

# bench serving
python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random  --random-input 128 --random-output 128 --num-prompts 1000 --request-rate 128 --random-range-ratio 1.0

Note: using --parallel 200 can accelerate accuracy benchmarking.

Example: Serving on any cloud or Kubernetes with SkyPilot

SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details here.

To serve on multiple nodes:

git clone https://github.com/skypilot-org/skypilot.git
# Serve on 2 H100/H200x8 nodes
sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B.yaml --retry-until-up
# Serve on 4 A100x8 nodes
sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B-A100.yaml --retry-until-up

Troubleshooting

If you encounter the following error with fp16/bf16 checkpoint:

ValueError: Weight output_partition_size = 576 is not divisible by weight quantization block_n = 128.

edit your config.json and remove the quantization_config block. For example:

"quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [128, 128]
},

Removing this block typically resolves the error. For more details, see the discussion in sgl-project/sglang#3491.

DeepSeek V3 Optimization Plan

#2591