Name		Name	Last commit message	Last commit date
parent directory ..
Dockerfile		Dockerfile
README.md		README.md

README.md

Getting Started with vLLM + Arctic

This tutorial covers how to use Arctic with vLLM and what performance you should expect when running it. We are actively working with the vLLM community to upstream Arctic support, but until then please use the repos detailed below.

Hardware assumptions of this tutorial. We are using a single 8xH100 instance (i.e., p5.48xlarge) for this tutorial but similar hardware should provide similar results.

Dockerfile

We strongly recommend building and using the following Dockerfile to stand up an environment for running vLLM with Arctic. The system performance and memory utilization of Arctic on vLLM can be sensitive to the runtime environment. We provide a short Dockerfile that closely aligns with Snowflake’s internal testing environment that is verified to obtain good performance and stability.

Dockerfile

Detailed Installation and Benchmarking Instructions

For the steps going forward we highly recommend that use hf_transfer when downloading any of the Arctic checkpoints from Hugging Face to get the best throughput. On an AWS instance we are seeing the checkpoint will download in about 20-30 minutes. In vLLM this should be enabled by default if the package is installed (vllm-project/vllm#3817).

If you are using a docker image based on the Dockerfile above you can skip right to step 2.

Step 1: Install Dependencies

We recommend setting up a virtual environment to get all of your dependencies isolated to avoid potential conflicts.

# we recommend setting up a virtual environment for this
virtualenv arctic-venv
source arctic-venv/bin/activate

# faster ckpt download speed
pip install huggingface_hub[hf_transfer]

# clone vllm repo and checkout arctic branch
git clone -b arctic https://github.com/Snowflake-Labs/vllm.git
cd vllm
pip install -e .

# clone huggingface and checkout arctic branch
git clone -b arctic https://github.com/Snowflake-Labs/transformers.git

# install deepspeed
pip install deepspeed>=0.14.2

Step 2: Run offline inference example

cd vllm/examples

# Make sure the arctic_model_path points to the folder path we provided.
USE_DUMMY=True python offline_inference_arctic.py

Step 3: how to run offline benchmarks

cd vllm/benchmarks

# Run the following
USE_DUMMY=True python3 benchmark_batch.py \
    --warmup 1 \
    -n 1,2,4,8 \
    -l 2048 \
    --max_new_tokens 256 \
    -tp 8 \
    --framework vllm \
    --model "snowflake/snowflake-arctic-instruct"

Step 4: how to run online benchmarks

cd vllm/benchmarks

# Start an OpenAI-like server
USE_DUMMY=True python -m vllm.entrypoints.api_server --model="snowflake/snowflake-arctic-instruct" -tp=8 --quantization yq

# Run the benchmark
python benchmark_online.py --prompt_length 2048 \
   -tp 8 \
   --max_new_tokens 256 \
   -c 1024 \
   -qps 1.0 \
   --model "snowflake/snowflake-arctic-instruct" \
   --framework vllm

Performance

Currently you should be seeing with batch_size=1 a throughput of 70+ tokens/sec. We are actively working on improving this performance so stay tuned!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm

vllm

README.md

Getting Started with vLLM + Arctic

Dockerfile

Detailed Installation and Benchmarking Instructions

Step 1: Install Dependencies

Step 2: Run offline inference example

Step 3: how to run offline benchmarks

Step 4: how to run online benchmarks

Performance

Files

vllm

Directory actions

More options

Directory actions

More options

Latest commit

History

vllm

Folders and files

parent directory

README.md

Getting Started with vLLM + Arctic

Dockerfile

Detailed Installation and Benchmarking Instructions

Step 1: Install Dependencies

Step 2: Run offline inference example

Step 3: how to run offline benchmarks

Step 4: how to run online benchmarks

Performance