diff --git a/README.md b/README.md index 2447173799dc..2652422c36cf 100644 --- a/README.md +++ b/README.md @@ -1,66 +1,54 @@ -# vLLM +# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone -## Build from source +| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() | -```bash -pip install -r requirements.txt -pip install -e . # This may take several minutes. -``` +vLLM is a fast and easy-to-use library for LLM inference and serving. -## Test simple server +## Latest News 🔥 -```bash -# Single-GPU inference. -python examples/simple_server.py # --model +- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post](). -# Multi-GPU inference (e.g., 2 GPUs). -ray start --head -python examples/simple_server.py -tp 2 # --model -``` +## Getting Started -The detailed arguments for `simple_server.py` can be found by: -```bash -python examples/simple_server.py --help -``` +Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started. +- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm` +- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html) +- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html) -## FastAPI server +## Key Features -To start the server: -```bash -ray start --head -python -m vllm.entrypoints.fastapi_server # --model -``` +vLLM comes with many powerful features that include: -To test the server: -```bash -python test_cli_client.py -``` +- State-of-the-art performance in serving throughput +- Efficient management of attention key and value memory with **PagedAttention** +- Seamless integration with popular HuggingFace models +- Dynamic batching of incoming requests +- Optimized CUDA kernels +- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search* +- Tensor parallelism support for distributed inference +- Streaming outputs +- OpenAI-compatible API server -## Gradio web server +## Performance -Install the following additional dependencies: -```bash -pip install gradio -``` +vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. +For details, check out our [blog post](). -Start the server: -```bash -python -m vllm.http_frontend.fastapi_frontend -# At another terminal -python -m vllm.http_frontend.gradio_webserver -``` +

+ + +
+ Serving throughput when each request asks for 1 output completion. +

-## Load LLaMA weights +

+ + +
+ Serving throughput when each request asks for 3 output completions. +

-Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights. +## Contributing -1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). - ```bash - python src/transformers/models/llama/convert_llama_weights_to_hf.py \ - --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b - ``` -2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example: - ```bash - python simple_server.py --model /output/path/llama-7b - python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b - ``` +We welcome and value any contributions and collaborations. +Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved. diff --git a/assets/figures/perf_a100_n1.png b/assets/figures/perf_a100_n1.png new file mode 100644 index 000000000000..4fe782b685bb Binary files /dev/null and b/assets/figures/perf_a100_n1.png differ diff --git a/assets/figures/perf_a100_n3.png b/assets/figures/perf_a100_n3.png new file mode 100644 index 000000000000..22a02c24c6ab Binary files /dev/null and b/assets/figures/perf_a100_n3.png differ diff --git a/assets/figures/perf_a10g_n1.png b/assets/figures/perf_a10g_n1.png new file mode 100644 index 000000000000..43f71bef52a6 Binary files /dev/null and b/assets/figures/perf_a10g_n1.png differ diff --git a/assets/figures/perf_a10g_n3.png b/assets/figures/perf_a10g_n3.png new file mode 100644 index 000000000000..a0ab9e3c43ea Binary files /dev/null and b/assets/figures/perf_a10g_n3.png differ diff --git a/docs/source/getting_started/installation.rst b/docs/source/getting_started/installation.rst index b5a245b327d0..824a74126371 100644 --- a/docs/source/getting_started/installation.rst +++ b/docs/source/getting_started/installation.rst @@ -3,17 +3,20 @@ Installation ============ -vLLM is a Python library that includes some C++ and CUDA code. -vLLM can run on systems that meet the following requirements: +vLLM is a Python library that also contains some C++ and CUDA code. +This additional code requires compilation on the user's machine. + +Requirements +------------ * OS: Linux * Python: 3.8 or higher * CUDA: 11.0 -- 11.8 -* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.) +* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.) .. note:: As of now, vLLM does not support CUDA 12. - If you are using Hopper or Lovelace GPUs, please use CUDA 11.8. + If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12. .. tip:: If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image. @@ -45,7 +48,7 @@ You can install vLLM using pip: Build from source ----------------- -You can also build and install vLLM from source. +You can also build and install vLLM from source: .. code-block:: console diff --git a/docs/source/index.rst b/docs/source/index.rst index ff51ae6264a6..ecb32f482000 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,7 +1,21 @@ Welcome to vLLM! ================ -vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM). +**vLLM** is a fast and easy-to-use library for LLM inference and serving. +Its core features include: + +- State-of-the-art performance in serving throughput +- Efficient management of attention key and value memory with **PagedAttention** +- Seamless integration with popular HuggingFace models +- Dynamic batching of incoming requests +- Optimized CUDA kernels +- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search* +- Tensor parallelism support for distributed inference +- Streaming outputs +- OpenAI-compatible API server + +For more information, please refer to our `blog post <>`_. + Documentation ------------- diff --git a/docs/source/models/supported_models.rst b/docs/source/models/supported_models.rst index 5901390283c4..cdbca3788259 100644 --- a/docs/source/models/supported_models.rst +++ b/docs/source/models/supported_models.rst @@ -3,7 +3,7 @@ Supported Models ================ -vLLM supports a variety of generative Transformer models in `HuggingFace Transformers `_. +vLLM supports a variety of generative Transformer models in `HuggingFace Transformers `_. The following is the list of model architectures that are currently supported by vLLM. Alongside each architecture, we include some popular models that use it. @@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it. * - :code:`GPTNeoXForCausalLM` - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM * - :code:`LlamaForCausalLM` - - LLaMA, Vicuna, Alpaca, Koala + - LLaMA, Vicuna, Alpaca, Koala, Guanaco * - :code:`OPTForCausalLM` - OPT, OPT-IML diff --git a/setup.py b/setup.py index eb320f2cd703..1b27aeed236d 100644 --- a/setup.py +++ b/setup.py @@ -165,7 +165,7 @@ def get_requirements() -> List[str]: "Topic :: Scientific/Engineering :: Artificial Intelligence", ], packages=setuptools.find_packages( - exclude=("benchmarks", "csrc", "docs", "examples", "tests")), + exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")), python_requires=">=3.8", install_requires=get_requirements(), ext_modules=ext_modules,