vllm-project · WoosukKwon · Jun 18, 2023 · Jun 11, 2023 · Jun 11, 2023 · Jun 11, 2023
diff --git a/README.md b/README.md
@@ -1,66 +1,54 @@
-# vLLM
+# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
 
-## Build from source
+| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() |
 
-```bash
-pip install -r requirements.txt
-pip install -e .  # This may take several minutes.
-```
+vLLM is a fast and easy-to-use library for LLM inference and serving.
 
-## Test simple server
+## Latest News 🔥
 
-```bash
-# Single-GPU inference.
-python examples/simple_server.py # --model <your_model>
+- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().
 
-# Multi-GPU inference (e.g., 2 GPUs).
-ray start --head
-python examples/simple_server.py -tp 2 # --model <your_model>
-```
+## Getting Started
 
-The detailed arguments for `simple_server.py` can be found by:
-```bash
-python examples/simple_server.py --help
-```
+Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
+- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm`
+- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
+- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)
 
-## FastAPI server
+## Key Features
 
-To start the server:
-```bash
-ray start --head
-python -m vllm.entrypoints.fastapi_server # --model <your_model>
-```
+vLLM comes with many powerful features that include:
 
-To test the server:
-```bash
-python test_cli_client.py
-```
+- State-of-the-art performance in serving throughput
- State-of-the-art performance in serving throughput
+- State-of-the-art serving throughput
- State-of-the-art performance in serving throughput
+- State-of-the-art serving throughput
+- Efficient management of attention key and value memory with **PagedAttention**
+- Seamless integration with popular HuggingFace models
+- Dynamic batching of incoming requests
+- Optimized CUDA kernels
+- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
+- Tensor parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
 
-## Gradio web server
+## Performance
 
-Install the following additional dependencies:
-```bash
-pip install gradio
-```
+vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
+For details, check out our [blog post]().
 
-Start the server:
-```bash
-python -m vllm.http_frontend.fastapi_frontend
-# At another terminal
-python -m vllm.http_frontend.gradio_webserver
-```
+<p align="center">
+  <img src="./assets/figures/perf_a10g_n1.png" width="45%">
+  <img src="./assets/figures/perf_a100_n1.png" width="45%">
+  <br>
+  <em> Serving throughput when each request asks for 1 output completion. </em>
+</p>
 
-## Load LLaMA weights
+<p align="center">
+  <img src="./assets/figures/perf_a10g_n3.png" width="45%">
+  <img src="./assets/figures/perf_a100_n3.png" width="45%">
+  <br>
+  <em> Serving throughput when each request asks for 3 output completions. </em>
+</p>
 
-Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.
+## Contributing
 
-1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
-    ```bash
-    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
-        --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
-    ```
-2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
-    ```bash
-    python simple_server.py --model /output/path/llama-7b
-    python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
-    ```
+We welcome and value any contributions and collaborations.
+Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
diff --git a/assets/figures/perf_a100_n1.png b/assets/figures/perf_a100_n1.png
diff --git a/assets/figures/perf_a100_n3.png b/assets/figures/perf_a100_n3.png
diff --git a/assets/figures/perf_a10g_n1.png b/assets/figures/perf_a10g_n1.png
diff --git a/assets/figures/perf_a10g_n3.png b/assets/figures/perf_a10g_n3.png
diff --git a/docs/source/getting_started/installation.rst b/docs/source/getting_started/installation.rst
@@ -3,17 +3,20 @@
 Installation
 ============
 
-vLLM is a Python library that includes some C++ and CUDA code.
-vLLM can run on systems that meet the following requirements:
+vLLM is a Python library that also contains some C++ and CUDA code.
+This additional code requires compilation on the user's machine.
+
+Requirements
+------------
 
 * OS: Linux
 * Python: 3.8 or higher
 * CUDA: 11.0 -- 11.8
-* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
+* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
 
 .. note::
     As of now, vLLM does not support CUDA 12.
-    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
+    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.
 
 .. tip::
     If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
@@ -45,7 +48,7 @@ You can install vLLM using pip:
 Build from source
 -----------------
 
-You can also build and install vLLM from source.
+You can also build and install vLLM from source:
 
 .. code-block:: console
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,7 +1,21 @@
 Welcome to vLLM!
 ================
 
-vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
+**vLLM** is a fast and easy-to-use library for LLM inference and serving.
+Its core features include:
+
+- State-of-the-art performance in serving throughput
+- Efficient management of attention key and value memory with **PagedAttention**
+- Seamless integration with popular HuggingFace models
+- Dynamic batching of incoming requests
+- Optimized CUDA kernels
+- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
+- Tensor parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+
+For more information, please refer to our `blog post <>`_.
+
 
 Documentation
 -------------

diff --git a/docs/source/models/supported_models.rst b/docs/source/models/supported_models.rst
@@ -3,7 +3,7 @@
 Supported Models
 ================
 
-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
+vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
 The following is the list of model architectures that are currently supported by vLLM.
 Alongside each architecture, we include some popular models that use it.
 
@@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
   * - :code:`GPTNeoXForCausalLM`
     - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
   * - :code:`LlamaForCausalLM`
-    - LLaMA, Vicuna, Alpaca, Koala
+    - LLaMA, Vicuna, Alpaca, Koala, Guanaco
   * - :code:`OPTForCausalLM`
     - OPT, OPT-IML
 

diff --git a/setup.py b/setup.py
@@ -165,7 +165,7 @@ def get_requirements() -> List[str]:
         "Topic :: Scientific/Engineering :: Artificial Intelligence",
     ],
     packages=setuptools.find_packages(
-        exclude=("benchmarks", "csrc", "docs", "examples", "tests")),
+        exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
     python_requires=">=3.8",
     install_requires=get_requirements(),
     ext_modules=ext_modules,