Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write README and front page of doc #147

Merged
merged 63 commits into from
Jun 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
0648331
Write new README
WoosukKwon Jun 11, 2023
56cd729
Minor
WoosukKwon Jun 11, 2023
69cc609
Minor
WoosukKwon Jun 11, 2023
7560a8a
Intelligent -> advanced
WoosukKwon Jun 11, 2023
f383031
Add Contributing
WoosukKwon Jun 11, 2023
f08d197
Minor
WoosukKwon Jun 11, 2023
be1fad7
Minor
WoosukKwon Jun 11, 2023
4da3392
News -> Latest News
WoosukKwon Jun 11, 2023
ef5aaf6
Add Guanaco
WoosukKwon Jun 11, 2023
bbe916b
Add front page of doc
WoosukKwon Jun 11, 2023
5f3dbe5
Merge branch 'doc-front' into readme
WoosukKwon Jun 11, 2023
d7de269
Minor
WoosukKwon Jun 11, 2023
5de5333
Add slides
WoosukKwon Jun 11, 2023
aecb1a5
Address comments
WoosukKwon Jun 15, 2023
f01acc3
Remove .
WoosukKwon Jun 15, 2023
b8ca6b5
Fix
WoosukKwon Jun 15, 2023
c6ae832
Fix
WoosukKwon Jun 15, 2023
9e52850
roll back
WoosukKwon Jun 15, 2023
ca274b3
Add URL
WoosukKwon Jun 15, 2023
939835a
Minor
WoosukKwon Jun 15, 2023
d87cddc
Merge branch 'main' into readme
WoosukKwon Jun 16, 2023
01f7c70
Minor
WoosukKwon Jun 16, 2023
4ea1ef1
Minor
WoosukKwon Jun 16, 2023
fdf23f4
Add URL
WoosukKwon Jun 17, 2023
8eb0257
Merge branch 'main' into readme
WoosukKwon Jun 17, 2023
7900568
CacheFlow -> vLLM
WoosukKwon Jun 17, 2023
f136523
CacheFlow -> vLLM
WoosukKwon Jun 17, 2023
136ab7d
LMSys -> LMSYS
WoosukKwon Jun 17, 2023
1543e45
Minor
WoosukKwon Jun 17, 2023
14d9681
Merge branch 'main' into readme
WoosukKwon Jun 17, 2023
25ebb22
Fix installation doc
WoosukKwon Jun 17, 2023
5ab4894
Minor
WoosukKwon Jun 17, 2023
ef9bb06
Add fire emoji
WoosukKwon Jun 17, 2023
4d3a226
Add perf figures
WoosukKwon Jun 17, 2023
9cb814c
Remove titles in figures
WoosukKwon Jun 17, 2023
405510c
Change URL
WoosukKwon Jun 17, 2023
7ecaf9c
Minor
WoosukKwon Jun 17, 2023
b023a2e
Add PagedAttention
WoosukKwon Jun 17, 2023
5e7696f
Minor
WoosukKwon Jun 17, 2023
3071102
Minor
WoosukKwon Jun 18, 2023
6d4c7ac
Fix title & contributing
WoosukKwon Jun 18, 2023
e09daa9
Add links & Fix key features
WoosukKwon Jun 18, 2023
289f613
Bold
WoosukKwon Jun 18, 2023
97c4b86
Minor
WoosukKwon Jun 18, 2023
2b99c2f
Add pip install
WoosukKwon Jun 18, 2023
6a9a0f7
Fix figures
WoosukKwon Jun 18, 2023
40d9fe3
Minor fix
WoosukKwon Jun 18, 2023
e1a38da
Fix
WoosukKwon Jun 18, 2023
9163438
Remove table
WoosukKwon Jun 18, 2023
99d1f85
Numeric
WoosukKwon Jun 18, 2023
1aee527
bullets
WoosukKwon Jun 18, 2023
6c9bc40
Fix front page
WoosukKwon Jun 18, 2023
5ffaafc
Address comments
WoosukKwon Jun 18, 2023
a920670
Multi-GPU -> distributed
WoosukKwon Jun 18, 2023
852c090
Address comments:
WoosukKwon Jun 18, 2023
1821a80
Fix
WoosukKwon Jun 18, 2023
92b653d
Use figure
WoosukKwon Jun 18, 2023
4faf03a
Reduce width
WoosukKwon Jun 18, 2023
5eb4577
Remove align
WoosukKwon Jun 18, 2023
510d9b8
Use p with br
WoosukKwon Jun 18, 2023
898d5f9
Fix docs
WoosukKwon Jun 18, 2023
60eaff8
Increase image resolution
WoosukKwon Jun 18, 2023
ea6180f
cached -> memory
WoosukKwon Jun 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 39 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,54 @@
# vLLM
# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone

## Build from source
| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() |

```bash
pip install -r requirements.txt
pip install -e . # This may take several minutes.
```
vLLM is a fast and easy-to-use library for LLM inference and serving.

## Test simple server
## Latest News 🔥

```bash
# Single-GPU inference.
python examples/simple_server.py # --model <your_model>
- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().

# Multi-GPU inference (e.g., 2 GPUs).
ray start --head
python examples/simple_server.py -tp 2 # --model <your_model>
```
## Getting Started

The detailed arguments for `simple_server.py` can be found by:
```bash
python examples/simple_server.py --help
```
Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm`
- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)

## FastAPI server
## Key Features

To start the server:
```bash
ray start --head
python -m vllm.entrypoints.fastapi_server # --model <your_model>
```
vLLM comes with many powerful features that include:

To test the server:
```bash
python test_cli_client.py
```
- State-of-the-art performance in serving throughput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- State-of-the-art performance in serving throughput
- State-of-the-art serving throughput

- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

## Gradio web server
## Performance
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved

Install the following additional dependencies:
```bash
pip install gradio
```
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
For details, check out our [blog post]().

Start the server:
```bash
python -m vllm.http_frontend.fastapi_frontend
# At another terminal
python -m vllm.http_frontend.gradio_webserver
```
<p align="center">
<img src="./assets/figures/perf_a10g_n1.png" width="45%">
<img src="./assets/figures/perf_a100_n1.png" width="45%">
<br>
<em> Serving throughput when each request asks for 1 output completion. </em>
</p>

## Load LLaMA weights
<p align="center">
<img src="./assets/figures/perf_a10g_n3.png" width="45%">
<img src="./assets/figures/perf_a100_n3.png" width="45%">
<br>
<em> Serving throughput when each request asks for 3 output completions. </em>
</p>

Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.
## Contributing

1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
```bash
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
```
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
```bash
python simple_server.py --model /output/path/llama-7b
python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
```
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
Binary file added assets/figures/perf_a100_n1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/perf_a100_n3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/perf_a10g_n1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/perf_a10g_n3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 8 additions & 5 deletions docs/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,20 @@
Installation
============

vLLM is a Python library that includes some C++ and CUDA code.
vLLM can run on systems that meet the following requirements:
vLLM is a Python library that also contains some C++ and CUDA code.
This additional code requires compilation on the user's machine.

Requirements
------------

* OS: Linux
* Python: 3.8 or higher
* CUDA: 11.0 -- 11.8
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)

.. note::
As of now, vLLM does not support CUDA 12.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.

.. tip::
If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
Expand Down Expand Up @@ -45,7 +48,7 @@ You can install vLLM using pip:
Build from source
-----------------

You can also build and install vLLM from source.
You can also build and install vLLM from source:

.. code-block:: console

Expand Down
16 changes: 15 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,21 @@
Welcome to vLLM!
================

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:

- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

For more information, please refer to our `blog post <>`_.


Documentation
-------------
Expand Down
4 changes: 2 additions & 2 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Supported Models
================

vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it.

Expand All @@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
* - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
* - :code:`LlamaForCausalLM`
- LLaMA, Vicuna, Alpaca, Koala
- LLaMA, Vicuna, Alpaca, Koala, Guanaco
* - :code:`OPTForCausalLM`
- OPT, OPT-IML

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ def get_requirements() -> List[str]:
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
packages=setuptools.find_packages(
exclude=("benchmarks", "csrc", "docs", "examples", "tests")),
exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
python_requires=">=3.8",
install_requires=get_requirements(),
ext_modules=ext_modules,
Expand Down