-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write README and front page of doc #147
Merged
Merged
Changes from all commits
Commits
Show all changes
63 commits
Select commit
Hold shift + click to select a range
0648331
Write new README
WoosukKwon 56cd729
Minor
WoosukKwon 69cc609
Minor
WoosukKwon 7560a8a
Intelligent -> advanced
WoosukKwon f383031
Add Contributing
WoosukKwon f08d197
Minor
WoosukKwon be1fad7
Minor
WoosukKwon 4da3392
News -> Latest News
WoosukKwon ef5aaf6
Add Guanaco
WoosukKwon bbe916b
Add front page of doc
WoosukKwon 5f3dbe5
Merge branch 'doc-front' into readme
WoosukKwon d7de269
Minor
WoosukKwon 5de5333
Add slides
WoosukKwon aecb1a5
Address comments
WoosukKwon f01acc3
Remove .
WoosukKwon b8ca6b5
Fix
WoosukKwon c6ae832
Fix
WoosukKwon 9e52850
roll back
WoosukKwon ca274b3
Add URL
WoosukKwon 939835a
Minor
WoosukKwon d87cddc
Merge branch 'main' into readme
WoosukKwon 01f7c70
Minor
WoosukKwon 4ea1ef1
Minor
WoosukKwon fdf23f4
Add URL
WoosukKwon 8eb0257
Merge branch 'main' into readme
WoosukKwon 7900568
CacheFlow -> vLLM
WoosukKwon f136523
CacheFlow -> vLLM
WoosukKwon 136ab7d
LMSys -> LMSYS
WoosukKwon 1543e45
Minor
WoosukKwon 14d9681
Merge branch 'main' into readme
WoosukKwon 25ebb22
Fix installation doc
WoosukKwon 5ab4894
Minor
WoosukKwon ef9bb06
Add fire emoji
WoosukKwon 4d3a226
Add perf figures
WoosukKwon 9cb814c
Remove titles in figures
WoosukKwon 405510c
Change URL
WoosukKwon 7ecaf9c
Minor
WoosukKwon b023a2e
Add PagedAttention
WoosukKwon 5e7696f
Minor
WoosukKwon 3071102
Minor
WoosukKwon 6d4c7ac
Fix title & contributing
WoosukKwon e09daa9
Add links & Fix key features
WoosukKwon 289f613
Bold
WoosukKwon 97c4b86
Minor
WoosukKwon 2b99c2f
Add pip install
WoosukKwon 6a9a0f7
Fix figures
WoosukKwon 40d9fe3
Minor fix
WoosukKwon e1a38da
Fix
WoosukKwon 9163438
Remove table
WoosukKwon 99d1f85
Numeric
WoosukKwon 1aee527
bullets
WoosukKwon 6c9bc40
Fix front page
WoosukKwon 5ffaafc
Address comments
WoosukKwon a920670
Multi-GPU -> distributed
WoosukKwon 852c090
Address comments:
WoosukKwon 1821a80
Fix
WoosukKwon 92b653d
Use figure
WoosukKwon 4faf03a
Reduce width
WoosukKwon 5eb4577
Remove align
WoosukKwon 510d9b8
Use p with br
WoosukKwon 898d5f9
Fix docs
WoosukKwon 60eaff8
Increase image resolution
WoosukKwon ea6180f
cached -> memory
WoosukKwon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,66 +1,54 @@ | ||
# vLLM | ||
# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone | ||
|
||
## Build from source | ||
| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() | | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
pip install -e . # This may take several minutes. | ||
``` | ||
vLLM is a fast and easy-to-use library for LLM inference and serving. | ||
|
||
## Test simple server | ||
## Latest News 🔥 | ||
|
||
```bash | ||
# Single-GPU inference. | ||
python examples/simple_server.py # --model <your_model> | ||
- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post](). | ||
|
||
# Multi-GPU inference (e.g., 2 GPUs). | ||
ray start --head | ||
python examples/simple_server.py -tp 2 # --model <your_model> | ||
``` | ||
## Getting Started | ||
|
||
The detailed arguments for `simple_server.py` can be found by: | ||
```bash | ||
python examples/simple_server.py --help | ||
``` | ||
Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started. | ||
- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm` | ||
- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html) | ||
- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html) | ||
|
||
## FastAPI server | ||
## Key Features | ||
|
||
To start the server: | ||
```bash | ||
ray start --head | ||
python -m vllm.entrypoints.fastapi_server # --model <your_model> | ||
``` | ||
vLLM comes with many powerful features that include: | ||
|
||
To test the server: | ||
```bash | ||
python test_cli_client.py | ||
``` | ||
- State-of-the-art performance in serving throughput | ||
- Efficient management of attention key and value memory with **PagedAttention** | ||
- Seamless integration with popular HuggingFace models | ||
- Dynamic batching of incoming requests | ||
- Optimized CUDA kernels | ||
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search* | ||
- Tensor parallelism support for distributed inference | ||
- Streaming outputs | ||
- OpenAI-compatible API server | ||
|
||
## Gradio web server | ||
## Performance | ||
WoosukKwon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Install the following additional dependencies: | ||
```bash | ||
pip install gradio | ||
``` | ||
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. | ||
For details, check out our [blog post](). | ||
|
||
Start the server: | ||
```bash | ||
python -m vllm.http_frontend.fastapi_frontend | ||
# At another terminal | ||
python -m vllm.http_frontend.gradio_webserver | ||
``` | ||
<p align="center"> | ||
<img src="./assets/figures/perf_a10g_n1.png" width="45%"> | ||
<img src="./assets/figures/perf_a100_n1.png" width="45%"> | ||
<br> | ||
<em> Serving throughput when each request asks for 1 output completion. </em> | ||
</p> | ||
|
||
## Load LLaMA weights | ||
<p align="center"> | ||
<img src="./assets/figures/perf_a10g_n3.png" width="45%"> | ||
<img src="./assets/figures/perf_a100_n3.png" width="45%"> | ||
<br> | ||
<em> Serving throughput when each request asks for 3 output completions. </em> | ||
</p> | ||
|
||
Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights. | ||
## Contributing | ||
|
||
1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). | ||
```bash | ||
python src/transformers/models/llama/convert_llama_weights_to_hf.py \ | ||
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b | ||
``` | ||
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example: | ||
```bash | ||
python simple_server.py --model /output/path/llama-7b | ||
python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b | ||
``` | ||
We welcome and value any contributions and collaborations. | ||
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.