Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write README and front page of doc #147

Merged
merged 63 commits into from
Jun 18, 2023
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
0648331
Write new README
WoosukKwon Jun 11, 2023
56cd729
Minor
WoosukKwon Jun 11, 2023
69cc609
Minor
WoosukKwon Jun 11, 2023
7560a8a
Intelligent -> advanced
WoosukKwon Jun 11, 2023
f383031
Add Contributing
WoosukKwon Jun 11, 2023
f08d197
Minor
WoosukKwon Jun 11, 2023
be1fad7
Minor
WoosukKwon Jun 11, 2023
4da3392
News -> Latest News
WoosukKwon Jun 11, 2023
ef5aaf6
Add Guanaco
WoosukKwon Jun 11, 2023
bbe916b
Add front page of doc
WoosukKwon Jun 11, 2023
5f3dbe5
Merge branch 'doc-front' into readme
WoosukKwon Jun 11, 2023
d7de269
Minor
WoosukKwon Jun 11, 2023
5de5333
Add slides
WoosukKwon Jun 11, 2023
aecb1a5
Address comments
WoosukKwon Jun 15, 2023
f01acc3
Remove .
WoosukKwon Jun 15, 2023
b8ca6b5
Fix
WoosukKwon Jun 15, 2023
c6ae832
Fix
WoosukKwon Jun 15, 2023
9e52850
roll back
WoosukKwon Jun 15, 2023
ca274b3
Add URL
WoosukKwon Jun 15, 2023
939835a
Minor
WoosukKwon Jun 15, 2023
d87cddc
Merge branch 'main' into readme
WoosukKwon Jun 16, 2023
01f7c70
Minor
WoosukKwon Jun 16, 2023
4ea1ef1
Minor
WoosukKwon Jun 16, 2023
fdf23f4
Add URL
WoosukKwon Jun 17, 2023
8eb0257
Merge branch 'main' into readme
WoosukKwon Jun 17, 2023
7900568
CacheFlow -> vLLM
WoosukKwon Jun 17, 2023
f136523
CacheFlow -> vLLM
WoosukKwon Jun 17, 2023
136ab7d
LMSys -> LMSYS
WoosukKwon Jun 17, 2023
1543e45
Minor
WoosukKwon Jun 17, 2023
14d9681
Merge branch 'main' into readme
WoosukKwon Jun 17, 2023
25ebb22
Fix installation doc
WoosukKwon Jun 17, 2023
5ab4894
Minor
WoosukKwon Jun 17, 2023
ef9bb06
Add fire emoji
WoosukKwon Jun 17, 2023
4d3a226
Add perf figures
WoosukKwon Jun 17, 2023
9cb814c
Remove titles in figures
WoosukKwon Jun 17, 2023
405510c
Change URL
WoosukKwon Jun 17, 2023
7ecaf9c
Minor
WoosukKwon Jun 17, 2023
b023a2e
Add PagedAttention
WoosukKwon Jun 17, 2023
5e7696f
Minor
WoosukKwon Jun 17, 2023
3071102
Minor
WoosukKwon Jun 18, 2023
6d4c7ac
Fix title & contributing
WoosukKwon Jun 18, 2023
e09daa9
Add links & Fix key features
WoosukKwon Jun 18, 2023
289f613
Bold
WoosukKwon Jun 18, 2023
97c4b86
Minor
WoosukKwon Jun 18, 2023
2b99c2f
Add pip install
WoosukKwon Jun 18, 2023
6a9a0f7
Fix figures
WoosukKwon Jun 18, 2023
40d9fe3
Minor fix
WoosukKwon Jun 18, 2023
e1a38da
Fix
WoosukKwon Jun 18, 2023
9163438
Remove table
WoosukKwon Jun 18, 2023
99d1f85
Numeric
WoosukKwon Jun 18, 2023
1aee527
bullets
WoosukKwon Jun 18, 2023
6c9bc40
Fix front page
WoosukKwon Jun 18, 2023
5ffaafc
Address comments
WoosukKwon Jun 18, 2023
a920670
Multi-GPU -> distributed
WoosukKwon Jun 18, 2023
852c090
Address comments:
WoosukKwon Jun 18, 2023
1821a80
Fix
WoosukKwon Jun 18, 2023
92b653d
Use figure
WoosukKwon Jun 18, 2023
4faf03a
Reduce width
WoosukKwon Jun 18, 2023
5eb4577
Remove align
WoosukKwon Jun 18, 2023
510d9b8
Use p with br
WoosukKwon Jun 18, 2023
898d5f9
Fix docs
WoosukKwon Jun 18, 2023
60eaff8
Increase image resolution
WoosukKwon Jun 18, 2023
ea6180f
cached -> memory
WoosukKwon Jun 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 24 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,36 @@
# CacheFlow
# FluentFlow

## Build from source
FluentFlow is a fast and easy-to-use library for LLM inference and serving.
Using efficient memory management techniques, FluentFlow delivers x-x higher throughput than state-of-the-art systems.
FluentFlow has powered [LMSys Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April, significantly reducing its operational costs.
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved

```bash
pip install -r requirements.txt
pip install -e . # This may take several minutes.
```
## Latest News

## Test simple server
- [2023/06] FluentFlow was officially released! Please check out our [blog post](), [slides](), and [paper]().

```bash
# Single-GPU inference.
python examples/simple_server.py # --model <your_model>
## Getting Started

# Multi-GPU inference (e.g., 2 GPUs).
ray start --head
python examples/simple_server.py -tp 2 # --model <your_model>
```
Visit our [documentation]() to get started.
- [Installation]()
- [Quickstart]()
- [OpenAI-compatible API]()
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
- [Supported Models]()

The detailed arguments for `simple_server.py` can be found by:
```bash
python examples/simple_server.py --help
```
## Key Features

## FastAPI server
FluentFlow comes with many powerful features that include:

To start the server:
```bash
ray start --head
python -m cacheflow.entrypoints.fastapi_server # --model <your_model>
```
- Seamless integration with popular HuggingFace models
- Efficient block-based management for KV cache
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
- Advanced batching mechanism
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
- Optimized CUDA kernels
- Tensor parallelism support for multi-GPU inference
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
- OpenAI-compatible API
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved

To test the server:
```bash
python test_cli_client.py
```
## Performance
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved

## Gradio web server

Install the following additional dependencies:
```bash
pip install gradio
```
## Contributing

Start the server:
```bash
python -m cacheflow.http_frontend.fastapi_frontend
# At another terminal
python -m cacheflow.http_frontend.gradio_webserver
```

## Load LLaMA weights

Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.

1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
```bash
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
```
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
```bash
python simple_server.py --model /output/path/llama-7b
python -m cacheflow.http_frontend.fastapi_frontend --model /output/path/llama-7b
```
As an open-source project in a fast-evolving field, we welcome any contributions and collaborations.
For guidance on how to contribute, please check out [CONTRIBUTING.md](./CONTRIBUTING.md).
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 18 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,24 @@
Welcome to CacheFlow!
=====================

**CacheFlow** is a fast and easy-to-use library for LLM inference and serving.
Using efficient memory management techniques, CacheFlow delivers x-x higher throughput than state-of-the-art systems.
Its core features include:

- Seamless integration with popular HuggingFace models
- Efficient block-based management for KV cache
- Advanced batching mechanism
- Optimized CUDA kernels
- Tensor parallelism support for multi-GPU inference
- OpenAI-compatible API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on comments in README.


For more information, please refer to:

* Blog post: ``
* Slides: ``
* Paper: `FluentFlow: Efficient Memory Management for Large Language Model Serving <https:>`_


Documentation
-------------

Expand Down
2 changes: 1 addition & 1 deletion docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
* - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
* - :code:`LlamaForCausalLM`
- LLaMA, Vicuna, Alpaca, Koala
- LLaMA, Vicuna, Alpaca, Koala, Guanaco
* - :code:`OPTForCausalLM`
- OPT, OPT-IML

Expand Down