vllm-project · WoosukKwon · Jun 18, 2023 · Jun 11, 2023 · Jun 11, 2023 · Jun 11, 2023
diff --git a/README.md b/README.md
@@ -1,66 +1,36 @@
-# CacheFlow
+# FluentFlow
 
-## Build from source
+FluentFlow is a fast and easy-to-use library for LLM inference and serving.
+Using efficient memory management techniques, FluentFlow delivers x-x higher throughput than state-of-the-art systems.
+FluentFlow has powered [LMSys Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April, significantly reducing its operational costs.
 
-```bash
-pip install -r requirements.txt
-pip install -e .  # This may take several minutes.
-```
+## Latest News
 
-## Test simple server
+- [2023/06] FluentFlow was officially released! Please check out our [blog post](), [slides](), and [paper]().
 
-```bash
-# Single-GPU inference.
-python examples/simple_server.py # --model <your_model>
+## Getting Started
 
-# Multi-GPU inference (e.g., 2 GPUs).
-ray start --head
-python examples/simple_server.py -tp 2 # --model <your_model>
-```
+Visit our [documentation]() to get started.
+- [Installation]()
+- [Quickstart]()
+- [OpenAI-compatible API]()
+- [Supported Models]()
 
-The detailed arguments for `simple_server.py` can be found by:
-```bash
-python examples/simple_server.py --help
-```
+## Key Features
 
-## FastAPI server
+FluentFlow comes with many powerful features that include:
 
-To start the server:
-```bash
-ray start --head
-python -m cacheflow.entrypoints.fastapi_server # --model <your_model>
-```
+- Seamless integration with popular HuggingFace models
+- Efficient block-based management for KV cache
+- Advanced batching mechanism
+- Optimized CUDA kernels
+- Tensor parallelism support for multi-GPU inference
+- OpenAI-compatible API
 
-To test the server:
-```bash
-python test_cli_client.py
-```
+## Performance
 
-## Gradio web server
 
-Install the following additional dependencies:
-```bash
-pip install gradio
-```
+## Contributing
 
-Start the server:
-```bash
-python -m cacheflow.http_frontend.fastapi_frontend
-# At another terminal
-python -m cacheflow.http_frontend.gradio_webserver
-```
-
-## Load LLaMA weights
-
-Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.
-
-1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
-    ```bash
-    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
-        --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
-    ```
-2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
-    ```bash
-    python simple_server.py --model /output/path/llama-7b
-    python -m cacheflow.http_frontend.fastapi_frontend --model /output/path/llama-7b
-    ```
+As an open-source project in a fast-evolving field, we welcome any contributions and collaborations.
+For guidance on how to contribute, please check out [CONTRIBUTING.md](./CONTRIBUTING.md).
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,6 +1,24 @@
 Welcome to CacheFlow!
 =====================
 
+**CacheFlow** is a fast and easy-to-use library for LLM inference and serving.
+Using efficient memory management techniques, CacheFlow delivers x-x higher throughput than state-of-the-art systems.
+Its core features include:
+
+- Seamless integration with popular HuggingFace models
+- Efficient block-based management for KV cache
+- Advanced batching mechanism
+- Optimized CUDA kernels
+- Tensor parallelism support for multi-GPU inference
+- OpenAI-compatible API
+
+For more information, please refer to:
+
+* Blog post: ``
+* Slides: ``
+* Paper: `FluentFlow: Efficient Memory Management for Large Language Model Serving <https:>`_
+
+
 Documentation
 -------------
 

diff --git a/docs/source/models/supported_models.rst b/docs/source/models/supported_models.rst
@@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
   * - :code:`GPTNeoXForCausalLM`
     - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
   * - :code:`LlamaForCausalLM`
-    - LLaMA, Vicuna, Alpaca, Koala
+    - LLaMA, Vicuna, Alpaca, Koala, Guanaco
   * - :code:`OPTForCausalLM`
     - OPT, OPT-IML