Skip to content

Commit

Permalink
Change the name to vLLM (vllm-project#150)
Browse files Browse the repository at this point in the history
  • Loading branch information
WoosukKwon authored Jun 17, 2023
1 parent 4830540 commit 79af1eb
Show file tree
Hide file tree
Showing 90 changed files with 342 additions and 339 deletions.
12 changes: 6 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contributing to CacheFlow
# Contributing to vLLM

Thank you for your interest in contributing to CacheFlow!
Thank you for your interest in contributing to vLLM!
Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large.
There are several ways you can contribute to the project:

Expand All @@ -11,9 +11,9 @@ There are several ways you can contribute to the project:
However, remember that contributions aren't just about code.
We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions.

Finally, one of the most impactful ways to support us is by raising awareness about CacheFlow.
Finally, one of the most impactful ways to support us is by raising awareness about vLLM.
Talk about it in your blog posts, highlighting how it's driving your incredible projects.
Express your support on Twitter if CacheFlow aids you, or simply offer your appreciation by starring our repository.
Express your support on Twitter if vLLM aids you, or simply offer your appreciation by starring our repository.


## Setup for development
Expand Down Expand Up @@ -70,5 +70,5 @@ If a comment isn't clear or you disagree with a suggestion, feel free to ask for

### Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to CacheFlow.
Your contributions make CacheFlow a great tool for everyone!
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
Your contributions make vLLM a great tool for everyone!
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CacheFlow
# vLLM

## Build from source

Expand Down Expand Up @@ -28,7 +28,7 @@ python examples/simple_server.py --help
To start the server:
```bash
ray start --head
python -m cacheflow.entrypoints.fastapi_server # --model <your_model>
python -m vllm.entrypoints.fastapi_server # --model <your_model>
```

To test the server:
Expand All @@ -45,9 +45,9 @@ pip install gradio

Start the server:
```bash
python -m cacheflow.http_frontend.fastapi_frontend
python -m vllm.http_frontend.fastapi_frontend
# At another terminal
python -m cacheflow.http_frontend.gradio_webserver
python -m vllm.http_frontend.gradio_webserver
```

## Load LLaMA weights
Expand All @@ -62,5 +62,5 @@ Since LLaMA weight is not fully public, we cannot directly download the LLaMA we
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
```bash
python simple_server.py --model /output/path/llama-7b
python -m cacheflow.http_frontend.fastapi_frontend --model /output/path/llama-7b
python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
```
2 changes: 1 addition & 1 deletion benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Benchmarking CacheFlow
# Benchmarking vLLM

## Downloading the ShareGPT dataset

Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_async_llm_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ def main(args: argparse.Namespace):
for i in range(args.n_threads)]

api_url = f"http://{args.host}:{args.port}/generate"
headers = {"User-Agent": "CacheFlow Benchmark Client"}
headers = {"User-Agent": "vLLM Benchmark Client"}
ploads = [{
"prompt": p,
"max_tokens": args.max_tokens,
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import torch
from tqdm import tqdm

from cacheflow import LLM, SamplingParams
from vllm import LLM, SamplingParams


def main(args: argparse.Namespace):
Expand Down
10 changes: 5 additions & 5 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
"""Benchmark online serving throughput.
On the server side, run one of the following commands:
(CacheFlow backend)
python -m cacheflow.entrypoints.api_server \
(vLLM backend)
python -m vllm.entrypoints.api_server \
--disable-log-requests --model <your_model>
(TGI backend)
Expand Down Expand Up @@ -114,7 +114,7 @@ async def send_request(
request_start_time = time.time()

headers = {"User-Agent": "Benchmark Client"}
if backend == "cacheflow":
if backend == "vllm":
pload = {
"prompt": prompt,
"n": 1,
Expand Down Expand Up @@ -213,8 +213,8 @@ def main(args: argparse.Namespace):
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Benchmark the online serving throughput.")
parser.add_argument("--backend", type=str, default="cacheflow",
choices=["cacheflow", "tgi"])
parser.add_argument("--backend", type=str, default="vllm",
choices=["vllm", "tgi"])
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8001)
parser.add_argument("--dataset", type=str, required=True,
Expand Down
15 changes: 8 additions & 7 deletions benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
import time
from typing import List, Tuple

from cacheflow import LLM, SamplingParams
import torch
from transformers import (AutoConfig, AutoTokenizer, AutoModelForCausalLM,
PreTrainedTokenizerBase)
from tqdm import tqdm

from vllm import LLM, SamplingParams


def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:
config = AutoConfig.from_pretrained(model_name)
Expand Down Expand Up @@ -70,7 +71,7 @@ def sample_requests(
return sampled_requests


def run_cacheflow(
def run_vllm(
requests: List[Tuple[str, int, int]],
model: str,
tensor_parallel_size: int,
Expand Down Expand Up @@ -172,8 +173,8 @@ def main(args: argparse.Namespace):
tokenizer = get_tokenizer(args.model)
requests = sample_requests(args.dataset, args.num_prompts, tokenizer)

if args.backend == "cacheflow":
elapsed_time = run_cacheflow(
if args.backend == "vllm":
elapsed_time = run_vllm(
requests, args.model, args.tensor_parallel_size, args.seed, args.n,
args.use_beam_search)
elif args.backend == "hf":
Expand All @@ -192,8 +193,8 @@ def main(args: argparse.Namespace):

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Benchmark the throughput.")
parser.add_argument("--backend", type=str, choices=["cacheflow", "hf"],
default="cacheflow")
parser.add_argument("--backend", type=str, choices=["vllm", "hf"],
default="vllm")
parser.add_argument("--dataset", type=str, required=True,
help="Path to the dataset.")
parser.add_argument("--model", type=str, default="facebook/opt-125m")
Expand All @@ -207,7 +208,7 @@ def main(args: argparse.Namespace):
parser.add_argument("--hf-max-batch-size", type=int, default=None,
help="Maximum batch size for HF backend.")
args = parser.parse_args()
if args.backend == "cacheflow":
if args.backend == "vllm":
if args.hf_max_batch_size is not None:
raise ValueError("HF max batch size is only for HF backend.")
elif args.backend == "hf":
Expand Down
18 changes: 0 additions & 18 deletions cacheflow/__init__.py

This file was deleted.

10 changes: 0 additions & 10 deletions cacheflow/model_executor/__init__.py

This file was deleted.

12 changes: 0 additions & 12 deletions cacheflow/model_executor/models/__init__.py

This file was deleted.

6 changes: 3 additions & 3 deletions csrc/activation_kernels.cu
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>

namespace cacheflow {
namespace vllm {

template<typename T>
__device__ __forceinline__ T silu(const T& x) {
Expand All @@ -22,7 +22,7 @@ __global__ void silu_and_mul_kernel(
}
}

} // namespace cacheflow
} // namespace vllm

void silu_and_mul(
torch::Tensor& out, // [num_tokens, d]
Expand All @@ -40,7 +40,7 @@ void silu_and_mul(
input.scalar_type(),
"silu_and_mul_kernel",
[&] {
cacheflow::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>(
vllm::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>(
out.data_ptr<scalar_t>(),
input.data_ptr<scalar_t>(),
d);
Expand Down
6 changes: 3 additions & 3 deletions csrc/attention/attention_generic.cuh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team.
* Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -19,7 +19,7 @@

#include <stdint.h>

namespace cacheflow {
namespace vllm {

// A vector type to store Q, K, V elements.
template<typename T, int VEC_SIZE>
Expand Down Expand Up @@ -61,4 +61,4 @@ inline __device__ void zero(T& dst) {
dst = tmp.raw;
}

} // namespace cacheflow
} // namespace vllm
8 changes: 4 additions & 4 deletions csrc/attention/attention_kernels.cu
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* Copyright (c) 2023, The CacheFlow team.
* Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -27,7 +27,7 @@
#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b))

namespace cacheflow {
namespace vllm {

// Utility function for attention softmax.
template<int NUM_WARPS>
Expand Down Expand Up @@ -315,10 +315,10 @@ __global__ void single_query_cached_kv_attention_kernel(
}
}

} // namespace cacheflow
} // namespace vllm

#define LAUNCH_ATTENTION_KERNEL(T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS) \
cacheflow::single_query_cached_kv_attention_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS> \
vllm::single_query_cached_kv_attention_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS> \
<<<grid, block, shared_mem_size, stream>>>( \
out_ptr, \
query_ptr, \
Expand Down
6 changes: 3 additions & 3 deletions csrc/attention/attention_utils.cuh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* Copyright (c) 2023, The CacheFlow team.
* Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -22,7 +22,7 @@
#include <float.h>
#include <type_traits>

namespace cacheflow {
namespace vllm {

// Q*K^T operation.
template<int THREAD_GROUP_SIZE, typename Vec, int N>
Expand Down Expand Up @@ -52,4 +52,4 @@ struct Qk_dot {
}
};

} // namespace cacheflow
} // namespace vllm
6 changes: 3 additions & 3 deletions csrc/attention/dtype_bfloat16.cuh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team.
* Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -25,7 +25,7 @@
#include <cuda_fp16.h>
#include <stdint.h>

namespace cacheflow {
namespace vllm {

// Define custom BF16 vector data types.
struct bf16_4_t {
Expand Down Expand Up @@ -420,4 +420,4 @@ inline __device__ void from_float(bf16_8_t& dst, Float8_ src) {
#endif
}

} // namespace cacheflow
} // namespace vllm
6 changes: 3 additions & 3 deletions csrc/attention/dtype_float16.cuh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team.
* Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -23,7 +23,7 @@

#include <stdint.h>

namespace cacheflow {
namespace vllm {

// FP16 vector types for Q, K, V.
template<>
Expand Down Expand Up @@ -441,4 +441,4 @@ inline __device__ Float8_ to_float(uint4 u) {
return tmp;
}

} // namespace cacheflow
} // namespace vllm
6 changes: 3 additions & 3 deletions csrc/attention/dtype_float32.cuh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/*
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
* Copyright (c) 2023, The CacheFlow team.
* Copyright (c) 2023, The vLLM team.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -22,7 +22,7 @@

#include <stdint.h>

namespace cacheflow {
namespace vllm {

// Define custom FP32 vector data types.
struct Float4_ {
Expand Down Expand Up @@ -265,4 +265,4 @@ inline __device__ Float8_ to_float(Float8_ u) {
return u;
}

} // namespace cacheflow
} // namespace vllm
Loading

0 comments on commit 79af1eb

Please sign in to comment.