Skip to content

Commit 0b98ba1

Browse files
authored
Change the name to vLLM (vllm-project#150)
1 parent e5464ee commit 0b98ba1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+342
-339
lines changed

CONTRIBUTING.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Contributing to CacheFlow
1+
# Contributing to vLLM
22

3-
Thank you for your interest in contributing to CacheFlow!
3+
Thank you for your interest in contributing to vLLM!
44
Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large.
55
There are several ways you can contribute to the project:
66

@@ -11,9 +11,9 @@ There are several ways you can contribute to the project:
1111
However, remember that contributions aren't just about code.
1212
We believe in the power of community support; thus, answering queries, assisting others, and enhancing the documentation are highly regarded and beneficial contributions.
1313

14-
Finally, one of the most impactful ways to support us is by raising awareness about CacheFlow.
14+
Finally, one of the most impactful ways to support us is by raising awareness about vLLM.
1515
Talk about it in your blog posts, highlighting how it's driving your incredible projects.
16-
Express your support on Twitter if CacheFlow aids you, or simply offer your appreciation by starring our repository.
16+
Express your support on Twitter if vLLM aids you, or simply offer your appreciation by starring our repository.
1717

1818

1919
## Setup for development
@@ -70,5 +70,5 @@ If a comment isn't clear or you disagree with a suggestion, feel free to ask for
7070

7171
### Thank You
7272

73-
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to CacheFlow.
74-
Your contributions make CacheFlow a great tool for everyone!
73+
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
74+
Your contributions make vLLM a great tool for everyone!

README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# CacheFlow
1+
# vLLM
22

33
## Build from source
44

@@ -28,7 +28,7 @@ python examples/simple_server.py --help
2828
To start the server:
2929
```bash
3030
ray start --head
31-
python -m cacheflow.entrypoints.fastapi_server # --model <your_model>
31+
python -m vllm.entrypoints.fastapi_server # --model <your_model>
3232
```
3333

3434
To test the server:
@@ -45,9 +45,9 @@ pip install gradio
4545

4646
Start the server:
4747
```bash
48-
python -m cacheflow.http_frontend.fastapi_frontend
48+
python -m vllm.http_frontend.fastapi_frontend
4949
# At another terminal
50-
python -m cacheflow.http_frontend.gradio_webserver
50+
python -m vllm.http_frontend.gradio_webserver
5151
```
5252

5353
## Load LLaMA weights
@@ -62,5 +62,5 @@ Since LLaMA weight is not fully public, we cannot directly download the LLaMA we
6262
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
6363
```bash
6464
python simple_server.py --model /output/path/llama-7b
65-
python -m cacheflow.http_frontend.fastapi_frontend --model /output/path/llama-7b
65+
python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
6666
```

benchmarks/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Benchmarking CacheFlow
1+
# Benchmarking vLLM
22

33
## Downloading the ShareGPT dataset
44

benchmarks/benchmark_async_llm_server.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ def main(args: argparse.Namespace):
1111
for i in range(args.n_threads)]
1212

1313
api_url = f"http://{args.host}:{args.port}/generate"
14-
headers = {"User-Agent": "CacheFlow Benchmark Client"}
14+
headers = {"User-Agent": "vLLM Benchmark Client"}
1515
ploads = [{
1616
"prompt": p,
1717
"max_tokens": args.max_tokens,

benchmarks/benchmark_latency.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import torch
77
from tqdm import tqdm
88

9-
from cacheflow import LLM, SamplingParams
9+
from vllm import LLM, SamplingParams
1010

1111

1212
def main(args: argparse.Namespace):

benchmarks/benchmark_serving.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
"""Benchmark online serving throughput.
22
33
On the server side, run one of the following commands:
4-
(CacheFlow backend)
5-
python -m cacheflow.entrypoints.api_server \
4+
(vLLM backend)
5+
python -m vllm.entrypoints.api_server \
66
--disable-log-requests --model <your_model>
77
88
(TGI backend)
@@ -114,7 +114,7 @@ async def send_request(
114114
request_start_time = time.time()
115115

116116
headers = {"User-Agent": "Benchmark Client"}
117-
if backend == "cacheflow":
117+
if backend == "vllm":
118118
pload = {
119119
"prompt": prompt,
120120
"n": 1,
@@ -213,8 +213,8 @@ def main(args: argparse.Namespace):
213213
if __name__ == "__main__":
214214
parser = argparse.ArgumentParser(
215215
description="Benchmark the online serving throughput.")
216-
parser.add_argument("--backend", type=str, default="cacheflow",
217-
choices=["cacheflow", "tgi"])
216+
parser.add_argument("--backend", type=str, default="vllm",
217+
choices=["vllm", "tgi"])
218218
parser.add_argument("--host", type=str, default="localhost")
219219
parser.add_argument("--port", type=int, default=8001)
220220
parser.add_argument("--dataset", type=str, required=True,

benchmarks/benchmark_throughput.py

+8-7
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,13 @@
55
import time
66
from typing import List, Tuple
77

8-
from cacheflow import LLM, SamplingParams
98
import torch
109
from transformers import (AutoConfig, AutoTokenizer, AutoModelForCausalLM,
1110
PreTrainedTokenizerBase)
1211
from tqdm import tqdm
1312

13+
from vllm import LLM, SamplingParams
14+
1415

1516
def get_tokenizer(model_name: str) -> PreTrainedTokenizerBase:
1617
config = AutoConfig.from_pretrained(model_name)
@@ -70,7 +71,7 @@ def sample_requests(
7071
return sampled_requests
7172

7273

73-
def run_cacheflow(
74+
def run_vllm(
7475
requests: List[Tuple[str, int, int]],
7576
model: str,
7677
tensor_parallel_size: int,
@@ -172,8 +173,8 @@ def main(args: argparse.Namespace):
172173
tokenizer = get_tokenizer(args.model)
173174
requests = sample_requests(args.dataset, args.num_prompts, tokenizer)
174175

175-
if args.backend == "cacheflow":
176-
elapsed_time = run_cacheflow(
176+
if args.backend == "vllm":
177+
elapsed_time = run_vllm(
177178
requests, args.model, args.tensor_parallel_size, args.seed, args.n,
178179
args.use_beam_search)
179180
elif args.backend == "hf":
@@ -192,8 +193,8 @@ def main(args: argparse.Namespace):
192193

193194
if __name__ == "__main__":
194195
parser = argparse.ArgumentParser(description="Benchmark the throughput.")
195-
parser.add_argument("--backend", type=str, choices=["cacheflow", "hf"],
196-
default="cacheflow")
196+
parser.add_argument("--backend", type=str, choices=["vllm", "hf"],
197+
default="vllm")
197198
parser.add_argument("--dataset", type=str, required=True,
198199
help="Path to the dataset.")
199200
parser.add_argument("--model", type=str, default="facebook/opt-125m")
@@ -207,7 +208,7 @@ def main(args: argparse.Namespace):
207208
parser.add_argument("--hf-max-batch-size", type=int, default=None,
208209
help="Maximum batch size for HF backend.")
209210
args = parser.parse_args()
210-
if args.backend == "cacheflow":
211+
if args.backend == "vllm":
211212
if args.hf_max_batch_size is not None:
212213
raise ValueError("HF max batch size is only for HF backend.")
213214
elif args.backend == "hf":

cacheflow/__init__.py

-18
This file was deleted.

cacheflow/model_executor/__init__.py

-10
This file was deleted.

cacheflow/model_executor/models/__init__.py

-12
This file was deleted.

csrc/activation_kernels.cu

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#include <torch/extension.h>
22
#include <ATen/cuda/CUDAContext.h>
33

4-
namespace cacheflow {
4+
namespace vllm {
55

66
template<typename T>
77
__device__ __forceinline__ T silu(const T& x) {
@@ -22,7 +22,7 @@ __global__ void silu_and_mul_kernel(
2222
}
2323
}
2424

25-
} // namespace cacheflow
25+
} // namespace vllm
2626

2727
void silu_and_mul(
2828
torch::Tensor& out, // [num_tokens, d]
@@ -40,7 +40,7 @@ void silu_and_mul(
4040
input.scalar_type(),
4141
"silu_and_mul_kernel",
4242
[&] {
43-
cacheflow::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>(
43+
vllm::silu_and_mul_kernel<scalar_t><<<grid, block, 0, stream>>>(
4444
out.data_ptr<scalar_t>(),
4545
input.data_ptr<scalar_t>(),
4646
d);

csrc/attention/attention_generic.cuh

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/*
22
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
3-
* Copyright (c) 2023, The CacheFlow team.
3+
* Copyright (c) 2023, The vLLM team.
44
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
55
*
66
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,7 +19,7 @@
1919

2020
#include <stdint.h>
2121

22-
namespace cacheflow {
22+
namespace vllm {
2323

2424
// A vector type to store Q, K, V elements.
2525
template<typename T, int VEC_SIZE>
@@ -61,4 +61,4 @@ inline __device__ void zero(T& dst) {
6161
dst = tmp.raw;
6262
}
6363

64-
} // namespace cacheflow
64+
} // namespace vllm

csrc/attention/attention_kernels.cu

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/*
22
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
3-
* Copyright (c) 2023, The CacheFlow team.
3+
* Copyright (c) 2023, The vLLM team.
44
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
55
*
66
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -27,7 +27,7 @@
2727
#define MAX(a, b) ((a) > (b) ? (a) : (b))
2828
#define MIN(a, b) ((a) < (b) ? (a) : (b))
2929

30-
namespace cacheflow {
30+
namespace vllm {
3131

3232
// Utility function for attention softmax.
3333
template<int NUM_WARPS>
@@ -315,10 +315,10 @@ __global__ void single_query_cached_kv_attention_kernel(
315315
}
316316
}
317317

318-
} // namespace cacheflow
318+
} // namespace vllm
319319

320320
#define LAUNCH_ATTENTION_KERNEL(T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS) \
321-
cacheflow::single_query_cached_kv_attention_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS> \
321+
vllm::single_query_cached_kv_attention_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS> \
322322
<<<grid, block, shared_mem_size, stream>>>( \
323323
out_ptr, \
324324
query_ptr, \

csrc/attention/attention_utils.cuh

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/*
22
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
3-
* Copyright (c) 2023, The CacheFlow team.
3+
* Copyright (c) 2023, The vLLM team.
44
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
55
*
66
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,7 +22,7 @@
2222
#include <float.h>
2323
#include <type_traits>
2424

25-
namespace cacheflow {
25+
namespace vllm {
2626

2727
// Q*K^T operation.
2828
template<int THREAD_GROUP_SIZE, typename Vec, int N>
@@ -52,4 +52,4 @@ struct Qk_dot {
5252
}
5353
};
5454

55-
} // namespace cacheflow
55+
} // namespace vllm

csrc/attention/dtype_bfloat16.cuh

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/*
22
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
33
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
4-
* Copyright (c) 2023, The CacheFlow team.
4+
* Copyright (c) 2023, The vLLM team.
55
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
66
*
77
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -25,7 +25,7 @@
2525
#include <cuda_fp16.h>
2626
#include <stdint.h>
2727

28-
namespace cacheflow {
28+
namespace vllm {
2929

3030
// Define custom BF16 vector data types.
3131
struct bf16_4_t {
@@ -420,4 +420,4 @@ inline __device__ void from_float(bf16_8_t& dst, Float8_ src) {
420420
#endif
421421
}
422422

423-
} // namespace cacheflow
423+
} // namespace vllm

csrc/attention/dtype_float16.cuh

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/*
22
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
33
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
4-
* Copyright (c) 2023, The CacheFlow team.
4+
* Copyright (c) 2023, The vLLM team.
55
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
66
*
77
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,7 +23,7 @@
2323

2424
#include <stdint.h>
2525

26-
namespace cacheflow {
26+
namespace vllm {
2727

2828
// FP16 vector types for Q, K, V.
2929
template<>
@@ -441,4 +441,4 @@ inline __device__ Float8_ to_float(uint4 u) {
441441
return tmp;
442442
}
443443

444-
} // namespace cacheflow
444+
} // namespace vllm

csrc/attention/dtype_float32.cuh

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/*
22
* Adapted from https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp
33
* and https://github.com/NVIDIA/FasterTransformer/blob/release/v5.3_tag/src/fastertransformer/kernels/decoder_masked_multihead_attention_utils.h
4-
* Copyright (c) 2023, The CacheFlow team.
4+
* Copyright (c) 2023, The vLLM team.
55
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
66
*
77
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,7 +22,7 @@
2222

2323
#include <stdint.h>
2424

25-
namespace cacheflow {
25+
namespace vllm {
2626

2727
// Define custom FP32 vector data types.
2828
struct Float4_ {
@@ -265,4 +265,4 @@ inline __device__ Float8_ to_float(Float8_ u) {
265265
return u;
266266
}
267267

268-
} // namespace cacheflow
268+
} // namespace vllm

0 commit comments

Comments
 (0)