Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit 0b0a588

Browse files
Add sparsity support based with magic_wand GPU kernels
1 parent 5265631 commit 0b0a588

File tree

14 files changed

+497
-116
lines changed

14 files changed

+497
-116
lines changed

README.md

+31-102
Original file line numberDiff line numberDiff line change
@@ -1,112 +1,41 @@
1-
<p align="center">
2-
<picture>
3-
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
4-
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
5-
</picture>
6-
</p>
1+
## Neural Magic vLLM
72

8-
<h3 align="center">
9-
Easy, fast, and cheap LLM serving for everyone
10-
</h3>
3+
Fork of vLLM with sparsity.
114

12-
<p align="center">
13-
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> |
5+
### To Run
146

15-
</p>
16-
17-
---
18-
19-
**The Second vLLM Bay Area Meetup (Jan 31st 5pm-7:30pm PT)**
20-
21-
We are thrilled to announce our second vLLM Meetup!
22-
The vLLM team will share recent updates and roadmap.
23-
We will also have vLLM collaborators from IBM coming up to the stage to discuss their insights on LLM optimizations.
24-
Please register [here](https://lu.ma/ygxbpzhl) and join us!
25-
26-
---
27-
28-
*Latest News* 🔥
29-
- [2023/12] Added ROCm support to vLLM.
30-
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
31-
- [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
32-
- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!
33-
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
34-
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
35-
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
36-
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
37-
38-
---
39-
## About
40-
vLLM is a fast and easy-to-use library for LLM inference and serving.
41-
42-
vLLM is fast with:
43-
44-
- State-of-the-art serving throughput
45-
- Efficient management of attention key and value memory with **PagedAttention**
46-
- Continuous batching of incoming requests
47-
- Fast model execution with CUDA/HIP graph
48-
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
49-
- Optimized CUDA kernels
50-
51-
vLLM is flexible and easy to use with:
52-
53-
- Seamless integration with popular Hugging Face models
54-
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
55-
- Tensor parallelism support for distributed inference
56-
- Streaming outputs
57-
- OpenAI-compatible API server
58-
- Support NVIDIA GPUs and AMD GPUs
59-
60-
vLLM seamlessly supports many Hugging Face models, including the following architectures:
61-
62-
- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
63-
- Baichuan & Baichuan2 (`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.)
64-
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
65-
- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.)
66-
- DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.)
67-
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
68-
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
69-
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
70-
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
71-
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
72-
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
73-
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
74-
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
75-
- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, etc.)
76-
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
77-
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
78-
- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.)
79-
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
80-
- Qwen2 (`Qwen/Qwen2-7B-beta`, `Qwen/Qwen-7B-Chat-beta`, etc.)
81-
- StableLM(`stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.)
82-
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)
83-
84-
Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
7+
Clone and install magic_wand:
858

869
```bash
87-
pip install vllm
10+
git clone https://github.com/neuralmagic/magic_wand.git
11+
cd magic_wand
12+
export TORCH_CUDA_ARCH_LIST=8.6
13+
pip install -e .
8814
```
8915

90-
## Getting Started
91-
92-
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
93-
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
94-
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
95-
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
96-
97-
## Contributing
16+
Install:
17+
```bash
18+
cd ../
19+
pip install -e .
20+
```
9821

99-
We welcome and value any contributions and collaborations.
100-
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
22+
### Run Sample
10123

102-
## Citation
24+
Run a 50% sparse model:
10325

104-
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
105-
```bibtex
106-
@inproceedings{kwon2023efficient,
107-
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
108-
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
109-
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
110-
year={2023}
111-
}
112-
```
26+
```bash
27+
from vllm import LLM, SamplingParams
28+
29+
model = LLM(
30+
"nm-testing/Llama-2-7b-pruned50-retrained",
31+
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
32+
enforce_eager=True, # Does not work with cudagraphs yet
33+
dtype="float16",
34+
tensor_parallel_size=1,
35+
max_model_len=1024
36+
)
37+
38+
sampling_params = SamplingParams(max_tokens=100, temperature=0)
39+
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
40+
outputs[0].outputs[0].text
41+
```

examples/offline_bench.py

+111
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
import random
2+
import time
3+
import argparse
4+
5+
from vllm import LLM, SamplingParams
6+
7+
NUM_REQUESTS_DEFAULT = 256
8+
MAX_SEQ_LEN_DEFAULT = 1024
9+
MAX_TOKENS_DEFAULT = 128
10+
SAMPLE_PROMPTS = [
11+
# "Hello, my name is",
12+
# "The president of the United States is",
13+
# "The capital of France is",
14+
"The future of AI is",
15+
]
16+
17+
18+
def run_bench(model_name,
19+
model_revision,
20+
is_sparse,
21+
quant_method,
22+
max_seq_len,
23+
max_tokens,
24+
num_requests,
25+
num_gpus,
26+
num_warmup_iters=1,
27+
num_bench_iters=5,
28+
possible_prompts=SAMPLE_PROMPTS,
29+
enforce_eager=True):
30+
print("Run bench with:")
31+
print(f" model_name = {model_name}")
32+
print(f" model_revision = {model_revision}")
33+
print(f" is_sparse = {is_sparse}")
34+
print(f" quant_method = {quant_method}")
35+
print(f" max_seq_len = {max_seq_len}")
36+
print(f" max_tokens = {max_tokens}")
37+
print(f" num_requests = {num_requests}")
38+
print(f" num_gpus = {num_gpus}")
39+
print(f" num_warmup_iters = {num_warmup_iters}")
40+
print(f" num_bench_iters = {num_bench_iters}")
41+
42+
prompts = []
43+
for _ in range(num_requests):
44+
index = random.randint(0, len(possible_prompts) - 1)
45+
prompts.append(possible_prompts[index])
46+
47+
# Create sampling params
48+
sampling_params = SamplingParams(temperature=0.8,
49+
top_p=0.95,
50+
max_tokens=max_tokens)
51+
52+
# Create LLM
53+
llm = LLM(
54+
model=model_name,
55+
revision=model_revision,
56+
sparsity="sparse_w16a16" if is_sparse else None,
57+
enforce_eager=enforce_eager,
58+
# dtype=torch.bfloat16,
59+
tensor_parallel_size=num_gpus,
60+
gpu_memory_utilization=0.9,
61+
max_model_len=max_seq_len,
62+
quantization=quant_method,
63+
)
64+
65+
for i in range(num_warmup_iters):
66+
start_time = time.time()
67+
outputs = llm.generate(prompts, sampling_params)
68+
elapsed_time = time.time() - start_time
69+
print(f"Warmup iter {i} time: {elapsed_time} [secs]")
70+
71+
iter_times = []
72+
for i in range(num_bench_iters):
73+
start_time = time.time()
74+
outputs = llm.generate(prompts, sampling_params)
75+
iter_times.append(time.time() - start_time)
76+
print(f"Bench iter {i} time: {iter_times[-1]} [secs]")
77+
78+
average_iter_time = sum(iter_times) / num_bench_iters
79+
print(f"Average per iter time: {average_iter_time} [secs]")
80+
81+
# Print outputs of the last iter
82+
for output in outputs:
83+
prompt = output.prompt
84+
generated_text = output.outputs[0].text
85+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
86+
87+
return average_iter_time
88+
89+
90+
if __name__ == "__main__":
91+
parser = argparse.ArgumentParser()
92+
93+
parser.add_argument("--model_name", type=str, required=True)
94+
parser.add_argument("--model_revision", type=str, default=None)
95+
parser.add_argument('--is_sparse', action='store_true')
96+
parser.add_argument("--quant_method", type=str, default=None)
97+
parser.add_argument("--max_seq_len", type=int, default=MAX_SEQ_LEN_DEFAULT)
98+
parser.add_argument("--max_tokens", type=int, default=MAX_TOKENS_DEFAULT)
99+
parser.add_argument("--num_requests",
100+
type=int,
101+
default=NUM_REQUESTS_DEFAULT)
102+
parser.add_argument("--num_gpus", type=int, default=1)
103+
parser.add_argument("--num_warmup_iters", type=int, default=1)
104+
parser.add_argument("--num_bench_iters", type=int, default=5)
105+
106+
args = parser.parse_args()
107+
108+
run_bench(args.model_name, args.model_revision, args.is_sparse,
109+
args.quant_method, args.max_seq_len, args.max_tokens,
110+
args.num_requests, args.num_gpus, args.num_warmup_iters,
111+
args.num_bench_iters)

vllm/config.py

+27
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ def __init__(
7272
tokenizer_revision: Optional[str] = None,
7373
max_model_len: Optional[int] = None,
7474
quantization: Optional[str] = None,
75+
sparsity: Optional[str] = None,
7576
enforce_eager: bool = False,
7677
max_context_len_to_capture: Optional[int] = None,
7778
) -> None:
@@ -85,6 +86,7 @@ def __init__(
8586
self.revision = revision
8687
self.tokenizer_revision = tokenizer_revision
8788
self.quantization = quantization
89+
self.sparsity = sparsity
8890
self.enforce_eager = enforce_eager
8991
self.max_context_len_to_capture = max_context_len_to_capture
9092

@@ -106,6 +108,7 @@ def __init__(
106108
self._verify_load_format()
107109
self._verify_tokenizer_mode()
108110
self._verify_quantization()
111+
self._verify_sparsity()
109112
self._verify_cuda_graph()
110113

111114
def _verify_load_format(self) -> None:
@@ -144,6 +147,30 @@ def _verify_tokenizer_mode(self) -> None:
144147
"either 'auto' or 'slow'.")
145148
self.tokenizer_mode = tokenizer_mode
146149

150+
def _verify_sparsity(self) -> None:
151+
supported_sparsity = ["sparse_w16a16"]
152+
153+
if self.quantization is not None:
154+
raise ValueError("Both sparsity and quantization detected. Only "
155+
"one or the other is supported at a time.")
156+
157+
if self.sparsity is not None and self.sparsity not in supported_sparsity:
158+
raise ValueError(f"Unknown sparse method: {self.sparsity}. Must "
159+
f"be one of {supported_sparsity}.")
160+
161+
hf_sparsity_config = getattr(self.hf_config, "sparsity_config", None)
162+
if hf_sparsity_config is not None:
163+
hf_sparsity_method = str(
164+
hf_sparsity_config["sparse_method"]).lower()
165+
if self.sparsity is None:
166+
self.sparsity = hf_sparsity_method
167+
elif self.sparsity != hf_sparsity_method:
168+
raise ValueError(
169+
"Sparsity method specified in the model config "
170+
f"({hf_sparsity_method}) does not match the sparsity "
171+
f"method specified in the `sparsity` argument "
172+
f"({self.sparsity}).")
173+
147174
def _verify_quantization(self) -> None:
148175
supported_quantization = ["awq", "gptq", "squeezellm"]
149176
rocm_not_supported_quantization = ["awq"]

vllm/engine/arg_utils.py

+17-7
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ class EngineArgs:
3333
revision: Optional[str] = None
3434
tokenizer_revision: Optional[str] = None
3535
quantization: Optional[str] = None
36+
sparsity: Optional[str] = None
3637
enforce_eager: bool = False
3738
max_context_len_to_capture: int = 8192
3839
enable_lora: bool = False
@@ -197,6 +198,16 @@ def add_cli_args(
197198
'None, we assume the model weights are not '
198199
'quantized and use `dtype` to determine the data '
199200
'type of the weights.')
201+
parser.add_argument(
202+
'--sparsity',
203+
'-s',
204+
type=str,
205+
choices=['sparse_w16a16', None],
206+
default=None,
207+
help='Method used to compress sparse weights. If '
208+
'None, we first check the `sparsity_config` attribute '
209+
'in the model config file. If that is None we assume '
210+
'the model weights are dense')
200211
parser.add_argument('--enforce-eager',
201212
action='store_true',
202213
help='Always use eager-mode PyTorch. If False, '
@@ -255,13 +266,12 @@ def create_engine_configs(
255266
self,
256267
) -> Tuple[ModelConfig, CacheConfig, ParallelConfig, SchedulerConfig,
257268
Optional[LoRAConfig]]:
258-
model_config = ModelConfig(self.model, self.tokenizer,
259-
self.tokenizer_mode, self.trust_remote_code,
260-
self.download_dir, self.load_format,
261-
self.dtype, self.seed, self.revision,
262-
self.tokenizer_revision, self.max_model_len,
263-
self.quantization, self.enforce_eager,
264-
self.max_context_len_to_capture)
269+
model_config = ModelConfig(
270+
self.model, self.tokenizer, self.tokenizer_mode,
271+
self.trust_remote_code, self.download_dir, self.load_format,
272+
self.dtype, self.seed, self.revision, self.tokenizer_revision,
273+
self.max_model_len, self.quantization, self.sparsity,
274+
self.enforce_eager, self.max_context_len_to_capture)
265275
cache_config = CacheConfig(self.block_size,
266276
self.gpu_memory_utilization,
267277
self.swap_space,

vllm/engine/llm_engine.py

+1
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ def __init__(
8383
f"load_format={model_config.load_format}, "
8484
f"tensor_parallel_size={parallel_config.tensor_parallel_size}, "
8585
f"quantization={model_config.quantization}, "
86+
f"sparsity={model_config.sparsity}, "
8687
f"enforce_eager={model_config.enforce_eager}, "
8788
f"seed={model_config.seed})")
8889
# TODO(woosuk): Print more configs in debug mode.

vllm/entrypoints/llm.py

+7
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,11 @@ class LLM:
4343
the `quantization_config` attribute in the model config file. If
4444
that is None, we assume the model weights are not quantized and use
4545
`dtype` to determine the data type of the weights.
46+
sparsity: The format of the sparse model weights. Currently,
47+
we support "sparse_w16a16". If None, we first check the `sparsity`
48+
attribute in the model config file. If that is None, we assume the
49+
model weights are dense and use `dtype` to determine the data
50+
type of the weights.
4651
revision: The specific model version to use. It can be a branch name,
4752
a tag name, or a commit id.
4853
tokenizer_revision: The specific tokenizer version to use. It can be a
@@ -75,6 +80,7 @@ def __init__(
7580
tensor_parallel_size: int = 1,
7681
dtype: str = "auto",
7782
quantization: Optional[str] = None,
83+
sparsity: Optional[str] = None,
7884
revision: Optional[str] = None,
7985
tokenizer_revision: Optional[str] = None,
8086
seed: int = 0,
@@ -94,6 +100,7 @@ def __init__(
94100
tensor_parallel_size=tensor_parallel_size,
95101
dtype=dtype,
96102
quantization=quantization,
103+
sparsity=sparsity,
97104
revision=revision,
98105
tokenizer_revision=tokenizer_revision,
99106
seed=seed,

0 commit comments

Comments
 (0)