You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 1, 2024. It is now read-only.
Copy file name to clipboardexpand all lines: README.md
+43-43
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU [[paper](https://arxiv.org/abs/2303.06865)]
1
+
# FlexLLMGen: High-throughput Generative Inference of Large Language Models with a Single GPU [[paper](https://arxiv.org/abs/2303.06865)]
2
2
3
-
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.
3
+
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.
4
4
5
5
## Motivation
6
6
@@ -18,15 +18,15 @@ Throughput is a measure of tokens processed per second over the job's entire run
18
18
Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which
19
19
makes it easier to take advantage of low-cost commodity GPUs.
20
20
21
-
The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of
21
+
The goal of FlexLLMGen is to create a high-throughput system to enable new and exciting applications of
22
22
foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU
23
23
instead of expensive systems.
24
24
25
-
Check out the [examples](#examples) of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling.
25
+
Check out the [examples](#examples) of what you can run on a single commodity GPU with FlexLLMGen, including benchmarking and data wrangling.
26
26
27
-
❌ **Limitation**. As an offloading-based system running on weak GPUs, FlexGen also has its limitations.
28
-
FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.
29
-
FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.
27
+
❌ **Limitation**. As an offloading-based system running on weak GPUs, FlexLLMGen also has its limitations.
28
+
FlexLLMGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.
29
+
FlexLLMGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.
30
30
31
31
----------
32
32
@@ -45,8 +45,8 @@ This project was made possible thanks to a collaboration with
45
45
-[Installation](#installation)
46
46
-[Usage and Examples](#usage-and-examples)
47
47
-[Get Started with a Single GPU](#get-started-with-a-single-gpu)
48
-
-[Run HELM Benchmark with FlexGen](#run-helm-benchmark-with-flexgen)
49
-
-[Run Data Wrangling Tasks with FlexGen](#run-data-wrangling-tasks-with-flexgen)
48
+
-[Run HELM Benchmark with FlexLLMGen](#run-helm-benchmark-with-flexllmgen)
49
+
-[Run Data Wrangling Tasks with FlexLLMGen](#run-data-wrangling-tasks-with-flexllmgen)
50
50
-[Scaling to Distributed GPUs](#scaling-to-distributed-gpus)
You should see some text generated by OPT-1.3B and the benchmark results.
85
85
86
86
#### OPT-30B
87
87
To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below.
88
88
The `--percent` argument specifies the offloading strategy for parameters, attention cache and hidden states separately.
89
-
The exact meaning of this argument can be found [here](https://github.com/FMInference/FlexGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/flexgen/flex_opt.py#L1271-L1279).
89
+
The exact meaning of this argument can be found [here](https://github.com/FMInference/FlexLLMGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/flexllmgen/flex_opt.py#L1271-L1279).
To run OPT-175B, you need to download the weights from [metaseq](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT) and convert the weights into Alpa [format](https://alpa.ai/tutorials/opt_serving.html#convert-opt-175b-weights-into-alpa-formats).
96
96
You can then try to offloading all weights to disk by
FlexGen can be integrated into [HELM](https://crfm.stanford.edu/helm), a language model benchmark framework, as its execution backend.
101
+
### Run HELM Benchmark with FlexLLMGen
102
+
FlexLLMGen can be integrated into [HELM](https://crfm.stanford.edu/helm), a language model benchmark framework, as its execution backend.
103
103
You can use the commands below to run a Massive Multitask Language Understanding (MMLU) [scenario](https://crfm.stanford.edu/helm/latest/?group=mmlu) with a single T4 (16GB) GPU and 200GB of DRAM.
Note that only a subset of HELM scenarios is tested. See more tested scenarios [here](flexgen/apps/helm_passed_30b.sh).
108
+
Note that only a subset of HELM scenarios is tested. See more tested scenarios [here](flexllmgen/apps/helm_passed_30b.sh).
109
109
110
-
### Run Data Wrangling Tasks with FlexGen
111
-
You can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexgen/apps/data_wrangle).
110
+
### Run Data Wrangling Tasks with FlexLLMGen
111
+
You can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexllmgen/apps/data_wrangle).
112
112
113
113
### Scaling to Distributed GPUs
114
-
If you have multiple machines with GPUs, FlexGen can combine offloading with pipeline parallelism to allow scaling.
115
-
For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation.
114
+
If you have multiple machines with GPUs, FlexLLMGen can combine offloading with pipeline parallelism to allow scaling.
115
+
For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexLLMGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation.
116
116
But to have scaled performance, you should have GPUs on distributed machines.
117
-
See examples [here](https://github.com/FMInference/FlexGen/tree/main/benchmark/flexgen#distributed-gpus).
117
+
See examples [here](https://github.com/FMInference/FlexLLMGen/tree/main/benchmark/flexllmgen#distributed-gpus).
118
118
119
119
### API Example
120
-
We demonstrate the usage of FlexGen API in [completion.py](flexgen/apps/completion.py).
120
+
We demonstrate the usage of FlexLLMGen API in [completion.py](flexllmgen/apps/completion.py).
121
121
This example shows how to run generation for two sentences.
122
-
To get the best throughput out of FlexGen, you typically need to batch more sentences.
122
+
To get the best throughput out of FlexLLMGen, you typically need to batch more sentences.
123
123
124
124
#### Generation API
125
-
FlexGen has a generation API following the style of Hugging Face's transformers.
125
+
FlexLLMGen has a generation API following the style of Hugging Face's transformers.
126
126
```python
127
127
output_ids = model.generate(
128
128
input_ids,
@@ -138,25 +138,25 @@ If you do not have enough GPU/CPU memory, see the [Handle Out-Of-Memory](#handle
138
138
139
139
```
140
140
# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
#### How to set the offloading strategy and `--percent`?
157
157
We will release an automatic policy optimizer later, but now you have to manually try a few strategies.
158
158
The idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary.
159
-
You can see the reference strategies in our benchmark [here](https://github.com/FMInference/FlexGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/benchmark/flexgen/bench_suite.py#L39-L79).
159
+
You can see the reference strategies in our benchmark [here](https://github.com/FMInference/FlexLLMGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/benchmark/flexllmgen/bench_suite.py#L39-L79).
160
160
To avoid out-of-memory, you can tune the `--percent` to offload more tensors to the CPU and disk.
161
161
162
162
@@ -176,31 +176,31 @@ The corresponding effective batch sizes and lowest offloading devices are in par
176
176
| Hugging Face Accelerate | 25.12 (2 on GPU) | 0.62 (8 on CPU) | 0.01 (2 on disk) |
177
177
| DeepSpeed ZeRO-Inference | 9.28 (16 on CPU) | 0.60 (4 on CPU) | 0.01 (1 on disk) |
178
178
| Petals | 8.25 (2 on GPU) | 2.84 (2 on GPU) | 0.08 (2 on GPU) |
179
-
|FlexGen| 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |
180
-
|FlexGen with Compression |**29.12** (72 on GPU) |**8.38** (512 on CPU) |**1.12** (144 on CPU) |
179
+
|FlexLLMGen| 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |
180
+
|FlexLLMGen with Compression |**29.12** (72 on GPU) |**8.38** (512 on CPU) |**1.12** (144 on CPU) |
181
181
182
182
- Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
183
183
- Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to **a large value** that maximizes the generation throughput for each system.
184
184
- Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation).
185
185
186
-
How to [reproduce](benchmark/flexgen).
186
+
How to [reproduce](benchmark/flexllmgen).
187
187
188
188
### Latency-Throughput Trade-Off
189
189
The figure below shows the latency and throughput trade-off of three offloading-based systems on OPT-175B (left) and OPT-30B (right).
190
-
FlexGen achieves a new Pareto-optimal frontier with significantly higher maximum throughput for both models.
190
+
FlexLLMGen achieves a new Pareto-optimal frontier with significantly higher maximum throughput for both models.
191
191
Other systems cannot further increase throughput due to out-of-memory.
FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss.
197
+
FlexLLMGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexLLMGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss.
198
198
199
-
One key idea of FlexGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods,
199
+
One key idea of FlexLLMGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods,
200
200
but the I/O efficiency of offloading can be greatly boosted for throughput-oriented scenarios (see the figure above).
201
-
FlexGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an inefficient row-by-row schedule, as shown in figure (a) below.
201
+
FlexLLMGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an inefficient row-by-row schedule, as shown in figure (a) below.
Copy file name to clipboardexpand all lines: benchmark/flexllmgen/README.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
-
# Benchmark FlexGen
1
+
# Benchmark FlexLLMGen
2
2
NOTE: This benchmark uses dummy weights by default for faster experiments.
3
3
It is expected if you see randomly generated garbled characters, but the throughput and latency numbers should be correct.
4
4
5
5
## Mount SSD
6
-
The following commands use `~/flexgen_offload_dir` as the offloading folder by default.
6
+
The following commands use `~/flexllmgen_offload_dir` as the offloading folder by default.
7
7
To get the best performance, it is recommonded to mount this folder on a fast SSD.
8
8
If you use AWS or GCP instances with local SSDs, you can use [mount_nvme_aws.sh](../../scripts/mount_nvme_aws.sh) or [mount_nvme_gcp.sh](../../scripts/mount_nvme_gcp.sh) to mount the local SSDs.
0 commit comments