Skip to content
This repository was archived by the owner on Dec 1, 2024. It is now read-only.

Commit 36c70c0

Browse files
committed
rename for compliance
1 parent d5fdcdc commit 36c70c0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+160
-160
lines changed

.gitignore

+6-6
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@ dist
1616
# cache
1717
*__pycache__
1818
*.egg-info
19-
flexgen/apps/data
20-
flexgen/apps/runs
21-
flexgen/apps/benchmark_output
22-
flexgen/apps/data_wrangle/data
23-
flexgen/apps/data_wrangle/outputs
24-
flexgen/apps/data_wrangle/core
19+
flexllmgen/apps/data
20+
flexllmgen/apps/runs
21+
flexllmgen/apps/benchmark_output
22+
flexllmgen/apps/data_wrangle/data
23+
flexllmgen/apps/data_wrangle/outputs
24+
flexllmgen/apps/data_wrangle/core
2525

2626
# pickle
2727
*.pkl

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright 2023 - The FlexGen team. All rights reserved.
1+
Copyright 2023 - The FlexLLMGen team. All rights reserved.
22

33
Apache License
44
Version 2.0, January 2004

README.md

+43-43
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU [[paper](https://arxiv.org/abs/2303.06865)]
1+
# FlexLLMGen: High-throughput Generative Inference of Large Language Models with a Single GPU [[paper](https://arxiv.org/abs/2303.06865)]
22

3-
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.
3+
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.
44

55
## Motivation
66

@@ -18,15 +18,15 @@ Throughput is a measure of tokens processed per second over the job's entire run
1818
Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which
1919
makes it easier to take advantage of low-cost commodity GPUs.
2020

21-
The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of
21+
The goal of FlexLLMGen is to create a high-throughput system to enable new and exciting applications of
2222
foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU
2323
instead of expensive systems.
2424

25-
Check out the [examples](#examples) of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling.
25+
Check out the [examples](#examples) of what you can run on a single commodity GPU with FlexLLMGen, including benchmarking and data wrangling.
2626

27-
**Limitation**. As an offloading-based system running on weak GPUs, FlexGen also has its limitations.
28-
FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.
29-
FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.
27+
**Limitation**. As an offloading-based system running on weak GPUs, FlexLLMGen also has its limitations.
28+
FlexLLMGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.
29+
FlexLLMGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.
3030

3131
----------
3232

@@ -45,8 +45,8 @@ This project was made possible thanks to a collaboration with
4545
- [Installation](#installation)
4646
- [Usage and Examples](#usage-and-examples)
4747
- [Get Started with a Single GPU](#get-started-with-a-single-gpu)
48-
- [Run HELM Benchmark with FlexGen](#run-helm-benchmark-with-flexgen)
49-
- [Run Data Wrangling Tasks with FlexGen](#run-data-wrangling-tasks-with-flexgen)
48+
- [Run HELM Benchmark with FlexLLMGen](#run-helm-benchmark-with-flexllmgen)
49+
- [Run Data Wrangling Tasks with FlexLLMGen](#run-data-wrangling-tasks-with-flexllmgen)
5050
- [Scaling to Distributed GPUs](#scaling-to-distributed-gpus)
5151
- [API Example](#api-example)
5252
- [Frequently Asked Questions](#frequently-asked-questions)
@@ -60,13 +60,13 @@ Requirements:
6060

6161
### Method 1: With pip
6262
```
63-
pip install flexgen
63+
pip install flexllmgen
6464
```
6565

6666
### Method 2: From source
6767
```
68-
git clone https://github.com/FMInference/FlexGen.git
69-
cd FlexGen
68+
git clone https://github.com/FMInference/FlexLLMGen.git
69+
cd FlexLLMGen
7070
pip install -e .
7171
```
7272

@@ -76,53 +76,53 @@ pip install -e .
7676

7777
#### OPT-1.3B
7878
To get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required.
79-
FlexGen will automatically download weights from Hugging Face.
79+
FlexLLMGen will automatically download weights from Hugging Face.
8080
```
81-
python3 -m flexgen.flex_opt --model facebook/opt-1.3b
81+
python3 -m flexllmgen.flex_opt --model facebook/opt-1.3b
8282
```
8383

8484
You should see some text generated by OPT-1.3B and the benchmark results.
8585

8686
#### OPT-30B
8787
To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below.
8888
The `--percent` argument specifies the offloading strategy for parameters, attention cache and hidden states separately.
89-
The exact meaning of this argument can be found [here](https://github.com/FMInference/FlexGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/flexgen/flex_opt.py#L1271-L1279).
89+
The exact meaning of this argument can be found [here](https://github.com/FMInference/FlexLLMGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/flexllmgen/flex_opt.py#L1271-L1279).
9090
```
91-
python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0
91+
python3 -m flexllmgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0
9292
```
9393

9494
#### OPT-175B
9595
To run OPT-175B, you need to download the weights from [metaseq](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT) and convert the weights into Alpa [format](https://alpa.ai/tutorials/opt_serving.html#convert-opt-175b-weights-into-alpa-formats).
9696
You can then try to offloading all weights to disk by
9797
```
98-
python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER
98+
python3 -m flexllmgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER
9999
```
100100

101-
### Run HELM Benchmark with FlexGen
102-
FlexGen can be integrated into [HELM](https://crfm.stanford.edu/helm), a language model benchmark framework, as its execution backend.
101+
### Run HELM Benchmark with FlexLLMGen
102+
FlexLLMGen can be integrated into [HELM](https://crfm.stanford.edu/helm), a language model benchmark framework, as its execution backend.
103103
You can use the commands below to run a Massive Multitask Language Understanding (MMLU) [scenario](https://crfm.stanford.edu/helm/latest/?group=mmlu) with a single T4 (16GB) GPU and 200GB of DRAM.
104104
```
105105
pip install crfm-helm
106-
python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100
106+
python3 -m flexllmgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100
107107
```
108-
Note that only a subset of HELM scenarios is tested. See more tested scenarios [here](flexgen/apps/helm_passed_30b.sh).
108+
Note that only a subset of HELM scenarios is tested. See more tested scenarios [here](flexllmgen/apps/helm_passed_30b.sh).
109109

110-
### Run Data Wrangling Tasks with FlexGen
111-
You can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexgen/apps/data_wrangle).
110+
### Run Data Wrangling Tasks with FlexLLMGen
111+
You can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexllmgen/apps/data_wrangle).
112112

113113
### Scaling to Distributed GPUs
114-
If you have multiple machines with GPUs, FlexGen can combine offloading with pipeline parallelism to allow scaling.
115-
For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation.
114+
If you have multiple machines with GPUs, FlexLLMGen can combine offloading with pipeline parallelism to allow scaling.
115+
For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexLLMGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation.
116116
But to have scaled performance, you should have GPUs on distributed machines.
117-
See examples [here](https://github.com/FMInference/FlexGen/tree/main/benchmark/flexgen#distributed-gpus).
117+
See examples [here](https://github.com/FMInference/FlexLLMGen/tree/main/benchmark/flexllmgen#distributed-gpus).
118118

119119
### API Example
120-
We demonstrate the usage of FlexGen API in [completion.py](flexgen/apps/completion.py).
120+
We demonstrate the usage of FlexLLMGen API in [completion.py](flexllmgen/apps/completion.py).
121121
This example shows how to run generation for two sentences.
122-
To get the best throughput out of FlexGen, you typically need to batch more sentences.
122+
To get the best throughput out of FlexLLMGen, you typically need to batch more sentences.
123123

124124
#### Generation API
125-
FlexGen has a generation API following the style of Hugging Face's transformers.
125+
FlexLLMGen has a generation API following the style of Hugging Face's transformers.
126126
```python
127127
output_ids = model.generate(
128128
input_ids,
@@ -138,25 +138,25 @@ If you do not have enough GPU/CPU memory, see the [Handle Out-Of-Memory](#handle
138138

139139
```
140140
# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
141-
python3 -m flexgen.apps.completion --model facebook/opt-6.7b
141+
python3 -m flexllmgen.apps.completion --model facebook/opt-6.7b
142142
```
143143

144144
```
145145
# Complete with OPT-30B. You need about 90GB of CPU memory.
146-
python3 -m flexgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
146+
python3 -m flexllmgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
147147
```
148148

149149
```
150150
# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
151-
python3 -m flexgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
151+
python3 -m flexllmgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
152152
```
153153

154154
### Frequently Asked Questions
155155

156156
#### How to set the offloading strategy and `--percent`?
157157
We will release an automatic policy optimizer later, but now you have to manually try a few strategies.
158158
The idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary.
159-
You can see the reference strategies in our benchmark [here](https://github.com/FMInference/FlexGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/benchmark/flexgen/bench_suite.py#L39-L79).
159+
You can see the reference strategies in our benchmark [here](https://github.com/FMInference/FlexLLMGen/blob/9d092d848f106cd9eaf305c12ef3590f7bcb0277/benchmark/flexllmgen/bench_suite.py#L39-L79).
160160
To avoid out-of-memory, you can tune the `--percent` to offload more tensors to the CPU and disk.
161161

162162

@@ -176,31 +176,31 @@ The corresponding effective batch sizes and lowest offloading devices are in par
176176
| Hugging Face Accelerate | 25.12 (2 on GPU) | 0.62 (8 on CPU) | 0.01 (2 on disk) |
177177
| DeepSpeed ZeRO-Inference | 9.28 (16 on CPU) | 0.60 (4 on CPU) | 0.01 (1 on disk) |
178178
| Petals | 8.25 (2 on GPU) | 2.84 (2 on GPU) | 0.08 (2 on GPU) |
179-
| FlexGen | 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |
180-
| FlexGen with Compression | **29.12** (72 on GPU) | **8.38** (512 on CPU) | **1.12** (144 on CPU) |
179+
| FlexLLMGen | 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |
180+
| FlexLLMGen with Compression | **29.12** (72 on GPU) | **8.38** (512 on CPU) | **1.12** (144 on CPU) |
181181

182182
- Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
183183
- Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to **a large value** that maximizes the generation throughput for each system.
184184
- Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation).
185185

186-
How to [reproduce](benchmark/flexgen).
186+
How to [reproduce](benchmark/flexllmgen).
187187

188188
### Latency-Throughput Trade-Off
189189
The figure below shows the latency and throughput trade-off of three offloading-based systems on OPT-175B (left) and OPT-30B (right).
190-
FlexGen achieves a new Pareto-optimal frontier with significantly higher maximum throughput for both models.
190+
FlexLLMGen achieves a new Pareto-optimal frontier with significantly higher maximum throughput for both models.
191191
Other systems cannot further increase throughput due to out-of-memory.
192-
"FlexGen(c)" is FlexGen with compression.
192+
"FlexLLMGen(c)" is FlexLLMGen with compression.
193193

194-
<img src="https://github.com/FMInference/FlexGen/blob/main/docs/throughput_vs_latency.jpg" alt="image" width="500"></img>
194+
<img src="https://github.com/FMInference/FlexLLMGen/blob/main/docs/throughput_vs_latency.jpg" alt="image" width="500"></img>
195195

196196
## How It Works
197-
FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss.
197+
FlexLLMGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexLLMGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss.
198198

199-
One key idea of FlexGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods,
199+
One key idea of FlexLLMGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods,
200200
but the I/O efficiency of offloading can be greatly boosted for throughput-oriented scenarios (see the figure above).
201-
FlexGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an inefficient row-by-row schedule, as shown in figure (a) below.
201+
FlexLLMGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an inefficient row-by-row schedule, as shown in figure (a) below.
202202

203-
<img src="https://github.com/FMInference/FlexGen/raw/main/docs/block_schedule.jpg" alt="image" width="500"></img>
203+
<img src="https://github.com/FMInference/FlexLLMGen/raw/main/docs/block_schedule.jpg" alt="image" width="500"></img>
204204

205205
More technical details see our [paper](https://arxiv.org/abs/2303.06865).
206206

benchmark/batch_size_table.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ The batch size is tuned for each system to achieve its maximum throughput with t
1616
| ------ | -------- | ------- | -------- |
1717
| Hugging Face Accelerate | 2 (gpu) | 8 (cpu) | 2 (disk) |
1818
| DeepSpeed ZeRO-Inference | 16 (cpu) | 4 (cpu) | 1 (disk) |
19-
| FlexGen | 2 (gpu) | 144 (cpu) | 256 (disk) |
20-
| FlexGen with Compression | 72 (gpu) | 512 (cpu) | 144 (cpu) |
19+
| FlexLLMGen | 2 (gpu) | 144 (cpu) | 256 (disk) |
20+
| FlexLLMGen with Compression | 72 (gpu) | 512 (cpu) | 144 (cpu) |
2121

2222
### Generation Throughput (token/s)
2323
We attach the generation throughput here for reference.
@@ -26,8 +26,8 @@ We attach the generation throughput here for reference.
2626
| ------ | -------- | ------- | -------- |
2727
| Hugging Face Accelerate | 25.12 | 0.62 | 0.01 |
2828
| DeepSpeed ZeRO-Inference | 9.28 | 0.60 | 0.01 |
29-
| FlexGen | 25.26 | 7.32 | 0.69 |
30-
| FlexGen with Compression | **29.12** | **8.38** | **1.12** |
29+
| FlexLLMGen | 25.26 | 7.32 | 0.69 |
30+
| FlexLLMGen with Compression | **29.12** | **8.38** | **1.12** |
3131

3232
### About Petals
3333
We also include [Petals](https://arxiv.org/abs/2209.01188) as an additional baseline.

benchmark/flexgen/README.md benchmark/flexllmgen/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
# Benchmark FlexGen
1+
# Benchmark FlexLLMGen
22
NOTE: This benchmark uses dummy weights by default for faster experiments.
33
It is expected if you see randomly generated garbled characters, but the throughput and latency numbers should be correct.
44

55
## Mount SSD
6-
The following commands use `~/flexgen_offload_dir` as the offloading folder by default.
6+
The following commands use `~/flexllmgen_offload_dir` as the offloading folder by default.
77
To get the best performance, it is recommonded to mount this folder on a fast SSD.
88
If you use AWS or GCP instances with local SSDs, you can use [mount_nvme_aws.sh](../../scripts/mount_nvme_aws.sh) or [mount_nvme_gcp.sh](../../scripts/mount_nvme_gcp.sh) to mount the local SSDs.
99

benchmark/flexgen/bench_175b_1x4.sh benchmark/flexllmgen/bench_175b_1x4.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ N_GPUS=4
66
N_CORES_PER_GPU=12
77

88
PYTHON_EXEC=$CONDA_PREFIX/bin/python
9-
PYTHON_SCRIPT=flexgen.dist_flex_opt
9+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
1010

1111
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
1212

benchmark/flexgen/bench_175b_4x1.sh benchmark/flexllmgen/bench_175b_4x1.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ALL_IPADDR=($MY_IPADDR ${OTHERS_IPADDR[@]})
1717
all_hosts=$(echo ${ALL_IPADDR[@]:0:$N_NODES} | sed 's/ /,/g')
1818

1919
PYTHON_EXEC=$CONDA_PREFIX/bin/python
20-
PYTHON_SCRIPT=flexgen.dist_flex_opt
20+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
2121

2222
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
2323

benchmark/flexgen/bench_30b_1x4.sh benchmark/flexllmgen/bench_30b_1x4.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ N_GPUS=4
66
N_CORES_PER_GPU=12
77

88
PYTHON_EXEC=$CONDA_PREFIX/bin/python
9-
PYTHON_SCRIPT=flexgen.dist_flex_opt
9+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
1010

1111
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
1212

benchmark/flexgen/bench_30b_4x1.sh benchmark/flexllmgen/bench_30b_4x1.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ALL_IPADDR=($MY_IPADDR ${OTHERS_IPADDR[@]})
1717
all_hosts=$(echo ${ALL_IPADDR[@]:0:$N_NODES} | sed 's/ /,/g')
1818

1919
PYTHON_EXEC=$CONDA_PREFIX/bin/python
20-
PYTHON_SCRIPT=flexgen.dist_flex_opt
20+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
2121

2222
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
2323

benchmark/flexgen/bench_6.7b_1x4.sh benchmark/flexllmgen/bench_6.7b_1x4.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ N_GPUS=4
66
N_CORES_PER_GPU=6
77

88
PYTHON_EXEC=$CONDA_PREFIX/bin/python
9-
PYTHON_SCRIPT=flexgen.dist_flex_opt
9+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
1010

1111
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
1212

benchmark/flexgen/bench_6.7b_4x1.sh benchmark/flexllmgen/bench_6.7b_4x1.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ALL_IPADDR=($MY_IPADDR ${OTHERS_IPADDR[@]})
1717
all_hosts=$(echo ${ALL_IPADDR[@]:0:$N_NODES} | sed 's/ /,/g')
1818

1919
PYTHON_EXEC=$CONDA_PREFIX/bin/python
20-
PYTHON_SCRIPT=flexgen.dist_flex_opt
20+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
2121

2222
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
2323

benchmark/flexgen/bench_dist_multi_node.sh benchmark/flexllmgen/bench_dist_multi_node.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ALL_IPADDR=($MY_IPADDR ${OTHERS_IPADDR[@]})
1717
all_hosts=$(echo ${ALL_IPADDR[@]:0:$N_NODES} | sed 's/ /,/g')
1818

1919
PYTHON_EXEC=$CONDA_PREFIX/bin/python
20-
PYTHON_SCRIPT=flexgen.dist_flex_opt
20+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
2121

2222
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
2323

benchmark/flexgen/bench_dist_single_node.sh benchmark/flexllmgen/bench_dist_single_node.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ N_GPUS=4
66
N_CORES_PER_GPU=4
77

88
PYTHON_EXEC=$CONDA_PREFIX/bin/python
9-
PYTHON_SCRIPT=flexgen.dist_flex_opt
9+
PYTHON_SCRIPT=flexllmgen.dist_flex_opt
1010

1111
pgrep -fl python | awk '!/dist_flex_opt\.py/{print $1}' | xargs sudo kill
1212

benchmark/flexgen/bench_suite.py benchmark/flexllmgen/bench_suite.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import argparse
22
from dataclasses import dataclass
33

4-
from flexgen.utils import run_cmd
4+
from flexllmgen.utils import run_cmd
55

66

77
@dataclass
@@ -188,7 +188,7 @@ class Case:
188188
cases = suites[suite]
189189
for case in cases:
190190
config, name, use_page_maga = case.command, case.name, case.use_page_maga
191-
cmd = f"python -m flexgen.flex_opt {config}"
191+
cmd = f"python -m flexllmgen.flex_opt {config}"
192192
if log_file:
193193
cmd += f" --log-file {args.log_file}"
194194
if use_page_maga:

benchmark/hf_ds/bench_hf.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
from dataclasses import dataclass
33
import time
44

5-
from flexgen.utils import run_cmd
5+
from flexllmgen.utils import run_cmd
66

77

88
def run_huggingface(model, prompt_len, gen_len, cut_gen_len, batch_size,

0 commit comments

Comments
 (0)