Skip to content
This repository was archived by the owner on Dec 1, 2024. It is now read-only.

Commit 698214b

Browse files
authored
Move apps into flexgen package (#70)
1 parent 9b600ba commit 698214b

File tree

6 files changed

+14
-7
lines changed

6 files changed

+14
-7
lines changed

README.md

+13-6
Original file line numberDiff line numberDiff line change
@@ -142,31 +142,38 @@ For example, if you have 2 GPUs but the aggregated GPU memory is less than the m
142142
See examples [here](https://github.com/FMInference/FlexGen/tree/main/benchmark/flexgen#distributed-gpus).
143143

144144
## API Example
145-
We demonstrate the usage of FlexGen API in [apps/completion.py](apps/completion.py).
145+
We demonstrate the usage of FlexGen API in [completion.py](flexgen/apps/completion.py).
146146
This example shows how to run generation for two sentences.
147147
To get the best throughput out of FlexGen, you typically need to batch more sentences.
148148

149149
### Generation API
150150
FlexGen has a generation API following the style of Hugging Face's transformers.
151-
https://github.com/FMInference/FlexGen/blob/cf90920349109205378e5253fd5e8da4fa2740c1/apps/completion.py#L53-L58
151+
```python
152+
output_ids = model.generate(
153+
input_ids,
154+
do_sample=True,
155+
temperature=0.7,
156+
max_new_tokens=32,
157+
stop=stop)
158+
```
152159

153160
### Example Commands
154161
You can use the example commands below.
155162
If you do not have enough GPU/CPU memory, see the [Handle Out-of-memory](#handle-out-of-memory) section.
156163

157164
```
158165
# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
159-
python3 completion.py --model facebook/opt-6.7b
166+
python3 -m flexgen.apps.completion --model facebook/opt-6.7b
160167
```
161168

162169
```
163170
# Complete with OPT-30B. You need about 90GB of CPU memory.
164-
python3 completion.py --model facebook/opt-30b --percent 0 100 100 0 100 0
171+
python3 -m flexgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
165172
```
166173

167174
```
168175
# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
169-
python3 completion.py --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
176+
python3 -m flexgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
170177
```
171178

172179
### Handle Out-of-memory
@@ -175,7 +182,7 @@ They save more memory but run slower.
175182

176183
- Do not pin weights by adding `--pin-weight 0`. This can reduce the weight memory usage on CPU by around 20% or more.
177184
- Enable weight compression by adding `--compress-weight`. This can reduce the weight memory usage by around 70%.
178-
- Offload weights to disk by using `--percent 0 0 100 0 100 0`. This requires very little CPU and GPU memory.
185+
- Offload all weights to disk by using `--percent 0 0 100 0 100 0`. This requires very little CPU and GPU memory.
179186

180187
## Roadmap
181188
We plan to work on the following features. Community contributions are welcome.
File renamed without changes.

flexgen/apps/__init__.py

Whitespace-only changes.
File renamed without changes.
File renamed without changes.

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "flexgen"
7-
version = "0.1.4"
7+
version = "0.1.5"
88
description = "Running large language models like OPT-175B/GPT-3 on a single GPU. Focusing on high-throughput large-batch generation."
99
readme = "README.md"
1010
requires-python = ">=3.7"

0 commit comments

Comments
 (0)