@@ -142,31 +142,38 @@ For example, if you have 2 GPUs but the aggregated GPU memory is less than the m
142
142
See examples [ here] ( https://github.com/FMInference/FlexGen/tree/main/benchmark/flexgen#distributed-gpus ) .
143
143
144
144
## API Example
145
- We demonstrate the usage of FlexGen API in [ apps/ completion.py] ( apps/completion.py ) .
145
+ We demonstrate the usage of FlexGen API in [ completion.py] ( flexgen/ apps/completion.py) .
146
146
This example shows how to run generation for two sentences.
147
147
To get the best throughput out of FlexGen, you typically need to batch more sentences.
148
148
149
149
### Generation API
150
150
FlexGen has a generation API following the style of Hugging Face's transformers.
151
- https://github.com/FMInference/FlexGen/blob/cf90920349109205378e5253fd5e8da4fa2740c1/apps/completion.py#L53-L58
151
+ ``` python
152
+ output_ids = model.generate(
153
+ input_ids,
154
+ do_sample = True ,
155
+ temperature = 0.7 ,
156
+ max_new_tokens = 32 ,
157
+ stop = stop)
158
+ ```
152
159
153
160
### Example Commands
154
161
You can use the example commands below.
155
162
If you do not have enough GPU/CPU memory, see the [ Handle Out-of-memory] ( #handle-out-of-memory ) section.
156
163
157
164
```
158
165
# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
159
- python3 completion.py --model facebook/opt-6.7b
166
+ python3 -m flexgen.apps.completion --model facebook/opt-6.7b
160
167
```
161
168
162
169
```
163
170
# Complete with OPT-30B. You need about 90GB of CPU memory.
164
- python3 completion.py --model facebook/opt-30b --percent 0 100 100 0 100 0
171
+ python3 -m flexgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
165
172
```
166
173
167
174
```
168
175
# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
169
- python3 completion.py --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
176
+ python3 -m flexgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
170
177
```
171
178
172
179
### Handle Out-of-memory
@@ -175,7 +182,7 @@ They save more memory but run slower.
175
182
176
183
- Do not pin weights by adding ` --pin-weight 0 ` . This can reduce the weight memory usage on CPU by around 20% or more.
177
184
- Enable weight compression by adding ` --compress-weight ` . This can reduce the weight memory usage by around 70%.
178
- - Offload weights to disk by using ` --percent 0 0 100 0 100 0 ` . This requires very little CPU and GPU memory.
185
+ - Offload all weights to disk by using ` --percent 0 0 100 0 100 0 ` . This requires very little CPU and GPU memory.
179
186
180
187
## Roadmap
181
188
We plan to work on the following features. Community contributions are welcome.
0 commit comments