Release GPTQModel v1.8.1 · ModelCloud/GPTQModel

What's Changed

⚡ DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ Over 50% speedup for vl model quantization (Qwen 2.5-VL + Ovis)
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model upload to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing and group_size=-1 regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

Flexible Pack DType by @Qubitium in #1158
cuda needs to declare pack dtypes by @Qubitium in #1169
fix pass pack dtype by @Qubitium in #1172
Pass dtype by @Qubitium in #1173
move in/out features and grop_size init to base by @Qubitium in #1174
move self.maxq to base class by @Qubitium in #1175
consolidate pack() into packer cls by @Qubitium in #1176
Add pack_dtype to dynamic config and fix validate by @Qubitium in #1178
Refract 4 by @Qubitium in #1180
Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
push bitblas logic down by @Qubitium in #1185
Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
Do not export config.key if value is None by @Qubitium in #1187
Fix examples/perplexity by @Qubitium in #1191
[MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
Fix #1196 hf_transfer not accepting max_memory arg by @Qubitium in #1197
reduce peak memory and reduce quant time by @Qubitium in #1198
skip zero math by @Qubitium in #1199
fix test_packing_speed by @Qubitium in #1202
Update test_quant_time.py by @Qubitium in #1203
experimental buffered_fwd quantize control by @Qubitium in #1205
Fix dynamic regression on quant save by @Qubitium in #1208
Python 3.10 type-hint compt bug by @Qubitium in #1213
Fix colab install by @Qubitium in #1215
add GPTQModel.push_to_hub() support by @Qubitium in #1216
default to 8GB shard-size for model save by @Qubitium in #1217
Auto gc toggle by @Qubitium in #1219
fix 3bit packing and inference by @Qubitium in #1218
fix merge error by @CSY-ModelCloud in #1234
fix var name by @CSY-ModelCloud in #1235
fix visual llm slow forward by @LRL-ModelCloud in #1232

Full Changelog: v1.7.4...v1.8.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQModel v1.8.1

What's Changed

Contributors