Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Release/2.1: Port a few commits about WOQ and SmoothQuant from cpu-de…
…vice (#2275) * WOQ: blockwise quantization of activation (#2136) * WOQ: Add blockwise quantization of activation * Update docstring for woq qconfig * Modify API name: M->BATCH, K->IC * Improve doc string for get_weight_only_quant_qconfig_mapping * Fix concat-linear bug and improve lowp_mode docstring * Use PER_IC_BLOCK by default * WOQ: fix bf16 correctness bug when lowp_mode=NONE (#2166) * WOQ: fix bf16 correctness bug when lowp_mode=NONE * improve ut * fix clang-format issue * Woq blockwise quant of weight (#2238) * Add block-wise quantization of weight * Use self-defined quantize/dequantize function instead of those from PyTorch * Remove the old fallback and use the new one; Patch N and K when necessary * Fix bug for concat linear * Fix bugs about uncompressing int4 zero points and concat linear * Fix bug about fallback path * Fix clang-format issue * Update llm example script and readme * Fix deepspeed UT * Block_k can be less than group_size; Update group_size docstring * WOQ Bug fix: fuse linear-gelu instead of linear-new_gelu (#2265) * WOQ Bug fix: fuse linear-gelu instead of linear-new_gelu * Use _convert_woq instead of PyTorch's prepare/convert for woq in optimize_transformers * Fix int8 concat linear accuracy issue by adding scales/zero points to IpexWoqLinear.from_float * fix flake8 issue * Move _convert_woq to quantization.convert * Revert changes to quantization.convert to fix deepspeed UT failure * SmoothQuant: make share_weight_observer configurable for layers like QKV (#2106) * SmoothQuant: make share_weight_observer configurable for layers like QKV * Add share_weight_observers to qconf_summary * SmoothQuant: Do not insert mul if user cancels quantization of linear by qconf (#2254) * SmoothQuant: Do not insert mul if user cancels quantization of linear by qconf * Add UT * Fix flake8 issue * Bug fix: SmoothQuant custom sub-observers are shared by mistake (#2219) * Bug fix: SmoothQuant custom sub-observers are shared by mistake * Fix flake8 issue * Update doc string for the change * Bug fix: fail to detect SmoothQuant observer has run (#2269) * Add an int8 linear op for Bitsandbytes (#2266) * Add a int8 linear op for Bitsandbytes * Rename the op to matmul_i8i8i32 since it's not specific for bnb * Add meaningful error messages for TORCH_CHECK * Fix clang-format issue * Run flake8
- Loading branch information