Is there a way of inferring on models with int8 MoQ? #2412
Unanswered
EdouardVilain-Git
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all, thanks for the awesome work,
I am working with Deepspeed to apply MoQ on BERT like transformer architectures (XLM-RoBERTa to be precise). I have been able to train the model on 2*GPUs with the following configurations:
{ "train_batch_size": 32, "steps_per_print": 50, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "fp16": { "enabled": true }, "compression_training": { "weight_quantization": { "shared_parameters": { "enabled": true, "quantizer_kernel": false, "schedule_offset": 0, "quantize_groups": 64, "quantize_verbose": true, "quantization_type": "symmetric", "quantize_weight_in_forward": false, "rounding": "nearest", "fp16_mixed_quantize": { "enabled": false, "quantize_change_ratio": 0.1 } }, "different_groups": { "wq1": { "params": { "start_bits": 16, "target_bits": 8, "quantization_period": 350 }, "modules": ["attention.self", "intermediate", "word_embeddings", "output.dense", "pooler.dense", "category_embeddings"] } } }, "activation_quantization": { "shared_parameters": { "enabled": true, "quantization_type": "symmetric", "range_calibration": "dynamic", "schedule_offset": 0 }, "different_groups": { "aq1": { "params": { "bits": 8 }, "modules": ["attention.self", "intermediate", "output.dense"] } } } } }
I am now trying to infer using:
engine = deepspeed.init_inference( deepspeed_trainer.model, dtype=torch.int8, quantization_setting=(False,64), replace_with_kernel_inject=True )
Several issues have already been created regarding the problem this causes (#2301), but it seems like int8 inference is not yet supported by Deepspeed because int8 kernels have not yet been released.
Nevertheless, I would like to find a solution to use this quantized model for inference. I have thought about using CPU but I would need to bypass the Deepspeed InferenceEngine in that case. Is there any way of loading the quantized model into a PyTorch model to enable int8 CPU inference?
I guess a simple way of doing so would be to run Post-Training Dynamic Quantization on the already quantized weights using Torch's quantization module but this is far from elegant. Hoping to find something a bit better!
Beta Was this translation helpful? Give feedback.
All reactions