You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quantization is a popular method of model compression, resulting in smaller models size and faster inference speed.
Based on PaddleSlim's Auto Compression Toolkit (ACT), FastDeploy provides users with a one-click model compression automation tool. This tool includes a variety of strategies for auto-compression, the current main strategies are post-trainning quantization and quantaware distillation training. At the same time, FastDeploy supports the deployment of compressed models to help users achieve inference acceleration.
Multiple inference engines and hardware support for quantized model deployment in FastDeploy
Currently, multiple inference engines in FastDeploy can support the deployment of quantized models on different hardware.
Hardware/Inference engine
ONNX Runtime
Paddle Inference
TensorRT
Paddle-TensorRT
CPU
Support
Support
GPU
Support
Support
Model Quantization
Quantization Method
Based on PaddleSlim, the quantization methods currently provided by FastDeploy one-click model auto-compression are quantaware distillation training and post training quantization, quantaware distillation training to obtain quantization models through model training, and post training quantization to complete the quantization of models without model training. FastDeploy can deploy the quantized models produced by both methods.
The comparison of the two methods is shown in the following table:
Method
Time Cost
Quantized Model Accuracy
Quantized Model Size
Inference Speed
Post Training Quantization
Less than Quantware
Lower than Quantaware
Same
Same
Quantaware Distillation Training
Normal
Lower than FP32 Model
Same
Same
Use FastDeploy one-click model automation compression tool to quantify models
Based on PaddleSlim's Auto Compression Toolkit (ACT), FastDeploy provides users with a one-click model automation compression tool, please refer to the following document for one-click model automation compression.
Currently, FastDeploy supports automated compression, and the Runtime Benchmark and End-to-End Benchmark of the model that completes the deployment test are shown below.
NOTE:
Runtime latency is the inference latency of the model on various Runtimes, including CPU->GPU data copy, GPU inference, and GPU->CPU data copy time. It does not include the respective pre and post processing time of the models.
The end-to-end latency is the latency of the model in the actual inference scenario, including the pre and post processing of the model.
The measured latencies are averaged over 1000 inferences, in milliseconds.
INT8 + FP16 is to enable the FP16 inference option for Runtime while inferring the INT8 quantization model.
INT8 + FP16 + PM is the option to use Pinned Memory while inferring INT8 quantization model and turning on FP16, which can speed up the GPU->CPU data copy speed.
The maximum speedup ratio is obtained by dividing the FP32 latency by the fastest INT8 inference latency.
The strategy is quantitative distillation training, using a small number of unlabeled data sets to train the quantitative model, and verify the accuracy on the full validation set, INT8 accuracy does not represent the highest INT8 accuracy.
The CPU is Intel(R) Xeon(R) Gold 6271C with a fixed CPU thread count of 1 in all tests. The GPU is Tesla T4, TensorRT version 8.4.15.