中文 | English

Quantization Acceleration

Quantization is a popular method of model compression, resulting in smaller models size and faster inference speed. Based on PaddleSlim's Auto Compression Toolkit (ACT), FastDeploy provides users with a one-click model compression automation tool. This tool includes a variety of strategies for auto-compression, the current main strategies are post-trainning quantization and quantaware distillation training. At the same time, FastDeploy supports the deployment of compressed models to help users achieve inference acceleration.

Multiple inference engines and hardware support for quantized model deployment in FastDeploy

Currently, multiple inference engines in FastDeploy can support the deployment of quantized models on different hardware.

Hardware/Inference engine	ONNX Runtime	Paddle Inference	TensorRT	Paddle-TensorRT
CPU	Support	Support
GPU			Support	Support

Model Quantization

Quantization Method

Based on PaddleSlim, the quantization methods currently provided by FastDeploy one-click model auto-compression are quantaware distillation training and post training quantization, quantaware distillation training to obtain quantization models through model training, and post training quantization to complete the quantization of models without model training. FastDeploy can deploy the quantized models produced by both methods.

The comparison of the two methods is shown in the following table:

Method	Time Cost	Quantized Model Accuracy	Quantized Model Size	Inference Speed
Post Training Quantization	Less than Quantware	Lower than Quantaware	Same	Same
Quantaware Distillation Training	Normal	Lower than FP32 Model	Same	Same

Use FastDeploy one-click model automation compression tool to quantify models

Based on PaddleSlim's Auto Compression Toolkit (ACT), FastDeploy provides users with a one-click model automation compression tool, please refer to the following document for one-click model automation compression.

FastDeploy One-Click Model Automation Compression

Benchmark

Currently, FastDeploy supports automated compression, and the Runtime Benchmark and End-to-End Benchmark of the model that completes the deployment test are shown below.

NOTE:

Runtime latency is the inference latency of the model on various Runtimes, including CPU->GPU data copy, GPU inference, and GPU->CPU data copy time. It does not include the respective pre and post processing time of the models.
The end-to-end latency is the latency of the model in the actual inference scenario, including the pre and post processing of the model.
The measured latencies are averaged over 1000 inferences, in milliseconds.
INT8 + FP16 is to enable the FP16 inference option for Runtime while inferring the INT8 quantization model.
INT8 + FP16 + PM is the option to use Pinned Memory while inferring INT8 quantization model and turning on FP16, which can speed up the GPU->CPU data copy speed.
The maximum speedup ratio is obtained by dividing the FP32 latency by the fastest INT8 inference latency.
The strategy is quantitative distillation training, using a small number of unlabeled data sets to train the quantitative model, and verify the accuracy on the full validation set, INT8 accuracy does not represent the highest INT8 accuracy.
The CPU is Intel(R) Xeon(R) Gold 6271C with a fixed CPU thread count of 1 in all tests. The GPU is Tesla T4, TensorRT version 8.4.15.

YOLO Series

Runtime Benchmark

Model	Inference Backends	Hardware	FP32 Runtime Latency	INT8 Runtime Latency	INT8 + FP16 Runtime Latency	INT8+FP16+PM Runtime Latency	Max Speedup	FP32 mAP	INT8 mAP	Method
YOLOv5s	TensorRT	GPU	7.87	4.51	4.31	3.17	2.48	37.6	36.7	Quantaware Distillation Training
YOLOv5s	Paddle-TensorRT	GPU	7.99	None	4.46	3.31	2.41	37.6	36.8	Quantaware Distillation Training
YOLOv5s	ONNX Runtime	CPU	176.41	91.90	None	None	1.90	37.6	33.1	Quantaware Distillation Training
YOLOv5s	Paddle Inference	CPU	213.73	130.19	None	None	1.64	37.6	35.2	Quantaware Distillation Training
YOLOv6s	TensorRT	GPU	9.47	3.23	4.09	2.81	3.37	42.5	40.7	Quantaware Distillation Training
YOLOv6s	Paddle-TensorRT	GPU	9.31	None	4.17	2.95	3.16	42.5	40.7	Quantaware Distillation Training
YOLOv6s	ONNX Runtime	CPU	334.65	126.38	None	None	2.65	42.5	36.8	Quantaware Distillation Training
YOLOv6s	Paddle Inference	CPU	352.87	123.12	None	None	2.87	42.5	40.8	Quantaware Distillation Training
YOLOv7	TensorRT	GPU	27.47	6.52	6.74	5.19	5.29	51.1	50.4	Quantaware Distillation Training
YOLOv7	Paddle-TensorRT	GPU	27.87	None	6.91	5.86	4.76	51.1	50.4	Quantaware Distillation Training
YOLOv7	ONNX Runtime	CPU	996.65	467.15	None	None	2.13	51.1	43.3	Quantaware Distillation Training
YOLOv7	Paddle Inference	CPU	995.85	477.93	None	None	2.08	51.1	46.2	Quantaware Distillation Training

End2End Benchmark

Model	Inference Backends	Hardware	FP32 End2End Latency	INT8 End2End Latency	INT8 + FP16 End2End Latency	INT8+FP16+PM End2End Latency	Max Speedup	FP32 mAP	INT8 mAP	Method
YOLOv5s	TensorRT	GPU	24.61	21.20	20.78	20.94	1.18	37.6	36.7	Quantaware Distillation Training
YOLOv5s	Paddle-TensorRT	GPU	23.53	None	21.98	19.84	1.28	37.6	36.8	Quantaware Distillation Training
YOLOv5s	ONNX Runtime	CPU	197.323	110.99	None	None	1.78	37.6	33.1	Quantaware Distillation Training
YOLOv5s	Paddle Inference	CPU	235.73	144.82	None	None	1.63	37.6	35.2	Quantaware Distillation Training
YOLOv6s	TensorRT	GPU	15.66	11.30	10.25	9.59	1.63	42.5	40.7	Quantaware Distillation Training
YOLOv6s	Paddle-TensorRT	GPU	15.03	None	11.36	9.32	1.61	42.5	40.7	Quantaware Distillation Training
YOLOv6s	ONNX Runtime	CPU	348.21	126.38	None	None	2.82	42.5	36.8	Quantaware Distillation Training
YOLOv6s	Paddle Inference	CPU	352.87	121.64	None	None	3.04	42.5	40.8	Quantaware Distillation Training
YOLOv7	TensorRT	GPU	36.47	18.81	20.33	17.58	2.07	51.1	50.4	Quantaware Distillation Training
YOLOv7	Paddle-TensorRT	GPU	37.06	None	20.26	17.53	2.11	51.1	50.4	Quantaware Distillation Training
YOLOv7	ONNX Runtime	CPU	988.85	478.08	None	None	2.07	51.1	43.3	Quantaware Distillation Training
YOLOv7	Paddle Inference	CPU	1031.73	500.12	None	None	2.06	51.1	46.2	Quantaware Distillation Training

PaddleClasSeries

Runtime Benchmark

Model	Inference Backends	Hardware	FP32 Runtime Latency	INT8 Runtime Latency	INT8 + FP16 Runtime Latency	INT8+FP16+PM Runtime Latency	Max Speedup	FP32 Top1	INT8 Top1	Method
ResNet50_vd	TensorRT	GPU	3.55	0.99	0.98	1.06	3.62	79.12	79.06	Post Training Quantization
ResNet50_vd	Paddle-TensorRT	GPU	3.46	None	0.87	1.03	3.98	79.12	79.06	Post Training Quantization
ResNet50_vd	ONNX Runtime	CPU	76.14	35.43	None	None	2.15	79.12	78.87	Post Training Quantization
ResNet50_vd	Paddle Inference	CPU	76.21	24.01	None	None	3.17	79.12	78.55	Post Training Quantization
MobileNetV1_ssld	TensorRT	GPU	0.91	0.43	0.49	0.54	2.12	77.89	76.86	Post Training Quantization
MobileNetV1_ssld	Paddle-TensorRT	GPU	0.88	None	0.49	0.51	1.80	77.89	76.86	Post Training Quantization
MobileNetV1_ssld	ONNX Runtime	CPU	30.53	9.59	None	None	3.18	77.89	75.09	Post Training Quantization
MobileNetV1_ssld	Paddle Inference	CPU	12.29	4.68	None	None	2.62	77.89	71.36	Post Training Quantization

End2End Benchmark

Model	Inference Backends	Hardware	FP32 End2End Latency	INT8 End2End Latency	INT8 + FP16 End2End Latency	INT8+FP16+PM End2End Latency	Max Speedup	FP32 Top1	INT8 Top1	Method
ResNet50_vd	TensorRT	GPU	4.92	2.28	2.24	2.23	2.21	79.12	79.06	Post Training Quantization
ResNet50_vd	Paddle-TensorRT	GPU	4.48	None	2.09	2.10	2.14	79.12	79.06	Post Training Quantization
ResNet50_vd	ONNX Runtime	CPU	77.43	41.90	None	None	1.85	79.12	78.87	Post Training Quantization
ResNet50_vd	Paddle Inference	CPU	80.60	27.75	None	None	2.90	79.12	78.55	Post Training Quantization
MobileNetV1_ssld	TensorRT	GPU	2.19	1.48	1.57	1.57	1.48	77.89	76.86	Post Training Quantization
MobileNetV1_ssld	Paddle-TensorRT	GPU	2.04	None	1.47	1.45	1.41	77.89	76.86	Post Training Quantization
MobileNetV1_ssld	ONNX Runtime	CPU	34.02	12.97	None	None	2.62	77.89	75.09	Post Training Quantization
MobileNetV1_ssld	Paddle Inference	CPU	16.31	7.42	None	None	2.20	77.89	71.36	Post Training Quantization

PaddleDetectionSeries

Runtime Benchmark

Model	Inference Backends	Hardware	FP32 Runtime Latency	INT8 Runtime Latency	INT8 + FP16 Runtime Latency	INT8+FP16+PM Runtime Latency	Max Speedup	FP32 mAP	INT8 mAP	Method
ppyoloe_crn_l_300e_coco	TensorRT	GPU	27.90	6.39	6.44	5.95	4.67	51.4	50.7	Quantaware Distillation Training
ppyoloe_crn_l_300e_coco	Paddle-TensorRT	GPU	30.89	None	13.78	14.01	2.24	51.4	50.5	Quantaware Distillation Training
ppyoloe_crn_l_300e_coco	ONNX Runtime	CPU	1057.82	449.52	None	None	2.35	51.4	50.0	Quantaware Distillation Training

End2End Benchmark

Model	Inference Backends	Hardware	FP32 End2End Latency	INT8 End2End Latency	INT8 + FP16 End2End Latency	INT8+FP16+PM End2End Latency	Max Speedup	FP32 mAP	INT8 mAP	Method
ppyoloe_crn_l_300e_coco	TensorRT	GPU	35.75	15.42	20.70	20.85	2.32	51.4	50.7	Quantaware Distillation Training
ppyoloe_crn_l_300e_coco	Paddle-TensorRT	GPU	33.48	None	18.47	18.03	1.81	51.4	50.5	Quantaware Distillation Training
ppyoloe_crn_l_300e_coco	ONNX Runtime	CPU	1067.17	461.037	None	None	2.31	51.4	50.0	Quantaware Distillation Training

PaddleSegSeries

Runtime Benchmark

Model	Inference Backends	Hardware	FP32 Runtime Latency	INT8 Runtime Latency	INT8 + FP16 Runtime Latency	INT8+FP16+PM Runtime Latency	Max Speedup	FP32 mIoU	INT8 mIoU	Method
PP-LiteSeg-T(STDC1)-cityscapes	Paddle Inference	CPU	1138.04	602.62	None	None	1.89	77.37	71.62	Quantaware Distillation Training

End2End Benchmark

Model	Inference Backends	Hardware	FP32 End2End Latency	INT8 End2End Latency	INT8 + FP16 End2End Latency	INT8+FP16+PM End2End Latency	Max Speedup	FP32 mIoU	INT8 mIoU	Method
PP-LiteSeg-T(STDC1)-cityscapes	Paddle Inference	CPU	4726.65	4134.91	None	None	1.14	77.37	71.62	Quantaware Distillation Training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantize.md

quantize.md

Quantization Acceleration

Multiple inference engines and hardware support for quantized model deployment in FastDeploy

Model Quantization

Quantization Method

Use FastDeploy one-click model automation compression tool to quantify models

Benchmark

YOLO Series

Runtime Benchmark

End2End Benchmark

PaddleClasSeries

Runtime Benchmark

End2End Benchmark

PaddleDetectionSeries

Runtime Benchmark

End2End Benchmark

PaddleSegSeries

Runtime Benchmark

End2End Benchmark

Files

quantize.md

Latest commit

History

quantize.md

File metadata and controls

Quantization Acceleration

Multiple inference engines and hardware support for quantized model deployment in FastDeploy

Model Quantization

Quantization Method

Use FastDeploy one-click model automation compression tool to quantify models

Benchmark

YOLO Series

Runtime Benchmark

End2End Benchmark

PaddleClasSeries

Runtime Benchmark

End2End Benchmark

PaddleDetectionSeries

Runtime Benchmark

End2End Benchmark

PaddleSegSeries

Runtime Benchmark

End2End Benchmark