Note
Dynamic quantization currently only supports the onnxruntime backend.
As a quantization class, dynamic quantization allows users to determine the scale factor for activations dynamically based on the data range that's observed at runtime as opposed to using other methods that entail multiplying a float point value by some scale factor and then rounding the result to a whole number. As noted in PyTorch documentation, this "ensures that the scale factor is 'tuned' so that as much signal as possible about each observed dataset is preserved." Since it does not use a lot of tuning parameters, dynamic quantization is a good match for NLP models.
The onnxruntime bert_base model provides an example where users can create a specific quantization method like the following yaml:
model: # mandatory. used to specify model specific information.
name: bert
framework: onnxrt_integerops # mandatory. possible values are tensorflow, mxnet, pytorch, pytorch_ipex, onnxrt_integerops and onnxrt_qlinearops.
quantization:
approach: post_training_dynamic_quant # optional. default value is post_training_static_quant
# possible value is post_training_static_quant,
# post_training_dynamic_quant
# quant_aware_training
calibration:
sampling_size: 8, 16, 32
tuning:
accuracy_criterion:
relative: 0.01 # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%.
exit_policy:
timeout: 0 # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.
random_seed: 9527 # optional. random seed for deterministic tuning.