The Knowledge Distillation Hinton et al., 2015 implies that a small model (student) is trained to mimic a pre-trained large model (teacher) through knowledge transfer. The goal is to improve the accuracy of the student network.
The NNCF for PyTorch supports Knowledge Distillation out of the box along with all supported compression algorithm (quantization, sparsity, filter pruning), when a student is a model being compressed and teacher - original non-compressed one.
Knowledge is transferred from the teacher model to the student one by minimizing loss function, which is calculated based on predictions of the models. At the moment, two types of loss functions are available. One of them should be explicitly specified in the config.
MSE distillation loss:
Cross-Entropy distillation loss:
The Knowledge Distillation loss function is combined with a regular loss function, so overall loss function will be computed as:
Note: the Cross-Entropy distillation loss was proposed in Hinton et al., 2015 with temperature parameter, but we don't use it or assume that T=1.
To turn on the Knowledge Distillation with some compression algorithm (e.g. filter_pruning) it's necessary to
specify knowledge_distillation
algorithm and its type in the config:
{
...
"compression": [
{
"algorithm": "filter_pruning",
...
},
{
"algorithm": "knowledge_distillation",
"type": "softmax" // or "type": "mse"
}
]
}
See this config file for an example, and NNCF config schema for reference to the available configuration parameters for the algorithm.
- The algorithm is supported for PyTorch only.
- Training the same configuration with Knowledge Distillation requires more time and GPU memory than without it. On average, memory (for all GPU execution modes) and time overhead is below 20% each.
- Outputs of model that shouldn't be differentiated must have
requires_grad=False
. - Model should output predictions, not calculate the losses.