| Literature DB >> 33954248 |
Abdolmaged Alkhulaifi1, Fahad Alsahli1, Irfan Ahmad1.
Abstract
Deep learning based models are relatively large, and it is hard to deploy such models on resource-limited devices such as mobile phones and embedded devices. One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model). In this paper, we present an outlook of knowledge distillation techniques applied to deep learning models. To compare the performances of different techniques, we propose a new metric called distillation metric which compares different knowledge distillation solutions based on models' sizes and accuracy scores. Based on the survey, some interesting conclusions are drawn and presented in this paper including the current challenges and possible research directions.Entities:
Keywords: Deep learning; Knowledge distillation; Model compression; Student model; Teacher model; Transferring knowledge
Year: 2021 PMID: 33954248 PMCID: PMC8053015 DOI: 10.7717/peerj-cs.474
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1A generic illustration of knowledge distillation.
Figure 2Illustration of knowledge distillation using (A) pre-trained teacher model (offline) and (B) while training the teacher model simultaneously (online).
Figure 3A tree diagram illustrating the different knowledge distillation categories of methods and the different branches within each category.
Summary of knowledge distillation approaches that utilize soft labels of teacher to train student model.
In case of several students, results of student with largest size reduction are reported. In case of several datasets, dataset associated with the lowest accuracy reduction is recorded. Baseline models have the same size as the corresponding student models, but they were trained without the teacher models.
| Reference | Targeted architecture | Utilized data | Reduction in accuracy compared to teacher | Improvement in accuracy compared to baseline | Reduction in size |
|---|---|---|---|---|---|
| Offline distillation | |||||
| CNN | Aurora ( | 0.782% | 2.238% | – | |
| Decision tree | MNIST ( | 12.796% | 1-5% | – | |
| DenseNet ( | CIFAR-100 ( | 2.369% (increase) | – | – | |
| Wide ResNet ( | CIFAR-100 | 0.1813% | – | 52.87% | |
| LSTM | SWB Switchboard subset from HUB5 dataset ( | 2.655% | – | 55.07% | |
| Seq2Seq | WSJ Wall Street Journal dataset ( | 8.264% | 8.97% | 89.88% | |
| CNN | MNIST | 10.526% (increase) | 16.359% | – | |
| CNN | MNIST | 0.57% | – | 40% | |
| ResNet ( | HMDB51 ( | 0.6193% | – | 58.31% | |
| Online distillation | |||||
| ResNet | CIFAR100, | – | 6.64% | – | |
| Micro CNN | Synthetic Aperture Radar Images Synthetic Aperture Radar Images dataset ( | 0.607% | – | 99.44% | |
| MobileNetV2 ( | ImageNet ( | 9.644% | 6.246% | 70.66% | |
| ResNet | CIFAR100, | – | 5.39% | – | |
| ResNet | CIFAR100, | 1.59% | 6.29% | 34.29% | |
Summary of knowledge distillation approaches that distills knowledge from parts other than or in addition to the soft labels of the teacher models to be used for training the student models.
In case of several students, results of student with largest size reduction are reported. In case of several datasets, dataset associated with the lowest accuracy reduction is recorded. Baseline models have the same size as the corresponding student models, but they were trained without the teacher models.
| Reference | Targeted architecture | Utilized data | Reduction in accuracy compared to teacher | Improvement in accuracy compared to baseline | Reduction in size |
|---|---|---|---|---|---|
| Offline distillation | |||||
| CNN | MNIST | 4.8% | 5.699% (decrease) | 50% | |
| ResNet | CIFAR-10 | 0.3043% (increase) | – | – | |
| ResNet | CIFAR-100 | 2.889% | 7.813% | 96.20% | |
| U-Net | Janelia ( | – | – | 78.99% | |
| MobileNetV2 | PASCAL ( | 4.868% (mIOU) | – | 92.13% | |
| WRN | ImageNet to MIT scene ( | 6.191% (increase) | 14.123% | 70.66% | |
| CNN | UIUC-Sports ( | 7.431% | 16.89% | 95.86% | |
| ResNet | CIFAR10 | 0.831% | 2.637% | 73.59% | |
| Online distillation | |||||
| WRN | CIFAR-10 | 1.006% | 1.37% | 66% | |
| ResNet18 | CIFAR100 | 13.72% | – | – | |
| CNN | CIFAR100 | 5.869% | – | – | |
| ResNet | CIFAR10 | 1.019% | 1.095% | 96.36% | |
| WRN | CIFAR100 | 1.557% | 6.768% | 53.333% | |
Figure 4Illustration of different types of knowledge distillation depending on the number of teachers and students.
(A) Knowledge distillation from one teacher to one student. (B) Knowledge distillation from one teacher to multiple students. (C) Knowledge distillation from multiple teachers to one student.
Figure 5Use cases for knowledge distillation to deploy deep learning models on small devices with limited resources.