| Literature DB >> 33267071 |
Liang Gao1, Xu Lan2, Haibo Mi1, Dawei Feng1, Kele Xu1, Yuxing Peng1.
Abstract
Recently, deep learning has achieved state-of-the-art performance in more aspects than traditional shallow architecture-based machine-learning methods. However, in order to achieve higher accuracy, it is usually necessary to extend the network depth or ensemble the results of different neural networks. Increasing network depth or ensembling different networks increases the demand for memory resources and computing resources. This leads to difficulties in deploying depth-learning models in resource-constrained scenarios such as drones, mobile phones, and autonomous driving. Improving network performance without expanding the network scale has become a hot topic for research. In this paper, we propose a cross-architecture online-distillation approach to solve this problem by transmitting supplementary information on different networks. We use the ensemble method to aggregate networks of different structures, thus forming better teachers than traditional distillation methods. In addition, discontinuous distillation with progressively enhanced constraints is used to replace fixed distillation in order to reduce loss of information diversity in the distillation process. Our training method improves the distillation effect and achieves strong network-performance improvement. We used some popular models to validate the results. On the CIFAR100 dataset, AlexNet's accuracy was improved by 5.94%, VGG by 2.88%, ResNet by 5.07%, and DenseNet by 1.28%. Extensive experiments were conducted to demonstrate the effectiveness of the proposed method. On the CIFAR10, CIFAR100, and ImageNet datasets, we observed significant improvements over traditional knowledge distillation.Entities:
Keywords: deep learning; distributed architecture; knowledge distillation; supplementary information
Year: 2019 PMID: 33267071 PMCID: PMC7514841 DOI: 10.3390/e21040357
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Overview of our approach. Multiple students work together, each student is a separate model, and the teacher was the aggregated information () from multiple student networks. First, students train their models by minimizing cross-entropy loss to learn from ground-truth label Y. Then, the server aggregates student information () to generate teacher information (). Finally, the teacher feeds back the student by minimizing Kullback–Leibler divergence between fixed aggregated soft labels and student soft labels during the distillation phase.
Figure 2On the MNIST dataset, Alexnet after 50 rounds of training outputs. (a) One sample from MNIST handwritten digit database, (b) standard probability (temperature = 1) and (c) soft-label probability (temperature = 3) for input. Soft labels can express the class-similarity relation more comprehensively.
Figure 3Our method’s training process.
Top-1 accuracy (%) on the CIFAR10 and CIFAR100 datasets. I represents independent training and MD represents our method. m1 and m2 are the abbreviations of Model1 and Model2.
|
| |||||
|
|
|
|
|
|
|
| VGG | ResNet | 93.36 | 93.84 |
|
|
| DenseNet | ResNet | 95.35 | 93.84 |
|
|
| AlexNet | VGG | 77.58 | 93.36 |
|
|
| ResNet | ResNet | 93.84 | 93.84 |
|
|
| DenseNet | DenseNet | 95.35 | 95.35 |
|
|
| VGG | VGG | 93.36 | 93.36 |
|
|
|
| |||||
|
|
|
|
|
|
|
| VGG | ResNet | 72.57 | 74.02 |
|
|
| DenseNet | ResNet | 77.57 | 74.02 |
|
|
| AlexNet | VGG | 43.85 | 72.57 |
|
|
| ResNet | ResNet | 74.02 | 74.02 |
|
|
| DenseNet | DenseNet | 77.57 | 77.57 |
|
|
| VGG | VGG | 72.57 | 72.57 |
|
|
Experiment test accuracy (%) on ImageNet.
| Model1 | Model2 | I (m1) | I (m2) | MD (m1) | MD (m2) |
|---|---|---|---|---|---|
| ResNet18 | ResNet18 | 69.76 | 69.76 |
|
|
| ResNet18 | ResNet50 | 69.76 | 76.15 |
|
|
| ResNet18 | SqueezNet | 69.76 | 58.1 |
|
|
Comparison with state-of-the-art online-distillation methods. Red/Blue: Best and second-best results.
| Network | AlexNet | ResNet-110 | ||
|---|---|---|---|---|
| Datasets | CIFAR10 | CIFAR100 | CIFAR10 | CIFAR100 |
| Baseline | 77.58 | 43.85 | 93.84 | 74.02 |
| DML | 78.6 |
| 94.81 | 76.92 |
| EC-DNN |
| 46.08 |
|
|
| MD |
|
|
|
|
Figure 4Classification accuracy compared to state-of-the-art online distillation. Baseline is the node-training-alone method; EC-DNN, Ensemble-Compression method; DML, Deep Mutual Learning method; MD, our method.
Figure 5Effect on student number on CIFAR10.
Setting the number of independent training epochs and the number of distillation epochs , the influence of temperature parameters on accuracy (%) is studied. The same structure (AlexNet + AlexNet) and different structure (AlexNet + VGG19) are compared in the experiment.
| Temperature ( | 0.5 | 1 | 2 | 3 | 4 | 5 | 6 | 9 |
|---|---|---|---|---|---|---|---|---|
| AlexNet | NaN | 79.37 | 79.3 |
| 79.3 | 79.43 | 79.35 | 79.42 |
| AlexNet | NaN | 79.56 | 79.21 | 79.59 | 79.46 |
| 79.28 | 79.48 |
| AlexNet | NaN | NaN | 79.97 | 80.23 | 80.22 |
| 79.79 | 80.17 |
| VGG19 | NaN | NaN | 93.72 |
| 93.41 | 93.46 | 93.67 | 93.68 |
Setting temperature parameter , the number of independent training epochs , change the number of distillation epochs , the accuracy (%) of cooperative training of two nodes (AlexNet + AlexNet) using our method.
| Distillation Epochs ( | 1 | 5 | 10 | 20 | 30 | 40 |
|---|---|---|---|---|---|---|
| AlexNet | 77.72 | 78.29 | 78.9 | 78.84 |
| 79.41 |
| AlexNet | 77.96 | 78.54 | 78.96 | 79.04 |
| 79.31 |
Results comparison of whether to use the gradual enhancement of the teacher-constraint method. gMD, incremental distillation weights; fMD, fixed distillation weights.
| Dataset | CIFAR10 | CIFAR100 | ||||
|---|---|---|---|---|---|---|
| Model | Baseline | fMD | gMD | Baseline | fMD | gMD |
| AlexNet | 77.58 | 79.49 | 79.87 | 43.85 | 49.96 | 50.13 |
| VGG19 | 93.36 | 93.93 | 94.33 | 72.57 | 75.32 | 75.45 |