| Literature DB >> 34955610 |
Mariana-Iuliana Georgescu1,2, Georgian-Emilian Duţǎ1,2, Radu Tudor Ionescu1,3.
Abstract
We study a series of recognition tasks in two realistic scenarios requiring the analysis of faces under strong occlusion. On the one hand, we aim to recognize facial expressions of people wearing virtual reality headsets. On the other hand, we aim to estimate the age and identify the gender of people wearing surgical masks. For all these tasks, the common ground is that half of the face is occluded. In this challenging setting, we show that convolutional neural networks trained on fully visible faces exhibit very low performance levels. While fine-tuning the deep learning models on occluded faces is extremely useful, we show that additional performance gains can be obtained by distilling knowledge from models trained on fully visible faces. To this end, we study two knowledge distillation methods, one based on teacher-student training and one based on triplet loss. Our main contribution consists in a novel approach for knowledge distillation based on triplet loss, which generalizes across models and tasks. Furthermore, we consider combining distilled models learned through conventional teacher-student training or through our novel teacher-student training based on triplet loss. We provide empirical evidence showing that, in most cases, both individual and combined knowledge distillation methods bring statistically significant performance improvements. We conduct experiments with three different neural models (VGG-f, VGG-face and ResNet-50) on various tasks (facial expression recognition, gender recognition, age estimation), showing consistent improvements regardless of the model or task.Entities:
Year: 2021 PMID: 34955610 PMCID: PMC8693600 DOI: 10.1007/s00138-021-01270-x
Source DB: PubMed Journal: Mach Vis Appl ISSN: 0932-8092 Impact factor: 2.012
Fig. 1The standard teacher–student training pipeline for facial expression recognition on severely occluded faces. The teacher CNN takes as input non-occluded (fully visible) faces, having access to privileged information. The student CNN takes as input only occluded (lower-half-visible) faces, but learns useful information from the teacher CNN model. The loss functions and are the terms of the loss defined in Equation (1). Best viewed in color
Fig. 2The teacher–student training based on triplet loss for facial expression recognition on severely occluded faces. During training, we modify the weights of the student network such that the distance becomes smaller than the distance . Best viewed in color
Accuracy rates of various models on AffectNet [51] and FER+ [5], for fully visible faces (denoted by ), lower-half-visible faces (denoted by ) and upper-half-visible faces (denoted by )
| Model | Train faces | Test faces | AffectNet (%) | FER+ | |
|---|---|---|---|---|---|
| Accuracy (%) | Weighted accuracy (%) | ||||
| VGG-13 [ | – | 84.99 | – | ||
| DACL [ | 65.20 | – | – | ||
| CNNs+BOVW+LC [ | 59.58 | 87.76 | – | ||
| VGG-12 [ | 58.50 | – | – | ||
| Bag of visual words [ | 48.30 | 80.65 | – | ||
| MT-VGG [ | 54.00 | – | – | ||
| AlexNet [ | 58.00 | – | – | ||
| Res-50IBN [ | 63.11 | 89.51 | – | ||
| MBCC-CNN [ | – | 88.10 | – | ||
| ESR-9 [ | 59.30 | 87.15 | – | ||
| SCN [ | 60.23 | 89.35 | – | ||
| PSR [ | 60.68 | 89.75 | – | ||
| Teacher VGG-f | 57.37 | 85.05 | 59.71 | ||
| Teacher VGG-face | 59.03 | 84.79 | 66.15 | ||
| Teacher ResNet-50 | 56.07 | 85.91 | 65.67 | ||
| Teacher VGG-f | 41.58 | 70.00 | 43.24 | ||
| Teacher VGG-face | 37.70 | 68.89 | 39.69 | ||
| Teacher ResNet-50 | 40.50 | 70.89 | 44.56 | ||
| Teacher VGG-f | 26.85 | 40.07 | 32.82 | ||
| Teacher VGG-face | 31.23 | 48.29 | 37.36 | ||
| Teacher ResNet-50 | 24.12 | 44.21 | 30.01 | ||
| VGG-f [ | 47.58 | 78.23 | 50.52 | ||
| VGG-face [ | 49.23 | 82.28 | 58.69 | ||
| ResNet-50 | 45.90 | 81.79 | 60.57 | ||
| VGG-f [ | 42.45 | 66.18 | 44.66 | ||
| VGG-face [ | 43.18 | 70.19 | 48.83 | ||
| ResNet-50 | 43.37 | 72.26 | 54.62 | ||
| VGG-f (standard T-S) | |||||
| VGG-face (standard T-S) | 49.75 | 82.37 | 59.46 | ||
| ResNet-50 (standard T-S) | |||||
| VGG-f (triplet loss T-S) | 48.13 | ||||
| VGG-face (triplet loss T-S) | 49.71 | 82.57 | 59.12 | ||
| ResNet-50 (triplet loss T-S) | 46.17 | 81.28 | 60.93 | ||
| VGG-f (triplet loss + standard T-S) | |||||
| VGG-face (triplet loss + standard T-S) | |||||
| ResNet-50 (triplet loss + standard T-S) | |||||
The VGG-f, VGG-face and ResNet-50 models based on our teacher–student (T–S) training strategies are compared with state-of-the-art methods [5, 12, 18, 25, 35, 42, 51, 64–66, 70, 71] tested on fully visible faces and with methods [16, 29] designed for the VR setting (tested on occluded faces). The test results of our student networks that are significantly better than the stronger baseline [16], according to a paired McNemar’s test [9], are marked with for a significance level of 0.05
Fig. 3Fully visible images () on top row, lower-half-visible faces () on second row, Grad-CAM [61] explanation masks on third row and lower-half-visible faces with superimposed Grad-CAM masks on bottom row. The predicted labels provided by the distilled VGG-face (left-hand side) or VGG-f (right-hand side) models are also provided at the bottom. The first two examples from each side are selected from AffectNet [51] and FER+ [5], respectively. The third example from each side is a person wearing an actual VR headset. Best viewed in color
Accuracy rates for gender prediction on UTKFace [83], for fully visible faces (denoted by ), lower-half-visible faces (denoted by ) and upper-half-visible faces (denoted by )
| Method | Train faces | Test faces | Accuracy (%) |
|---|---|---|---|
| ResNet-50+PyNADA [ | 90.80 | ||
| Teacher VGG-f | 92.78 | ||
| Teacher VGG-face | 92.20 | ||
| Teacher ResNet-50 | 90.88 | ||
| Teacher VGG-f | 78.13 | ||
| Teacher VGG-face | 73.05 | ||
| Teacher ResNet-50 | 72.69 | ||
| Teacher VGG-f | 85.69 | ||
| Teacher VGG-face | 88.18 | ||
| Teacher ResNet-50 | 83.20 | ||
| VGG-f | 88.70 | ||
| VGG-face | 90.62 | ||
| ResNet-50 | 86.47 | ||
| VGG-f | 88.92 | ||
| VGG-face | 88.26 | ||
| ResNet-50 | 88.75 | ||
| VGG-f (standard T-S) | 89.13 | ||
| VGG-face (standard T-S) | 88.45 | ||
| ResNet-50 (standard T-S) | |||
| VGG-f (triplet loss T-S) | |||
| VGG-face (triplet loss T-S) | 88.31 | ||
| ResNet-50 (triplet loss T-S) | 89.19 | ||
| VGG-f (triplet loss + standard T-S) | |||
| VGG-face (triplet loss + standard T-S) | |||
| ResNet-50 (triplet loss + standard T-S) |
A state-of-the-art model [19] is included as reference. The results of distilled models that are significantly better than the student trained on upper-half-visible faces, according to a paired McNemar’s test [9] at a significance level of 0.05, are marked with
Fig. 4Fully visible images () on top row, upper-half-visible faces () on second row, Grad-CAM [61] explanation masks on third row and upper-half-visible faces with superimposed Grad-CAM masks on bottom row. The predicted gender provided by the distilled ResNet-50 model is shown at the bottom. The first four examples are selected from the UTKFace [83] data set. The last two examples are people wearing surgical masks. Best viewed in color
Mean absolute error (MAE) values for age estimation on UTKFace [83], for fully visible faces (denoted by ), lower-half-visible faces (denoted by ) and upper-half-visible faces (denoted by )
| Method | Train faces | Test faces | MAE |
|---|---|---|---|
| ResNet-50+PyNADA [ | 5.79 | ||
| Teacher VGG-f | 5.63 | ||
| Teacher VGG-face | 5.11 | ||
| Teacher ResNet-50 | 5.27 | ||
| Teacher VGG-f | 11.16 | ||
| Teacher VGG-face | 13.08 | ||
| Teacher ResNet-50 | 14.23 | ||
| Teacher VGG-f | 9.60 | ||
| Teacher VGG-face | 10.30 | ||
| Teacher ResNet-50 | 11.92 | ||
| VGG-f | 6.80 | ||
| VGG-face | 6.15 | ||
| ResNet-50 | 6.66 | ||
| VGG-f | 6.36 | ||
| VGG-face | 5.53 | ||
| ResNet-50 | 6.44 | ||
| VGG-f (standard T-S) | 6.34 | ||
| VGG-face (standard T-S) | 5.40 | ||
| ResNet-50 (standard T-S) | 6.35 | ||
| VGG-f (triplet loss T-S) | 6.34 | ||
| VGG-face (triplet loss T-S) | 5.42 | ||
| ResNet-50 (triplet loss T-S) | 6.34 | ||
| VGG-f (triplet loss + standard T-S) | 6.22 | ||
| VGG-face (triplet loss + standard T-S) | 5.40 | ||
| ResNet-50 (triplet loss + standard T-S) | 6.33 |
A state-of-the-art model [19] is included as reference
Fig. 5Fully visible images () on top row, upper-half-visible faces () on second row, Grad-CAM [61] explanation masks on third row and upper-half-visible faces with superimposed Grad-CAM masks on bottom row. The estimated age provided by the distilled ResNet-50 model is the first number shown at the bottom. The second number is the ground-truth age. The first four examples are selected from the UTKFace [83] data set, while the last two examples are people wearing masks. Best viewed in color
The accuracy rates of SVMs trained on embeddings extracted from students based on standard teacher–student (TS) or triplet loss (TL) strategies
| Network | Method | FER+ (%) | AffectNet (%) | UTKFace (%) |
|---|---|---|---|---|
| VGG-f | TS | 80.17 | 48.75 | 89.13 |
| TL | 80.05 | 48.13 | 89.55 | |
| TS+SVM | 80.39 | 48.52 | 89.04 | |
| TL+SVM | 79.06 | 47.01 | 89.70 | |
| TS+TL+SVM | 81.09 | 48.70 | 89.82 | |
| VGG-face | TS | 82.37 | 49.75 | 88.45 |
| TL | 82.57 | 49.71 | 88.31 | |
| TS+SVM | 82.34 | 48.89 | 90.35 | |
| TL+SVM | 82.37 | 49.90 | 90.27 | |
| TS+TL+SVM | 82.75 | 50.09 | 90.35 |
These models are compared with SVMs trained on concatenated embeddings as well as the students providing the embeddings. Results are reported for two tasks: facial expression recognition (on FER+ and AffectNet) and gender prediction (on UTKFace)