| Literature DB >> 35632135 |
Sangwon Kim1, Jaeyeal Nam1, Byoung Chul Ko1.
Abstract
In recent image classification approaches, a vision transformer (ViT) has shown an excellent performance beyond that of a convolutional neural network. A ViT achieves a high classification for natural images because it properly preserves the global image features. Conversely, a ViT still has many limitations in facial expression recognition (FER), which requires the detection of subtle changes in expression, because it can lose the local features of the image. Therefore, in this paper, we propose Squeeze ViT, a method for reducing the computational complexity by reducing the number of feature dimensions while increasing the FER performance by concurrently combining global and local features. To measure the FER performance of Squeeze ViT, experiments were conducted on lab-controlled FER datasets and a wild FER dataset. Through comparative experiments with previous state-of-the-art approaches, we proved that the proposed method achieves an excellent performance on both types of datasets.Entities:
Keywords: facial expression recognition; landmark token; squeeze module; vision transformer; visual token
Mesh:
Year: 2022 PMID: 35632135 PMCID: PMC9147983 DOI: 10.3390/s22103729
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Illustration of the proposed Squeeze ViT. (a) An input image is fed to the CNN to yield the global and local features and . Each feature is transformed into visual tokens and landmark tokens , respectively. (b) The squeeze module adjusts the feature dimensions while maintaining robust discriminative elements. (c) Concatenated tokens are fed to the Squeeze ViT having multiple stacks of encoders and squeeze modules to reduce the computational complexity.
Ablation study of tokens for each configuration on the RAF-DB dataset.
| Visual Token | Landmark Token | Accuracy (%) |
|---|---|---|
| ✔ | ✘ | 88.0 |
| ✔ | ✔ | 88.9 |
Comparison results on the numbers of parameters and computations between pure ViT and the proposed Squeeze ViT method on the CK+ dataset.
| Methods | Params (M) | FLOPs (G) | Accuracy (%) |
|---|---|---|---|
| ViT [ | 86.86 | 33.03 | 98.75 |
| Squeeze ViT | 11.96 | 1.84 | 99.54 |
Performance comparison with state-of-the-art methods applied to the CK+, MMI, and RAF-DB datasets. Bold marks the best accuracy.
| Methods | Accuracy (%) | ||
|---|---|---|---|
| CK+ | MMI | RAF-DB | |
| WRF [ | 92.6 | 76.7 | - |
| IPA2LT [ | 91.67 | 65.61 | 86.77 |
| FMPN [ | 98.06 | 82.74 | - |
| ALSG [ | 93.08 | 70.49 | 85.53 |
| FDRL [ |
| 85.23 |
|
| LNLAttenNet [ | 98.18 | 68.75 | 86.15 |
| ViT-SE * [ | 99.49 | - | 87.22 |
| DMUE [ | - | - | 88.76 |
| RUL [ | - | - | 88.98 |
| Squeeze ViT |
|
| 88.90 |
* ViT-SE refers to the test results from [23] under the same conditions as those used in our study.
Figure 2Confusion matrices obtained by the proposed Squeeze ViT approach using three datasets. CK+ and MMI: Results accumulated across all 10-fold cross-validations. RAF-DB: Results from RAF-DB validation set.