| Literature DB >> 35885106 |
Xue Li1,2,3, Chunhua Zhu1,2,3, Fei Zhou1,2,3.
Abstract
Facial expression recognition (FER) in the wild is a challenging task due to some uncontrolled factors such as occlusion, illumination, and pose variation. The current methods perform well in controlled conditions. However, there are still two issues with the in-the-wild FER task: (i) insufficient descriptions of long-range dependency of expression features in the facial information space and (ii) not finely refining subtle inter-classes distinction from multiple expressions in the wild. To overcome the above issues, an end-to-end model for FER, named attention-modulated contextual spatial information network (ACSI-Net), is presented in this paper, with the manner of embedding coordinate attention (CA) modules into a contextual convolutional residual network (CoResNet). Firstly, CoResNet is constituted by arranging contextual convolution (CoConv) blocks of different levels to integrate facial expression features with long-range dependency, which generates a holistic representation of spatial information on facial expression. Then, the CA modules are inserted into different stages of CoResNet, at each of which the subtle information about facial expression acquired from CoConv blocks is first modulated by the corresponding CA module across channels and spatial locations and then flows into the next layer. Finally, to highlight facial regions related to expression, a CA module located at the end of the whole network, which produces attentional masks to multiply by input feature maps, is utilized to focus on salient regions. Different from other models, the ACSI-Net is capable of exploring intrinsic dependencies between features and yielding a discriminative representation for facial expression classification. Extensive experimental results on AffectNet and RAF_DB datasets demonstrate its effectiveness and competitiveness compared to other FER methods.Entities:
Keywords: deep learning; facial expression recognition; features extraction; neural network; spatial information
Year: 2022 PMID: 35885106 PMCID: PMC9324190 DOI: 10.3390/e24070882
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1The overall construction of our proposed attention-modulated spatial information network (ASCI-Net).
Figure 2A CoConv block integrating kernels with different dilation ratios in the convolution layer.
Parameters of the CoConv blocks.
| Stage | Input Size | Level | CoConv |
|---|---|---|---|
| 1 |
|
|
|
| 2 |
|
|
|
| 3 |
|
|
|
| 4 |
|
|
|
Figure 3The structure of the coordinated attention module. Here, ““ and “ “ refer to 1D horizontal global pooling and 1D vertical global pooling, respectively.
Details of experimental datasets, including categories of expressions, number of training and testing samples.
| Dataset | Affectnet-7 | RAF_DB | ||
|---|---|---|---|---|
| Train | Test | Train | Test | |
| Anger | 24,882 | 500 | 705 | 162 |
| Disgust | 3803 | 500 | 717 | 160 |
| Fear | 6378 | 500 | 281 | 74 |
| Happy | 134,415 | 500 | 4772 | 1185 |
| Sad | 25,459 | 500 | 1982 | 478 |
| Surprise | 14,090 | 500 | 1290 | 329 |
| Normal | 74,874 | 500 | 2524 | 680 |
| Total | 283,901 | 3500 | 12,271 | 3068 |
The performance of CoResNet and ResNet, including the number of model parameters, FLOPs, test time of an image, and recognition accuracy (%).
| Model | Params | GFLOPs | Time/s | Accuracy (%) | |
|---|---|---|---|---|---|
| RAF-DB | AffectNet-7 | ||||
| ResNet | 11.69 | 1.82 | 1.32 | 85.88 | 63.82 |
| CoResNet | 11.69 | 1.82 | 1.32 | 86.86 | 65.83 |
Figure 4The distribution of deeply learned features under (a) “ResNet” and (b) “CoResNet” for samples from AffectNet-7 (Row 1) and DAF_DB (Row 2) datasets. As we can see, CoResNet can learn features with more discrimination. Moreover, it is seen that the features extracted from CoResNet tend to shape several clusters in the space.
The performance of the CA module at different network locations, including the number of model parameters, FLOPs, test time of an image, and recognition accuracy (%).
| Model | Params | GFLOPs | Time/s | Accuracy (%) | |
|---|---|---|---|---|---|
| RAF-DB | AffectNet-7 | ||||
| CoResNet | 11.69 | 1.82 | 1.32 | 86.29 | 64.38 |
| CoResNet_CA-a | 11.72 | 1.82 | 1.33 | 86.45 | 65.16 |
| CoResNet_CA-b | 11.75 | 1.82 | 1.35 | 86.52 | 65.60 |
|
| 11.78 | 1.82 | 1.38 |
|
|
Figure 5Attention visualization of different facial expressions on some examples from the RAF_DB dataset under the ACSI-Net. (I), (II) denote original facial images and attentive masks to the original image, respectively. (a–g) denote anger, disgust, fear, happiness, sadness, surprise, and neutral separately.
The recognition accuracy (%) of different models on RAF_DB and AffectNet-7.
| Method | Year | RAF_DB | AffectNet-7 |
|---|---|---|---|
| gACNN [ | 2018 | 85.07 | - |
| CPG [ | 2019 | - | 63.57 |
| Separate Loss [ | 2019 | 86.38 | - |
| MA-Net [ | 2021 | 86.34 | 64.54 |
| OAENet [ | 2021 | 86.50 | - |
| DACL [ | 2021 |
|
|
| HSNet [ | 2022 | 86.67 | - |
|
|
|
| |