| Literature DB >> 35095460 |
Chongwen Wang1, Zicheng Wang1.
Abstract
Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.Entities:
Keywords: affective computing; cross-attention; facial action unit recognition; multi-scale transformer; self-attention
Year: 2022 PMID: 35095460 PMCID: PMC8790567 DOI: 10.3389/fnbot.2021.824592
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 2.650
Figure 1Attention maps of some faces. Our proposed PMVT is capable of capturing the AU-specific facial regions for different identities with diverse facial expressions.
Figure 2The main idea of the proposed progressive multi-scale vision transformer (PMVT). With the encoded convolutional feature map X, PMVT uses L and S branch transformer encoders that each receives tokens with different resolutions as input. The two branches will be fused adaptively via cross-attention mechanism.
Figure 3The main idea of the cross-attention in PMVT. In this study, we show that PMVT utilizes the classification (CLS) token at the L branch as an agent to exchange semantic AU information among the patch tokens from the S branch. PMVT can also use the CLS token at S to absorb information among the tokens from the L branch.
Action unit (AU) detection performance of our proposed progressive multi-scale vision transformer (PMVT) and state-of-the-art methods on the BP4D dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LSVM (Fan et al., | 23.2 | 22.8 | 23.1 | 27.2 | 47.1 | 77.2 | 63.7 | 64.3 | 18.4 | 33.0 | 19.4 | 20.7 | 35.3 |
| DRML (Zhao et al., | 36.4 | 41.8 | 43.0 | 55.0 | 67.0 | 66.3 | 65.8 | 54.1 | 33.2 | 48.0 | 31.7 | 30.0 | 48.3 |
| EAC-Net (Li et al., | 39.0 | 35.2 | 48.6 | 76.1 | 72.9 | 81.9 | 86.2 | 58.8 | 37.5 | 59.1 | 35.9 | 35.8 | 55.9 |
| ROI (Li et al., | 36.2 | 31.6 | 43.4 | 77.1 | 73.7 | 85.0 | 87.0 | 62.6 | 45.7 | 58.0 | 38.3 | 37.4 | 56.4 |
| JAA-Net (Shao et al., | 47.2 | 44.0 | 54.9 | 77.5 | 74.6 | 84.0 | 86.9 | 61.9 | 43.6 | 60.3 | 42.7 | 41.9 | 60.0 |
| DSIN (Corneanu et al., | 51.7 | 40.4 | 56.0 | 76.1 | 73.5 | 79.9 | 85.4 | 62.7 | 37.3 | 62.9 | 38.8 | 41.6 | 58.9 |
| TCAE (Li et al., | 43.1 | 32.2 | 44.4 | 75.1 | 70.5 | 80.8 | 85.5 | 61.8 | 34.7 | 58.5 | 37.2 | 48.7 | 56.1 |
| TAE (Li et al., | 47.0 | 45.9 | 50.9 | 74.7 | 72.0 | 82.4 | 85.6 | 62.3 | 48.1 | 62.3 | 45.9 | 46.3 | 60.3 |
| SRERL (Li et al., | 46.9 | 45.3 | 55.6 | 77.1 | 78.4 | 83.5 |
| 63.9 | 52.2 |
| 47.1 | 53.3 | 62.9 |
| ARL (Shao et al., | 45.8 | 39.8 | 55.1 | 75.7 | 77.2 | 82.3 | 86.6 | 58.8 | 47.6 | 62.1 | 47.4 | 55.4 | 61.1 |
| FAUT (Jacob and Stenger, | 51.7 | 49.3 |
| 77.8 |
| 82.9 | 86.3 |
| 51.9 | 63.0 | 43.7 |
|
|
| SEV-Net (Yang et al., | 58.2 |
| 58.3 | 81.9 | 73.9 |
| 87.5 | 61.6 | 52.6 | 62.2 | 44.6 | 47.6 | 63.9 |
|
|
| 43.0 | 59.3 |
| 73.6 | 82.6 | 86.1 | 57.6 |
| 60.2 |
| 50.6 | 62.9 |
The highest values are illustrated in Bold format.
Action unit detection performance of our proposed PMVT and state-of-the-art methods on the DISFA dataset.
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| DRML (Zhao et al., | 17.3 | 17.7 | 37.4 | 29.0 | 10.7 | 37.7 | 38.5 | 20.1 | 26.7 |
| EAC-Net (Li et al., | 41.5 | 26.4 | 66.4 | 50.7 |
|
| 88.9 | 15.6 | 48.5 |
| JAA-Net (Shao et al., | 43.7 | 46.2 | 56.0 | 41.4 | 44.7 | 69.6 | 88.3 | 58.4 | 56.0 |
| OFS-CNN (Han et al., | 43.7 | 40.0 | 67.2 | 59.0 | 49.7 | 75.8 | 72.4 | 54.8 | 51.4 |
| DSIN (Corneanu et al., | 42.4 | 39.0 |
| 28.6 | 46.8 | 70.8 | 90.4 | 42.2 | 53.6 |
| TCAE (Li et al., | 15.1 | 15.2 | 50.5 | 48.7 | 23.3 | 72.1 | 82.1 | 52.9 | 45.0 |
| TAE (Li et al., | 21.4 | 19.6 | 64.5 | 46.8 | 44.0 | 73.2 | 85.1 | 55.3 | 51.5 |
| SRERL (Li et al., | 45.7 | 47.8 | 59.6 | 47.1 | 45.6 | 73.5 | 84.3 | 43.6 | 55.9 |
| FAUT (Jacob and Stenger, | 46.1 | 48.6 | 72.8 |
| 50.0 | 72.1 | 90.8 | 55.4 |
|
| ARL (Shao et al., | 43.9 | 42.1 | 63.6 | 41.8 | 40.0 | 76.2 | 95.2 |
| 58.7 |
| SEV-Net (Yang et al., |
| 53.1 | 61.5 | 53.6 | 38.2 | 71.6 | 95.7 | 41.5 | 58.8 |
|
| 50.0 |
| 63.2 | 55.6 | 40.0 | 72.2 |
| 56.3 | 60.9 |
The highest values are illustrated in Bold format.
Ablation studies on the BP4D and DISFA datasets.
|
|
|
|
|---|---|---|
| CL=1 | 60.7 | 56.3 |
| CL=2 | 62.9 | 60.9 |
| CL=3 | 59.5 | 55.8 |
| MS=1 | 62.9 | 60.9 |
| MS=2 | 59.8 | 58.1 |
| MS=3 | 55.0 | 51.1 |
Figure 4Attention maps of some representative faces. We illustrate a subject with different facial expressions in each row. It is obvious that the proposed PMVT is capable of focusing on the most silent parts for facial AU detection. Deep red denotes high activation, better viewed in color and zoom in.