| Literature DB >> 35370591 |
Chongwen Wang1, Zicheng Wang1.
Abstract
Due to the cumbersome and expensive data collection process, facial action unit (AU) datasets are generally much smaller in scale than those in other computer vision fields, resulting in overfitting AU detection models trained on insufficient AU images. Despite the recent progress in AU detection, deployment of these models has been impeded due to their limited generalization to unseen subjects and facial poses. In this paper, we propose to learn the discriminative facial AU representation in a self-supervised manner. Considering that facial AUs show temporal consistency and evolution in consecutive facial frames, we develop a self-supervised pseudo signal based on temporally predictive coding (TPC) to capture the temporal characteristics. To further learn the per-frame discriminativeness between the sibling facial frames, we incorporate the frame-wisely temporal contrastive learning into the self-supervised paradigm naturally. The proposed TPC can be trained without AU annotations, which facilitates us using a large number of unlabeled facial videos to learn the AU representations that are robust to undesired nuisances such as facial identities, poses. Contrary to previous AU detection works, our method does not require manually selecting key facial regions or explicitly modeling the AU relations manually. Experimental results show that TPC improves the AU detection precision on several popular AU benchmark datasets compared with other self-supervised AU detection methods.Entities:
Keywords: contrastive learning; facial action unit recognition; representation learning; self-supervised learning; temporal predictive coding
Year: 2022 PMID: 35370591 PMCID: PMC8965886 DOI: 10.3389/fnbot.2022.851847
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 2.650
Figure 1Main idea of the proposed self-supervised temporally predictive coding (TPC) for facial AU representation learning. Given a facial sequence with T faces, we use the preceding T1 faces as input and exploit the left faces for temporal prediction. Besides, we randomly sampled some triplets in each facial sequence to perceive the temporal consistency and frame-wisely discriminativeness self-supervised. ψ takes the context representation c as input and estimates the features for the future frame recursively. Better viewed in color and zoom in.
Figure 2Illustration of the convolutional gated recurrent unit (GRU).
Action unit (AU) detection accuracy of the proposed temporally predictive coding (TPC) and state-of-the-art approaches on BP4D dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DRML Zhao et al. ( | 36.4 | 41.8 | 43.0 | 55.0 | 67.0 | 66.3 | 65.8 | 54.1 | 33.2 | 48.0 | 31.7 | 30.0 | 48.3 |
| EAC-Net Li et al. ( | 39.0 | 35.2 | 48.6 | 76.1 | 72.9 | 81.9 | 86.2 | 58.8 | 37.5 | 59.1 | 35.9 | 35.8 | 55.9 |
| DSIN Corneanu et al. ( | 51.7 | 40.4 | 56.0 | 76.1 | 73.5 | 79.9 | 85.4 | 62.7 | 37.3 | 62.9 | 38.8 | 41.6 | 58.9 |
| LP-Net Niu et al. ( | 43.4 | 38.0 | 54.2 | 77.1 | 76.7 | 83.8 | 87.2 | 63.3 | 45.3 | 60.5 | 48.1 | 54.2 | 61.0 |
| UGN Song et al. ( | 54.2 | 46.4 | 56.8 | 76.2 | 76.7 | 82.4 | 86.1 | 64.7 | 51.2 | 63.1 |
| 53.6 | 63.3 |
| SRERL Li et al. ( | 46.9 | 45.3 | 55.6 | 77.1 | 78.4 | 83.5 | 87.6 | 63.9 | 52.2 |
| 47.1 | 53.3 | 62.9 |
| FAUT Jacob and Stenger ( | 51.7 | 49.3 |
| 77.8 |
| 82.9 | 86.3 |
| 51.9 | 63.0 | 43.7 |
|
|
| SEV-Net Yang et al. ( |
|
| 58.3 |
| 73.9 |
| 87.5 | 61.6 |
| 62.2 | 44.6 | 47.6 | 63.9 |
| MAL Li and Shan ( | 47.9 | 49.5 | 52.1 | 77.6 | 77.8 | 82.8 |
| 66.4 | 49.7 | 59.7 | 45.2 | 48.5 | 62.2 |
| TCAE Li et al. ( | 43.1 | 32.2 | 44.4 |
| 70.5 | 80.8 | 85.5 | 61.8 | 34.7 | 58.5 | 37.2 |
| 56.1 |
| TAE Li et al. ( |
|
| 50.9 | 74.7 |
|
| 85.6 | 62.3 | 48.1 |
| 45.9 | 46.3 | 60.3 |
| TRL Lu et al. ( | 42.3 | 24.3 | 44.1 | 71.8 | 67.8 | 77.6 | 83.3 | 61.2 | 31.6 | 51.6 | 29.8 | 38.6 | 52.0 |
|
| 43.2 | 44.6 |
| 72.6 | 71.9 | 84.9 |
|
|
| 61.5 |
| 43.7 |
|
The best results in the supervised and self-supervised methods are illustrated in Bold.
Action unit detection accuracy of the proposed TPC and state-of-the-art approaches on the DISFA dataset.
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| DRML Zhao et al. ( | 17.3 | 17.7 | 37.4 | 29.0 | 10.7 | 37.7 | 38.5 | 20.1 | 26.7 |
| EAC-Net Li et al. ( | 41.5 | 26.4 | 66.4 | 50.7 |
|
| 88.9 | 15.6 | 48.5 |
| OFS-CNN Han et al. ( | 43.7 | 40.0 | 67.2 | 59.0 | 49.7 | 75.8 | 72.4 | 54.8 | 51.4 |
| DSIN Corneanu et al. ( | 42.4 | 39.0 |
| 28.6 | 46.8 | 70.8 | 90.4 | 42.2 | 53.6 |
| SRERL Li et al. ( | 45.7 | 47.8 | 59.6 | 47.1 | 45.6 | 73.5 | 84.3 | 43.6 | 55.9 |
| LP-Net Niu et al. ( | 29.9 | 24.7 | 72.7 | 46.8 | 49.6 | 72.9 | 93.8 | 65.0 | 56.9 |
| FAUT Jacob and Stenger ( | 46.1 | 48.6 | 72.8 |
| 50.0 | 72.1 | 90.8 | 55.4 |
|
| SEV-Net Yang et al. ( |
| 53.1 | 61.5 | 53.6 | 38.2 | 71.6 | 95.7 | 41.5 | 58.8 |
| UGN Song et al. ( | 43.3 | 48.1 | 63.4 | 49.5 | 48.2 | 72.9 | 90.8 | 59.0 | 60.0 |
| MAL Li and Shan ( |
|
|
| 47.4 |
|
|
| 52.6 |
|
| TCAE Li et al. ( | 15.1 | 15.2 | 50.5 | 48.7 | 23.3 | 72.1 | 82.1 | 52.9 | 45.0 |
| TAE Li et al. ( | 21.4 | 19.6 |
| 46.8 |
| 73.2 |
|
|
|
| TRL Lu et al. ( | 18.7 | 27.4 | 35.1 | 33.6 | 20.7 | 67.5 | 68.0 | 43.8 | 39.4 |
|
|
|
| 59.6 |
| 42.7 |
| 82.1 | 51.6 | 52.3 |
The best results in the supervised and self-supervised methods are illustrated in Bold.
Ablation studies on the BP4D and DISFA datasets.
|
|
|
|
|---|---|---|
|
| 58.7 | 49.8 |
|
| 57.9 | 50.8 |
| λ = 10.0 | 55.2 | 47.1 |
| λ = 1.0 | 59.3 | 48.6 |
| λ = 0.1 | 61.1 | 52.3 |