| Literature DB >> 36211003 |
Xiaodong Liu1, Huating Xu1, Miao Wang1.
Abstract
Video emotion recognition has attracted increasing attention. Most existing approaches are based on the spatial features extracted from video frames. The context information and their relationships in videos are often ignored. Thus, the performance of existing approaches is restricted. In this study, we propose a sparse spatial-temporal emotion graph convolutional network-based video emotion recognition method (SE-GCN). For the spatial graph, the emotional relationship between any two emotion proposal regions is first calculated and the sparse spatial graph is constructed according to the emotional relationship. For the temporal graph, the emotional information contained in each emotion proposal region is first analyzed and the sparse temporal graph is constructed by using the emotion proposal regions with rich emotional cues. Then, the reasoning features of the emotional relationship are obtained by the spatial-temporal GCN. Finally, the features of the emotion proposal regions and the spatial-temporal relationship features are fused to recognize the video emotion. Extensive experiments are conducted on four challenging benchmark datasets, that is, MHED, HEIV, VideoEmotion-8, and Ekman-6. The experimental results demonstrate that the proposed method achieves state-of-the-art performance.Entities:
Mesh:
Year: 2022 PMID: 36211003 PMCID: PMC9534632 DOI: 10.1155/2022/3518879
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The motivation of emotional relationship graph.
Figure 2The overall framework of the SE-GCN.
Exploration of spatial-temporal relationships.
| Method | MEHD (%) | HEIV (%) |
|---|---|---|
| Base model | 55.60 | 46.17 |
| Sparse spatial graph | 65.65 | 54.07 |
| Sparse temporal graph | 64.95 | 53.33 |
| Spatiotemporal graph | 67.29 | 55.06 |
Exploration of emotion features of proposal regions.
| Method | MEHD (%) | HEIV (%) |
|---|---|---|
| Spatiotemporal graph | 67.29 | 55.06 |
| Spatiotemporal graph | 67.99 | 55.55 |
Exploration of the number of frames.
| The number of frames | 5 (%) | 10 (%) | 15 (%) | 20 (%) |
|---|---|---|---|---|
| MEHD | 65.65 | 66.36 | 66.82 | 67.29 |
| HEIV | 53.33 | 54.07 | 54.56 | 55.06 |
Exploration of the number of proposal regions.
| N | 2 | 4 (%) | 6 (%) | 8 (%) | 10 (%) |
|---|---|---|---|---|---|
| MEHD | 49.31% | 50.56 | 50.88 | 67.29 | 50.82 |
| HEIV | 53.83% | 54.07 | 54.57 | 55.06 | 54.81 |
Top-1 accuracy (%) compared with related works on the MHED and HEIV.
| Method | Accuracy on MEHD (%) | Accuracy on HEIV (%) |
|---|---|---|
| Vielzeuf et al. [ | 53.73 | 45.93 |
| Chen et al. [ | 55.60 | 46.17 |
| Attention clusters [ | 59.81 | 49.63 |
| HAMF | 63.08 | 52.34 |
| Our method | 65.89 | 53.09 |
Top-1 accuracy (%) compared with state-of-the-art methods on Ekman-6 and VideoEmotion-8.
| Method | Ekman | VideoEmotion-8 |
|---|---|---|
| Emotion in context [ | 51.8 | 50.6 |
| Xu et al. [ | 50.4 | 46.7 |
| Kernelized feature [ | 54.4 | 49.7 |
| Concept selection [ | 54.40 | 50.82 |
| Ours | 56.23 | 52.5 |