| Literature DB >> 32872511 |
Changzeng Fu1,2, Chaoran Liu1, Carlos Toshinori Ishi1,3, Hiroshi Ishiguro1,2.
Abstract
Emotion recognition has been gaining attention in recent years due to its applications on artificial agents. To achieve a good performance with this task, much research has been conducted on the multi-modality emotion recognition model for leveraging the different strengths of each modality. However, a research question remains: what exactly is the most appropriate way to fuse the information from different modalities? In this paper, we proposed audio sample augmentation and an emotion-oriented encoder-decoder to improve the performance of emotion recognition and discussed an inter-modality, decision-level fusion method based on a graph attention network (GAT). Compared to the baseline, our model improved the weighted average F1-scores from 64.18 to 68.31% and the weighted average accuracy from 65.25 to 69.88%.Entities:
Keywords: emotion recognition; graph attention network; multi-modality
Mesh:
Year: 2020 PMID: 32872511 PMCID: PMC7506856 DOI: 10.3390/s20174894
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Notation details.
| Notation | Description | Value |
|---|---|---|
|
| Mel scale | 384 |
|
| length of Mel spectrogram | 256 |
|
| number of audio features | 256 |
|
| number of facial features | 64 |
|
| number of hidden emotion features | 64 |
|
| number of EED cells | 128 |
|
| attention heads for GAT | 8 |
Figure 1Architecture of proposed multi-model for emotion recognition: For the visual model, we used a 3D-convolutional neural network (CNN) to capture the affective time-spatial features and analyzed the emotions with an emotion encoder-decoder (EED). Audio is composed of a bidirectional long-short term memory layer (BLSTM), a multi-head attention layer, and an EED. Audio model treats the Mel spectrogram as input to predict the emotion. Text model employed a SeMemNN to analyze emotions from semantics. Note that the multi-model predicts emotions at time t by considering multi-modality data at times and t.
Comparison results of audio model. The bold numbers in the table are the highest accuracies.
| Methods | Happy | Sad | Neutral | Angry | Excited | Frustrated | Acc.(w) | F1(w) |
|---|---|---|---|---|---|---|---|---|
| AM |
|
| 47.95 | 60.00 | 58.25 |
|
|
|
| AM | 40.38 | 57.75 | 47.16 | 59.17 |
| 44.07 | 52.11 | 51.21 |
| AM | 35.90 | 65.87 |
|
| 56.94 | 46.77 | 54.04 | 53.34 |
| AM | 42.11 | 65.41 | 37.30 | 58.60 | 52.53 | 38.60 | 47.87 | 46.91 |
Figure 2Confusion matrices for proposed uni-models: Columns are ground-truth, and rows are predictions.
Comparison results of visual model. The bold numbers in the table are the highest accuracies.
| Methods | Happy | Sad | Neutral | Angry | Excited | Frustrated | Acc.(w) | F1(w) |
|---|---|---|---|---|---|---|---|---|
| VM | 37.11 |
|
|
|
| 49.04 |
|
|
| VM |
| 47.38 | 56.42 | 42.64 | 46.59 |
| 50.81 | 48.18 |
Comparison results of text models. The bold numbers in the table are the highest accuracies.
| Methods | Happy | Sad | Neutral | Angry | Excited | Frustrated | Acc.(w) | F1(w) |
|---|---|---|---|---|---|---|---|---|
| TM |
|
|
|
|
|
|
|
|
| TM | 34.43 | 44.29 | 30.76 | 31.90 | 55.13 | 38.79 | 39.62 | 36.13 |
Comparison results of multi-models. The bold numbers in the table are the highest accuracies.
| Methods | Happy | Sad | Neutral | Angry | Excited | Frustrated | Acc.(w) | F1(w) |
|---|---|---|---|---|---|---|---|---|
| DialogueRNN | 25.69 | 75.10 | 58.59 | 64.71 |
| 61.15 | 63.40 | 62.75 |
| DialogueGCN | 40.62 |
| 61.92 | 67.53 | 65.46 | 64.18 | 65.25 | 64.18 |
| – | ||||||||
| MulM | 66.99 | 80.73 |
| 65.88 | 74.34 | 57.63 | 67.26 | 66.74 |
| MulM |
| 75.19 | 64.07 |
| 68.27 |
|
|
|
Figure 3Confusion matrices for proposed multi-model with and without graph attention network (GAT): Columns are ground-truth, and rows are predictions.