| Literature DB >> 35898032 |
Jian Zhao1,2,3, Wenhua Dong1,2,3, Lijuan Shi3,4, Wenqian Qiang1, Zhejun Kuang2,3, Dawei Xu1, Tianbo An1.
Abstract
With the wide application of social media, public opinion analysis in social networks has been unable to be met through text alone because the existing public opinion information includes data information of various modalities, such as voice, text, and facial expressions. Therefore multi-modal emotion analysis is the current focus of public opinion analysis. In addition, multi-modal emotion recognition of speech is an important factor restricting the multi-modal emotion analysis. In this paper, the emotion feature retrieval method for speech is firstly explored and the processing method of sample disequilibrium data is then analyzed. By comparing and studying the different feature fusion methods of text and speech, respectively, the multi-modal feature fusion method for sample disequilibrium data is proposed to realize multi-modal emotion recognition. Experiments are performed using two publicly available datasets (IEMOCAP and MELD), which shows that processing multi-modality data through this method can obtain good fine-grained emotion recognition results, laying a foundation for subsequent social public opinion analysis.Entities:
Keywords: emotion features; feature generation; fine-grained; model fusion; multi-modality
Mesh:
Year: 2022 PMID: 35898032 PMCID: PMC9331324 DOI: 10.3390/s22155528
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Feature Layer Fusion Scheme for Fewer Samples.
Figure 2Decision Level Fusion Approach for Few Shot.
Figure 3Multi-modal fine-grained emotion classification model based on sample disequilibrium.
IEMOCAP dataset and MELD dataset data volume.
| Emotion Category | IEMOCAP Total Data Volume | MELD Total Data Volume |
|---|---|---|
| ang | 1103 | 1607 |
| Hap/joy | 595 | 2308 |
| Sad/sadness | 1041 | 1002 |
| fear | 1084 | 358 |
| surprise | 1849 | 1636 |
| neutral | 1708 | 6436 |
| disgust | — | 361 |
Figure 4The representation of sample data volume after over-sampling in MELD dataset (where 0–6 in label represents six emotional labels: ‘anger’, ‘disgust’, ‘fear’, ‘joy’, ‘neutral’, ‘sadness’, ‘surprise’).
MELD dataset analysis result table.
| Methods | Model | Anger | Disgust | Fear | Joy | Neutral | Sadness | Surprise | acc | w_avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Text-CNN | text | 34.49 | 8.22 | 3.74 | 49.39 | 74.88 | 21.05 | 45.45 | — | 55.02 |
| HiGRU-sf | text | 21.16 | 0 | 2 | 47.01 | 91.56 | 0.48 | 40.93 | — | 58.58 |
| cMKL | Text + audio | 39.50 | 16.10 | 3.75 | 51.39 | 72.73 | 23.95 | 46.25 | — | 55.51 |
| bcLSTM | text | 42.06 | 21.69 | 7.75 | 54.31 | 71.63 | 26.92 | 48.15 | — | 56.44 |
| Audio | 25.85 | 6.06 | 2.90 | 15.74 | 61.86 | 14.71 | 19.34 | — | 39.08 | |
| text + audio | 43.39 | 23.66 | 9.38 | 54.48 | 76.67 | 24.34 | 51.04 | — | 59.25 | |
| DialogueRNN | text | 40.59 | 2.04 | 8.93 | 50.27 | 75.75 | 24.19 | 49.38 | — | 57.03 |
| Audio | 35.18 | 5.13 | 5.56 | 13.17 | 65.57 | 14.01 | 20.47 | — | 41.75 | |
| text + audio | 43.65 | 7.89 | 11.68 | 54.40 | 77.44 | 34.59 | 52.51 | — | 60.25 | |
| LR | audio | 21.47 | 0.00 | 0.00 | 20.71 | 27.54 | 27.79 | 19.67 | 21.05 | 17.11 |
| text | 84.71 | 93.71 | 88.96 | 74.28 | 60.21 | 90.50 | 67.69 | 80.13 | 79.76 | |
| text + audio | 79.14 | 93.14 | 88.01 | 73.96 | 62.41 | 90.20 | 71.74 | 79.94 | 79.39 | |
| text + audio (back) | 87.76 | 93.75 | 89.24 | 77.65 | 70.71 | 92.98 | 63.37 | 82.32 | 82.60 | |
| MLP | audio | 23.64 | 25.69 | 25.78 | 21.32 | 22.22 | 27.08 | 23.52 | 24.07 | 24.04 |
| text | 89.68 | 96.03 | 92.81 | 79.95 | 84.91 | 93.97 | 68.14 | 84.4 | 85.69 | |
| text + audio | 95.10 | 95.95 | 97.94 | 86.02 | 82.40 | 94.23 | 81.80 | 90.23 | 90.24 | |
| text + audio (back) | 93.08 | 95.70 | 93.22 | 86.64 | 93.33 | 96.88 | 70.23 | 87.51 | 88.94 | |
| MNB | audio | 20.03 | 0 | 0 | 1 | 0 | 0 | 17.2 | 17.86 | 22.73 |
| text | 81.40 | 90.69 | 82.08 | 64.91 | 60.77 | 91.19 | 63.68 | 75.62 | 75.85 | |
| text + audio | 84.00 | 93.10 | 82.33 | 61.22 | 63.97 | 93.46 | 63.31 | 75.34 | 76.56 | |
| text + audio (back) | 83.32 | 92.89 | 83.49 | 65.56 | 76.16 | 93.82 | 65.12 | 77.47 | 79.02 | |
| SVC | audio | 20.80 | 46.43 | 28.39 | 21.30 | 0.00 | 29.16 | 21.10 | 21.71 | 23.79 |
| text | 90.24 | 95.32 | 88.96 | 75.48 | 67.71 | 93.17 | 70.24 | 82.59 | 82.71 | |
| text + audio | 89.87 | 94.31 | 89.37 | 77.18 | 68.79 | 92.65 | 71.43 | 83.02 | 83.07 | |
| Our model | audio | 84.91 | 97.95 | 97.91 | 82.12 | 87.96 | 90.69 | 86.42 | 88.88 | 89.08 |
| text | 87.74 | 94.74 | 90.93 | 80.01 | 77.46 | 93.44 | 71.46 | 84.08 | 84.50 | |
| text + audio | 93.80 | 99.44 | 99.32 | 86.14 | 96.77 | 97.88 | 89.83 | 93.73 | 94.13 |
IEMOCAP dataset analysis result table.
| Method | Model | ang | hap | exc | sad | fru | neu | acc | w_avg |
|---|---|---|---|---|---|---|---|---|---|
| HiGRU | text | 64.12 | 39.86 | 62.21 | 78.37 | 60.37 | 62.50 | — | 62.52 |
| HiGRU-sf | text | 71.18 | 51.75 | 62.88 | 70.20 | 61.68 | 64.84 | — | 64.06 |
| LR | audio | 40.07 | 24.36 | 16.67 | 42.85 | 15.38 | 0 | 32.44 | 24.47 |
| text | 72.71 | 69.36 | 69.89 | 61.78 | 49.03 | 45.37 | 64.01 | 62.94 | |
| text + audio | 73.18 | 73.66 | 73.35 | 68.77 | 51.87 | 51.65 | 67.82 | 66.71 | |
| MLP | audio | 40.22 | 25.91 | 28.86 | 42.20 | 17.41 | 46.66 | 34.10 | 33.07 |
| text | 80.32 | 88.61 | 77.19 | 60.53 | 55.37 | 50.60 | 71.43 | 71.37 | |
| text + audio | 77.66 | 89.64 | 84.92 | 82.12 | 58.41 | 54.33 | 77.94 | 76.48 | |
| MNB | audio | 69.76 | 23.24 | 0 | 0 | 0 | 0 | 24.78 | 17.28 |
| text | 73.58 | 64.17 | 65.35 | 57.35 | 49.64 | 46.24 | 61.47 | 60.59 | |
| text + audio | 70.97 | 60.27 | 69.16 | 67.54 | 50.64 | 55.87 | 63.05 | 62.79 | |
| SVC | audio | 39.66 | 24.73 | 100 | 39.37 | 11.36 | 0 | 32.67 | 36.57 |
| text | 75.29 | 75.33 | 69.02 | 68.60 | 50.00 | 46.69 | 67.62 | 66.08 | |
| text + audio | 73.22 | 83.18 | 77.02 | 68.89 | 50.85 | 49.61 | 70.51 | 69.07 | |
| XGB | audio | 59.81 | 62.93 | 55.34 | 67.40 | 29.60 | 39.72 | 58.96 | 54.13 |
| text | 70.16 | 42.78 | 71.30 | 59.61 | 47.66 | 45.80 | 51.46 | 55.91 | |
| text + audio | 55.20 | 58.89 | 73.67 | 63.31 | 47.39 | 66.40 | 60.16 | 60.68 | |
| Menment | Text + audio + video | 67.1 | 24.4 | 65.2 | 60.4 | 68.4 | 56.8 | — | 59.9 |
| Text + audio + video | 70.0 | 25.5 | 58.8 | 58.6 | 67.4 | 56.5 | — | 59.8 | |
| Text + audio + video | 69.1 | 23.2 | 63.1 | 58.0 | 65.5 | 56.6 | — | 58.8 | |
| Text + audio + video | 72.3 | 24.0 | 64.3 | 65.6 | 67.9 | 55.5 | — | 60.1 | |
| CMU | Text + audio + video | 67.6 | 25.7 | 69.9 | 66.5 | 71.7 | 53.9 | — | 61.9 |
| ICON | Text + audio + video | 68.2 | 23.6 | 72.2 | 70.6 | 71.9 | 59.9 | — | 64.0 |
| Our model | audio | 57.97 | 71.67 | 59.89 | 66.30 | 38.55 | 41.76 | 61.62 | 57.88 |
| text | 77.46 | 84.78 | 77.35 | 55.81 | 56.45 | 53.72 | 69.89 | 69.77 | |
| text + audio | 72.76 | 92.38 | 80.72 | 73.73 | 53.21 | 58.13 | 75.87 | 73.93 |
Figure 5Results Heat Map of MELD Data Set Fusion Model.
Figure 6Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (Text).
Figure 7Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (speech).
Figure 8Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (Multimodalities).
Figure 9Result Heat Map of Fusion Model of IEMOCAP Data Set (out model).
Figure 10Speech Feature Heat Map of IEMOCAP Data Set (our model).
Figure 11Fine-grained Emotion Accuracy Rate Analysis of Different IEMOCAP Methods (Text).
Figure 12Fine-grained Emotion Accuracy Rate Analysis of Different MOCAP Methods (Audio).
Figure 13Fine-grained Emotion Accuracy Rate Analysis of Different MOCAP Methods (Multimodalities).