| Literature DB >> 36080828 |
Bagus Tris Atmaja1, Akira Sasou1.
Abstract
The study of understanding sentiment and emotion in speech is a challenging task in human multimodal language. However, in certain cases, such as telephone calls, only audio data can be obtained. In this study, we independently evaluated sentiment analysis and emotion recognition from speech using recent self-supervised learning models-specifically, universal speech representations with speaker-aware pre-training models. Three different sizes of universal models were evaluated for three sentiment tasks and an emotion task. The evaluation revealed that the best results were obtained with two classes of sentiment analysis, based on both weighted and unweighted accuracy scores (81% and 73%). This binary classification with unimodal acoustic analysis also performed competitively compared to previous methods which used multimodal fusion. The models failed to make accurate predictionsin an emotion recognition task and in sentiment analysis tasks with higher numbers of classes. The unbalanced property of the datasets may also have contributed to the performance degradations observed in the six-class emotion, three-class sentiment, and seven-class sentiment tasks.Entities:
Keywords: affective computing; sentiment analysis; sentiment analysis and emotion recognition; speech emotion recognition; universal speech representation
Mesh:
Year: 2022 PMID: 36080828 PMCID: PMC9460459 DOI: 10.3390/s22176369
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Distribution of samples for each sentiment and emotion label for different classes (c).
| Sentiment | Emotion | ||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| −3 | - | - | 821 | happiness | 14,567 |
| −2 | - | - | 2253 | sadness | 3782 |
| −1 | 6683 | 6683 | 3609 | anger | 2730 |
| 0 | - | 5100 | 5100 | surprise | 437 |
| 1 | 16,576 | 11,476 | 7576 | disgust | 1291 |
| 2 | - | - | 3225 | fear | 452 |
| 3 | - | - | 675 | - | - |
| Total | 23,259 | ||||
Figure 1Flow diagram of the data processing method from the dataset to each classification task.
Hyperparameters used in the experiments.
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate (LR) | 0.0002 |
| LR scheduler | Linear |
| Batch size | 2 |
| #Total steps | 10,000 |
| #Eval/save steps | 250 |
| #Warm up steps | 1000 |
| #Workers (CPU) | 10 |
| #GPU | 4 |
| #Transfomer dim | 128 |
| #Head | 2 |
| #Encoding layers | 2 |
Weighted and unweighted accuracies (WA and UA) for sentiment analysis and emotion recognition tasks using the MOSEI dataset; Bolds indicate the highest scores.
| Task | WA | UA |
|---|---|---|
| UniSpeech-SAT Base | ||
| 2-class sentiment | 78.68 | 69.46 |
| 3-class sentiment | 61.33 | 53.75 |
| 6-class emotion | 64.33 | 22.01 |
| 7-class sentiment | 40.63 | 29.12 |
| UniSpeech-SAT Base+ | ||
| 2-class sentiment | 79.36 | 68.85 |
| 3-class sentiment | 63.06 | 55.14 |
| 6-class emotion | 64.52 | 22.15 |
| 7-class sentiment | 42.64 | 31.12 |
| UniSpeech-SAT Large | ||
| 2-class sentiment |
|
|
| 3-class sentiment |
|
|
| 6-class emotion |
|
|
| 7-class sentiment |
|
|
Figure 2Normalized confusion matrix for two-class sentiment analysis.
Figure 3Normalized confusion matrix for three-class sentiment analysis.
Figure 4Normalized confusion matrix for six-class emotion recognition.
Figure 5Normalized confusion matrix for seven-class sentiment analysis [−3, 3].
Comparison of accuracies (WA%) of the results of previous studies using the CMU-MOSEI dataset; A: audio, T: text, V: video; scores in italics denote UA.
| Method | Modality | Acc | Acc | Acc | Acc |
|---|---|---|---|---|---|
| RAVEN [ | A + T + V | 79.1 | - | - | 50.0 |
| MCTN [ | A + T + V | 79.8 | - | - | 49.6 |
| M.Rout [ | A + T + V | 81.7 | - |
| 51.6 |
| MuIT [ | A + T + V | 82.5 | - | - | 51.8 |
| M | A + T + V | 82.5 | - | - | 51.9 |
| DCCA [ | A + T | 69.4 | - | - | - |
| MAT [ | A + T | 82.0 | - | - | - |
| MFCC | A | 71.7 | 54.0 | 62.9 | 34.5 |
| UniSpeech-SAT | A | 81.4 | 65.3 | 64.9 | 44.8 |