| Literature DB >> 35336515 |
Fangfang Zhu-Zhou1, Roberto Gil-Pita1, Joaquín García-Gómez1, Manuel Rosa-Zurera1.
Abstract
Every human being experiences emotions daily, e.g., joy, sadness, fear, anger. These might be revealed through speech-words are often accompanied by our emotional states when we talk. Different acoustic emotional databases are freely available for solving the Emotional Speech Recognition (ESR) task. Unfortunately, many of them were generated under non-real-world conditions, i.e., actors played emotions, and recorded emotions were under fictitious circumstances where noise is non-existent. Another weakness in the design of emotion recognition systems is the scarcity of enough patterns in the available databases, causing generalization problems and leading to overfitting. This paper examines how different recording environmental elements impact system performance using a simple logistic regression algorithm. Specifically, we conducted experiments simulating different scenarios, using different levels of Gaussian white noise, real-world noise, and reverberation. The results from this research show a performance deterioration in all scenarios, increasing the error probability from 25.57% to 79.13% in the worst case. Additionally, a virtual enlargement method and a robust multi-scenario speech-based emotion recognition system are proposed. Our system's average error probability of 34.57% is comparable to the best-case scenario with 31.55%. The findings support the prediction that simulated emotional speech databases do not offer sufficient closeness to real scenarios.Entities:
Keywords: affective computing; emotion recognition; speech emotions
Mesh:
Year: 2022 PMID: 35336515 PMCID: PMC8953251 DOI: 10.3390/s22062343
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Diagram of the creation of virtual enlarged training set for one scenario using = 1 dB and = 40 dB and stepping factor = 8 dB.
Emotion distribution of the edited Berlin Database used.
| Emotion | Number of Audios Files |
|---|---|
| Anger | 126 |
| Boredom | 80 |
| Disgust | 46 |
| Anxiety/fear | 68 |
| Happiness | 69 |
| Sadness | 62 |
| Neutral | 77 |
| Total Number of Audio Files | 528 |
Figure 2Creation of test sets diagram for one scenario.
Extracted features for the experiments.
| Statistic | Feature | Index | Total Number of Features |
|---|---|---|---|
| Mean | MFCCs (13 coef.) | 1–13 | 40 |
| 14–26 | |||
| 27–39 | |||
| Pitch | 40 | ||
| Standard Deviation | MFCCs (13 coef.) | 41–53 | 40 |
| 54–66 | |||
| 67–79 | |||
| Pitch | 80 |
First stage experiment results. Error probability (%) cross results within different noises databases.
| Classifier | ||||||
|---|---|---|---|---|---|---|
| Test Set | Base 40 dB | WGNS 20 dB | RWNS 20 dB | Reverb. | RWGNS | RRWNS |
| Base 40 dB |
| 72.80 | 32.80 | 77.84 | 78.75 | 68.79 |
| WGNS 20 dB | 63.71 |
| 42.92 | 46.17 | 53.90 | 51.93 |
| RWNS 20 dB | 55.49 | 71.06 |
| 74.36 | 79.96 | 69.13 |
| Reverb. | 64.62 | 66.86 | 56.70 |
| 55.04 | 38.11 |
| RWGNS 20 dB | 78.90 | 72.27 | 56.55 | 62.73 |
| 44.28 |
| RRWNS 20 dB | 79.13 | 68.45 | 59.77 | 51.97 | 50.38 |
|
Figure 3Second stage experiment results: test sets error probabilities for three different models (White Gaussian Noise Scenario).
Figure 4Second stage experiment results: test sets error probabilities for three different models (Real-World Noise Scenario).
Figure 5Second stage experiment results: test sets error probabilities for three different models (Reverberated White Gaussian Noise Scenario).
Figure 6Second stage experiment results: test sets error probabilities for three different models (Reverberated Real-World Noise Scenario).
Second stage experiment results. Average error probability (%).
| Training Set | Average Error Probability (%) | ||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| 1 | 21,120 | 30.24 | 36.81 | 34.19 | 37.85 |
| 2 | 10,560 | 30.27 | 37.12 | 34.28 | 37.62 |
| 3 | 7920 | 30.54 | 37.32 | 34.20 | 38.40 |
| 4 | 5280 | 30.58 | 37.58 | 34.33 | 37.96 |
| 5 | 4224 | 30.88 | 38.53 | 34.15 | 39.06 |
| 6 | 3168 | 31.16 | 37.80 | 34.82 | 38.56 |
| 7 | 2640 | 31.75 | 38.57 | 34.17 | 39.30 |
| 8 | 2640 | 31.76 | 38.96 | 35.00 | 38.81 |
| 9 | 2112 | 32.47 | 39.52 | 34.84 | 39.66 |
| 10 | 2112 | 32.45 | 39.68 | 35.00 | 39.18 |
| 11 | 1584 | 33.11 | 38.56 | 35.24 | 39.08 |
| 12 | 1584 | 33.05 | 39.05 | 35.87 | 39.13 |
| 13 | 1584 | 33.87 | 39.80 | 36.56 | 39.22 |
| 14 | 1056 | 34.31 | 41.04 | 37.27 | 40.06 |
| 15 | 1056 | 35.08 | 41.47 | 37.62 | 39.86 |
| 16 | 1056 | 35.56 | 40.99 | 38.47 | 39.89 |
| 17 | 1056 | 36.37 | 41.83 | 39.72 | 40.43 |
| 18 | 1056 | 37.29 | 41.22 | 38.68 | 39.49 |
| 19 | 1056 | 38.72 | 41.03 | 39.85 | 40.08 |
| 20 | 1056 | 38.32 | 40.94 | 39.81 | 39.34 |
P-value for some model results from Table 4.
| Training Set | ||||
|---|---|---|---|---|
|
|
|
|
|
|
| 2 | 0.461 | 0.187 | 0.401 | 0.751 |
| 3 | 0.199 | 0.070 | 0.492 | 0.053 |
| 4 | 0.170 | 0.013 | 0.347 | 0.375 |
| 5 | 0.038 | 0.000 | 0.548 | 0.000 |
| 6 | 0.005 | 0.002 | 0.034 | 0.018 |
Third stage experiments results. Error probability (%) cross results within different noises databases (Emo-DB).
| Classifier | ||||||
|---|---|---|---|---|---|---|
| Best Case | WGNS | RWNS | RWGNS | RRWNS | All Scenarios | |
| Test Set | Scenario | |||||
| Base 40 dB | 25.57 | 30.49 | 29.13 | 56.89 | 56.59 | 30.34 |
| WGNS 20 dB | 30.42 | 28.03 | 39.09 | 47.65 | 45.38 | 30.15 |
| RWNS 20 dB | 40.08 | 48.26 | 34.81 | 62.27 | 59.17 | 36.63 |
| Reverb. | 27.50 | 63.18 | 59.66 | 37.27 | 37.16 | 40.64 |
| RWGNS 20 dB | 30.49 | 64.28 | 63.67 | 29.89 | 38.14 | 33.71 |
| RRWNS 20 dB | 35.23 | 72.42 | 61.17 | 40.76 | 34.51 | 35.95 |
| Average |
| 51.11 | 47.92 | 45.79 | 45.16 |
|
Third stage experiments results. Error probability (%) cross results within different scenarios (CREMA-d).
| Classifier | ||||||
|---|---|---|---|---|---|---|
| Best Case | WGNS | RWNS | RWGNS | RRWNS | All Scenarios | |
| Test Set | Scenario | |||||
| Base 40 dB | 49.67 | 50.71 | 52.11 | 74.12 | 72.36 | 51.97 |
| WGNS 20 dB | 49.36 | 50.49 | 54.19 | 74.32 | 71.88 | 52.33 |
| RWNS 20 dB | 54.11 | 57.58 | 53.54 | 74.45 | 72.59 | 53.82 |
| Reverb. | 49.97 | 66.09 | 61.30 | 51.68 | 53.27 | 54.94 |
| RWGNS 20 dB | 49.56 | 67.66 | 65.71 | 50.36 | 52.48 | 53.87 |
| RRWNS 20 dB | 54.27 | 71.42 | 64.93 | 55.90 | 53.55 | 56.07 |
| Average |
| 60.66 | 58.63 | 63.47 | 62.69 |
|
Figure 7Normalized confusion matrix of the original database as test set and the proposed model “All scenarios = 3 dB” as training set on the EMO-db corpus.