| Literature DB >> 32316473 |
Zhen-Tao Liu1,2, Bao-Han Wu1,2, Dan-Yun Li1,2, Peng Xiao1,2, Jun-Wei Mao1,2.
Abstract
Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.Entities:
Keywords: SISMOTE; data imbalance processing; feature selection; speech emotion recognition
Year: 2020 PMID: 32316473 PMCID: PMC7219047 DOI: 10.3390/s20082297
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Flowchart of the proposed speech emotion recognition.
Figure 2Diagram of the traditional SMOTE algorithm.
Figure 3Diagram of the SISMOTE algorithm.
Comparative results in the initial samples and the samples after using SISMOTE on Emo-DB.
| Category | None | SISMOTE | Count | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | ||
| Anger | 0.76 | 0.95 | 0.84 | 0.76 | 0.95 | 0.84 | 37 |
| Boredom | 0.70 | 1.00 | 0.82 | 0.77 | 1.00 | 0.87 | 23 |
| Disgust | 0.92 | 0.75 | 0.83 | 1.00 | 0.75 | 0.86 | 16 |
| Anxiety | 0.64 | 0.69 | 0.67 | 0.75 | 0.69 | 0.72 | 13 |
| Happiness | 0.87 | 0.52 | 0.65 | 0.94 | 0.64 | 0.76 | 25 |
| Sadness | 0.86 | 0.83 | 0.84 | 0.84 | 0.91 | 0.87 | 23 |
| Neutral | 0.83 | 0.62 | 0.71 | 0.89 | 0.71 | 0.79 | 24 |
| Avg/Total | 0.80 | 0.78 | 0.78 | 0.84 | 0.83 | 0.82 | 161 |
Comparative results in the initial samples and the samples after using SISMOTE on SAVEE.
| Category | None | SISMOTE | Count | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | ||
| Anger | 0.50 | 0.65 | 0.56 | 0.57 | 0.71 | 0.63 | 17 |
| Disgust | 1.00 | 0.31 | 0.48 | 1.00 | 0.38 | 0.55 | 16 |
| Fear | 0.82 | 0.41 | 0.55 | 0.81 | 0.59 | 0.68 | 22 |
| Happiness | 0.50 | 0.50 | 0.50 | 0.71 | 0.56 | 0.63 | 18 |
| Neutral | 0.74 | 0.98 | 0.84 | 0.82 | 0.98 | 0.89 | 41 |
| Sadness | 0.69 | 0.75 | 0.72 | 0.60 | 0.75 | 0.67 | 12 |
| Surprise | 0.62 | 0.72 | 0.67 | 0.65 | 0.83 | 0.73 | 16 |
| Avg/Total | 0.70 | 0.67 | 0.65 | 0.76 | 0.73 | 0.72 | 144 |
Comparative results using different imbalance processing methods on Emo-DB and SAVEE.
| Database | Recognition Accuracy (%) | ||||||
|---|---|---|---|---|---|---|---|
| None | Subsampling | Random Oversampling | SMOTE | ADASYN | Borderline-SMOTE | SISMOTE | |
| Emo-DB | 78.26% | 74.53% | 81.75% | 81.99% | 82.09% | 82.15% | 82.61% |
| SAVEE | 66.67% | 57.64% | 70.14% | 72.22% | 72.50% | 72.53% | 72.92% |
Recognition results using proposed method on CASIA.
| Category | Precision | Recall | F1 | Count |
|---|---|---|---|---|
| Anger | 0.94 | 0.92 | 0.93 | 296 |
| Fear | 0.90 | 0.84 | 0.87 | 300 |
| Happy | 0.86 | 0.90 | 0.88 | 304 |
| Neutral | 0.96 | 0.95 | 0.96 | 288 |
| Sadness | 0.84 | 0.91 | 0.87 | 288 |
| Surprise | 0.93 | 0.91 | 0.92 | 324 |
| Avg/Total | 0.90 | 0.90 | 0.90 | 1800 |
Average recognition accuracy of speaker-dependent emotion recognition on CASIA.
| Classifier | Recognition Accuracy (%) | |||
|---|---|---|---|---|
| None | Pearson | RF | Our Method | |
| NB | 44.22 | 44.39 | 50.89 | 50.39 |
| KNN | 62.00 | 59.83 | 76.83 | 74.72 |
| DT | 59.83 | 55.00 | 62.83 | 62.11 |
| LR | 81.17 | 78.00 | 82.44 | 84.39 |
| SVM | 88.39 | 84.78 | 88.61 | 90.28 |
Unweighted average recall of speaker-dependent emotion recognition on CASIA.
| Classifier | Unweighted Average Recall (%) | |||
|---|---|---|---|---|
| None | Pearson | RF | Our Method | |
| NB | 44.39 | 44.76 | 51.23 | 52.26 |
| KNN | 62.08 | 59.98 | 78.11 | 78.90 |
| DT | 59.13 | 53.93 | 61.15 | 63.27 |
| LR | 81.16 | 78.04 | 82.29 | 82.78 |
| SVM | 88.42 | 84.81 | 88.69 | 89.38 |
Recognition results using proposed method on Emo-DB.
| Category | Precision | Recall | F1 | Count |
|---|---|---|---|---|
| Anger | 0.87 | 0.87 | 0.87 | 30 |
| Boredom | 0.83 | 0.83 | 0.83 | 18 |
| Disgust | 0.78 | 0.90 | 0.84 | 20 |
| Anxiety | 1.00 | 0.92 | 0.96 | 12 |
| Happiness | 0.78 | 0.78 | 0.78 | 22 |
| Sadness | 1.00 | 0.81 | 0.90 | 16 |
| Neutral | 0.86 | 0.90 | 0.88 | 20 |
| Avg/Total | 0.80 | 0.86 | 0.86 | 134 |
Average recognition accuracy of speaker-dependent speech emotion recognition on Emo-DB.
| Classifier | Recognition Accuracy (%) | |||
|---|---|---|---|---|
| None | Pearson | RF | Our Method | |
| NB | 74.63 | 74.86 | 73.13 | 79.34 |
| KNN | 64.18 | 67.16 | 69.40 | 72.39 |
| DT | 56.72 | 50.00 | 52.99 | 59.70 |
| LR | 77.61 | 80.06 | 81.34 | 76.12 |
| SVM | 82.09 | 81.34 | 83.58 | 85.82 |
Unweighted average recall of speaker-dependent speech emotion recognition on Emo-DB.
| Classifier | Unweighted Average Recall (%) | |||
|---|---|---|---|---|
| None | Pearson | RF | Our Method | |
| NB | 73.12 | 73.93 | 74.46 | 80.06 |
| KNN | 67.08 | 69.88 | 69.19 | 76.07 |
| DT | 57.90 | 43.77 | 56.27 | 57.76 |
| LR | 77.46 | 79.78 | 81.33 | 80.52 |
| SVM | 81.53 | 81.53 | 83.27 | 85.04 |
Recognition results using proposed method on SAVEE.
| Category | Precision | Recall | F1 | Count |
|---|---|---|---|---|
| Anger | 0.56 | 0.71 | 0.63 | 14 |
| Disgust | 0.73 | 0.57 | 0.64 | 14 |
| Fear | 0.88 | 0.74 | 0.80 | 19 |
| Happiness | 0.69 | 0.56 | 0.62 | 16 |
| Neutral | 0.86 | 0.94 | 0.90 | 33 |
| Sadness | 0.67 | 0.75 | 0.71 | 8 |
| Surprise | 0.71 | 0.75 | 0.73 | 16 |
| Avg/Total | 0.76 | 0.75 | 0.75 | 120 |
Average recognition accuracy of speaker-dependent speech emotion recognition on SAVEE.
| Classifier | Recognition Accuracy (%) | |||
|---|---|---|---|---|
| None | Pearson | RF | Our Method | |
| NB | 55.83 | 52.50 | 60.00 | 60.83 |
| KNN | 50.00 | 50.00 | 54.17 | 59.17 |
| DT | 43.33 | 49.17 | 45.83 | 46.67 |
| LR | 64.17 | 63.33 | 65.00 | 66.67 |
| SVM | 69.17 | 67.5 | 74.17 | 75.00 |
Unweighted average recall of speaker-dependent speech emotion recognition on SAVEE.
| Classifier | Unweighted Average Recall (%) | |||
|---|---|---|---|---|
| None | Pearson | RF | Our Method | |
| NB | 48.78 | 48.78 | 57.39 | 61.11 |
| KNN | 54.64 | 54.64 | 57.29 | 58.32 |
| DT | 42.47 | 45.21 | 49.40 | 40.07 |
| LR | 60.10 | 60.10 | 62.31 | 65.14 |
| SVM | 63.87 | 63.87 | 67.43 | 74.85 |
Figure 4Diagram of recognition accuracy for different sample sizes.
Comparison between our method and some related works.
| Database | Reference | Average Recognition Accuracy (%) |
|---|---|---|
| CASIA | [ | 85.08 |
| [ | 89.6 | |
| Our method | 90.28 | |
| SAVEE | [ | 61.25 |
| Our method | 75.00 | |
| Emo-DB | [ | 80.5 |
| Our method | 85.82 |