| Literature DB >> 36236658 |
Cem Doğdu1,2,3, Thomas Kessler1, Dana Schneider1,2,3,4, Maha Shadaydeh2,5, Stefan R Schweinberger2,3,6,7.
Abstract
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.Entities:
Keywords: emotional speech database; feature set; machine learning; speech; vocal emotion recognition
Mesh:
Year: 2022 PMID: 36236658 PMCID: PMC9571288 DOI: 10.3390/s22197561
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Number of instances for each emotion class of the EMO-DB.
| Emotions | Number of Instances |
|---|---|
| Anger | 127 |
| Fear | 69 |
| Disgust | 46 |
| Sadness | 62 |
| Happiness | 71 |
| Boredom | 81 |
| Neutral | 79 |
|
|
|
Low-Level-Descriptors and Functionals of Each Feature Set. (*) indicates features of emobase and IS-09 distinct from each other. (**) indicates the features of eGeMAPS in addition to GeMAPS features.
| Feature Sets | Low-Level-Descriptors | Functionals |
|---|---|---|
| emobase and IS-09 (Common Features) | Mean, Standard Deviation, | |
| emobase | * Intensity, Loudness, | * 3 Inter-Quartile Ranges, |
| IS-09 | * (RMS) Energy | - |
| GeMAPS | Mean, Coefficient of Variation; | |
| eGeMAPS | ** MFCCs 1–4, Spectral Flux and Formant 2–3 Bandwidth | * Equivalent Sound Level. |
Figure 1Computations for the prediction performance evaluations.
Accuracy percentages of each classifier on each feature set. Classifier and feature set names and abbreviations are written bold.
| % | MLP | SMO | DT | RF | KNN | LOG | MLR |
|---|---|---|---|---|---|---|---|
|
| 84.00 | 87.85 | 54.95 | 75.70 | 63.93 | 83.74 | 84.70 |
|
| 80.37 | 83.74 | 53.08 | 73.46 | 58.69 | 79.44 | 78.51 |
|
| 76.63 | 78.32 | 56.10 | 75.00 | 66.00 | 79.63 | 70.65 |
|
| 79.25 | 79.63 | 55.14 | 74.77 | 68.22 | 79.81 | 71.96 |
Figure 2Classification performance measures among feature sets. Precision, Recall, AUPRC and AUC values are weighted averages among number of instances of each class in the database. Data bars represent values between 0 and 1. Length of the data bars are determined by the number in each cell.
Cross-Validated Paired T-Test Comparison (two-tailed) with the Test Base SMO Classifier. *: p < 0.05. **: p ≤ 0.001. Note that positive t-values indicate better performance of the SMO classifier. Classifier and feature set names and abbreviations are written bold.
| Feature Set | MLP | DT | RF | KNN | LOG | MLR |
|---|---|---|---|---|---|---|
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
Figure 3F-measures for each emotion. Color coding indicates performance, with dark green indicating best and dark red indicating poorest performance, and with yellow indicating intermediate classification performance, as shown in the color bar.
Figure 4Confusion Matrices of the Predictions With (a) emobase/MLP, (b) IS-09/SMO, (c) GeMAPS/MLR, (d) eGeMAPS/LOG, (e) IS-09/RF, (f) eGeMAPS/KNN. The x-axis represents the ground truth labels and the y-axis represents predicted labels. Note: Figures give percentages determining the color map but also provide absolute numbers in parentheses to transparently indicate different base frequencies of the predicted emotions. Note also that percentages and numbers are omitted for empty cells to enhance readability.