| Literature DB >> 35408076 |
Giovanni Costantini1, Emilia Parada-Cabaleiro2, Daniele Casali1, Valerio Cesarini1.
Abstract
Machine Learning (ML) algorithms within a human-computer framework are the leading force in speech emotion recognition (SER). However, few studies explore cross-corpora aspects of SER; this work aims to explore the feasibility and characteristics of a cross-linguistic, cross-gender SER. Three ML classifiers (SVM, Naïve Bayes and MLP) are applied to acoustic features, obtained through a procedure based on Kononenko's discretization and correlation-based feature selection. The system encompasses five emotions (disgust, fear, happiness, anger and sadness), using the Emofilm database, comprised of short clips of English movies and the respective Italian and Spanish dubbed versions, for a total of 1115 annotated utterances. The results see MLP as the most effective classifier, with accuracies higher than 90% for single-language approaches, while the cross-language classifier still yields accuracies higher than 80%. The results show cross-gender tasks to be more difficult than those involving two languages, suggesting greater differences between emotions expressed by male versus female subjects than between different languages. Four feature domains, namely, RASTA, F0, MFCC and spectral energy, are algorithmically assessed as the most effective, refining existing literature and approaches based on standard sets. To our knowledge, this is one of the first studies encompassing cross-gender and cross-linguistic assessments on SER.Entities:
Keywords: English; SER; SVM; artificial intelligence; cross-gender; cross-linguistic; emotion recognition; machine learning; speech
Mesh:
Year: 2022 PMID: 35408076 PMCID: PMC9003467 DOI: 10.3390/s22072461
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Review of the most representative works in SER and the classifiers they employed.
| Study | Year | Database | Emotions | Features | Classifier | Reported Results |
|---|---|---|---|---|---|---|
| Alonso et al. [ | 2015 | EMO-DB [ | Happy, Angry, Sad, Bored. | Spectral, Prosody, Pitch | SVM | 94.9% (EMO-DB) |
| Shukla et al. [ | 2016 | SUSAS [ | Neutral, Angry, Sad, Lombard, others. | MFCC | HMM | 93.9% |
| Wen et al. [ | 2017 | EMO-DB, SAVEE [ | Neutral, Happy, Angry, Sad, Fear, Disgust, Surprise. | Spectral, Prosody, Hu Moments | DBN, SVM | 82.3% (EMO-DB) |
| Sun et al. [ | 2019 | EMO-DB, CASIA | Neutral, Happy, Angry, Sad, Bored, Fear, Disgust. | Spectral, Prosody, MFCC, Voice Quality | SVM | 86.7% (EMO-DB) |
| Kerkeni et al. [ | 2019 | EMO-DB, Spanish | Neutral, Happy, Angry, Sad, Bored, Fear, Disgust. | MFCC, Spectral | SVM, RNN | 83% (EMO-DB) |
| Aftab et al. [ | 2021 | EMO-DB, IEMOCAP [ | Neutral, Happy, Angry, Sad, Bored, Fear, Disgust. | - | CNN | 94.2% (EMO-DB) |
| Zehra et al. [ | 2021 | EMO-DB, SAVEE, EMOVO [ | Neutral, Happy, Angry, Sad, Fear, Disgust, Surprise. | MFCC, Spectral, Prosody | SVM | Many (single and cross-corpora) |
| Gat et al. [ | 2022 | IEMOCAP | Neutral, Happy, Sad, Angry | - | Gradient-base Adversary Learning | 81% |
Number of clips for each emotion, each language and each gender. As an example, “It M” means “Italian Males” and there are 37 clips of Italian males labelled with the “Disgust” emotion. Last column/row shows the total of clips for all emotions/tasks.
| Task | Dis | Hap | Fea | Ang | Sad | Total |
|---|---|---|---|---|---|---|
| It M | 37 | 23 | 33 | 41 | 31 | 165 |
| It F | 35 | 27 | 37 | 36 | 43 | 178 |
| Sp M | 33 | 33 | 29 | 43 | 44 | 182 |
| Sp F | 31 | 17 | 47 | 39 | 43 | 177 |
| En M | 41 | 30 | 40 | 35 | 44 | 190 |
| En F | 44 | 38 | 54 | 38 | 49 | 223 |
| Total | 221 | 168 | 240 | 232 | 254 |
Figure 1Number of speakers that uttered a certain number of clips (a) Female; (b) Male. As an example, 17 females uttered one clip, 20 females uttered 2 clips and, finally, one female uttered 25 clips (last point on the x-axis).
Figure 2Flowchart for the SER Machine Learning framework.
Classification performances in terms of weighted accuracy. Emotion labels are thus abbreviated: “DIS” = Disgust; “HAP” = Happy; “FEAR” = Fear; “ANG” = Angry; “SAD” = Sad. “Best” and “Worst emotion” refer to the most and least accurate classes.
| Classification: Language(s) | Classification: Gender(s) | No. of Features | WA (%): | WA (%): | WA (%): | Best Emotion | Worst Emotion |
|---|---|---|---|---|---|---|---|
| It | M | 160 | 94.2 | 89.7 | 96.0 | sad | dis |
| It | F | 177 | 93.7 | 89.5 | 94.2 | sad | hap |
| It | M + F | 176 | 80.4 | 77.0 | 83.3 | ang | fea |
| Sp | M | 158 | 97.2 | 95.5 | 97.7 | sad | dis |
| Sp | F | 163 | 91.8 | 91.2 | 91.8 | ang | hap |
| Sp | M + F | 167 | 82.5 | 82.5 | 85.0 | sad | dis |
| En | M | 166 | 97.2 | 95.5 | 97.2 | sad | fea |
| En | F | 173 | 94.6 | 94.6 | 95.8 | sad | hap |
| En | M + F | 149 | 81.9 | 78.4 | 82.5 | ang | hap |
| It + Sp | M | 196 | 89.8 | 84.3 | 91.0 | ang | dis |
| It + Sp | F | 202 | 85.2 | 84.0 | 88.4 | ang | fea |
| Sp + En | M | 199 | 89.0 | 85.1 | 89.9 | ang | dis |
| Sp + En | F | 176 | 84.4 | 85.1 | 89.9 | ang | hap |
| It + En | M | 215 | 85.8 | 80.6 | 85.8 | sad | dis |
| It + En | F | 185 | 79.7 | 81.4 | 82.5 | ang | hap |
| All | M | 215 | 85.3 | 77.7 | 85.3 | sad | fea |
| All | F | 195 | 78.4 | 76.4 | 80.3 | sad | hap |
| All | M + F | 204 | 67.3 | 60.6 | 67.3 | ang | fea |
Figure 3Confusion matrices for the SVM classifier for the It M, It F, All M and All F comparisons. Emotion labels are thus abbreviated: “DIS” = Disgust; “HAP” = Happy; “FEAR” = Fear; “ANG” = Angry; “SAD” = Sad.
Feature list for the It M classification task. Please note that numbers in square brackets are not references, but refer to the number of the window for the specific filtering.
| It M—Feature List |
|---|
| audSpec_Rfilt_sma[7]_leftctime |
| audSpec_Rfilt_sma[8]_quartile1 |
| audSpec_Rfilt_sma[10]_quartile1 |
| audSpec_Rfilt_sma[10]_lpc4 |
| audSpec_Rfilt_sma[11]_leftctime |
| audSpec_Rfilt_sma[12]_lpc4 |
| audSpec_Rfilt_sma[15]_lpc3 |
| audSpec_Rfilt_sma[16]_maxPos |
| audSpec_Rfilt_sma[20]_risetime |
| audSpec_Rfilt_sma[21]_minPos |
| audSpec_Rfilt_sma[22]_percentile1.0 |
| audSpec_Rfilt_sma[25]_percentile1.0 |
| pcm_Mag_fband250-650_sma_lpgain |
| pcm_Mag_fband250-650_sma_lpc0 |
| pcm_Mag_fband250-650_sma_lpc2 |
| pcm_Mag_fband1000-22000_sma_iqr2-3 |
| pcm_Mag_fband1000-22000_sma_lpc3 |
| pcm_Mag_spectralRollOff50.0_sma_quartile2 |
| pcm_Mag_spectralRollOff75.0_sma_quartile1 |
| pcm_Mag_spectralRollOff75.0_sma_risetime |
| pcm_Mag_spectralRollOff90.0_sma_risetime |
| pcm_Mag_spectralFlux_sma_lpc0 |
| pcm_Mag_spectralCentroid_sma_quartile1 |
| pcm_Mag_spectralCentroid_sma_lpc1 |
| pcm_Mag_spectralEntropy_sma_lpc0 |
| pcm_Mag_spectralVariance_sma_quartile3 |
| pcm_Mag_spectralVariance_sma_iqr1-3 |
| pcm_Mag_spectralKurtosis_sma_quartile1 |
| pcm_Mag_harmonicity_sma_quartile2 |
| mfcc_sma[1]_quartile1 |
| mfcc_sma[1]_quartile3 |
| mfcc_sma[1]_pctlrange0-1 |
| mfcc_sma[1]_skewness |
| mfcc_sma[1]_leftctime |
| mfcc_sma[2]_percentile1.0 |
| mfcc_sma[2]_lpc0 |
| mfcc_sma[3]_quartile1 |
| mfcc_sma[4]_skewness |
| mfcc_sma[6]_maxPos |
| mfcc_sma[6]_percentile1.0 |
| mfcc_sma[6]_upleveltime75 |
| mfcc_sma[6]_lpc1 |
| mfcc_sma[8]_lpgain |
| mfcc_sma[10]_maxPos |
| mfcc_sma[10]_quartile2 |
| mfcc_sma[10]_stddev |
| mfcc_sma[11]_quartile3 |
| mfcc_sma[11]_percentile1.0 |
| mfcc_sma[12]_quartile3 |
| mfcc_sma[13]_percentile99.0 |
| mfcc_sma[13]_upleveltime75 |
| mfcc_sma[14]_percentile1.0 |
| mfcc_sma[14]_skewness |
| mfcc_sma[14]_upleveltime50 |
| audSpec_Rfilt_sma_de[2]_leftctime |
| audSpec_Rfilt_sma_de[8]_iqr1-3 |
| audSpec_Rfilt_sma_de[13]_lpc0 |
| audSpec_Rfilt_sma_de[21]_quartile2 |
| audSpec_Rfilt_sma_de[23]_quartile2 |
| audspec_lengthL1norm_sma_iqr2-3 |
| pcm_zcr_sma_skewness |
| audspec_lengthL1norm_sma_de_range |
| audspec_lengthL1norm_sma_de_stddev |
| audspec_lengthL1norm_sma_de_lpc4 |
| audspecRasta_lengthL1norm_sma_de_iqr2-3 |
| pcm_Mag_fband250-650_sma_de_iqr1-3 |
| pcm_Mag_fband1000-22000_sma_de_iqr1-3 |
| pcm_Mag_spectralRollOff25.0_sma_de_minPos |
| pcm_Mag_spectralRollOff25.0_sma_de_percentile1.0 |
| pcm_Mag_spectralRollOff50.0_sma_de_leftctime |
| pcm_Mag_spectralFlux_sma_de_iqr1-3 |
| pcm_Mag_spectralFlux_sma_de_lpgain |
| pcm_Mag_spectralCentroid_sma_de_quartile2 |
| pcm_Mag_spectralCentroid_sma_de_percentile1.0 |
| pcm_Mag_spectralSkewness_sma_de_iqr2-3 |
| pcm_Mag_spectralSlope_sma_de_lpc2 |
| pcm_Mag_harmonicity_sma_de_upleveltime50 |
| mfcc_sma_de[2]_iqr1-3 |
| mfcc_sma_de[2]_percentile1.0 |
| mfcc_sma_de[3]_lpgain |
| mfcc_sma_de[3]_lpc1 |
| mfcc_sma_de[4]_minPos |
| mfcc_sma_de[4]_lpc3 |
| mfcc_sma_de[5]_percentile1.0 |
| mfcc_sma_de[5]_lpgain |
| mfcc_sma_de[6]_iqr1-2 |
| mfcc_sma_de[6]_lpc0 |
| mfcc_sma_de[7]_quartile2 |
| mfcc_sma_de[11]_percentile99.0 |
| mfcc_sma_de[13]_skewness |
| mfcc_sma_de[14]_leftctime |
| F0final_sma_rqmean |
| F0final_sma_quartile1 |
| F0final_sma_quartile2 |
| F0final_sma_quartile3 |
| F0final_sma_skewness |
| F0final_sma_upleveltime25 |
| jitterLocal_sma_linregc1 |
| jitterLocal_sma_iqr1-2 |
| jitterLocal_sma_iqr1-3 |
| shimmerLocal_sma_iqr2-3 |
| shimmerLocal_sma_iqr1-3 |
| shimmerLocal_sma_lpc0 |
| F0final_sma_de_qregc1 |
| F0final_sma_de_risetime |
| jitterLocal_sma_de_posamean |
| jitterLocal_sma_de_iqr1-2 |
| audspec_lengthL1norm_sma_qregc3 |
| audSpec_Rfilt_sma[0]_flatness |
| audSpec_Rfilt_sma[5]_minRangeRel |
| audSpec_Rfilt_sma[6]_peakMeanAbs |
| audSpec_Rfilt_sma[6]_peakMeanRel |
| audSpec_Rfilt_sma[11]_minRangeRel |
| pcm_Mag_fband250-650_sma_linregc1 |
| pcm_Mag_fband250-650_sma_qregc1 |
| pcm_Mag_fband1000-22000_sma_peakRangeAbs |
| pcm_Mag_fband1000-22000_sma_qregc3 |
| pcm_Mag_spectralRollOff25.0_sma_qregc2 |
| pcm_Mag_spectralRollOff90.0_sma_flatness |
| pcm_Mag_spectralFlux_sma_stddevFallingSlope |
| pcm_Mag_spectralEntropy_sma_qregc3 |
| pcm_Mag_spectralVariance_sma_meanFallingSlope |
| pcm_Mag_spectralSkewness_sma_peakMeanMeanDist |
| pcm_Mag_spectralSlope_sma_peakRangeRel |
| pcm_Mag_harmonicity_sma_rqmean |
| pcm_Mag_harmonicity_sma_peakRangeRel |
| pcm_Mag_harmonicity_sma_peakMeanAbs |
| mfcc_sma[1]_peakDistStddev |
| mfcc_sma[1]_peakMeanAbs |
| mfcc_sma[1]_meanFallingSlope |
| mfcc_sma[1]_qregc3 |
| mfcc_sma[2]_linregerrQ |
| mfcc_sma[4]_rqmean |
| mfcc_sma[5]_meanFallingSlope |
| mfcc_sma[8]_peakMeanRel |
| mfcc_sma[9]_peakMeanAbs |
| mfcc_sma[9]_peakMeanMeanDist |
| mfcc_sma[10]_peakMeanRel |
| mfcc_sma[11]_peakMeanAbs |
| mfcc_sma[12]_peakDistStddev |
| mfcc_sma[12]_peakMeanRel |
| mfcc_sma[13]_stddevRisingSlope |
| mfcc_sma[13]_qregc2 |
| mfcc_sma[14]_stddevFallingSlope |
| audspec_lengthL1norm_sma_de_posamean |
| audspec_lengthL1norm_sma_de_peakMeanMeanDist |
| audspec_lengthL1norm_sma_de_meanFallingSlope |
| audSpec_Rfilt_sma_de[18]_minRangeRel |
| audSpec_Rfilt_sma_de[24]_peakMeanRel |
| audSpec_Rfilt_sma_de[25]_peakRangeRel |
| pcm_Mag_fband1000-22000_sma_de_peakMeanAbs |
| pcm_Mag_spectralRollOff75.0_sma_de_meanPeakDist |
| pcm_Mag_spectralSkewness_sma_de_minRangeRel |
| pcm_Mag_spectralSlope_sma_de_peakMeanAbs |
| pcm_Mag_harmonicity_sma_de_peakRangeAbs |
| mfcc_sma_de[2]_meanRisingSlope |
| mfcc_sma_de[7]_meanPeakDist |
| mfcc_sma_de[7]_meanRisingSlope |
| mfcc_sma_de[9]_peakDistStddev |
| mfcc_sma_de[14]_peakRangeAbs |