| Literature DB >> 33113907 |
Misbah Farooq1, Fawad Hussain1, Naveed Khan Baloch1, Fawad Riasat Raja2,3, Heejung Yu4, Yousaf Bin Zikria5.
Abstract
Speech emotion recognition (SER) plays a significant role in human-machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.Entities:
Keywords: correlation-based feature selection; deep convolutional neural network; speech emotion recognition
Mesh:
Year: 2020 PMID: 33113907 PMCID: PMC7660211 DOI: 10.3390/s20216008
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The framework of our proposed methodology.
Weighted average recall and standard deviation of Speaker-dependent SER experiments without feature selection.
| Dataset | MLP | SVM | RF | KNN |
|---|---|---|---|---|
| Emo-DB |
|
|
|
|
| SAVEE |
|
|
|
|
| IEMOCAP |
|
|
|
|
| RAVDESS |
|
|
|
|
Weighted average recall and standard deviation of speaker-dependent SER experiments with feature selection.
| Dataset | MLP | SVM | RF | KNN |
|---|---|---|---|---|
| Emo-DB |
|
|
|
|
| SAVEE |
|
|
|
|
| IEMOCAP |
|
|
|
|
| RAVDESS |
|
|
|
|
Figure 2Confusion matrix on Emo-DB dataset for speaker-dependent SER.
Figure 3Confusion matrix of SAVEE dataset for speaker-dependent SER.
Figure 4Confusion matrix of RAVDESS dataset for speaker-dependent SER.
Figure 5Confusion matrix of IEMOCAP dataset for speaker-dependent SER.
Weighted average recall and standard deviation of speaker-independent SER experiments without feature selection.
| Dataset | MLP | SVM | RF | KNN |
|---|---|---|---|---|
| Emo-DB |
|
|
|
|
| SAVEE |
|
|
|
|
| IEMOCAP |
|
|
|
|
| RAVDESS |
|
|
|
|
Weighted average recall and standard deviation of speaker-independent SER experiments with feature selection.
| Dataset | MLP | SVM | RF | KNN |
|---|---|---|---|---|
| Emo-DB | 90.50 ± 2.60 | 85.00 ± 2.95 | 80.15 ± 2.68 | 78.90 ± 2.92 |
| SAVEE | 66.90 ± 5.18 | 65.40 ± 5.21 | 57.20 ± 6.74 | 56.10 ± 6.62 |
| IEMOCAP | 72.20 ± 3.14 | 76.60 ± 3.36 | 71.30 ± 4.31 | 69.28 ± 4.86 |
| RAVDESS | 73.50 ± 3.48 | 69.21 ± 4.69 | 65.28 ± 4.24 | 61.53 ± 4.73 |
Figure 6Confusion matrix of Emo-DB dataset for speaker-independent SER.
Figure 7Confusion matrix of- SAVEE dataset for speaker-independent SER.
Figure 8Confusion matrix of RAVDESS dataset for speaker-independent SER.
Figure 9Confusion matrix of IEMOCAP dataset for speaker-independent SER.
Comparison of speaker-dependent experiments with state-of-the-art approaches.
| DATASET | Reference | Features | Accuracy (%) |
|---|---|---|---|
|
| [ | openSMILE features | 84.62 |
| [ | MFCCs, spectral centroids and MFCC derivatives | 92.45 | |
| [ | Amplitude spectrogram and phase information | 91.78 | |
| [ | 3-D ACRNN | 82.82 | |
| [ | ADRNN | 90.78 | |
|
|
|
| |
|
| [ | openSMILE features | 72.39 |
|
|
|
| |
|
| [ | Convolution-LSTM | 68 |
| [ | Attention-BLSTM | 64 | |
| [ | CNN + LSTM | 64.50 | |
| [ | 3-D ACRNN | 64.74 | |
| [ | ADRNN | 74.96 | |
|
|
|
| |
|
| [ | MFCCs, spectral centroids and MFCC derivatives | 75.69 |
| [ | Spectrogram + GResNet | 64.48 | |
|
|
|
|
Comparison of speaker-independent experiments with state-of-the-art approaches.
| DATASET | Reference | Features | Accuracy (%) |
|---|---|---|---|
|
| [ | LLDs Stats | 82.40 |
| [ | Emobase feature set | 76.90 | |
| [ | OpenSmile features + ADAN | 83.74 | |
| [ | RESNET MODEL + Deep BiLSTM | 85.57 | |
| [ | Complementary Features + KELM | 84.49 | |
| [ | ADRNN | 85.39 | |
| [ | DCNN + DTPM | 87.31 | |
|
|
|
| |
|
| [ | LLDs Stats | 51.50 |
| [ | eGeMAPs feature set | 42.40 | |
|
|
|
| |
|
| [ | OpenSmile features + ADAN | 65.01 |
| [ | IS10 + DBN | 60.9 | |
| [ | SP + CNN | 64.80 | |
| [ | RESNET MODEL + Deep BiLSTM | 72.2 | |
| [ | Complementary Features + KELM | 57.10 | |
| [ | ADRNN | 69.32 | |
|
|
|
| |
|
|
|
|
|