| Literature DB >> 35401731 |
Abeer Ali Alnuaim1, Mohammed Zakariah2, Aseel Alhadlaq1, Chitra Shashidhar3, Wesam Atef Hatamleh4, Hussam Tarazi5, Prashant Kumar Shukla6, Rajnish Ratna7.
Abstract
Emotions play an essential role in human relationships, and many real-time applications rely on interpreting the speaker's emotion from their words. Speech emotion recognition (SER) modules aid human-computer interface (HCI) applications, but they are challenging to implement because of the lack of balanced data for training and clarity about which features are sufficient for categorization. This research discusses the impact of the classification approach, identifying the most appropriate combination of features and data augmentation on speech emotion detection accuracy. Selection of the correct combination of handcrafted features with the classifier plays an integral part in reducing computation complexity. The suggested classification model, a 1D convolutional neural network (1D CNN), outperforms traditional machine learning approaches in classification. Unlike most earlier studies, which examined emotions primarily through a single language lens, our analysis looks at numerous language data sets. With the most discriminating features and data augmentation, our technique achieves 97.09%, 96.44%, and 83.33% accuracy for the BAVED, ANAD, and SAVEE data sets, respectively.Entities:
Mesh:
Year: 2022 PMID: 35401731 PMCID: PMC8989588 DOI: 10.1155/2022/7463091
Source DB: PubMed Journal: Comput Intell Neurosci
Data set distribution.
| Emotion category | Low | Neutral | High | |
|---|---|---|---|---|
| Gender | Male | 342 | 379 | 388 |
| Female | 243 | 293 | 290 | |
| Total number of samples | 585 | 672 | 678 | |
| Total: 1935 | ||||
Figure 1Distribution of target classes in the data set.
Figure 2Waveforms for the three classes of emotions.
Figure 3Spectrograms for the three classes of emotions.
Figure 4Preprocessing and feature extraction.
Figure 51D CNN architecture.
Figure 6A detailed description of the 1D CNN.
Performance of different combinations of features and models without augmentation.
| Data set | Features | Augmentation | Model | Accuracy (%) |
|---|---|---|---|---|
| BAVED | Chroma, melspectrogram, MFCC | No | 1D convolution | 81.395 |
| KNeighborsClassifier | 79.07 | |||
| RandomForestClassifier | 78.04 | |||
| SVC RBF kernel | 76.74 | |||
| SVC | 72.61 | |||
| DecisionTreeClassifier | 68.22 | |||
| AdaBoostClassifier | 66.15 | |||
| QuadraticDiscriminantAnalysis | 55.30 | |||
| GaussianNB | 51.68 | |||
| Chroma, Melspectrogram, MFCC, Contrast, Tonnetz, ZCR, RSME, energy, flux, centroid, rolloff | No | 1D convolution | 82.07 | |
| KNeighborsClassifier | 79.59 | |||
| RandomForestClassifier | 79.07 | |||
| SVC RBF kernel | 75.71 | |||
| SVC | 73.64 | |||
| DecisionTreeClassifier | 63.31 | |||
| AdaBoostClassifier | 67.96 | |||
| QuadraticDiscriminantAnalysis | 59.69 | |||
| GaussianNB | 51.16 |
Performance of different combinations of features and models with augmentation.
| Data set | Features | Augmentation | Model | Accuracy (%) |
|---|---|---|---|---|
| BAVED | Chroma + Melspectrogram + MFCC | Yes | 1D convolution (CNN) | 96.38 |
| RandomForestClassifier | 89.02 | |||
| KNeighborsClassifier | 79.78 | |||
| DecisionTreeClassifier | 75.78 | |||
| SVC RBF kernel | 74.87 | |||
| SVC | 72.74 | |||
| AdaBoostClassifier | 65.76 | |||
| QuadraticDiscriminantAnalysis | 56.52 | |||
| GaussianNB | 50.19 | |||
| Chroma + Melspectrogram + MFCC + contrast + tonnetz + ZCR, RSME | Yes | 1D convolution (CNN) | 97.09 | |
| RandomForestClassifier | 92.25 | |||
| KNeighborsClassifier | 82.88 | |||
| SVC RBF kernel | 79.46 | |||
| DecisionTreeClassifier | 77.97 | |||
| SVC | 73.64 | |||
| AdaBoostClassifier | 68.60 | |||
| QuadraticDiscriminantAnalysis | 58.91 | |||
| GaussianNB | 52.00 |
Figure 7Accuracy and loss graphs of the 1D CNN.
Figure 8Confusion matrix for 1D model for BAVED database.
Recognition accuracy on individual emotion classes of BAVED.
| Model | Low (%) | Medium (%) | High (%) |
|---|---|---|---|
| BAVED | 97.45 | 95.87 | 97.76 |
Recognition accuracy on individual emotion classes of ANAD.
| Model | Angry (%) | Happy (%) | Surprised (%) |
|---|---|---|---|
| ANAD | 98 | 91 | 96 |
Figure 9Confusion matrix of ANAD using 1D CNN.
Figure 10Confusion matrix of SAVEE using 1D CNN.
Recognition accuracy on individual emotion classes of SAVEE.
| Model | Angry (%) | Disgust (%) | Fear (%) | Happy (%) | Neutral (%) | Sad (%) | Surprise (%) |
|---|---|---|---|---|---|---|---|
| SAVEE | 78 | 75 | 79 | 83 | 99 | 71 | 87 |
Recognition accuracy of 1D CNN on ANAD and SAVEE.
| Model | Accuracy (%) | F1 score (%) | Recall (%) | Precision (%) |
|---|---|---|---|---|
| ANAD | 96.44 | 95.46 | 95.17 | 95.82 |
| SAVEE | 83.33 | 83.579 | 83.579 | 83.579 |
Summary of accuracies (%) obtained by various authors using BAVED database.
| Method | Model | Accuracy (%) |
|---|---|---|
| [ | wav2vec2.0 | 89 |
| Ours | 1D CNN | 97.09 |
Summary of accuracies (%) obtained by various authors using ANAD database.
| Method | Model | Accuracy (%) |
|---|---|---|
| [ | Linear SVM | 96.02 |
| Ours | 1D CNN | 96.44 |
Summary of accuracies (%) obtained by various authors using SAVEE database.
| Method | Model | Accuracy (%) |
|---|---|---|
| [ | VACNN + BOVW | 75 |
| [ | DCNN + CFS + SVM | 82.10 |
| Ours | 1D CNN | 83.33 |