| Literature DB >> 35336548 |
Apeksha Aggarwal1, Akshat Srivastava2, Ajay Agarwal3, Nidhi Chahal4, Dilbag Singh5, Abeer Ali Alnuaim6, Aseel Alhadlaq6, Heung-No Lee5.
Abstract
Recognizing human emotions by machines is a complex task. Deep learning models attempt to automate this process by rendering machines to exhibit learning capabilities. However, identifying human emotions from speech with good performance is still challenging. With the advent of deep learning algorithms, this problem has been addressed recently. However, most research work in the past focused on feature extraction as only one method for training. In this research, we have explored two different methods of extracting features to address effective speech emotion recognition. Initially, two-way feature extraction is proposed by utilizing super convergence to extract two sets of potential features from the speech data. For the first set of features, principal component analysis (PCA) is applied to obtain the first feature set. Thereafter, a deep neural network (DNN) with dense and dropout layers is implemented. In the second approach, mel-spectrogram images are extracted from audio files, and the 2D images are given as input to the pre-trained VGG-16 model. Extensive experiments and an in-depth comparative analysis over both the feature extraction methods with multiple algorithms and over two datasets are performed in this work. The RAVDESS dataset provided significantly better accuracy than using numeric features on a DNN.Entities:
Keywords: machine learning; neural network; speech emotion recognition
Mesh:
Year: 2022 PMID: 35336548 PMCID: PMC8949356 DOI: 10.3390/s22062378
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Schematic representation of two-way feature extraction for speech data.
Figure 2Architecture of the DNN Model.
Figure 3Accuracy and loss analyses of the proposed two-way feature extraction-based DNN on TESS dataset.
Figure 4Confusion matrices for feature extraction and modeling using DNN over 2 datasets. Diagonal elements with dark blue color show accurately predicted classes.
Figure 5Confusion matrices for feature extraction and modeling using VGG16 over 2 datasets. Diagonal elements with dark blue color show accurately predicted classes.
Comparative study.
| SERIAL NO. | APPROACH | MODEL USED | DATASET USED | ACCURACY |
|---|---|---|---|---|
| 01 | Dissanayake [ | CNN-LSTM (encoder) | RAVDESS | 56.71% |
| 02 | Li et al. [ | Multimodal Fine-Grained Learning | RAVDESS | 74.7 |
| 03 | Xu et al. [ | Attention Networks | RAVDESS | 77.4% |
| 04 | Proposed II Approach | 2-D Feature Extration + VGG-16 | RAVDESS | 81.94 |
Figure 6Confusion matrices for feature extraction and modeling using Resnet-18 over 2 datasets. Diagonal elements with dark blue color show accurately predicted classes.
Comparative analysis of RAVDESS and TESS.
| SERIAL NO. | MODEL | RAVDESS | TESS |
|---|---|---|---|
| 01 | ResNet18 | 79.16% | 96.26% |
| 02 | Proposed-I | 73.95% | 99.99% |
| 03 | Proposed-II | 81.94% | 97.15% |
Comparative analysis of RAVDESS and TESS.
| SERIAL NO. | MODEL | RAVDESS | TESS |
|---|---|---|---|
| 01 | Decision Tree | 37.85% | 3.21% |
| 02 | Random Forest | 46.88% | 7.68% |
| 03 | MLPClassifier | 33.68% | 15.54% |
| 04 | ResNet18 | 79.16% | 96.26% |
| 05 | Proposed-I | 73.95% | 99.99% |
| 06 | Proposed-II | 81.94% | 97.15% |