| Literature DB >> 36091976 |
Ammar Amjad1, Lal Khan1, Hsien-Tsung Chang1,2,3,4.
Abstract
Speech emotion recognition (SER) systems have evolved into an important method for recognizing a person in several applications, including e-commerce, everyday interactions, law enforcement, and forensics. The SER system's efficiency depends on the length of the audio samples used for testing and training. However, the different suggested models successfully obtained relatively high accuracy in this study. Moreover, the degree of SER efficiency is not yet optimum due to the limited database, resulting in overfitting and skewing samples. Therefore, the proposed approach presents a data augmentation method that shifts the pitch, uses multiple window sizes, stretches the time, and adds white noise to the original audio. In addition, a deep model is further evaluated to generate a new paradigm for SER. The data augmentation approach increased the limited amount of data from the Pakistani racial speaker speech dataset in the proposed system. The seven-layer framework was employed to provide the most optimal performance in terms of accuracy compared to other multilayer approaches. The seven-layer method is used in existing works to achieve a very high level of accuracy. The suggested system achieved 97.32% accuracy with a 0.032% loss in the 75%:25% splitting ratio. In addition, more than 500 augmentation data samples were added. Therefore, the proposed approach results show that deep neural networks with data augmentation can enhance the SER performance on the Pakistani racial speech dataset.Entities:
Keywords: Data augmentation; Deep neural network; Multiple window size; Speaker recognition
Year: 2022 PMID: 36091976 PMCID: PMC9454772 DOI: 10.7717/peerj-cs.1053
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Detailed description of datasets.
| Reference | Approach | Database | Classes | Accuracy |
|---|---|---|---|---|
|
| HMM and GMM | S-PTH database | 4 | 13.8% and 24.6% error rate |
|
| DNNs | The First Accents of the British Isles Speech | 14 | 3.91% and 10.5% error rate |
|
| Support Vector Machine,Random Forest and Gaussian Mixture Model | Recorded Pakistan ethnic speaker | 6 | 92.55% |
|
| SB-CNN | Urban- Sound8K | 10 | 94% |
|
| Deep Belief Network | FAS Database | 6 | 90.2% |
|
| Fuzzy Vector Quantization | TIMIT | 100 | 98.8% |
|
| HMM One layer Deep Neural Network | VOCE | 4 | 90% |
|
| CNN | Spontaneous Urdu dataset | – | 87.5% |
|
| DNNs | Indonesian speech | 4 | 98.96% |
|
| GMM | TIMIT | 8 | 98.44% |
|
| Support Vector Machine | speaker ethnicity | 4 | 56.96% |
|
| DNN | OC16 | 2 | 86.10% |
Figure 1Structure of a deep neural network.
Figure 2Structure of proposed approach.
Duration of audio speech data in hours.
| Racial | Number of male and female speakers | Duration per sample | Number of samples | Nature of samples |
|---|---|---|---|---|
| Punjabi ( | 4 males and 4 females | 42 s | 500 samples | Speaker and text independent |
| Urdu ( | 4 males and 4 females | 42 s | 500 samples | Speaker and text independent |
| Sindhi ( | 32 males and 38 females | 30 s | 80 samples | Speaker and text independent |
| Saraiki | 42 males and 28 females | 30 s | 80 samples | Speaker and text independent |
| Pashto | 35 males and 35 females | 30 s | 80 samples | Speaker and text independent |
Figure 3Block diagram of the computation steps of MFCC.
Figure 4Structure of proposed approach.
Figure 5Proposed model performance on training dataset.
Figure 6Proposed model performance on testing dataset.
Comparison table of loss at dividing ratio with accuracy.
| Dividing ratio | Classification accuracy | Total loss |
|---|---|---|
| 90:10 | 93.55 | 0.105 |
| 80:20 | 95.767 | 0.093 |
| 75:25 | 97.32 | 0.032 |
The accuracy and loss comparison table includes augmentation data with 75:25 ratio.
| Data augmentation | Accuracy | Loss |
|---|---|---|
| 100 | 96.57 | 1.33 |
| 200 | 96.21 | 0.05 |
| 300 | 96.83 | 2.77 |
| 400 | 96.45 | 0.035 |
| 500 | 97.32 | 0.031 |
The accuracy and loss comparison table includes augmentation data with 80:20 ratio.
| Data augmentation | Accuracy | Loss |
|---|---|---|
| 100 | 95.12 | 6.33 |
| 200 | 95.99 | 0.04 |
| 300 | 96.13 | 0.19 |
| 400 | 96.29 | 0.66 |
| 500 | 97.09 | 2.77 |
The accuracy and loss comparison table includes augmentation data with 90:10 ratio.
| Data augmentation | Accuracy | Loss |
|---|---|---|
| 100 | 95.21 | 0.13 |
| 200 | 96.90 | 0.28 |
| 300 | 96.34 | 3.22 |
| 400 | 96.99 | 6.23 |
| 500 | 97.01 | 5.232 |
Comparison of outcomes with different ML and DL algorithms.
| Dataset | Classification accuracy | Accuracy |
|---|---|---|
| Pakistani racial speaker classification | KNN | 81.99 |
| Random Forest | 71.56 | |
| Multilayer Perceptron (MLP) | 91.45 | |
| Decision Tree | 67.45 | |
| Three layers Deep Neural Network | 92.56 | |
| Five layers Deep Neural Network | 94.78 | |
| Seven Layer DNN-DA (Proposed) | 97.732 |