| Literature DB >> 32932723 |
Tursunov Anvarjon1, Soonil Kwon1.
Abstract
Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.Entities:
Keywords: artificial intelligence; deep frequency features extraction; deep learning; speech emotion recognition; speech spectrograms
Year: 2020 PMID: 32932723 PMCID: PMC7570673 DOI: 10.3390/s20185212
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Typical flow of speech emotion recognition system, which is show the flow from raw data to classification results.
Figure 2Spectrograms samples of different emotional speech signals.
Figure 3An overview of the proposed speech emotion recognition (SER) model: We proposed new plain rectangular shape fitters with a modified pooling strategy to learn and capture the deep frequency features from the speech spectrogram.
The overall specification of the proposed model is illustrated with set of convolutions layers, input and output size, and number of units which consist of kernel size, and number of parameters.
| Model Layers | Input Tensor | Output Tensor | Parameters |
|---|---|---|---|
| Layer 1, 2, and Layer 3. | 3 × 64 × 64 | 10 × 64 × 64 | L1 = 280, L2 = 210, and L3 = 310 |
| Pooling_1 | 10 × 64 × 64 | 10 × 62 × 64 | 0 |
| Layer 4, and Layer 5. | 10 × 62 × 64 | 20 × 62 × 64 | L4 = 620, and L5 = 1220 |
| Pooling_2 | 20 × 62 × 64 | 20 × 31 × 64 | 0 |
| Layer 6, and Layer 7. | 20 × 31 × 64 | 40 × 31 × 64 | L6 = 2440, and L7 = 4840 |
| Pooling_3 | 40 × 31 × 64 | 40 × 15 × 64 | 0 |
| Layer 8. | 40 × 15 × 64 | 80 × 15 × 64 | L8 = 7260 |
| Dense_1 | 35,840 | 80 | 2,867,280 |
| Dense_2 | 80 | 30 | 2430 |
| Softmax | Probabilities | Number of classes | Accordingly |
| Total parameters = 2,974,367 | |||
| Trainable parameters = 2,973,887 | |||
The detailed description about the interactive emotional dyadic motion capture IEMOCAP dataset about the emotions, the numbers of utterances, and the participation rates in percentage.
| Emotions | Samples | Contribution in (%) |
|---|---|---|
| Anger | 1017 | 19.94 |
| Sadness | 1120 | 19.60 |
| Happiness | 1636 | 29.58 |
| Neutral | 1650 | 30.88 |
The detailed description about the berlin emotional speech database EMO-DB dataset in term of the emotions, the numbers of utterances, and participation rates in percentage.
| Emotions | Samples | Contribution in (%) |
|---|---|---|
| Anger | 127 | 23.74 |
| Sadness | 62 | 11.59 |
| Happy | 71 | 13.27 |
| Neutral | 79 | 14.77 |
| Disgust | 46 | 8.60 |
| Fear | 69 | 12.90 |
| Boredom | 81 | 15.14 |
Training evaluation of the proposed model using the IEMOCAP and the EMO-DB speech datasets.
| Input Feature | Dataset | Weighted Acc % | Un-Weighted Acc % | Accuracy % |
|---|---|---|---|---|
| Speech-Spectrogram | IEMOCAP | 83 | 82 | 83 |
Figure 4The figure shows the training performance of the proposed system and illustrates the training and the validation accuracies, as well as the losses, for the benchmark (a) IEMOCAP and (b) EMO-DB datasets.
Prediction performance of the suggested SER system for the IEMOCAP dataset using speech spectrograms.
| Emotions/Classes | Recall Values | Precision Values | F1-Score |
|---|---|---|---|
| Anger | 0.85 | 0.68 | 0.76 |
| Sad | 0.74 | 0.72 | 0.75 |
| Happy | 0.69 | 0.85 | 0.73 |
| Neutral | 0.81 | 0.78 | 0.80 |
| Weighted Acc | 0.77 | 0.76 | 0.77 |
| Un-weighted Acc | 0.76 | 0.76 | 0.76 |
|
|
|
|
|
Prediction performance of the suggested SER system for the EMO-DB dataset using speech spectrograms.
| Emotions/Classes | Recall Values | Precision Values | F1-Score |
|---|---|---|---|
| Anger | 0.96 | 0.99 | 0.98 |
| Sadness | 0.92 | 0.86 | 0.89 |
| Happiness | 0.80 | 0.92 | 0.85 |
| Neutral | 0.88 | 0.82 | 0.86 |
| Boredom | 0.99 | 0.89 | 0.94 |
| Fear | 0.93 | 0.99 | 0.96 |
| Disgust | 0.99 | 0.98 | 0.97 |
| Weighted Acc | 0.93 | 0.93 | 0.93 |
| Un-weighted Acc | 0.92 | 0.93 | 0.92 |
|
|
|
|
|
Figure 5Confusion matrixes of the IEMOCAP and the EMO-DB datasets. (a) The IEMOCAP confusion matrix; (b) the EMO-DB confusion matrix. The x-axes show the predicted labels, and the y-axes show the actual labels.
The comparison of the proposed model in term of time complexity and accuracy using the suggested speech datasets with the baseline SER methods.
| Models | IEMOCAP | Accuracy | EMO-DB | Accuracy |
|---|---|---|---|---|
| ACRNN [ | 13,487 s | 64.74% | 6811 s | 82.82% |
| ADRNN [ | 13,887 s | 69.32% | 7187 s | 84.99% |
| CB-SER [ | 10,452 s | 72.25% | 5396 s | 85.57% |
|
|
|
|
|
|
Comparative analysis of the suggested SER system and the baseline SER system using the IEMOCAP and the EMO-DB speech corpus.
| Authors/Reference/Year | Dataset | Un-Weighted Accuracy |
|---|---|---|
| Zhao et al. [ | IEMOCAP | 52.14% |
| Fayek et al. [ | // | 64.78% |
| Guo et al. [ | // | 57.10% |
| Zhang et al. [ | // | 40.02% |
| Han et al. [ | // | 51.24% |
| Meng et al. [ | // | 69.32% |
| Zhao et al. [ | // | 66.50% |
| Luo et al. [ | // | 63.98% |
| Jiang, S, et al. [ | // | 61.60% |
| Chen et al. [ | // | 64.74% |
| Issa et al. [ | // | 64.03% |
| Mustaqeem et al. [ | // | 72.25% |
|
|
|
|
| Guo et al. [ | EMO-DB | 84.49% |
| Meng et al. [ | // | 88.99% |
| Chen et al. [ | // | 82.82% |
| Badshah et al. [ | // | 80.79% |
| Jiang et al. [ | // | 84.53% |
| Issa et al. [ | // | 86.10% |
| Mustaqeem et al. [ | // | 85.57% |
|
|
|
|