| Literature DB >> 31905692 |
Abstract
Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.Entities:
Keywords: artificial intelligence; emotion recognition; neural networks; noise removal; signals enhancement; spectrogram
Mesh:
Year: 2019 PMID: 31905692 PMCID: PMC6982825 DOI: 10.3390/s20010183
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1A block diagram of pre-processing to enhance speech signals with an adaptive threshold value.
Figure 2Visual representations of speech signal in 2D spectrograms of various emotions.
Figure 3Overall architecture of the proposed deep stride convolutional neural network for speech emotion recognition.
Comparison of the proposed model, using raw spectrograms and clean spectrograms.
| Model | Input | Dataset | Weighted Acc% | Unweighted Acc% | F1 Score% |
|---|---|---|---|---|---|
| Model | Raw spec | IEMOCAP | 76 | 72 | 77 |
| Model | Clean spec | IEMOCAP | 84 | 82 | 84 |
| Model | Raw spec | RAVDESS | 68 | 61 | 70 |
| Model | Clean spec | RAVDESS | 80 | 79 | 81 |
Training performance of the proposed DSCNN model on raw and clean spectrograms using IEMOCAP.
| Nature | Result on Raw Spectrograms | Result on Clean Spectrograms | ||||
|---|---|---|---|---|---|---|
| Emotion | Precision | Recall | F1 Score | Precision | Recall | F1 Score |
| Anger | 0.96 | 0.87 | 0.91 | 0.87 | 0.96 | 0.91 |
| Happy | 0.58 | 0.85 | 0.69 | 0.97 | 0.68 | 0.80 |
| Neutral | 1.00 | 0.76 | 0.86 | 0.77 | 0.91 | 0.83 |
| Sad | 0.69 | 0.92 | 0.79 | 0.82 | 0.92 | 0.84 |
| Weighted Avg | 0.80 | 0.77 | 0.76 | 0.86 | 0.85 | 0.85 |
| Unweighted Avg | 0.76 | 0.73 | 0.72 | 0.86 | 0.86 | 0.82 |
|
|
|
|
|
|
|
|
Training performance of the proposed DSCNN model on raw and clean spectrograms using RAVDESS.
| Nature | Result on Raw Spectrograms | Result on Clean Spectrograms | ||||
|---|---|---|---|---|---|---|
| Emotion | Precision | Recall | F1 Score | Precision | Recall | F1 Score |
| Anger | 0.40 | 1.00 | 0.57 | 0.79 | 0.91 | 0.84 |
| Happy | 0.92 | 0.29 | 0.44 | 0.79 | 0.90 | 0.84 |
| Neutral | 0.91 | 0.42 | 0.57 | 0.71 | 1.00 | 0.83 |
| Sad | 0.98 | 0.98 | 0.98 | 0.90 | 0.96 | 0.93 |
| Clam | 0.82 | 0.75 | 0.78 | 0.71 | 0.94 | 0.81 |
| Fearful | 0.00 | 0.00 | 0.00 | 1.00 | 0.50 | 0.67 |
| Surprised | 0.90 | 0.46 | 0.61 | 0.89 | 0.87 | 0.88 |
| Disgust | 0.92 | 0.86 | 0.89 | 1.00 | 0.38 | 0.55 |
| Weighted Avg | 0.79 | 0.70 | 0.68 | 0.85 | 0.81 | 0.80 |
| Unweighted Avg | 0.73 | 0.59 | 0.61 | 0.85 | 0.81 | 0.79 |
|
|
|
|
|
|
|
|
Confusion matrix for emotions prediction on IEMOCAM with average recall value (81.75%) and each row indicated the confusion of each emotion with ground truth and predictions.
| Emotion Class | Anger | Happy | Neutral | Sad |
|---|---|---|---|---|
| Anger | 0.90 | 0.06 | 0.01 | 0.03 |
| Happy | 0.09 | 0.74 | 0.08 | 0.09 |
| Neutral | 0.00 | 0.07 | 0.90 | 0.02 |
| Sad | 0.00 | 0.02 | 0.25 | 0.73 |
|
|
| |||
Confusion matrix for emotions prediction on RAVDESS with average recall value (79.5%) and each row indicated the confusion of each emotion with ground truth and predictions.
| Emo Class | Anger | Clam | Disgust | Fear | Happy | Neutral | Sad | Surprised |
|---|---|---|---|---|---|---|---|---|
| Anger | 0.82 | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 0.00 | 0.03 |
| Clam | 0.00 | 0.85 | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 0.00 |
| Disgust | 0.18 | 0.16 | 0.52 | 0.00 | 0.00 | 0.05 | 0.00 | 0.09 |
| Fear | 0.21 | 0.14 | 0.00 | 0.43 | 0.07 | 0.11 | 0.00 | 0.04 |
| Happy | 0.04 | 0.00 | 0.00 | 0.00 | 0.87 | 0.02 | 0.02 | 0.04 |
| Neutral | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.95 | 0.00 | 0.03 |
| Sad | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.04 | 0.94 | 0.00 |
| Surprised | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.98 |
|
|
| |||||||
Training and testing accuracy of the proposed DSCNN model.
| Dataset | Training Accuracy | Testing Accuracy |
|---|---|---|
|
| 84% | 81.75% |
|
| 81% | 79.50% |
Confusion matrix for emotions prediction on cross corpus between IEMOCAP and RAVDESS.
| Emotion Class | Anger | Happy | Neutral | Sad |
|---|---|---|---|---|
| Anger | 0.77 | 0.08 | 0.05 | 0.09 |
| Happy | 0.11 | 0.44 | 0.01 | 0.43 |
| Neutral | 0.01 | 0.10 | 0.56 | 0.33 |
| Sad | 0.00 | 0.51 | 0.00 | 0.49 |
|
|
| |||
Comparison of the proposed method with base line method using IEMOCAP dataset.
| Method | Input | Weighted Accuracy | Unweighted Accuracy | Accuracy |
|---|---|---|---|---|
| Fayek et al. [ | Spectrograms | 64.78% | 60.89% | - |
| Luo et al. [ | Spectrograms | 60.35% | 63.98% | - |
| Tripathi et al. [ | Spectrograms | 71.3% | 61.6% | - |
| Yenigalla et al. [ | Spectrograms | 73.9% | 68.5% | - |
| Chen et al. [ | Spectrograms | - | - | 64.74% |
| Proposed model | Raw_Spectrograms |
|
|
|
| Proposed model | Clean_Spectrograms |
|
|
|
Comparison of the proposed method with base line method using RAVDESS dataset.
| Method | Input | Weighted Accuracy | Unweighted Accuracy | Accuracy |
|---|---|---|---|---|
| Zeng et al. [ | Spectrograms | - | - | 64.48% |
| Jalal et al. [ | log-spectrogram | - | 69.4% | 68.10% |
| Bhavan et al. [ | spectral features | - | - | 75.69% |
| Proposed model | Raw_Spectrograms | 68% | 61% | 70.00% |
| Proposed model | Clean_Spectrograms | 80% | 79% | 79.5% |
Computational comparison of the suggested DSCNN model with other baseline CNNs models.
| Model | Training Time | Model Size | Accuracy |
|---|---|---|---|
| Alex Net (transfer Learning) [ | 38 min | 201 MB | 70.54% |
| Vgg16 (transfer Learning) [ | 55 min | 420 MB | 73.00% |
| ResNet50 (transfer Learning) [ | 30 min | 75 MB | 75.50% |
| Proposed DSCNN model |
|
|
|