| Literature DB >> 34199027 |
Youngja Nam1, Chankyu Lee1,2.
Abstract
Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)-CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN-CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN-CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN-CNN has an overall accuracy of 59.3-76.6%, whereas the CNN has an overall accuracy of 39.4-58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition.Entities:
Keywords: cascaded DnCNN–CNN; residual learning; speech emotion recognition
Mesh:
Year: 2021 PMID: 34199027 PMCID: PMC8271804 DOI: 10.3390/s21134399
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Block diagram of the proposed DnCNN–CNN architecture.
Figure 2Sample spectrograms of an utterance in the Korean emotional speech database: (a) Original spectrogram; (b) Denoised spectrogram.
Structure of the proposed cascaded DnCNN–CNN.
|
| |||
|
|
|
|
|
| Convolutional Layer 1 | 256 × 256 × 64 | 3 × 3 | 1 × 1 |
| Batch Normalization | |||
| Convolutional Layer 2 | 256 × 256 × 32 | 3 × 3 | 1 × 1 |
| Batch Normalization | |||
| Convolutional Layer 3 | 256 × 256 × 16 | 3 × 3 | 1 × 1 |
| Batch Normalization | |||
| Convolutional Layer 4 | 256 × 256 × 3 | ||
| Batch Normalization | |||
|
| |||
| Convolutional Layer 1 | 256 × 256 × 32 | 3 × 3 | 1 × 1 |
| Max Pooling Layer 1 | 128 × 128 × 32 | 2 × 2 | 1 × 1 |
| Convolutional Layer 2 | 128 × 128 × 64 | 3 × 3 | 1 × 1 |
| Max Pooling Layer 2 | 64 × 64 × 64 | 2 × 2 | 1 × 1 |
| Convolutional Layer 3 | 64 × 64 × 128 | 3 × 3 | 1 × 1 |
| Max Pooling Layer 3 | 32 × 32 × 128 | 2 × 2 | 1 × 1 |
| Flattened Layer | 131,072 | - | - |
| Dropout Layer | 70% | - | - |
| Fully Connected Layer 1 | 256 | - | - |
| Fully Connected Layer 2 | 5 | - | - |
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the Korean CADKES mixed with the PCAFETER noise.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 48.1% | 0.7% | 47.9% | 1.7% | 1.5% | |
| Happiness | 10.4% | 42.0% | 43.0% | 1.8% | 2.8% | |
| Sadness | 2.2% | 0.7% | 95.3% | 0.5% | 1.3% | |
| Anger | 9.2% | 6.6% | 37.7% | 40.0% | 6.6% | |
| Fearful | 3.2% | 1.7% | 56.0% | 3.0% | 36.1% | |
| Overall accuracy: 52.3% | ||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the Korean CADKES mixed with the PCAFETER noise.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 60.5% | 0.8% | 35.3% | 2.8% | 0.6% | |
| Happiness | 14.0% | 47.2% | 34.4% | 3.5% | 0.9% | |
| Sadness | 3.7% | 0.8% | 92.7% | 1.9% | 0.9% | |
| Anger | 11.8% | 7.0% | 29.4% | 49.2% | 2.7% | |
| Fearful | 6.2% | 1.7% | 55.2% | 5.5% | 31.4% | |
| Overall accuracy: 56.1% | ||||||
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the Korean CADKES mixed with the PSTATION noise.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 44.0% | 0.1% | 52.5% | 1.9% | 1.5% | |
| Happiness | 10.5% | 37.0% | 45.3% | 4.3% | 2.9% | |
| Sadness | 1.3% | 0.3% | 96.3% | 0.2% | 1.9% | |
| Anger | 8.5% | 5.6% | 37.0% | 45.5% | 3.4% | |
| Fearful | 1.9% | 0.7% | 57.0% | 3.5% | 37.0% | |
| Overall accuracy: 51.9% | ||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the Korean CADKES mixed with the PSTATION noise.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 47.9% | 0.5% | 46.1% | 1.5% | 3.9% | |
| Happiness | 17.0% | 33.9% | 36.3% | 5.3% | 7.5% | |
| Sadness | 2.5% | 0.3% | 92.6% | 1.1% | 3.4% | |
| Anger | 11.6% | 3.4% | 30.3% | 44.6% | 10.1% | |
| Fearful | 3.6% | 0.7% | 42.4% | 3.0% | 50.2% | |
| Overall accuracy: 53.9% | ||||||
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the Korean CADKES mixed with the TMETRO noise.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 58.0% | 0.7% | 38.1% | 0.5% | 2.6% | |
| Happiness | 10.8% | 48.6% | 35.7% | 1.8% | 3.1% | |
| Sadness | 1.8% | 0.5% | 93.8% | 0.7% | 3.1% | |
| Anger | 10.5% | 8.2% | 27.7% | 47.9% | 5.7% | |
| Fearful | 2.3% | 2.2% | 42.0% | 1.8% | 51.7% | |
| Overall accuracy: 60.1% | ||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the Korean CADKES mixed with the TMETRO noise.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 79.2% | 1.4% | 17.2% | 0.3% | 1.9% | |
| Happiness | 20.3% | 52.8% | 20.0% | 1.4% | 5.6% | |
| Sadness | 4.2% | 0.2% | 90.6% | 0.7% | 4.3% | |
| Anger | 17.0% | 10.7% | 13.4% | 50.4% | 8.6% | |
| Fearful | 5.7% | 2.0% | 35.5% | 1.0% | 55.8% | |
| Overall accuracy: 64.8% | ||||||
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the original Korean CADKES.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 95.4% | 1.4% | 2.2% | 0.3% | 0.7% | |
| Happiness | 2.1% | 94.2% | 0.8% | 1.9% | 1.0% | |
| Sadness | 3.6% | 1.6% | 90.7% | 0.4% | 3.8% | |
| Anger | 0.9% | 3.6% | 0.6% | 93.9% | 1.1% | |
| Fearful | 2.1% | 1.8% | 3.2% | 0.8% | 92.1% | |
| Overall accuracy: 93.6% | ||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the original Korean CADKES.
| Predicted Emotion | ||||||
|---|---|---|---|---|---|---|
| True emotion | Neutral | Happiness | Sadness | Anger | Fearful | |
| Neutral | 98.1% | 0.9% | 1.0% | 0.0% | 0.0% | |
| Happiness | 1.0% | 94.7% | 0.6% | 3.7% | 0.0% | |
| Sadness | 1.2% | 0.6% | 97.1% | 0.0% | 1.2% | |
| Anger | 0.5% | 1.1% | 1.1% | 96.2% | 1.1% | |
| Fearful | 1.9% | 1.4% | 1.9% | 1.7% | 93.0% | |
| Overall accuracy: 95.8% | ||||||
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the German EMO-DB mixed with the PCAFETER noise.
| Predicted Emotion | ||||||||
|---|---|---|---|---|---|---|---|---|
| True emotion | Neutral | Anger | Boredom | Disgust | Anxiety | Happiness | Sadness | |
| Neutral | 30.4% | 0.0% | 64.6% | 0.0% | 0.0% | 0.0% | 5.1% | |
| Anger | 3.1% | 26.4% | 24.8% | 7.0% | 28.7% | 7.8% | 2.3% | |
| Boredom | 0.0% | 0.0% | 100.0% | 0.0% | 0.0% | 0.0% | 0.0% | |
| Disgust | 2.2% | 0.0% | 45.7% | 37.0% | 6.5% | 0.0% | 8.7% | |
| Anxiety | 10.4% | 0.0% | 19.4% | 3.0% | 52.2% | 0.0% | 14.9% | |
| Happiness | 11.3% | 2.8% | 50.7% | 4.2% | 7.0% | 23.9% | 0.0% | |
| Sadness | 1.6% | 0.0% | 25.8% | 0.0% | 0.0% | 0.0% | 72.6% | |
| Overall accuracy: 48.2% | ||||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the German EMO-DB mixed with the PCAFETER noise.
| Predicted Emotion | ||||||||
|---|---|---|---|---|---|---|---|---|
| True emotion | Neutral | Anger | Boredom | Disgust | Anxiety | Happiness | Sadness | |
| Neutral | 83.5% | 0.0% | 16.5% | 0.0% | 0.0% | 0.0% | 0.0% | |
| Anger | 4.7% | 76.0% | 0.8% | 1.6% | 3.9% | 13.2% | 0.0% | |
| Boredom | 7.4% | 0.0% | 92.6% | 0.0% | 0.0% | 0.0% | 0.0% | |
| Disgust | 6.5% | 0.0% | 15.2% | 65.2% | 8.7% | 2.2% | 2.2% | |
| Anxiety | 17.9% | 1.5% | 4.5% | 1.5% | 67.2% | 1.5% | 6.0% | |
| Happiness | 12.7% | 5.6% | 11.3% | 1.4% | 1.4% | 67.6% | 0.0% | |
| Sadness | 1.6% | 0.0% | 21.0% | 0.0% | 0.0% | 0.0% | 77.4% | |
| Overall accuracy: 76.6% | ||||||||
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the German EMO-DB mixed with the PSTATION noise.
| Predicted Emotion | ||||||||
|---|---|---|---|---|---|---|---|---|
| True emotion | Neutral | Anger | Boredom | Disgust | Anxiety | Happiness | Sadness | |
| Neutral | 24.1% | 0.0% | 70.9% | 0.0% | 0.0% | 0.0% | 5.1% | |
| Anger | 14.0% | 10.1% | 31.8% | 7.0% | 30.2% | 3.9% | 3.1% | |
| Boredom | 0.0% | 0.0% | 100.0% | 0.0% | 0.0% | 0.0% | 0.0% | |
| Disgust | 2.2% | 0.0% | 45.7% | 23.9% | 8.7% | 0.0% | 19.6% | |
| Anxiety | 13.4% | 0.0% | 20.9% | 1.5% | 38.8% | 0.0% | 25.4% | |
| Happiness | 9.9% | 1.4% | 62.0% | 4.2% | 7.0% | 15.5% | 0.0% | |
| Sadness | 0.0% | 0.0% | 29.0% | 0.0% | 0.0% | 0.0% | 71.0% | |
| Overall accuracy: 39.4% | ||||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the German EMO-DB mixed with the PSTATION noise.
| Predicted Emotion | ||||||||
|---|---|---|---|---|---|---|---|---|
| True emotion | Neutral | Anger | Boredom | Disgust | Anxiety | Happiness | Sadness | |
| Neutral | 45.6% | 0.0% | 53.2% | 0.0% | 0.0% | 0.0% | 1.3% | |
| Anger | 14.0% | 46.5% | 10.9% | 7.8% | 8.5% | 7.0% | 5.4% | |
| Boredom | 0.0% | 0.0% | 98.8% | 0.0% | 0.0% | 0.0% | 1.2% | |
| Disgust | 10.9% | 0.0% | 26.1% | 52.2% | 2.2% | 2.2% | 6.5% | |
| Anxiety | 11.9% | 0.0% | 13.4% | 0.0% | 50.7% | 0.0% | 23.9% | |
| Happiness | 22.5% | 2.8% | 29.6% | 4.2% | 0.0% | 39.4% | 1.4% | |
| Sadness | 0.0% | 0.0% | 11.3% | 0.0% | 0.0% | 0.0% | 88.7% | |
| Overall accuracy: 59.3% | ||||||||
Confusion matrix of the emotion classification results of the baseline CNN that was trained on the German EMO-DB mixed with the TMETRO noise.
| Predicted Emotion | ||||||||
|---|---|---|---|---|---|---|---|---|
| True emotion | Neutral | Anger | Boredom | Disgust | Anxiety | Happiness | Sadness | |
| Neutral | 57.0% | 0.0% | 39.2% | 0.0% | 0.0% | 0.0% | 3.8% | |
| Anger | 8.5% | 38.8% | 14.0% | 6.2% | 22.5% | 8.5% | 1.6% | |
| Boredom | 0.0% | 0.0% | 98.8% | 0.0% | 0.0% | 0.0% | 1.2% | |
| Disgust | 0.0% | 0.0% | 28.3% | 52.2% | 13.0% | 0.0% | 6.5% | |
| Anxiety | 9.0% | 0.0% | 9.0% | 1.5% | 61.2% | 0.0% | 19.4% | |
| Happiness | 16.9% | 2.8% | 38.0% | 4.2% | 9.9% | 28.2% | 0.0% | |
| Sadness | 1.6% | 0.0% | 11.3% | 0.0% | 0.0% | 0.0% | 87.1% | |
| Overall accuracy: 58.1% | ||||||||
Confusion matrix of the emotion classification results of the proposed cascaded DnCNN–CNN that was trained on the German EMO-DB mixed with the TMETRO noise.
| Predicted Emotion | ||||||||
|---|---|---|---|---|---|---|---|---|
| True emotion | Neutral | Anger | Boredom | Disgust | Anxiety | Happiness | Sadness | |
| Neutral | 68.4% | 0.0% | 30.4% | 0.0% | 0.0% | 0.0% | 1.3% | |
| Anger | 7.0% | 68.2% | 3.9% | 2.3% | 6.2% | 11.6% | 0.8% | |
| Boredom | 6.2% | 0.0% | 93.8% | 0.0% | 0.0% | 0.0% | 0.0% | |
| Disgust | 15.2% | 2.2% | 13.0% | 50.0% | 8.7% | 4.3% | 6.5% | |
| Anxiety | 23.9% | 1.5% | 6.0% | 0.0% | 59.7% | 0.0% | 9.0% | |
| Happiness | 28.2% | 4.2% | 11.3% | 1.4% | 4.2% | 50.7% | 0.0% | |
| Sadness | 1.6% | 0.0% | 9.7% | 0.0% | 0.0% | 0.0% | 88.7% | |
| Overall accuracy: 69.5% | ||||||||