| Literature DB >> 32998382 |
Minji Seo1, Myungho Kim1.
Abstract
Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.Entities:
Keywords: bag of visual words; convolutional neural network; cross-corpus; log-mel spectrograms; speech emotion recognition; visual attention
Mesh:
Year: 2020 PMID: 32998382 PMCID: PMC7583996 DOI: 10.3390/s20195559
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Overall architecture of our pretraining and fine-tuning models for SER.
Figure 2Architecture of the proposed pretrain VACNN model.
Figure 3Architecture of the proposed fine-tuned model.
Overview of the selected datasets.
| Dataset | Language | Utterances | Emotions | Emotion Labels |
|---|---|---|---|---|
| EmoDB | German | 565 | 7 | Anger, Boredom, Neutral, Disgust, Fear, Happiness, Sadness |
| RAVDESS | English | 1440 | 8 | Neutral, Calm, Happiness, Sadness, Anger, Fear, Surprise, Disgust |
| TESS | English | 2800 | 7 | Anger, Disgust, Fear, Happiness, Pleasant Surprise, Sadness, Neutral |
| SAVEE | English | 480 | 7 | Anger, Disgust, Fear, Happiness, Sadness, Surprise, Neutral |
Confusion matrix based on TESS and RAVDESS dataset for pretraining.
| Emo Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Anger | 0.94 | 0.88 | 0.91 |
| Disgust | 0.89 | 0.92 | 0.90 |
| Fear | 0.83 | 0.94 | 0.88 |
| Happiness | 0.90 | 0.75 | 0.82 |
| Neutral | 0.92 | 0.89 | 0.91 |
| Sadness | 0.83 | 0.90 | 0.87 |
| Surprise | 0.93 | 0.94 | 0.93 |
| Average | 0.89 | 0.89 | 0.89 |
Comparison of overall accuracy using the proposed VACNN + BOVW, fine-tuned VACNN + BOVW, and fine-tuned VACNN with CBAM instead VA module + BOVW.
| Dataset | Pure VACNN + BOVW | Fine-Tuned VACNN + BOVW | Fine-Tuned VACNN (CBAM) + BOVW |
|---|---|---|---|
| RAVDESS | 0.72 | 0.83 | 0.77 |
| EmoDB | 0.77 | 0.87 | 0.81 |
| SAVEE | 0.69 | 0.75 | 0.72 |
Cross-corpus emotion recognition result using proposed fine-tuning method on RAVDESS.
| Emo Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Anger | 0.77 | 0.91 | 0.83 |
| Calm | 0.95 | 0.89 | 0.92 |
| Disgust | 0.89 | 0.85 | 0.87 |
| Fear | 0.91 | 0.84 | 0.87 |
| Happiness | 0.81 | 0.74 | 0.77 |
| Neutral | 0.74 | 0.88 | 0.80 |
| Sadness | 0.74 | 0.76 | 0.75 |
| Surprise | 0.86 | 0.86 | 0.85 |
| Average | 0.83 | 0.83 | 0.83 |
Cross-corpus emotion recognition result using proposed fine-tuning method on EmoDB.
| Emo Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Anger | 0.94 | 0.91 | 0.93 |
| Boredom | 0.87 | 0.87 | 0.88 |
| Disgust | 0.95 | 0.95 | 0.97 |
| Fear | 0.74 | 0.80 | 0.77 |
| Happiness | 0.86 | 0.73 | 0.79 |
| Neutral | 0.80 | 0.85 | 0.81 |
| Sadness | 0.86 | 0.96 | 0.91 |
| Average | 0.86 | 0.87 | 0.87 |
Cross-corpus emotion recognition result using proposed fine-tuning method on SAVEE.
| Emo Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Anger | 0.74 | 0.58 | 0.65 |
| Disgust | 0.81 | 0.65 | 0.72 |
| Fear | 0.59 | 0.76 | 0.67 |
| Happiness | 0.72 | 0.59 | 0.65 |
| Neutral | 0.88 | 0.93 | 0.90 |
| Sadness | 0.73 | 0.83 | 0.78 |
| Surprise | 0.59 | 0.65 | 0.67 |
| Average | 0.72 | 0.71 | 0.72 |
Confusion matrix for emotion prediction on RAVDESS using RAVDESS and TESS.
| Emo Class | Anger | Calm | Disgust | Fear | Happiness | Neutral | Sadness | Surprised |
|---|---|---|---|---|---|---|---|---|
| Anger | 0.91 | 0.00 | 0.03 | 0.00 | 0.03 | 0.00 | 0.00 | 0.03 |
| Calm | 0.00 | 0.89 | 0.00 | 0.00 | 0.05 | 0.02 | 0.05 | 0.00 |
| Disgust | 0.10 | 0.02 | 0.85 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
| Fear | 0.00 | 0.00 | 0.03 | 0.84 | 0.03 | 0.00 | 0.08 | 0.02 |
| Happiness | 0.07 | 0.00 | 0.02 | 0.04 | 0.74 | 0.02 | 0.04 | 0.07 |
| Neutral | 0.00 | 0.00 | 0.00 | 0.00 | 0.12 | 0.88 | 0.00 | 0.00 |
| Sadness | 0.05 | 0.03 | 0.03 | 0.03 | 0.03 | 0.08 | 0.76 | 0.00 |
| Surprise | 0.00 | 0.00 | 0.00 | 0.06 | 0.02 | 0.00 | 0.06 | 0.86 |
Confusion matrix for emotion prediction on EmoDB using RAVDESS and TESS.
| Emo Class | Anger | Boredom | Disgust | Fear | Happiness | Neutral | Sadness |
|---|---|---|---|---|---|---|---|
| Anger | 0.91 | 0.00 | 0.00 | 0.04 | 0.05 | 0.00 | 0.00 |
| Boredom | 0.00 | 0.87 | 0.00 | 0.00 | 0.00 | 0.10 | 0.03 |
| Disgust | 0.00 | 0.00 | 0.95 | 0.05 | 0.00 | 0.00 | 0.00 |
| Fear | 0.04 | 0.00 | 0.00 | 0.8 | 0.00 | 0.04 | 0.12 |
| Happiness | 0.00 | 0.00 | 0.04 | 0.15 | 0.73 | 0.08 | 0.00 |
| Neutral | 0.03 | 0.12 | 0.00 | 0.00 | 0.00 | 0.85 | 0.00 |
| Sadness | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.96 |
Confusion matrix for emotion prediction on SAVEE using RAVDESS and TESS.
| Emo Class | Anger | Disgust | Fear | Happiness | Neutral | Sadness | Surprised |
|---|---|---|---|---|---|---|---|
| Anger | 0.58 | 0.08 | 0.08 | 0.13 | 0.00 | 0.00 | 0.13 |
| Disgust | 0.00 | 0.65 | 0.07 | 0.00 | 0.12 | 0.08 | 0.08 |
| Fear | 0.00 | 0.05 | 0.76 | 0.05 | 0.05 | 0.05 | 0.05 |
| Happiness | 0.22 | 0.00 | 0.05 | 0.59 | 0.00 | 0.00 | 0.14 |
| Neutral | 0.00 | 0.00 | 0.00 | 0.00 | 0.93 | 0.07 | 0.00 |
| Sadness | 0.00 | 0.04 | 0.00 | 0.00 | 0.13 | 0.83 | 0.00 |
| Surprise | 0.00 | 0.3 | 0.05 | 0.00 | 0.00 | 0.00 | 0.65 |
Comparison of our method and related works on RAVDESS, EmoDB, SAVEE for cross-corpus SER.
| Dataset | Method | Features | Overall Accuracy |
|---|---|---|---|
| RAVDESS | Mustaqeem and Kwon [ | Raw spectrogram | 56.50 |
| Milner et al. [ | Log-mel spectrogram | 75.60 | |
| Parry et al. [ | MFCC | 65.67 | |
| Inception v3 [ | Log-mel spectrogram | 69.10 | |
| MobileNet [ | Log-mel spectrogram | 71.53 | |
| ResNet [ | Log-mel spectrogram | 73.26 | |
| VACNN | Log-mel spectrogram | 74.31 | |
| VACNN + BOVW | Log-mel spectrogram | 83.33 | |
| EmoDB | Zong et al. [ | Low-level acoustic features | 61.41 |
| Latif et al. [ | Low-level acoustic features + latent code | 65.30 | |
| Mao et al. [ | Raw spectrogram | 71.80 | |
| Parry et al. [ | MFCC | 69.72 | |
| Inception v3 [ | Log-mel spectrogram | 73.83 | |
| MobileNet [ | Log-mel spectrogram | 76.64 | |
| ResNet [ | Log-mel spectrogram | 76.64 | |
| VACNN | Log-mel spectrogram | 79.44 | |
| VACNN + BOVW | Log-mel spectrogram | 86.92 | |
| SAVEE | Latif et al. [ | Low-level acoustic features + latent code | 53.20 |
| Goel and Beigi [ | Low-level acoustic features | 55.55 | |
| Mao et al. [ | Raw spectrogram | 57.2 | |
| Parry et al. [ | MFCC | 72.66 | |
| Inception v3 [ | Log-mel spectrogram | 51.04 | |
| MobilNet [ | Log-mel spectrogram | 56.25 | |
| ResNet [ | Log-mel spectrogram | 59.38 | |
| VACNN | Log-mel spectrogram | 64.58 | |
| VACNN + BOVW | Log-mel spectrogram | 75.00 |