| Literature DB >> 31216650 |
Wei Jiang1,2, Zheng Wang3, Jesse S Jin4, Xianfeng Han5, Chunguang Li6.
Abstract
Automatic speech emotion recognition is a challenging task due to the gap between acoustic features and human emotions, which rely strongly on the discriminative acoustic features extracted for a given recognition task. We propose a novel deep neural architecture to extract the informative feature representations from the heterogeneous acoustic feature groups which may contain redundant and unrelated information leading to low emotion recognition performance in this work. After obtaining the informative features, a fusion network is trained to jointly learn the discriminative acoustic feature representation and a Support Vector Machine (SVM) is used as the final classifier for recognition task. Experimental results on the IEMOCAP dataset demonstrate that the proposed architecture improved the recognition performance, achieving accuracy of 64% compared to existing state-of-the-art approaches.Entities:
Keywords: deep neural architecture; fusion network; heterogeneous feature unification; human–computer interaction (HCI); speech emotion recognition
Year: 2019 PMID: 31216650 PMCID: PMC6630663 DOI: 10.3390/s19122730
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The proposed speech emotion recognition architecture.
Figure 2The architecture of an autoencoder.
Figure 3The architecture of the Shared-Hidden-Layer Autoencoder (SHLA) model.
Figure 4The architecture of each branch network in heterogeneous unification module: The pre-training part (left) and the fine-tuning (right).
Different emotion category distribution and duration (interactive emotional dyadic motion capture database—IEMOCAP).
| Emotion | Angry | Happy | Neutral | Sad | Total |
|---|---|---|---|---|---|
| Utterances | 1103 | 1636 | 1708 | 1084 | 5531 |
| Duration (min) | 83.0 | 126.0 | 111.1 | 99.3 | 419.4 |
Comparison of the classification results of different classifiers.
| Classifiers | KNN | LR | RF | SVM |
|---|---|---|---|---|
|
| 0.56 | 0.64 | 0.66 | 0.65 |
|
| 0.71 | 0.73 | 0.77 | 0.79 |
|
| 0.38 | 0.39 | 0.46 | 0.45 |
|
| 0.58 | 0.62 | 0.64 | 0.69 |
|
| 0.55 | 0.59 | 0.63 | 0.64 |
Per-class Emotion Accuracy Comparison of the different features.
| Features | IS10 | MFCCs | eGemaps | SoundNet | VGGish |
|---|---|---|---|---|---|
|
| 0.39 | 0.33 | 0.43 | 0.47 | 0.49 |
|
| 0.53 | 0.51 | 0.57 | 0.59 | 0.63 |
|
| 0.21 | 0.21 | 0.24 | 0.29 | 0.3 |
|
| 0.42 | 0.37 | 0.43 | 0.48 | 0.51 |
|
| 0.38 | 0.35 | 0.41 | 0.45 | 0.48 |
Per-class emotion accuracy comparison of different approaches.
| Approaches | Angry | Happy | Neutral | Sad | Total |
|---|---|---|---|---|---|
| Lakomkin [ | 0.59 | 0.72 | 0.37 | 0.59 | 0.58 |
| Gu [ | - | - | - | - | 0.62 |
|
| 0.65 | 0.79 | 0.45 | 0.69 | 0.64 |
Per-class emotion accuracy comparison of ablation studies.
| Methods | Angry | Happy | Neutral | Sad | Total |
|---|---|---|---|---|---|
|
| 0.53 | 0.64 | 0.33 | 0.56 | 0.51 |
|
| 0.61 | 0.74 | 0.41 | 0.62 | 0.59 |
|
| 0.63 | 0.79 | 0.45 | 0.66 | 0.63 |
|
| 0.65 | 0.79 | 0.45 | 0.69 | 0.64 |