| Literature DB >> 32375342 |
Eesung Kim1, Hyungchan Song2, Jong Won Shin2.
Abstract
In this paper, we propose a novel emotion recognition method based on the underlying emotional characteristics extracted from a conditional adversarial auto-encoder (CAAE), in which both acoustic and lexical features are used as inputs. The acoustic features are generated by calculating statistical functionals of low-level descriptors and by a deep neural network (DNN). These acoustic features are concatenated with three types of lexical features extracted from the text, which are a sparse representation, a distributed representation, and an affective lexicon-based dimensions. Two-dimensional latent representations similar to vectors in the valence-arousal space are obtained by a CAAE, which can be directly mapped into the emotional classes without the need for a sophisticated classifier. In contrast to the previous attempt to a CAAE using only acoustic features, the proposed approach could enhance the performance of the emotion recognition because combined acoustic and lexical features provide enough discriminant power. Experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus showed that our method outperformed the previously reported best results on the same corpus, achieving 76.72% in the unweighted average recall.Entities:
Keywords: conditional adversarial autoencoder; emotion recognition; latent representation
Mesh:
Year: 2020 PMID: 32375342 PMCID: PMC7248815 DOI: 10.3390/s20092614
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Valence-arousal space of the circumplex model (a) and the distribution of the learned latent vectors for the training set of the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset (b).
Figure 2Overview of the proposed multimodal emotion recognition framework integrating the acoustic and lexical features.
Figure 3Architecture of the conditional adversarial autoencoder.
Figure 4The distribution of the learned latent vectors for (a) the training set with the acoustic feature only, (b) the test set with the acoustic feature only, (c) the training set with both acoustic and lexical features, and (d) the test set with both acoustic and lexical features.
Accuracies for different types of acoustic features. WAR, weighted average recall; UAR, unweighted average recall; LLD, low-level descriptor; BN, bottleneck.
| Feature Set | WAR | UAR |
|---|---|---|
| IS10 [ | 57.2 | 59.3 |
| IS13 [ | 57.3 | 58.6 |
| eGeMAPS [ | 54.7 | 55.3 |
| LLD + MMFCC [ | 59.3 | 60.2 |
|
[BOW | 55.4 | - |
| BN [ | 59.7 | 61.4 |
| CNN-LSTM-DNN [ | - | 60.23 |
| CTC-LSTM [ | 64.0 | 65.7 |
| LLD + BN |
|
|
|
| 63.82 | 66.19 |
Accuracies for different types of lexical features.
| Feature Set | WAR | UAR |
|---|---|---|
| eVector + BOW [ | 58.5 | - |
| mLRF [ | 63.8 | 64 |
| 63.52 | 64.55 | |
|
|
| |
|
| 63.91 | 64.84 |
|
| 64.05 | 64.48 |
Accuracies for the multimodal emotion recognition methods.
| Feature Set | Classifier | WAR | UAR |
|---|---|---|---|
| SVM | 69.5 | 70.1 | |
| SVM | 69.2 | - | |
| SVM | 67.2 | 67.3 | |
| Hierarchical Attention Fusion [ | DNN | 72.7 | 72.7 |
| DNN | 72.34 | 74.31 | |
|
| DNN | 72.92 | 75.44 |
| DNN |
|
| |
| linear |
|
|
Confusion matrix for the proposed model corresponding to Figure 4 with a DNN classifier (WAR 74.37%, UAR 76.91%).
|
| |||||
|
|
|
|
|
| |
|
| 85.35 | 6.82 | 5.65 | 2.17 | |
|
| 6.81 | 79.16 | 11.53 | 2.48 | |
|
| 7.22 | 17.13 | 63.63 | 12.02 | |
|
| 2.01 | 3.07 | 15.4 | 79.51 | |
Confusion matrix for the proposed model corresponding to Figure 4 with a simple classifier checking the signs (WAR 74.08%, UAR 76.72%).
|
| |||||
|
|
|
|
|
| |
|
| 84.94 | 6.55 | 6.08 | 2.4 | |
|
| 6.86 | 77.82 | 12.48 | 2.85 | |
|
| 8.02 | 16.22 | 63.78 | 11.98 | |
|
| 2.06 | 4.27 | 13.39 | 80.29 | |
Confusion matrix for in Table 1 only with acoustic features (WAR 63.82%, UAR 66.19%).
|
| |||||
|
|
|
|
|
| |
|
| 71.84 | 15.3 | 10.63 | 2.22 | |
|
| 10.71 | 65.93 | 19.54 | 3.82 | |
|
| 5.55 | 24.66 | 57.58 | 12.2 | |
|
| 1.88 | 8.32 | 20.36 | 69.45 | |