| Literature DB >> 36177309 |
Huawei Tao1, Yang Wang1, Zhihao Zhuang1, Hongliang Fu1, Xinying Guo1, Shuguang Zou1.
Abstract
In this paper, we do research on cross-corpus speech emotion recognition (SER), in which the training and testing speech signals come from different speech corpus. The mismatched feature distribution between the training and testing sets makes many classical algorithms unable to achieve better results. To deal with this issue, a transfer learning and multi-loss dynamic adjustment (TLMLDA) algorithm is initiatively proposed in this paper. The proposed algorithm first builds a novel deep network model based on a deep auto-encoder and fully connected layers to improve the representation ability of features. Subsequently, global domain and subdomain adaptive algorithms are jointly adopted to implement features transfer. Finally, dynamic weighting factors are constructed to adjust the contribution of different loss functions to prevent optimization offset of model training, which effectively improve the generalization ability of the whole system. The results of simulation experiments on Berlin, eNTERFACE, and CASIA speech corpora show that the proposed algorithm can achieve excellent recognition results, and it is competitive with most of the state-of-the-art algorithms.Entities:
Mesh:
Year: 2022 PMID: 36177309 PMCID: PMC9514941 DOI: 10.1155/2022/5019384
Source DB: PubMed Journal: Comput Intell Neurosci
A brief summary of related work.
| References | Year | Methods | Features | Corpus |
|---|---|---|---|---|
| Zong et al. [ | 2016 | Least squares regression | INTERSPEECH 2009 | Berlin, AFEW 4.0, eNTERFACE |
| Liu et al. [ | 2018 | Feature selection + SVM | INTERSPEECH 2009 | Berlin, AFEW 4.0, eNTERFACE |
| Luo et al. [ | 2019 | NMF + MMD | Segmental features | Berlin, CASIA, eNTERFACE, Estonian |
| Song [ | 2019 | TLSL | INTERSPEECH 2010 | Berlin, FAU-AIBO, eNTERFACE |
| Zhang et al. [ | 2020 | TSDSL | INTERSPEECH 2010 | Berlin, BAUM-1a, eNTERFACE |
| Zhang et al. [ | 2021 | JDAR | INTERSPEECH 2010 | Berlin, CASIA, eNTERFACE |
| Zehra et al. [ | 2021 | Ensemble learning | Spectral and prosodic | SAVEE, UrduRDU, EMO-DB, EMOVO |
| Latif et al. [ | 2018 | DBNs | eGeMAPS feature set | FAU-AIBO, SAVEE IEMOCAP, EMO-DB, EMOVO |
| Zhang et al. [ | 2019 | Deep metric learning | Log Mel-frequencyfilter-bank energy | IEMOCAP, MSP-improv |
| Ahn et al. [ | 2021 | Few-shot learning | INTERSPEECH 2010 | IEMOCAP, CREMA-D, MSP-IMPROV,Berlin, Korean multimodal emotion dataset |
| Chang et al. [ | 2021 | Adversarial learning | INTERSPEECH 2010 | IEMOCAP, MSP-improv, MSP-PODCAST |
| Sneha et al. [ | 2022 | VAE with KL annealing | eGeMAPS feature set | IEMOCAP, SAVEE, Berlin, CaFE, URDU, AESD |
Figure 1The TLMLDA model proposed in this paper. The flowchart above shows the training phase, and the flowchart below is the testing phase.
Emotional labels and samples sizes selected for six cross-corpus SER schemes.
| Schemes | Corpus | Emotional labels | Size |
|---|---|---|---|
| E⟶B | eNTERFACE Berlin | Anger, sad, fear, happy, disgust | 1072 |
| B⟶E | 375 | ||
| E⟶C | eNTERFACE CASIA | Anger, sad, fear, happy, surprise | 1072 |
| C⟶E | 1000 | ||
| B⟶C | Berlin CASIA | Anger, sad, fear, happy, neutral | 408 |
| C⟶B | 1000 |
Experimental results of the use of ablation experiments.
| Algorithm | E⟶B | B⟶E | E⟶C | C⟶E | B⟶C | C⟶B | Average |
|---|---|---|---|---|---|---|---|
| TLMLDA_w | 51.95 | 31.15 | 31.10 | 30.40 | 32.70 | 53.53 | 38.51 |
| TLMLDA_ | 46.62 | 34.33 | 31.60 | 30.67 | 32.70 | 53.13 | 38.18 |
| TLMLDA_L | 36.76 | 21.12 | 28.70 | 28.01 | 20.05 | 42.71 | 29.56 |
| TLMLDA_M | 54.08 | 38.28 | 34.90 | 29.23 | 32.70 | 54.33 | 40.58 |
|
|
|
|
|
|
|
|
|
The bold values are the highest recognition rate in each task to reflect the rationality of the TLMLDA model, because TLMLDA has obtained the best performance compared with other ablation experimental models.
Figure 2The t-SNE visualization of feature distributions (left: Only_cls, mid: TLMLDA_M, and right: TLMLDA). (a) E⟶B, (b) B⟶E, (c) E⟶C, (d) C⟶E, (e) B⟶C, and (f) C⟶B.
Experimental results of the use of other algorithms.
| Algorithm | E⟶B | B⟶E | E⟶C | C⟶E | B⟶C | C⟶B | Average |
|---|---|---|---|---|---|---|---|
| PCA + SVM | 50.85 | 33.48 | 28.40 | 27.61 | 33.13 | 43.38 | 36.14 |
| DoSL [ | 50.55 | 33.03 | 35.20 |
| 39.23 | 53.20 | 40.84 |
| TSDSL [ | 47.41 | 35.44 | 32.50 | 33.25 | 37.40 | 56.74 | 40.46 |
| JDAR [ | 48.74 | 38.14 | 30.30 | 28.43 | 38.60 | 49.58 | 38.97 |
| DBN + BP [ | 29.86 | 32.21 | 24.20 | 31.02 | 35.80 | 49.59 | 33.78 |
|
|
|
|
| 32.74 |
|
|
|
The bold values are the highest recognition rate in each task to reflect the rationality of the TLMLDA model, because TLMLDA has obtained the best performance compared with other ablation experimental models.