| Literature DB >> 36187564 |
Yuan Zong1,2, Hailun Lian1, Jiacheng Zhang1,3, Ercui Feng4, Cheng Lu1, Hongli Chang1, Chuangao Tang1,2.
Abstract
In this paper, we investigate a challenging but interesting task in the research of speech emotion recognition (SER), i.e., cross-corpus SER. Unlike the conventional SER, the training (source) and testing (target) samples in cross-corpus SER come from different speech corpora, which results in a feature distribution mismatch between them. Hence, the performance of most existing SER methods may sharply decrease. To cope with this problem, we propose a simple yet effective deep transfer learning method called progressive distribution adapted neural networks (PDAN). PDAN employs convolutional neural networks (CNN) as the backbone and the speech spectrum as the inputs to achieve an end-to-end learning framework. More importantly, its basic idea for solving cross-corpus SER is very straightforward, i.e., enhancing the backbone's corpus invariant feature learning ability by incorporating a progressive distribution adapted regularization term into the original loss function to guide the network training. To evaluate the proposed PDAN, extensive cross-corpus SER experiments on speech emotion corpora including EmoDB, eNTERFACE, and CASIA are conducted. Experimental results showed that the proposed PDAN outperforms most well-performing deep and subspace transfer learning methods in dealing with the cross-corpus SER tasks.Entities:
Keywords: cross-corpus speech emotion recognition; deep learning; deep transfer learning; domain adaptation; speech emotion recognition
Year: 2022 PMID: 36187564 PMCID: PMC9520908 DOI: 10.3389/fnbot.2022.987146
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 3.493
Figure 1The overview of progressive distribution adapted neural networks (PDAN). The PDAN uses the speech spectrums as the inputs and directly builds the relationship between the emotion labels and speech signals. It consists of several convolutional layers and three fully connected (FC) layers and is trained under the guidance of the combination of four loss functions, i.e., emotion discriminative loss , marginal distribution adapted loss , rough emotion class aware conditional distribution adapted loss , and fine emotion class aware conditional distribution adapted loss .
Figure 2The 2D arousal-valence emotion wheel proposed by Yang et al. (2022). It consists of two dimensions, where the horizontal axis denotes the degree of valence while the vertical axis corresponds to the arousal. Each typical discrete emotion can be mapped to one point in the emotion wheel according to its corresponding valence and arousal values.
The detailed procedures for updating optimization problem of PDAN in Equation (5).
|
|
| |
| |
| |
| |
|
|
| 1: |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: |
| 8: |
| 9: |
The sample statistics of EmoDB (B), eNTERFACE (E), and CASIA (C) corpora used in the designed six cross-corpus SER tasks.
|
|
|
|
|---|---|---|
| B (Angry: 127, Sad: 62, Fear: 69, Happy: 71, Disgust: 46) | 375 | |
| E (Angry: 211, Sad: 211, Fear: 211, Happy: 208, Disgust: 211) | 1,052 | |
| B (Angry: 127, Sad: 62, Fear: 69, Happy: 71, Neutral: 79) | 408 | |
| C (Angry: 200, Sad: 200, Fear: 200, Happy: 200, Neutral: 200) | 1,000 | |
| E (Angry: 211, Sad: 211, Fear: 211, Happy: 208, Surprise: 211) | 1,052 | |
| C (Angry: 200, Sad: 200, Fear: 200, Happy: 200, Surprise: 200) | 1,000 |
The experimental results of all the transfer learning methods for six cross-corpus SER tasks, in which the best results are highlighted in bold.
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|
| Subspace Learning | SVM | 28.93 | 23.58 | 29.60 | 35.01 | 26.10 | 25.14 | 28.06 |
| TCA | 30.52 | 44.03 | 33.40 | 45.07 | 31.10 | 32.32 | 36.07 | |
| GFK | 32.11 | 42.48 | 33.10 | 48.08 | 32.80 | 28.13 | 36.17 | |
| SA | 33.50 | 43.89 | 35.80 | 49.03 | 32.60 | 28.17 | 36.33 | |
| DoSL | 36.12 | 38.95 | 34.40 | 45.75 | 30.40 | 31.59 | 36.20 | |
| JDAR | 36.33 | 39.97 | 31.10 | 46.29 | 32.40 | 31.50 | 36.27 | |
| Subspace Learning | SVM | 34.50 | 28.13 | 35.30 | 35.29 | 24.30 | 26.81 | 30.73 |
| TCA | 32.60 | 44.53 | 40.50 | 51.47 | 33.20 | 29.77 | 38.68 | |
| GFK | 36.01 | 40.11 | 40.00 | 45.93 | 33.00 | 29.09 | 37.35 | |
| SA | 35.65 | 43.92 | 37.50 | 47.06 | 32.10 | 30.61 | 37.80 | |
| DoSL | 36.82 | 43.33 | 36.80 | 48.45 |
| 33.91 | 39.15 | |
| JDAR |
| 47.80 | 42.70 | 48.97 |
|
| 41.76 | |
| Deep Learning | AlexNet | 29.49 | 31.03 | 32.90 | 42.23 | 27.59 | 26.30 | 31.59 |
| DAN | 36.13 | 40.41 | 39.00 | 49.85 | 29.00 | 31.47 | 37.64 | |
| DANN | 33.38 | 43.68 | 39.20 | 53.71 | 29.80 | 29.25 | 38.05 | |
| Deep-CORAL | 35.03 | 43.38 | 38.30 | 48.28 | 31.00 | 30.89 | 37.81 | |
| DSAN | 36.19 | 46.90 | 40.30 | 50.69 | 29.70 | 32.61 | 39.41 | |
Experimental results of PDAN with different total loss functions for six cross-corpus SER tasks, in which the best results are highlighted in bold.
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
|
| 34.36 | 43.39 | 37.50 | 48.89 | 30.00 | 30.12 | 37.38 |
|
| 35.16 | 48.96 | 41.40 | 54.96 | 32.70 | 32.98 | 41.03 |
|
|
|
|
|
|
|
|
|