| Literature DB >> 34912204 |
Shiqing Zhang1, Ruixin Liu1,2, Xin Tao1, Xiaoming Zhao1.
Abstract
Automatic speech emotion recognition (SER) is a challenging component of human-computer interaction (HCI). Existing literatures mainly focus on evaluating the SER performance by means of training and testing on a single corpus with a single language setting. However, in many practical applications, there are great differences between the training corpus and testing corpus. Due to the diversity of different speech emotional corpus or languages, most previous SER methods do not perform well when applied in real-world cross-corpus or cross-language scenarios. Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have increasingly been adopted for cross-corpus SER. This paper aims to provide an up-to-date and comprehensive survey of cross-corpus SER, especially for various deep learning techniques associated with supervised, unsupervised and semi-supervised learning in this area. In addition, this paper also highlights different challenges and opportunities on cross-corpus SER tasks, and points out its future trends.Entities:
Keywords: cross-corpus; deep learning; feature learning; speech emotion recognition; survey
Year: 2021 PMID: 34912204 PMCID: PMC8666588 DOI: 10.3389/fnbot.2021.784514
Source DB: PubMed Journal: Front Neurorobot ISSN: 1662-5218 Impact factor: 2.650
A brief summary of speech emotion databases.
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| DES/ | Danish | 1997 | Neutral, surprise, anger, | 5,200 | 4 | Acted | Audio |
| SUSAS/ | English | 1997 | Four states of speech under stress: | 16,000 | 32 | Natural | Audio |
| SmartKom/ | German | 2002 | Neutral, joy, anger, helplessness, | 3,823 | 70 | Natural | Audio |
| FAU-AIBO/ | German | 2004 | Anger, bored, emphatic, helpless, | 4,525 | 51 | Natural | Audio |
| EMO-DB/ | German | 2005 | Anger, boredom, disgust, fear, | 535 | 10 | Acted | Audio |
| eNTERFACE05/ | English | 2006 | Anger, disgust, fear, happiness, | 1,277 | 42 | Elicited | Audiovisual |
| MASC/ | Mandarin | 2006 | Neutral, anger, pride, panic, sadness | 25,636 | 68 | acted | Audio |
| SAL/ | English | 2007 | Anger, sadness, happiness, fear, neutral | 1,692 | 4 | Natural | Audiovisual |
| ABC/ | German | 2007 | Aggressive, cheer, intoxicated, | 431 | 8 | Elicited | audiovisual |
| CASIA/ | Mandarin | 2008 | Surprise, happiness, | 9,600 | 4 | Acted | Audio |
| VAM/ | German | 2008 | Dimension emotions | 946 | 47 | Natural | audiovisual |
| IEMOCAP/ | English | 2008 | Happiness, anger, sadness, | 1,150 | 10 | Elicited | Audiovisual |
| AVIC/ | German | 2009 | Breathing, consent, garbage, | 996 | 21 | Natural | Audiovisual |
| Polish/ | Polish | 2009 | Anger, sadness, happiness, | 2,351 | 13 | Acted | audiovisual |
| IITKGPSEHSC/ | Hindi | 2011 | Happy, sad, angry, sarcastic, | 1,200 | 10 | Acted | Audio |
| EMOVO/ | Italian | 2014 | disgust, fear, anger, | 588 | 6 | Acted | Audiovisual |
| SAVEE/ | English | 2014 | Anger, sadness, fear, disgust neutral, joy, surprise | 480 | 4 | Acted | Audiovisual |
| AFEW/ | English | 2015 | Anger, disgust, fear, joy, neutral, sadness, | 1,645 | 330 | Natural | Audiovisual |
| BAUM-1/ | Turkish | 2016 | Happiness, anger, sadness, disgust, fear, surprise, boredom | 1,222 | 31 | Natural | Audiovisual |
| MSP-IMPROV/ | English | 2017 | Happiness, anger, sadness, neutral | 8,438 | 12 | acted | Audiovisual |
| CHEAVD/ | Mandarin | 2017 | Anger, anxious, disgust, happiness, neutral, sadness, surprise, worried | 2,852 | 238 | Natural | Audiovisual |
| NNIME/ | Mandarin | 2017 | Discrete emotions | 102 | 44 | Acted | Multimodal |
| URDU/ | Urdu | 2018 | angry, sad, neutral, happy | 400 | 38 | Natural | Audiovisual |
| RAVDESS/ | English | 2018 | Calm, happy, sad, angry, | 7,356 | 24(12f) | Acted | Audiovisual |
| MSP-PODCAST/ | English | 2019 | Discrete emotions | 2,317 | 197 | Natural | Audio |
A brief summary of traditional cross-corpus SER literatures.
|
|
|
|
|
|
|---|---|---|---|---|
| Schuller et al. ( | Supervised | 93 LLDs | speaker-corpus normalization | DES/, EMO-DB, SUSAS, AVIC, SmartKom, eNTERFACE05 |
| Feraru et al. ( | Supervised | 1,941 LLDs | rule-based model inversion | EMO-DB, DES, eNTERFACE05 |
| Song et al. ( | Supervised | INTERSPEECH-2010 | TNMF | FAU-AIBO, eNTERFACE05, EMO-DB |
| Mao et al. ( | Supervised | INTERSPEECH-2009 | EDFLM | ABC, EMO-DB, FAU-AIBO |
| Kaya and Karpov ( | Supervised | ComParE | cascaded normalization | EMO-DB, DES, eNTERFACE05 |
| Luo and Han ( | Supervised | INTERSPEECH-2010 | NMFTSL | CASIA, SAVEE, EMO-DB, IEMOCAP, eNTERFACE05 |
| Zhang et al. ( | Unsupervised | 6,552 LLDs | corpus normalization | ABC, AVIC, DES, VAM, SAL, |
| Liu et al. ( | Unsupervised | INTERSPEECH-2009 | DoSL | EMO-DB, eNTERFACE05 |
| Liu et al. ( | Unsupervised | INTERSPEECH-2009 | TRaSL | EMO-DB,eNTERFACE05, IEMOCAP |
| Song et al. ( | Semi-supervised | INTERSPEECH-2010 | TSDA | EMO-DB, eNTERFACE05 |
| Luo and Han ( | Semi-supervised | ComParE | SATNMF | CASIA, EMO-DB, |
A brief summary of existing deep cross-corpus SER literatures.
|
|
|
|
|
|
|---|---|---|---|---|
| Marczewski et al. ( | Supervised | 54,000 dimensional data points | CNN, LSTM | AFEW, EMO-DB, EMOVO, eNTERFACE05, IEMOCAP |
| Latif et al. ( | Supervised | eGeMAPS | DBNs | FAU-AIBO, IEMOCAP, EMO-DB, SAVEE, EMOVO |
| Parry et al. ( | Supervised | Mel filterbank | CNN, LSTM, | IEMOCAP, EMOVO, EMO-DB, RAVDESS, SAVEE |
| Rehman et al. ( | Supervised | 13 MFCCs | LSTMs, a ramification layer | IEMOCAP, RAVDESS, EMO-DB |
| Deng et al. ( | Unsupervised | INTERSPEECH-2009 | A-DAE | FAU-AIBO, ABC, SUSAS |
| Deng et al. ( | Unsupervised | INTERSPEECH-2009 | U-AE | ABC, EMO-DB, SUSAS |
| Abdelwahab and Busso ( | Unsupervised | INTERSPEECH-2013 | DANN | IEMOCAP, |
| Neumann and Vu ( | Unsupervised | 128 Mel frequency bands | unsupervised autoencoder and ACNN | IEMOCAP, |
| Ocquaye et al. ( | Unsupervised | spectrogram | three attentive asymmetric CNNs | SAVEE, IEMOCAP, EMO-DB,FAU-AIBO, EMOVO |
| Chang and Scherer ( | Semi-supervised | spectrogram | DCGAN | AMI, IEMOCAP |
| Deng et al. ( | Semi-supervised | INTERSPEECH-2009 | Unsupervised | FAU-AIBO, ABC, |
| Gideon et al. ( | Semi-supervised | 40 dimensional Mel-filter banks | ADDoG | IEMOCAP, |
| Latif et al. ( | Semi-supervised | spectrogram | AAE | IEMOCAP, |
| Parthasarathy and Busso ( | Semi-supervised | INTERSPEECH-2013 | ladder network | MSP-PODCAST, IEMOCAP, MSP-IMPROV |