| Literature DB >> 31415592 |
Panikos Heracleous1, Akio Yoneyama1.
Abstract
Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in several languages that are simultaneously recognized by a complete system. To address this issue, a method, which provides an effective and powerful solution to bilingual speech emotion recognition, is proposed and evaluated. The proposed method is based on a two-pass classification scheme consisting of spoken language identification and speech emotion recognition. In the first pass, the language spoken is identified; in the second pass, emotion recognition is conducted using the emotion models of the language identified. Based on deep learning and the i-vector paradigm, bilingual emotion recognition experiments have been conducted using the state-of-the-art English IEMOCAP (four emotions) and German FAU Aibo (five emotions) corpora. Two classifiers along with i-vector features were used and compared, namely, fully connected deep neural networks (DNN) and convolutional neural networks (CNN). In the case of DNN, 64.0% and 61.14% unweighted average recalls (UARs) were obtained using the IEMOCAP and FAU Aibo corpora, respectively. When using CNN, 62.0% and 59.8% UARs were achieved in the case of the IEMOCAP and FAU Aibo corpora, respectively. These results are very promising, and superior to those obtained in similar studies on multilingual or even monolingual speech emotion recognition. Furthermore, an additional baseline approach for bilingual speech emotion recognition was implemented and evaluated. In the baseline approach, six common emotions were considered, and bilingual emotion models were created, trained on data from the two languages. In this case, 51.2% and 51.5% UARs for six emotions were obtained using DNN and CNN, respectively. The results using the baseline method were reasonable and promising, showing the effectiveness of using i-vectors and deep learning in bilingual speech emotion recognition. On the other hand, the proposed two-pass method based on language identification showed significantly superior performance. Furthermore, the current study was extended to also deal with multilingual speech emotion recognition using corpora collected under similar conditions. Specifically, the English IEMOCAP, the German Emo-DB, and a Japanese corpus were used to recognize four emotions based on the proposed two-pass method. The results obtained were very promising, and the differences in UAR were not statistically significant compared to the monolingual classifiers.Entities:
Mesh:
Year: 2019 PMID: 31415592 PMCID: PMC6695118 DOI: 10.1371/journal.pone.0220386
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Recall, precision, and F1-score in the binary case.
| Predicted Class | |||
| (+) | (-) | ||
| Actual Class | (+) | True Positives (TP) | False Negatives (FN) |
| (-) | False Positives (FP) | True Negatives (TN) | |
Emotions considered in bilingual emotion recognition with a common model set.
| Monolingual emotions | Bilingual emotion | |
|---|---|---|
| IEMOCAP | FAU Aibo | |
| Happy | Joyful | Happy |
| Angry | Angry | Angry |
| Sad | - | Sad |
| Neutral | Neutral | Neutral |
| - | Emphatic | Emphatic |
| - | Rest | Rest |
Fig 1Computation of shifted delta cepstral (SDC) coefficients.
Fig 2Architecture of the proposed convolutional neural networks-based classifier.
Spoken language identification rates [%] using English and German emotional speech data.
| Features used in i-vectors extraction | Classification Method | |||
|---|---|---|---|---|
| DNN | CNN | |||
| English | German | English | German | |
| MFCC | 100.0 | 99.0 | 100.0 | 99.0 |
| MFCC+SDC | 100.0 | 100.0 | 100.0 | 100.0 |
Recalls for speech emotion recognition using IEMOCAP and DNN.
| Features used in i-vectors extraction | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | UAR | |
| MFCC | 52.0 | 42.0 | 70.0 | 62.0 | 56.5 |
| MFCC+SDC | 48.0 | 44.0 | 88.0 | 76.0 | 64.0 |
Recalls for speech emotion recognition using IEMOCAP and CNN.
| Features used in i-vectors extraction | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | UAR | |
| MFCC | 46.0 | 40.0 | 66.0 | 70.0 | 55.5 |
| MFCC+SDC | 48.0 | 36.0 | 88.0 | 76.0 | 62.0 |
Precision of speech emotion recognition using IEMOCAP and DNN.
| Features used in i-vectors extraction | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | Average | |
| MFCC | 45.61 | 51.22 | 66.04 | 63.27 | 56.54 |
| MFCC+SDC | 51.06 | 57.89 | 77.19 | 65.52 | 62.92 |
Precision of speech emotion recognition using IEMOCAP and CNN.
| Features used in i-vectors extraction | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | Average | |
| MFCC | 45.10 | 47.62 | 68.75 | 59.32 | 55.20 |
| MFCC+SDC | 48.98 | 54.55 | 73.33 | 65.52 | 60.60 |
F1-scores for speech emotion recognition using IEMOCAP and DNN.
| Features used in i-vectors extraction | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | Average | |
| MFCC | 48.60 | 46.15 | 67.96 | 62.63 | 56.34 |
| MFCC+SDC | 49.48 | 50.00 | 82.24 | 70.37 | 63.02 |
F1-scores for speech emotion recognition using IEMOCAP and CNN.
| Features used in i-vectors extraction | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | Average | |
| MFCC | 45.54 | 43.48 | 67.35 | 64.22 | 55.15 |
| MFCC+SDC | 48.48 | 43.37 | 80.00 | 70.37 | 60.56 |
Confusion matrix [%] using IEMOCAP and DNN with MFCC/SDC features.
| Neutral | Happy | Angry | Sad | |
| Neutral | 20.0 | 12.0 | 20.0 | |
| Happy | 30.0 | 10.0 | 16.0 | |
| Angry | 4.0 | 4.0 | 4.0 | |
| Sad | 12.0 | 8.0 | 4.0 |
Confusion matrix [%] using IEMOCAP and CNN with MFCC/SDC features.
| Neutral | Happy | Angry | Sad | |
| Neutral | 18.0 | 14.0 | 20.0 | |
| Happy | 34.0 | 14.0 | 16.0 | |
| Angry | 4.0 | 4.0 | 4.0 | |
| Sad | 12.0 | 8.0 | 4.0 |
Recalls for speech emotion recognition using FAU Aibo and DNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Joyful | Neutral | Rest | UAR | |
| MFCC | 40.47 | 42.47 | 48.16 | 28.76 | 35.12 | 38.99 |
| MFCC+SDC | 63.55 | 63.88 | 68.90 | 60.20 | 49.16 | 61.14 |
Recalls for speech emotion recognition using FAU Aibo and CNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Joyful | Neutral | Rest | UAR | |
| MFCC | 46.49 | 35.12 | 53.51 | 33.78 | 28.7 | 39.53 |
| MFCC+SDC | 55.52 | 62.88 | 71.24 | 68.23 | 41.14 | 59.80 |
Precision of speech emotion recognition using FAU Aibo and DNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Joyful | Neutral | Rest | Average | |
| MFCC | 41.02 | 33.60 | 55.38 | 37.55 | 31.53 | 39.82 |
| MFCC+SDC | 64.85 | 61.41 | 69.36 | 62.50 | 48.04 | 61.23 |
Precision of speech emotion recognition using FAU Aibo and CNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Joyful | Neutral | Rest | Average | |
| MFCC | 41.12 | 35.35 | 52.81 | 35.07 | 31.97 | 39.26 |
| MFCC+SDC | 67.76 | 58.93 | 69.38 | 58.79 | 44.40 | 59.85 |
F1-scores for speech emotion recognition using FAU Aibo and DNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Joyful | Neutral | Rest | Average | |
| MFCC | 40.74 | 37.52 | 51.52 | 32.58 | 33.23 | 39.12 |
| MFCC+SDC | 64.19 | 62.62 | 69.13 | 61.33 | 48.60 | 61.17 |
F1-scores for speech emotion recognition using FAU Aibo and CNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Joyful | Neutral | Rest | Average | |
| MFCC | 43.64 | 35.23 | 53.16 | 34.41 | 30.28 | 39.34 |
| MFCC+SDC | 61.03 | 60.84 | 70.34 | 63.16 | 42.71 | 59.61 |
Confusion matrix [%] using FAU Aibo and DNN with MFCC/SDC features.
| Angry | Emphatic | Joyful | Neutral | Rest | |
| Angry | 14.38 | 6.69 | 5.02 | 10.37 | |
| Emphatic | 15.05 | 0.33 | 14.38 | 6.35 | |
| Joyful | 3.68 | 2.34 | 4.35 | 20.74 | |
| Neutral | 3.34 | 14.38 | 6.35 | 15.72 | |
| Rest | 12.37 | 9.03 | 17.06 | 12.37 |
Confusion matrix [%] using FAU Aibo and CNN with MFCC/SDC features.
| Angry | Emphatic | Joyful | Neutral | Rest | |
| Angry | 18.06 | 7.02 | 6.35 | 13.04 | |
| Emphatic | 11.37 | 0.33 | 18.39 | 7.02 | |
| Joyful | 2.68 | 2.68 | 5.69 | 17.73 | |
| Neutral | 1.0 | 13.04 | 4.01 | 13.71 | |
| Rest | 11.37 | 10.03 | 20.07 | 17.39 |
Recalls for speech emotion recognition using a common model set and DNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Happy | Neutral | Rest | Sad | |
| MFCC | 37.0 | 61.0 | 33.0 | 26.0 | 37.0 | 96.0 |
| MFCC+SDC | 51.0 | 54.0 | 26.0 | 29.0 | 51.0 | 96.0 |
Recalls for speech emotion recognition using a common model set and CNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Happy | Neutral | Rest | Sad | |
| MFCC | 39.0 | 60.0 | 42.0 | 25.0 | 29.0 | 92.0 |
| MFCC+SDC | 54.0 | 62.0 | 24.0 | 37.0 | 40.0 | 92.0 |
Precision of speech emotion recognition using a common model set and DNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Happy | Neutral | Rest | Sad | |
| MFCC | 52.11 | 42.07 | 47.83 | 33.77 | 33.33 | 75.59 |
| MFCC+SDC | 53.13 | 59.34 | 36.62 | 44.62 | 34.46 | 74.42 |
Precision of speech emotion recognition using a common model set and CNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Happy | Neutral | Rest | Sad | |
| MFCC | 52.7 | 43.17 | 44.21 | 30.86 | 31.87 | 76.67 |
| MFCC+SDC | 47.37 | 55.36 | 41.38 | 42.05 | 37.04 | 76.67 |
F1-scores for speech emotion recognition using a common model set and DNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Happy | Neutral | Rest | Sad | |
| MFCC | 43.27 | 49.80 | 39.05 | 29.38 | 35.07 | 84.58 |
| MFCC+SDC | 52.04 | 56.54 | 30.41 | 35.15 | 41.13 | 83.84 |
F1-scores for speech emotion recognition using a common model set and CNN.
| Features used in i-vectors extraction | Emotions | |||||
|---|---|---|---|---|---|---|
| Angry | Emphatic | Happy | Neutral | Rest | Sad | |
| MFCC | 44.83 | 50.21 | 43.08 | 27.62 | 30.37 | 83.64 |
| MFCC+SDC | 50.47 | 58.49 | 30.38 | 39.36 | 38.46 | 83.64 |
Training and test instances for the IEMOCAP corpus.
| Instances | Emotions | ||||
|---|---|---|---|---|---|
| Neutral | Happy | Angry | Sad | Total | |
| Training | 1139 | 397 | 735 | 723 | 2994 |
| Test | 569 | 198 | 368 | 361 | 1496 |
| Total | 1708 | 595 | 1103 | 1084 | |
Confusion matrix [%] of the spoken language identification in the first pass.
| Japanese | English | German | |
| Japanese | 96.48 | 3.52 | 0.0 |
| English | 0.41 | 97.43 | 2.16 |
| German | 0.88 | 11.51 | 87.61 |
Fig 3UARs for multilingual and monolingual emotion recognition for three languages.