| Literature DB >> 33686349 |
Nivedita Patel1, Shireen Patel1, Sapan H Mankad1.
Abstract
Emotion recognition from speech has its fair share of applications and consequently extensive research has been done over the past few years in this interesting field. However, many of the existing solutions aren't yet ready for real time applications. In this work, we propose a compact representation of audio using conventional autoencoders for dimensionality reduction, and test the approach on two benchmark publicly available datasets. Such compact and simple classification systems where the computing cost is low and memory is managed efficiently may be more useful for real time application. System is evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Toronto Emotional Speech Set (TESS). Three classifiers, namely, support vector machines (SVM), decision tree classifier, and convolutional neural networks (CNN) have been implemented to judge the impact of the approach. The results obtained by attempting classification with Alexnet and Resnet50 are also reported. Observations proved that this introduction of autoencoders indeed can improve the classification accuracy of the emotion in the input audio files. It can be concluded that in emotion recognition from speech, the choice and application of dimensionality reduction of audio features impacts the results that are achieved and therefore, by working on this aspect of the general speech emotion recognition model, it may be possible to make great improvements in the future.Entities:
Keywords: Audio; Autoencoder; Emotion; RAVDESS; TESS
Year: 2021 PMID: 33686349 PMCID: PMC7927770 DOI: 10.1007/s12652-021-02979-3
Source DB: PubMed Journal: J Ambient Intell Humaniz Comput
Fig. 1Block diagram of a general speech emotion recognition system
Summary of different methodologies used for SER
| No. | Dataset | Methodology | Results (accuracy) | Author |
|---|---|---|---|---|
| 1 | IVR customer care domain, database from WoZ data collectiona | SVM | 79%, 75% |
Polzehl et al. ( |
| 2 | IEMOCAP corpusb | RNN | 63.5% |
Mirsamadi et al. ( |
| 3 | EMO-DB, VAM, and TUM AVIC | SVM | 51.6% |
Deng et al. ( |
| 4 | Berlin EmoDB and IEMOCAP | CNN, LSTM | 95.33%, 95.89% on Berlin EmoDB; 89.16%, 52.14% on IEMOCAP |
Zhao et al. ( |
| 5 | EMO-DB | SVM | 74.4% |
Deb and Dandapat ( |
| 6 | EMO-DB and IEMOCAP | Bidirectional LSTM and CNN | 82.35% |
Pandey et al. ( |
| 7 | (UMSSEDc) and (RAVDESSd) | Four models for binary classification | 64.29% |
Zhang et al. ( |
| 8 | RAVDESS | CNN | 66.41%. |
Jannat et al. ( |
| 9 | RAVDESS | SVM, NN | 78.75%, 89.16% |
Tomba et al. ( |
| 10 | RAVDESS | SVM | 75.69% |
Bhavan et al. ( |
| 11 | GeWEC | Universum AE | 59.3% |
Deng et al. ( |
| 12 | GeWEC | SSAE | 51.6% |
Deng et al. ( |
| 13 | SAVEE | SVM, DSM, AE | 69.84%, 68.25%, 73.01% |
Aouani and Ben Ayed ( |
ahttp://dicit.fbk.eu/index.php?location=woz
bhttps://sail.usc.edu/iemocap/
chttps://web.eecs.umich.edu/~emilykmp/umssed.html
dhttps://zenodo.org/record/1188976
Fig. 2Block diagram for MFCC
Fig. 3Architecture of a general autoencoder
Fig. 4Flowchart of decision tree algorithm (Pantazi et al. 2020)
Fig. 5The proposed system model
RAVDESS-wave only audio files description
| Gender | Count | Trials per actor | # Of audio samples |
|---|---|---|---|
| Female | 12 | 60 | 1440 |
| Male | 12 | 60 |
Filename identifiers(RAVDESS)
| Modality | 01 |
| Vocal Channel | 01 |
| Emotion | 01 |
| 06 | |
| Intensity | 01 |
| Statement | 01 |
| 02 | |
| Repetition | 01 |
| Actor | 01 to 24 |
| Male: Odd numbered actors | |
| Female: Even numbered actors |
Fig. 6Filename convention for a sample audio file from RAVDESS corpus
TESS dataset description
| Actor/subject | Words per emotion | # Of emotions | # Of audio files |
|---|---|---|---|
| Female 1 (age 26) | 200 | 7 | 2800 |
| Female 2 (age 64) | 200 |
Fig. 7After applying autoencoder model to RAVDESS dataset
Fig. 8After applying autoencoder model to TESS dataset
Fig. 9Keras visualization of the 1D CNN model applied to RAVDESS
Fig. 10Conv1D model keras visualization for the TESS dataset
Comparison between performance of models (in terms of % accuracy) implemented on RAVDESS dataset
| SVM | Decision tree | CNN | |
|---|---|---|---|
| Model accuracy on original data | 30.17 | 77 | 75 |
| Model accuracy after applying autoencoder | 40.16 | 76 | 80 |
| Average speedup in accuracy (%) | 33.11 | − 1.29 | 6.66 |
Classification results of RAVDESS dataset on original data
| Classes | Decision tree classifier | CNN classifier | ||||
|---|---|---|---|---|---|---|
| Precision (%) | Recall (%) | F-1 score | Precision (%) | Recall (%) | F-1 score | |
| 0 (Neutral) | 79 | 84 | 0.82 | 61 | 59 | 0.6 |
| 1 (Calm) | 84 | 81 | 0.83 | 77 | 84 | 0.8 |
| 2 (Happy) | 72 | 77 | 0.75 | 59 | 79 | 0.67 |
| 3 (Sad) | 64 | 70 | 0.67 | 68 | 68 | 0.68 |
| 4 (Angry) | 76 | 76 | 0.76 | 93 | 71 | 0.81 |
| 5 (Fearful) | 78 | 83 | 0.8 | 73 | 74 | 0.74 |
| 6 (Disgust) | 78 | 75 | 0.76 | 83 | 70 | 0.76 |
| 7 (Surprised) | 86 | 72 | 0.78 | 79 | 76 | 0.78 |
| Macro average | 77 | 77 | 0.77 | 74 | 73 | 0.73 |
| Weighted average | 77 | 77 | 0.77 | 75 | 74 | 0.74 |
Classification results of RAVDESS dataset on encoded data
| Classes | Decision tree classifier | CNN classifier | ||||
|---|---|---|---|---|---|---|
| Precision (%) | Recall (%) | F-1 score | Precision (%) | Recall (%) | F-1 score | |
| 0 (Neutral) | 78 | 78 | 0.78 | 74 | 75 | 0.74 |
| 1 (Calm) | 83 | 75 | 0.79 | 84 | 95 | 0.89 |
| 2 (Happy) | 83 | 80 | 0.82 | 83 | 71 | 0.77 |
| 3 (Sad) | 67 | 72 | 0.7 | 84 | 71 | 0.77 |
| 4 (Angry) | 77 | 73 | 0.75 | 76 | 87 | 0.81 |
| 5 (Fearful) | 72 | 83 | 0.77 | 71 | 81 | 0.76 |
| 6 (Disgust) | 77 | 70 | 0.73 | 79 | 77 | 0.78 |
| 7 (Surprised) | 72 | 76 | 0.74 | 88 | 77 | 0.82 |
| Macro average | 76 | 76 | 0.76 | 80 | 79 | 0.79 |
| Weighted average | 76 | 76 | 0.76 | 80 | 80 | 0.8 |
Comparison between performance of models implemented on TESS dataset
| SVM | Decision tree classifier | CNN | |
|---|---|---|---|
| Model accuracy on original data | 86.14% | 90% | 94% |
| Model accuracy after applying autoencoder | 91.99% | 90% | 96% |
| Average speedup in accuracy (%) | 6.79 | 0.0 | 2.12 |
Classification results of TESS dataset on original data
| Classes | Decision tree classifier | CNN classifier | ||||
|---|---|---|---|---|---|---|
| Precision (%) | Recall (%) | F-1 score | Precision (%) | Recall (%) | F-1 score | |
| 0 (Angry) | 92 | 91 | 0.91 | 100 | 100 | 1 |
| 1 (Disgust) | 94 | 91 | 0.93 | 100 | 98 | 0.99 |
| 2 (Fear) | 93 | 90 | 0.91 | 88 | 97 | 0.92 |
| 3 (Happy) | 98 | 91 | 0.94 | 96 | 100 | 0.98 |
| 4 (Neutral) | 86 | 93 | 0.89 | 100 | 96 | 0.98 |
| 5 (Surprise) | 83 | 84 | 0.84 | 96 | 79 | 0.87 |
| 6 (Sad) | 85 | 89 | 0.87 | 78 | 87 | 0.82 |
| Macro average | 90 | 90 | 0.9 | 94 | 94 | 0.94 |
| Weighted average | 90 | 90 | 0.9 | 94 | 94 | 0.94 |
Classification results of TESS dataset on encoded data
| Classes | Decision tree classifier | CNN classifier | ||||
|---|---|---|---|---|---|---|
| Precision (%) | Recall (%) | F-1 score | Precision (%) | Recall (%) | F-1 score | |
| 0 (Angry) | 93 | 97 | 0.95 | 98 | 99 | 0.99 |
| 1 (Disgust) | 94 | 97 | 0.95 | 98 | 98 | 0.98 |
| 2 (Fear) | 89 | 87 | 0.88 | 95 | 96 | 0.96 |
| 3 (Happy) | 90 | 86 | 0.88 | 99 | 96 | 0.99 |
| 4 (Neutral) | 86 | 90 | 0.88 | 95 | 97 | 0.96 |
| 5 (Surprise) | 87 | 84 | 0.86 | 97 | 91 | 0.94 |
| 6 (Sad) | 89 | 87 | 0.88 | 91 | 95 | 0.93 |
| Macro average | 90 | 90 | 0.9 | 96 | 96 | 0.96 |
| Weighted average | 90 | 90 | 0.9 | 96 | 96 | 0.96 |
Fig. 11Performance of a SVM, b decision tree, and c CNN classifier on both the datasets