| Literature DB >> 35069919 |
P Sreevidya1, S Veni1, O V Ramana Murthy2.
Abstract
The objective of the work is to develop an automated emotion recognition system specifically targeted to elderly people. A multi-modal system is developed which has integrated information from audio and video modalities. The database selected for experiments is ElderReact, which contains 1323 video clips of 3 to 8 s duration of people above the age of 50. Here, all the six available emotions Disgust, Anger, Fear, Happiness, Sadness and Surprise are considered. In order to develop an automated emotion recognition system for aged adults, we attempted different modeling techniques. Features are extracted, and neural network models with backpropagation are attempted for developing the models. Further, for the raw video model, transfer learning from pretrained networks is attempted. Convolutional neural network and long short-time memory-based models were taken by maintaining the continuity in time between the frames while capturing the emotions. For the audio model, cross-model transfer learning is applied. Both the models are combined by fusion of intermediate layers. The layers are selected through a grid-based search algorithm. The accuracy and F1-score show that the proposed approach is outperforming the state-of-the-art results. Classification of all the images shows a minimum relative improvement of 6.5% for happiness to a maximum of 46% increase for sadness over the baseline results.Entities:
Keywords: CNN; Cross-model transfer learning; Emotion classification; Fusion
Year: 2022 PMID: 35069919 PMCID: PMC8763433 DOI: 10.1007/s11760-021-02079-x
Source DB: PubMed Journal: Signal Image Video Process ISSN: 1863-1703 Impact factor: 1.583
Fig. 1Structure of the proposed model
Fig. 2Sample elder images in the dataset
Fig. 3Distribution of the presence of six emotions
Fig. 4Spectrogram obtained for different emotions
Fig. 5Diagrammatic representation of video model
Accuracy and F1-score of extracted feature-based sound model and visual model
| Emotion | Accuracy | F1-score | Accuracy | F1-score |
|---|---|---|---|---|
| Anger | 55.4 | 57.14 | 57.4 | 61.9 |
| Disgust | 57.0 | 51.0 | 63.0 | 67.0 |
| Fear | 65.3 | 67.7 | 58.0 | 65.1 |
| Happy | 65.9 | 70.0 | 59.0 | 67.0 |
| Sad | 57.9 | 57.7 | 53.9 | 58.1 |
| Surprise | 55.4 | 61.1 | 61.3 | 66.0 |
Accuracy and F1-score of the extracted feature of combined visual and audio model
| Emotion | Accuracy | F1-score |
|---|---|---|
| Anger | 61.4 | 61.7 |
| Disgust | 65.0 | 61.5 |
| Fear | 65.4 | 73.0 |
| Happy | 66.9 | 76.8 |
| Sad | 56.0 | 65.0 |
| Surprise | 65.3 | 70.1 |
State-of-the-art results on ElderReact dataset for emotion classification
| Model | Angr | Disg | Fear | Hap | Sad | Surpr |
|---|---|---|---|---|---|---|
| Random | 30 | 26 | 14 | 27 | 41 | |
| Naive Bayes | 34 | 27 | 25 | 33 | 45 | |
| SVM | 41 | 35 | 16 | 34 | 54 | |
| XGBOOST | 43 | 36 | 17 | 35 | 14 |
Bold values indicate the best value
Accuracy and F1-score of raw video model
| Emotion | Accuracy | F1-score |
|---|---|---|
| Anger | 55.9 | 56.1 |
| Disgust | 57.0 | 61.2 |
| Fear | 62.4 | 66.1 |
| Happy | 73.7 | 77.1 |
| Sad | 56.1 | 64.1 |
| Surprise | 56.0 | 64.0 |
Fig. 6Comparison of audio, video and fusion models
Accuracy and F1-score of spectrogram model
| Emotion | Accuracy | F1-score |
|---|---|---|
| Anger | 50.7 | 62.1 |
| Disgust | 57.1 | 51.2 |
| Fear | 60.4 | 62.5 |
| Happy | 58.6 | 61.4 |
| Sad | 60.5 | 60.5 |
| Surprise | 62.0 | 61.0 |
Accuracy and F1-score of combined model using CNN model as well as LSTM model
| Emotion | Accuracy | F1-score | Accuracy | F1-Score |
|---|---|---|---|---|
| Anger | 55.1 | 62.3 | 55.6 | 65,3 |
| Disgust | 51.2 | 67.1 | 60.5 | 66.7 |
| Fear | 57.0 | 64.1 | 60.5 | 70.0 |
| Happy | 54.9 | 69.1 | 66.5 | 76.0 |
| Sad | 54.1 | 67.1 | 57.8 | 67.0 |
| Surprise | 56.1 | 62.3 | 59.5 | 69.5 |
Comparison of performance of spectrogram model on ElderReact and EmoReact dataset
| Dataset | Angr | Disg | Hap | Surpr |
|---|---|---|---|---|
| ElderReact | 62.1 | 51.2 | 61 | |
| EmoReact | 17 | 10 | 64 |
Bold values indicate the best value
Fig. 7Comparison of F-score between two audio models developed for classification of emotions
Comparison of performance of feature-based model on ElderReact and RAVEDESS dataset
| Model | Angr | Disg | Hap | Surpr |
|---|---|---|---|---|
| ElderReact | 57.14 | 51.0 | 61.1 | |
| RAVEDESS | 17 | 29.0 | 44.9 |
Bold values indicate the best value
Fig. 8Comparison of F1-score of video models for classification of emotions