| Literature DB >> 33903785 |
Nikita Jain1, Vedika Gupta1, Shubham Shubham1, Agam Madan1, Ankit Chaudhary1, K C Santosh2.
Abstract
Emotion is an instinctive or intuitive feeling as distinguished from reasoning or knowledge. It varies over time, since it is a natural instinctive state of mind deriving from one's circumstances, mood, or relationships with others. Since emotions vary over time, it is important to understand and analyze them appropriately. Existing works have mostly focused well on recognizing basic emotions from human faces. However, the emotion recognition from cartoon images has not been extensively covered. Therefore, in this paper, we present an integrated Deep Neural Network (DNN) approach that deals with recognizing emotions from cartoon images. Since state-of-works do not have large amount of data, we collected a dataset of size 8 K from two cartoon characters: 'Tom' & 'Jerry' with four different emotions, namely happy, sad, angry, and surprise. The proposed integrated DNN approach, trained on a large dataset consisting of animations for both the characters (Tom and Jerry), correctly identifies the character, segments their face masks, and recognizes the consequent emotions with an accuracy score of 0.96. The approach utilizes Mask R-CNN for character detection and state-of-the-art deep learning models, namely ResNet-50, MobileNetV2, InceptionV3, and VGG 16 for emotion classification. In our study, to classify emotions, VGG 16 outperforms others with an accuracy of 96% and F1 score of 0.85. The proposed integrated DNN outperforms the state-of-the-art approaches.Entities:
Keywords: Animation; Cartoon; Character Detection; Convolutional Neural Network; Emotion; Face Segmentation; Mask R-CNN; VGG16
Year: 2021 PMID: 33903785 PMCID: PMC8059693 DOI: 10.1007/s00521-021-06003-9
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.102
Fig. 1Workflow of the proposed approach
Fig. 2Screenshot of the VGG Image Annotator
Fig. 3Distribution of emotion labels after annotation, for the dataset
Fig. 4Architecture of Mask R-CNN
Fig. 5RoIAlign Layer
Fig. 6Use of spatial transformer and bilinear sampling kernel
Fig. 7Layers involved in VGG16
Summary for CNN architectures used in this paper
| Model name | # of Layers | Total parameters | Trainable parameters |
|---|---|---|---|
| VGG16 | 16 | 19,175,207 | 13,863,719 |
| InceptionV3 | 48 | 7,317,608 | 5,688,975 |
| ResNet50 | 50 | 20,942,054 | 16,848,092 |
| MobileNetV2 | 53 | 5,440,729 | 2,804,612 |
Fig. 8Mask formation steps
Fig. 9Masks generated by Mask R-CNN
Fig. 10Visualizing the data batch created
Fig. 11Visualization of emotion recognition using the proposed approach
Confusion matrix for all the models
Classification report for all the models with Mask R-CNN
| Character | Emotion | VGG16 | InceptionV3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1-score | Accuracy score | Precision | Recall | F1-score | Accuracy score | ||
| Tom | Angry | 0.82 | 0.86 | 0.84 | 0.96 | 0.72 | 0.75 | 0.73 | 0.94 |
| Happy | 0.82 | 0.94 | 0.87 | 0.97 | 0.76 | 0.75 | 0.76 | 0.94 | |
| Sad | 0.87 | 0.66 | 0.75 | 0.95 | 0.72 | 0.68 | 0.7 | 0.93 | |
| Surprise | 0.85 | 0.85 | 0.85 | 0.96 | 0.76 | 0.78 | 0.77 | 0.94 | |
| Jerry | Angry | 0.81 | 0.91 | 0.86 | 0.96 | 0.81 | 0.8 | 0.81 | 0.95 |
| Happy | 0.90 | 0.87 | 0.88 | 0.97 | 0.71 | 0.69 | 0.7 | 0.91 | |
| Sad | 0.88 | 0.79 | 0.84 | 0.96 | 0.79 | 0.76 | 0.78 | 0.93 | |
| Surprise | 0.85 | 0.87 | 0.86 | 0.96 | 0.72 | 0.78 | 0.75 | 0.93 | |
| Micro-average | 0.85 | 0.84 | 0.84 | 0.96 | 0.75 | 0.75 | 0.75 | 0.93 | |
| Weighted average | 0.85 | 0.85 | 0.85 | 0.96 | 0.75 | 0.75 | 0.75 | 0.93 | |
Classification report for all the models without Mask R-CNN
| Character | Emotion | VGG16 | InceptionV3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1-score | Accuracy score | Precision | Recall | F1-score | Accuracy score | ||
| Tom | Angry | 0.45 | 0.48 | 0.46 | 0.64 | 0.37 | 0.35 | 0.36 | 0.60 |
| Happy | 0.36 | 0.37 | 0.36 | 0.65 | 0.35 | 0.36 | 0.35 | 0.59 | |
| Sad | 0.38 | 0.38 | 0.38 | 0.65 | 0.30 | 0.32 | 0.30 | 0.59 | |
| Surprise | 0.42 | 0.46 | 0.44 | 0.63 | 0.35 | 0.33 | 0.34 | 0.61 | |
| Jerry | Angry | 0.43 | 0.48 | 0.45 | 0.64 | 0.36 | 0.35 | 0.35 | 0.61 |
| Happy | 0.47 | 0.47 | 0.47 | 0.65 | 0.36 | 0.34 | 0.35 | 0.60 | |
| Sad | 0.45 | 0.38 | 0.41 | 0.63 | 0.37 | 0.37 | 0.37 | 0.58 | |
| Surprise | 0.39 | 0.45 | 0.42 | 0.63 | 0.32 | 0.31 | 0.31 | 0.61 | |
| Micro Average | 0.42 | 0.42 | 0.42 | 0.64 | 0.35 | 0.34 | 0.34 | 0.59 | |
| Weighted average | 0.42 | 0.43 | 0.42 | 0.64 | 0.35 | 0.34 | 0.34 | 0.60 | |
Fig. 12(a–d) Plots for precision-recall curves and ROC curves for all four DNNs
Fig. 13(a–d) Accuracy vs. the number of epochs plots for all the models
Comparison between the proposed approach and the existing methods
| Comparison Parameters | Proposed Approach | Hill [ | Li et al. [ | Ma et al. [ | Aneja et al. [ | Aneja et al. [ |
|---|---|---|---|---|---|---|
| Number of emotions recognized | 4 (H, S, A, & Su) | 3 (H, S & A) | 4 (A, H, S, & N) | 6 (H, S, A, F, Su, & D) | 7 (S, A, Su, D, F, N, & J) | 7 (S, A, Su, D, F, N, & J) |
| Learning rate | 0.0003 | 0.001 | – | – | 0.01 | 0.0001 |
| Accuracy Score for Individual Emotion | 0.97 (H), 0.95 (S), 0.96 (A), 0.96 (Su) | – | 0.77 (A), 0.65 (H), 0.70 (S) | 0.78 (H), 0.49 (S), 0.62 (A), 0.19 (Su) | 0.89 (S), 0.85 (A), 0.95 (Su) | 0.79 (S), 0.90 (A), 0.94 (Su) |
| Overall Model Accuracy Score | 0.96 | 0.80 | – | – | 0.89 | – |
| F1-score | 0.85 | – | – | – | – | – |
*H → Happy, S → Sad, A → Angry, Su → Surprised, F → Fear, D → Disgusted, J → Joy, N→ Neutral
Fig. 14Comparing our approach in detecting a cartoon character to already existing work