| Literature DB >> 32235662 |
Dami Jeong1, Byung-Gyu Kim1, Suh-Yeon Dong1.
Abstract
Understanding a person's feelings is a very important process for the affective computing. People express their emotions in various ways. Among them, facial expression is the most effective way to present human emotional status. We propose efficient deep joint spatiotemporal features for facial expression recognition based on the deep appearance and geometric neural networks. We apply three-dimensional (3D) convolution to extract spatial and temporal features at the same time. For the geometric network, 23 dominant facial landmarks are selected to express the movement of facial muscle through the analysis of energy distribution of whole facial landmarks.We combine these features by the designed joint fusion classifier to complement each other. From the experimental results, we verify the recognition accuracy of 99.21%, 87.88%, and 91.83% for CK+, MMI, and FERA datasets, respectively. Through the comparative analysis, we show that the proposed scheme is able to improve the recognition accuracy by 4% at least.Entities:
Keywords: deep learning; deep spatiotemporal network; facial expression recognition (FER); geometric feature; joint fusion classifier; local binary pattern (LBP) feature
Mesh:
Year: 2020 PMID: 32235662 PMCID: PMC7180996 DOI: 10.3390/s20071936
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Datasets with different genders, races, and ages.
Analysis of the existing FER systems for business applications.
| Items | Description | Developers |
|---|---|---|
| Project Oxford [ | As part of the Machine Learning (Machine Learning) project, it provides API including Face API for machine learning technology, which is being promoted to four major categories. | Oxford Univ. and Microsoft |
| Face Reader [ | The Project Analysis Module includes the Stimulus Presentation Tool. By designing a test, this tool automatically shows the test participants the stimuli while FaceReader accurately analyzes the participant’s face. | Noldus |
| Emotient [ | Emotient is spearheading the use of machine learning for facial expression analysis, a new category of natural user interface that is rapidly expanding as more companies seek to improve technology responsiveness and increase customer engagement. | crunchbase |
| Affectiva [ | The first multi-modal in-cabin sensing AI that identifies, from face and voice, complex and nuanced emotional and cognitive states of drivers and passengers. This helps improve road safety and the transportation experience. | Affectiva |
| EmoVu [ | EmoVu specializes in creating emotionally intelligent tools that can perceive the emotions of people by analyzing microexpressions using webcams. | Eyeris |
| Kairos [ | Kairos provides featured platforms such as face detection/recognition, age, gender detection, etc. | Kairos |
| Nviso [ | Nviso develops face detection, face verification, demographic detection, emotion classification, action unit detection, and pose detection. | Nviso |
| Sightcorp [ | Sightcorp provides technologies for face recognition, emotion detection, attention analysis, and mood detection. | Sightcorp |
| SkyBiometry [ | SkyBiometry is state-of-the-art face recognition and face detection cloud biometrics API allowing developers and marketers. | SkyBiometry |
| Face++ [ | It ensures that operator behind a transaction is a live human by facial landmarks localization, face tracking technique, etc. | Face++ |
| Imotions [ | Motions helps you quantify engagement and emotional responses. The iMotions Platform is an emotion recognition software that seamlessly integrates multiple sensors. | Imotions |
| CrowdEmotion [ | An emotion inspired artificial intelligence company that enables technology to see, hear, and feel the way humans do. | CrowdEmotion |
| FacioMetrics [ | FacioMetrics develops facial analysis software for mobile applications including facial image analysis—with all kinds of applications including augmented/virtual reality, animation, and audience reaction measurement. | Facebook Research |
| Findface [ | A face recognition technology developed by the Russian company NtechLab that specializes in neural network tools. It compares photos to profile pictures on social network Vkontakte and works out identities with 70% reliability | Findface |
Figure 2The overall structure of the proposed facial expression recognition scheme.
The number of input images as emotions and datasets.
| Neu | Ang | Dis | Fea | Hap | Sad | Sur | Total | |
|---|---|---|---|---|---|---|---|---|
| CK+ | 316 | 360 | 448 | 192 | 552 | 224 | 624 | 2716 |
| MMI | 366 | 288 | 492 | 258 | 204 | 366 | 366 | 2340 |
| FERA | 603 | 529 | - | 479 | 606 | 695 | - | 2915 |
Figure 3Example of encoding a LBP feature.
Figure 4Feature extraction: (a) original image; (b) filtered image; (c) LBP feature without filtering; and (d) LBP feature with filtering.
Figure 5Appearance feature-based spatiotemporal network.
Figure 6Examples of 68 landmarks detection: (Left) CK+; (Middle) MMI; and (Right) AFEW.
Figure 7Movement distribution of landmarks.
Figure 8Top 13 landmarks: (a) CK+ dataset; and (b) MMI dataset.
Figure 9Twenty-three landmarks used as the input of geometric network: (a) CK+ dataset; and (b) MMI dataset.
Figure 10The input vector for the geometric feature-based network.
Figure 11Geometric feature-based network.
Figure 12The structure of joint fusion classifier.
Figure 13The examples of the neutral labeled frames: (a) seven consecutive frames; (b) neutral frames for three; (c) neutral frames for five; and (d) neutral frames for seven.
The recognition accuracy according to three different input data structures (%).
| 3 Frames | 5 Frames | 7 Frames | |
|---|---|---|---|
|
| 99.27 | 94.45 | 95.57 |
|
| 87.23 | 83.76 | 81.88 |
|
| 90.67 | 87.85 | 87.15 |
|
|
|
|
|
Figure 14Comparison of accuracy according to the number of landmarks.
Comparison of the average accuracy according to .
|
| CK+ | MMI | FERA |
|---|---|---|---|
|
| 98.87 | 87.28 | 91.31 |
|
| 99.20 | 86.64 | 91.78 |
|
| 99.21 |
|
|
|
|
| 87.49 | 91.79 |
|
| 99.09 | 87.51 | 90.77 |
|
| 99.02 | 87.23 | 90.60 |
|
| 99.07 | 86.79 | 89.13 |
Analysis of the state-of-the-art methods.
| Method | Database | Input Construction | Model |
|---|---|---|---|
| DTAGN [ | CK+ | DTAN, DTGN | Hybrid network |
| 3DIR [ | CK+, MMI, FERA | Multiple frames, | 3D CNN, LSTM, |
| DESTN [ | CK+ | Single image frame, | Hybrid network |
| nestedLSTM [ | CK+, MMI | Multiple frames | 3D CNN, LSTM |
| STCNN-CRF [ | CK+, MMI, FERA | Multiple frames | 2D CNN, CRF, |
| STFR [ | MMI | Multiple frames | LSTM |
The performance comparison of the recognition accuracy in the CK+ dataset (%).
| Methods | Accuracy (%) |
|---|---|
| DTAGN [ | 97.25 |
| STCNN-CRF [ | 93.04 |
| 3DIR (S/I) [ | 93.21 |
| DESTN [ | 98.50 |
| nestedLSTM [ | 99.80 |
| Proposed (App.) |
|
| Proposed (Geo,) |
|
| Proposed (Joint) |
|
The performance comparison of the recognition accuracy in the MMI dataset (%).
| Methods | Accuracy (%) |
|---|---|
| STCNN-CRF [ | 68.51 |
| 3DIR [ | 77.50 |
| STFR [ | 78.61 |
| nestedLSTM [ | 84.53 |
| Proposed (App.) |
|
| Proposed (Geo.) |
|
| Proposed (Joint) |
|
The performance comparison of the recognition accuracy in the FERA dataset (%).
| Methods | Accuracy |
|---|---|
| STCNN-CRF [ | 66.66 |
| 3DIR (S/I) [ | 77.42 |
| Proposed (App.) |
|
| Proposed (Geo.) |
|
| Proposed (Joint) |
|
Confusion matrix of joint fusion classifier in the CK+ dataset. NE, Neutral; AN, Angry; DI, Disgust; FE, Fear; HA, Happy; SA, Sad; SU, Surprise.
| Actual values | NE | AN | DI | FE | HA | SA | SU | |
| NE |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | |
| AN | 0.00 |
| 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | |
| DI | 0.00 | 0.01 |
| 0.00 | 0.00 | 0.00 | 0.00 | |
| FE | 0.00 | 0.00 | 0.00 |
| 0.02 | 0.00 | 0.00 | |
| HA | 0.00 | 0.00 | 0.00 | 0.00 |
| 0.00 | 0.00 | |
| SA | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 |
| 0.00 | |
| SU | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 |
| |
| Predicted values | ||||||||
Confusion matrix of joint fusion classifier in the MMI dataset. NE, Neutral; AN, Angry; DI, Disgust; FE, Fear; HA, Happy; SA, Sad; SU, Surprise.
| Actual values | NE | AN | DI | FE | HA | SA | SU | |
| NE |
| 0.00 | 0.03 | 0.02 | 0.01 | 0.03 | 0.03 | |
| AN | 0.16 |
| 0.10 | 0.01 | 0.01 | 0.08 | 0.01 | |
| DI | 0.09 | 0.09 |
| 0.01 | 0.02 | 0.01 | 0.00 | |
| FE | 0.07 | 0.00 | 0.10 |
| 0.00 | 0.05 | 0.10 | |
| HA | 0.15 | 0.00 | 0.00 | 0.00 |
| 0.00 | 0.00 | |
| SA | 0.10 | 0.06 | 0.00 | 0.06 | 0.02 |
| 0.00 | |
| SU | 0.05 | 0.00 | 0.00 | 0.02 | 0.02 | 0.00 |
| |
| Predicted values | ||||||||
Confusion matrix of joint fusion classifier in the FERA dataset. RE, Relief; AN, Anger; FE, Fear; JO, joy; SA, Sadness.
| Actual values | RE | AN | FE | JO | SA | |
| RE |
| 0.03 | 0.01 | 0.06 | 0.00 | |
| AN | 0.15 |
| 0.00 | 0.02 | 0.00 | |
| FE | 0.04 | 0.01 |
| 0.03 | 0.00 | |
| JO | 0.05 | 0.02 | 0.01 |
| 0.00 | |
| SA | 0.01 | 0.00 | 0.01 | 0.00 |
| |
| Predicted values | ||||||