| Literature DB >> 29385749 |
Abstract
Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling "end-to-end" learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.Entities:
Keywords: conventional FER; convolutional neural networks; deep learning-based FER; facial action coding system; facial action unit; facial emotion recognition; long short term memory
Mesh:
Year: 2018 PMID: 29385749 PMCID: PMC5856145 DOI: 10.3390/s18020401
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Procedure used in conventional FER approaches: From input images (a), face region and facial landmarks are detected (b), spatial and temporal features are extracted from the face components and landmarks (c), and the facial expression is determined based on one of facial categories using pre-trained pattern classifiers (face images are taken from CK+ dataset [10]) (d).
Figure 2Procedure of CNN-based FER approaches: (a) The input images are convolved using filters in the convolution layers. (b) From the convolution results, feature maps are constructed and max-pooling (subsampling) layers lower the spatial resolution of the given feature maps. (c) CNNs apply fully connected neural-network layers behind the convolutional layers, and (d) a single face expression is recognized based on the output of softmax (face images are taken from CK+ dataset [10]).
Figure 3Sample examples of various facial emotions and AUs: (a) basic emotions (sad, fearful, and angry), (face images are taken from CE dataset [17]) (b) compound emotions (happily surprised, happily disgusted, and sadly fearful) (face images are taken from CE dataset [17]), (c) spontaneous expressions, and (face images are taken from YouTube) (d) AUs (upper and lower face) (face images are taken from CK+ dataset [10]).
Prototypical AUs observed in each basic and compound emotion category, adapted from [18].
| Category | AUs | Category | AUs |
|---|---|---|---|
| Happy | 12, 25 | Sadly disgusted | 4, 10 |
| Sad | 4, 15 | Fearfully angry | 4, 20, 25 |
| Fearful | 1, 4, 20, 25 | Fearfully surprised | 1, 2, 5, 20, 25 |
| Angry | 4, 7, 24 | Fearfully disgusted | 1, 4, 10, 20, 25 |
| Surprised | 1, 2 , 25, 26 | Angrily surprised | 4, 25, 26 |
| Disgusted | 9, 10, 17 | Disgusted surprised | 1, 2, 5, 10 |
| Happily sad | 4, 6, 12, 25 | Happily fearful | 1, 2, 12, 25, 26 |
| Happily surprised | 1, 2, 12, 25 | Angrily disgusted | 4, 10, 17 |
| Happily disgusted | 10, 12, 25 | Awed | 1, 2, 5, 25 |
| Sadly fearful | 1, 4, 15, 25 | Appalled | 4, 9, 10 |
| Sadly angry | 4, 7, 15 | Hatred | 4, 7, 10 |
| Sadly surprised | 1, 4, 25, 26 | - | - |
A summary of publicly available databases related to FER. (The detail information on database is described in Section 4).
| Reference | Emotions Analyzed | Visual Features | Decision Methods | Database |
|---|---|---|---|---|
| Compound emotion [ | Seven emotions and 22 compound emotions |
Distribution between each pair of fiducials Appearance defined by Gabor filters | Nearest-mean classifier, Kernel subclass discriminant analysis | CE [ |
| EmotioNet [ | 23 basic and compound emotions |
Euclidean distances between normalized landmarks Angles between landmarks Gabor filters centered at of the landmark points | Kernel subclass discriminant analysis | CE [ |
| Real-time mobile [ | Seven emotions |
Active shape model fitting landmarks Displacement between landmarks | SVM | CK+ [ |
| Ghimire and Lee [ | Seven emotions |
Displacement between landmarks in continuous frames | Multi-class AdaBoost, SVM | CK+ [ |
| Global Feature [ | Six emotions |
Local binary pattern (LBP) histogram of a face image | Principal component analysis (PCA) | Self-generated |
| Local region specific feature [ | Seven emotions |
Appearance of LBP features from specific local regions Geometric normalized central moment features from specific local regions. | SVM | CK+ [ |
| InfraFace [ | Seven emotions, 17 AUs detected |
Histogram of gradients (HoG) | A linear SVM | CK+ [ |
| 3D facial expression [ | Six prototypical emotions |
3D curve shape and 3D patch shape by analyzing shapes of curves to the shapes of patches | Multiboosting and SVM | BU-3DFE [ |
| Stepwise approach [ | Six prototypical emotions |
Stepwise linear discriminant analysis (SWLDA) used to select the localized features from the expression | Hidden conditional random fields (HCRFs) | CK+ [ |
Figure 4The basic structure of an LSTM, adapted from [50]. (a) One LSTM cell contains four interacting layers: the cell state, an input gate layer, a forget gate layer, and an output gate layer, (b) The repeating module of cells in an LSTM.
Summary of FER systems based on deep learning.
| Reference | Emotions Analyzed | Recognition Algorithm | Database |
|---|---|---|---|
| hybrid CNN-RNN [ | Seven emotions |
Hybrid RNN-CNN framework for propagating information over a sequence Using temporal averaging for aggregation | EmotiW [ |
| Kim et al. [ | Six emotions |
Spatial image characteristics of the representative expression-state frames are learned using a CNN Temporal characteristics of the spatial feature representation in the first part are learned using an LSTM | MMI [ |
| Breuer and Kimmel [ | Eight emotions, 50 AU detection |
CNN-based feature extraction and inference | CK+ [ |
| Joint Fine-Tunning [ | Seven emotions |
Two different models CNN for temporal appearance features CNN for temporal geometry features from temporal facial landmark points | CK+ [ |
| DRML [ | 12 AUs for BP4D, eight AUs for DISFA |
Feed-forward functions to induce important facial regions Learning of weights to capture structural information of the face | DISFA [ |
| Multi-level AU [ | 12 AU detection |
Spatial representations are extracted by a CNN LSTMs for temporal dependencies | BP4D [ |
| 3D Inception-ResNet [ | 23 basic and compound emotions |
LSTM unit that together extracts the spatial relations and temporal relations within facial images Facial landmark points are also used as inputs to this network | CK+ [ |
| Candide-3 [ | Six emotions |
Conjunction with a learned objective function for face model fitting Using a recurrent network for temporal dependencies present in the image sequences during classification. | CK+ [ |
| Multi-angle FER [ | Six emotions |
Extraction of the texture patterns and the relevant key features of the facial points. Employment of LSTM-CNN to predict the required label for the facial expressions | CK+ [ |
Figure 5Overview of the general hybrid deep-learning framework for FER. The outputs of the CNNs and LSTMs are further aggregated into a fusion network to produce a per-frame prediction, adapted from [53].
A summary of publicly available databases related to FER.
| Database | Data Configuration | Web Link |
|---|---|---|
| CK+ [ |
593 video sequences on both posed and non-posed (spontaneous) emotions 123 subjects from 18 to 30 years in age Provides protocols and baseline results for facial feature tracking, action units, and emotion recognition Image resolutions of 640 × 480, and 640 × 490 | |
| CE [ |
5060 images corresponding to 22 categories of basic and compound emotions 230 human subjects (130 females and 100 males, mean age 23) Includes most ethnicities and races Image resolution of 3000 × 4000 | |
| DISFA [ |
130,000 stereo video frames at high resolution 27 adult subjects (12 females and 15 males) 66 facial landmark points for each image Image resolution of 1024 × 768 | |
| BU-3DFE [ |
3D human faces and facial emotions 100 subjects in the database, 56 females and 44 males, with about six emotions 25 3D facial emotion models per subject Image resolution of 1040 × 1329 | |
| JAFFE [ |
213 images of seven facial emotions Ten different female Japanese models Six emotion adjectives by 60 Japanese subjects Image resolution of 256 × 256 | |
| B+ [ |
16,128 facial images 28 distinct subjects for 576 viewing conditions Image resolution of 320 × 243 | |
| MMI [ |
Over 2900 video sequences and high-resolution still images of 75 subjects 238 video sequences on 28 subjects, male and female Image resolution of 720 × 576 | |
| BP4D-Spontanous [ |
3D video database includes 41 participants (23 women, 18 men), with spontaneous facial emotions 11 Asians, six African-Americans, four Hispanics, and 20 Euro-Americans Image resolution of 1040 × 1329 | |
| KDEF [ |
4900 images of human facial expressions of emotion 70 individuals, seven different emotional expressions with 5 different angles Image resolution of 562 × 762 |
Figure 6Examples of nine representative databases related to FER. Databases (a) through (g) support 2D still images and 2D video sequences, and databases (h) through (i) support 3D video sequences.
Recognition performance with MMI dataset, adapted from [11].
| Type | Brief Description of Main Algorithms | Input | Accuracy (%) |
|---|---|---|---|
| Conventional (handcrafted-feature) FER approaches |
Sparse representation classifier with LBP features [ | Still frame | 59.18 |
|
Sparse representation classifier with local phase quantization features [ | Still frame | 62.72 | |
|
SVM with Gabor wavelet features [ | Still frame | 61.89 | |
|
Sparse representation classifier with LBP from three orthogonal planes [ | Sequence | 61.19 | |
|
Sparse representation classifier with local phase quantization feature from three orthogonal planes [ | Sequence | 64.11 | |
|
Collaborative expression representation CER [ | Still frame | 70.12 | |
| Deep-learning-based FER approaches |
Deep learning of deformable facial action parts [ | Sequence | 63.40 |
|
Joint fine-tuning in deep neural networks [ | Sequence | 70.24 | |
|
AU-aware deep networks [ | Still frame | 69.88 | |
|
AU-inspired deep networks [ | Still frame | 75.85 | |
|
Deeper CNN [ | Still frame | 77.90 | |
|
CNN + LSTM with spatio-temporal feature representation [ | Sequence | 78.61 | |