| Literature DB >> 31003522 |
Najmeh Samadiani1, Guangyan Huang2, Borui Cai3, Wei Luo4, Chi-Hung Chi5, Yong Xiang6, Jing He7.
Abstract
Facial Expression Recognition (FER) can be widely applied to various research areas, such as mental diseases diagnosis and human social/physiological interaction detection. With the emerging advanced technologies in hardware and sensors, FER systems have been developed to support real-world application scenes, instead of laboratory environments. Although the laboratory-controlled FER systems achieve very high accuracy, around 97%, the technical transferring from the laboratory to real-world applications faces a great barrier of very low accuracy, approximately 50%. In this survey, we comprehensively discuss three significant challenges in the unconstrained real-world environments, such as illumination variation, head pose, and subject-dependence, which may not be resolved by only analysing images/videos in the FER system. We focus on those sensors that may provide extra information and help the FER systems to detect emotion in both static images and video sequences. We introduce three categories of sensors that may help improve the accuracy and reliability of an expression recognition system by tackling the challenges mentioned above in pure image/video processing. The first group is detailed-face sensors, which detect a small dynamic change of a face component, such as eye-trackers, which may help differentiate the background noise and the feature of faces. The second is non-visual sensors, such as audio, depth, and EEG sensors, which provide extra information in addition to visual dimension and improve the recognition reliability for example in illumination variation and position shift situation. The last is target-focused sensors, such as infrared thermal sensors, which can facilitate the FER systems to filter useless visual contents and may help resist illumination variation. Also, we discuss the methods of fusing different inputs obtained from multimodal sensors in an emotion system. We comparatively review the most prominent multimodal emotional expression recognition approaches and point out their advantages and limitations. We briefly introduce the benchmark data sets related to FER systems for each category of sensors and extend our survey to the open challenges and issues. Meanwhile, we design a framework of an expression recognition system, which uses multimodal sensor data (provided by the three categories of sensors) to provide complete information about emotions to assist the pure face image/video analysis. We theoretically analyse the feasibility and achievability of our new expression recognition system, especially for the use in the wild environment, and point out the future directions to design an efficient, emotional expression recognition system.Entities:
Keywords: emotional expression recognition; facial expression recognition (FER); multimodal sensor data; real-world conditions; spontaneous expression
Mesh:
Year: 2019 PMID: 31003522 PMCID: PMC6514576 DOI: 10.3390/s19081863
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The different stages of facial expression recognition (FER) system.
Existing FER systems for resolving illumination variation.
| Classifier | Work | Method | Sensor Type | Dataset | Average Accuracy % |
|---|---|---|---|---|---|
| SVM | [ | HOG, dense SIFT features | RGB Video | AFEW 4.0 | 50.4 |
| [ | STM, UMM features | RGB Video | AFEW 4.0 | 31.73 | |
| Deep Network | [ | LBP features | RGB Image | SWEF (real-world) | 54.56 |
| [ | MLDP-GDA features | Depth Video | 40 videos taken from 10 subjects (Lab-controlled) | 96.2 | |
| [ | Spatio-temporal convolutional neural network | RGB Video | MMI | 59.03 | |
| KNN | [ | FFT-CLAHE, MBPC features | RGB Image | SEWF (real-world) | 96.5 |
| [ | Logarithm-laplace (LL) domain in DLBP | RGB Video | CK+ | 92.86 |
Existing FER systems for resolving subject-dependence problem.
| Classifier | Work | Method | Sensor Type | Dataset | Average Accuracy % |
|---|---|---|---|---|---|
| Deep Network | [ | Deep spatial-temporal networks | RGB Video | CK+ | 98.5 |
| [ | MLDP-GDA features | Depth Video | 40 videos taken from 10 subjects (Lab-controlled) | 89.16 | |
| CD-MM Learning | [ | 3D-HOG, geometric warp, audio features | RGB Video | AFEW 4.0 (real-world) | 46.8 |
Existing FER systems for resolving head pose problem.
| Classifier | Work | Method | Sensor Type | Dataset | Average Accuracy % |
|---|---|---|---|---|---|
| SVM | [ | HOG-TOP and geometric features | RGB Video | AFEW 4.0 | 40.2 |
| CD-MM | [ | 3D-HOG, geometric warp, audio features | RGB Video | AFEW 4.0 (real-world) | 46.8 |
| Regression forest | [ | Multiple-label dataset augmentation, non-informative patch | RGB Image | LFW | 94.05 |
Existing FER systems using non-visual sensors.
| Work | Method | Sensor Type | Fusion Technology | Dataset | Average Accuracy % |
|---|---|---|---|---|---|
| [ | Statistical features, | RGB Video | Rule-based methodology | 144 videos from two subjects | 72 |
| [ | HOG, dense SIFT features | Audio | Equal-weighted linear fusion technique | AFEW 4.0 | 50.4 |
| [ | MLDP-GDA | Depth Video | No fusion | 40 videos included ten frames | 96.25 |
| [ | LDPP, GDA, PCA | Depth Video | No fusion | 40 videos included ten frames | 89.16 |
| [ | ERP, fixation distribution patterns | Eye movements | Radboud Faces Database (RaFD) | AUC > 0.6 | |
| [ | HOG-TOP, geometric features | Audio | Multi-kernel SVM | CK+ | 95.7 |
| [ | Collaborative discriminative multi-metric learning | RGB Video | PCA | AFEW 4.0 | 46.8 |
| [ | HOG, AMM | RGB Image | Multichannel feature vector | A merged dataset of three public datasets | 89.46 |
Existing FER systems using target-focused sensors.
| Work | Method | Sensor Type | Fusion Technology | Dataset | Average Accuracy % |
|---|---|---|---|---|---|
| [ | Head motion, AAM, thermal statistical parameters | RGB Image | Multiple genetic algorithms-based fusion method | USTC-NVIE | 63 |
| [ | Sequence features | RGB Image | No fusion | USTC-NVIE | 73 |
| [ | Geometric features | Infrared thermal Image | No fusion | USTC-NVIE | 85.51 |
The camera-based datasets (under real-world and simulation of real-world conditions in the lab).
| Dataset | Main Feature | Capacity | Emotions | Environment | Link |
|---|---|---|---|---|---|
| CAS(ME)2 | Both spontaneous micro and macro-expressions | 87 images of micro and macro-expressions | Four | Lab controlled |
|
| AFEW 4.0 | Videos from real world scenarios | 1268 videos | Seven | Real-world |
|
| SFEW | Images from real world scenarios | 700 images | Seven | Real-world |
|
| CK+ | The most common lab-controlled dataset | 593 videos | Seven | Lab controlled |
|
| CMU Multi-PIE | Simulation of 19 various lighting conditions | 755,370 images | Six | Lab controlled in different illumination variations |
|
| Florentine | Many participants for collecting videos | 2777 video clips | Seven | Lab controlled in different illumination variations | Not still public |
| Autoencoder | The largest introduced dataset in real world | 6.5 million video clips | Seven | Real world | Not still public |
| MMI | Dual view in an image | 1520 videos of Posed expressions | Six | Lab-controlled trying poor illumination conditions |
|
| AM-FED | Webcam videos from online viewers | 242 videos spontaneous expressions | Smile | Real-world |
|
| CAS-PEAL | Simulation of various backgrounds in the lab | images of posed expressions | Five | Lab-controlled in various illumination conditions |
|
The non-visual, target-focused, and multimodal datasets.
| Dataset | Main Feature | Capacity | Number of Emotions | Environment | Link |
|---|---|---|---|---|---|
| UTSC-NVIE | The biggest thermal-infrared dataset | Videos of posed and spontaneous expressions | Six | Lab controlled |
|
| SPOS | Subjects with different accessories | 231 images of spontaneous, and posed expressions | Six | Lab controlled |
|
| VAMGS | Visual-audio real world dataset | 1867 images of spontaneous expression | Six | Real-world |
|
| MMSE | A multimodal dataset | 10 GB data per subject | Ten | Lab controlled |
|
Figure 2Example datasets: (a) CAS(ME)2, (b) AFEW, and (c) SEFW.
Figure 3A sample from MMSE dataset which each row from top to bottom shows the 2D image of individual, shaded model, textured model, thermal image, physiological signals, and action units, respectively [116].
Figure 4Accuracy of different FER systems.
Figure 5The framework of automatic FER system assisted by multimodal sensor data.