| Literature DB >> 32079319 |
Muhammad Naveed Riaz1, Yao Shen1, Muhammad Sohail2, Minyi Guo1.
Abstract
Facial expression recognition has been well studied for its great importance in the areasof human-computer interaction and social sciences. With the evolution of deep learning, therehave been significant advances in this area that also surpass human-level accuracy. Althoughthese methods have achieved good accuracy, they are still suffering from two constraints (high computational power and memory), which are incredibly critical for small hardware-constrained devices. To alleviate this issue, we propose a new Convolutional Neural Network (CNN) architecture eXnet (Expression Net) based on parallel feature extraction which surpasses current methodsin accuracy and contains a much smaller number of parameters (eXnet: 4.57 million, VGG19:14.72 million), making it more efficient and lightweight for real-time systems. Several moderndata augmentation techniques are applied for generalization of eXnet; these techniques improvethe accuracy of the network by overcoming the problem of overfitting while containing the same size. We provide an extensive evaluation of our network against key methods on Facial ExpressionRecognition 2013 (FER-2013), Extended Cohn-Kanade Dataset (CK+), and Real-world Affective Faces Database (RAF-DB) benchmark datasets. We also perform ablation evaluation to show the importance of different components of our architecture. To evaluate the efficiency of eXnet on embedded systems,we deploy it on Raspberry Pi 4B. All these evaluations show the superiority of eXnet for emotionrecognition in the wild in terms of accuracy, the number of parameters, and size on disk.Entities:
Keywords: CK+; CNN; FER; RAF-DB; deep learning; embedded devices; emotion classification
Mesh:
Year: 2020 PMID: 32079319 PMCID: PMC7071079 DOI: 10.3390/s20041087
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Visualization of eXnet.
Figure 2Structure of eXnet: and mentioned in the CBR (C = Convolutional layer, B = Batch Normalization, and R = Relu) box shows the kernel size used in convolutions, and in the pool box, S = stride.
Settings of parameters used for training of eXnet on all three dataset.
| Dataset | Parameters | Values |
|---|---|---|
| FER-2013 | Size of images used | 48 × 48 |
| Optimizer | Stochastic Gradient Descent (SGD) | |
| Number of epochs | 200–250 | |
| Batch size | 64 | |
| Learning rate | 0.01 | |
| Momentum | 0.9 | |
| Learning decay | 4e-5 | |
| Cyclical learning rate | Yes | |
| CK+ | Size of images used | 48 × 48 |
| Optimizer | SGD | |
| Number of epochs | 60–100 | |
| Batch size | 64 | |
| Learning rate | 0.01 | |
| Momentum | 0.9 | |
| Learning decay | 4e-5 | |
| Cyclical learning rate | No | |
| RAF-DB | Size of images used | 48 × 48 |
| Optimizer | SGD | |
| Number of epochs | 150–200 | |
| Batch size | 64 | |
| Learning rate | 0.01 | |
| Momentum | 0.9 | |
| Learning decay | 1e-4 | |
| Cyclical learning rate | Yes |
Ablation evaluation of eXnet.
| Model Parameter | Accuracy |
|---|---|
|
| |
| eXnet after removal of initial two CBR blocks | 67 |
| eXnet after removal of last two CBR blocks | 69 |
|
| |
| eXnet having convolutions with larger strides | 58 |
| eXnet having convolutions with smaller strides | 65 |
| eXnet having pooling layers | >71 |
|
| |
| eXnet without dropout layer | 69 |
| eXnet with dropout between fully-connected layers | 71 |
|
| |
| eXnet with kernel size 32 | 48 |
| eXnet with kernel size 16 | 57 |
| eXnet with kernel size 8 | 68 |
|
| |
| eXnet after skipping first | 68 |
| eXnet after skipping second | 67 |
| eXnet without using both | 62 |
|
| |
| eXnet without any fully connected layer | 60 |
| eXnet without fully connected layer 1 | 65 |
| eXnet without fully connected layer 2 | 68 |
|
| |
| eXnet with Adaptive moment estimation (Adam) optimizer | 70 |
| eXnet with SGD optimizer | 71.67 |
|
| |
| eXnet with fix learning rate | 69 |
| eXnet cyclical learning rate | 71.67 |
Comparison between sizes of eXnet with different sizes of kernels in convolutional layers.
| Model | Param |
|---|---|
| eXnet with 5 × 5 and 3 × 3 conv | 20 M |
| eXnet with only 3 × 3 conv | 12 M |
| eXnet with 1 × 1 and 3 × 3 | 4.57 M |
Comparison of eXnet with benchmark networks on the FER-2013 dataset.
| Model | Accuracy (%) | Params | Size (MB) |
|---|---|---|---|
| VGG [ | 71.29 | 14.72M | 70.94 |
| ResNet [ | 71.12 | 11.17M | 61.18 |
| DenseNet [ | 67.54 | 3.0M | 59.69 |
| DeXpression [ | 68 | 3.54M | 57.14 |
| Liu et al. [ | 61.74 | 84M | - |
| CNN + Support Vector Machine. [ | 71 | 4.92M | - |
| Tang [ | 69.4 | 7.17M | - |
| Shao at el. [ | 71.14 | 7.12M | - |
| eXnet (ours) | 71.67 | 4.57M | 36.49 |
| eXnet | 71.92 | 4.57M | 36.49 |
| eXnet | 72.67 | 4.57M | 36.49 |
| eXnet | 73.54 | 4.57M | 36.49 |
Comparison of eXnet with benchmark networks on the CK+ dataset.
| Model | Accuracy (%) 10-Cross |
|---|---|
| VGG [ | 94.6 |
| ResNet [ | 94 |
| DenseNet [ | 92 |
| DeXpression [ | 96 |
| Tautkute et al. [ | 92 |
| Lopes et al. [ | 92.73 |
| Jain et al. [ | 93.24 |
| Shao et al. [ | 92.86 (without 10-cross ) |
| Shao et al. [ | 95.29 (without 10-cross) |
| eXnet (ours) | 95.63 |
| eXnet | 95.81 |
| eXnet | 96.17 |
| eXnet | 96.75 |
Comparison of eXnet with benchmark networks on the RAF-DB dataset.
| Model | Accuracy (%) |
|---|---|
| VGG [ | 82.39 |
| ResNet [ | 81.71 |
| DenseNet [ | 76.71 |
| DeXpression [ | 76.33 |
| Li et al. DLP+SVM [ | 74.20 |
| Fan et al. [ | 76.73 |
| eXnet (ours) | 84 |
| eXnet | 85.59 |
| eXnet | 85.63 |
| eXnet | 86.37 |
Figure 3Performance of eXnet on the FER-2013 dataset.
Figure 4Performance of eXnet on the CK+ dataset.
Figure 5Performance of eXnet on the RAF-DB.
Time comparison of proposed model with benchmark networks on Raspberry Pi 4B.
| Model | Time for Emotion Classification |
|---|---|
| VGG [ | 3.92 s |
| ResNet [ | 3.86 s |
| DenseNet [ | 2.09 s |
| eXnet (ours) | 0.998 s |