| Literature DB >> 30669531 |
Dat Tien Nguyen1, Tuyen Danh Pham2, Min Beom Lee3, Kang Ryoung Park4.
Abstract
Face-based biometric recognition systems that can recognize human faces are widely employed in places such as airports, immigration offices, and companies, and applications such as mobile phones. However, the security of this recognition method can be compromised by attackers (unauthorized persons), who might bypass the recognition system using artificial facial images. In addition, most previous studies on face presentation attack detection have only utilized spatial information. To address this problem, we propose a visible-light camera sensor-based presentation attack detection that is based on both spatial and temporal information, using the deep features extracted by a stacked convolutional neural network (CNN)-recurrent neural network (RNN) along with handcrafted features. Through experiments using two public datasets, we demonstrate that the temporal information is sufficient for detecting attacks using face images. In addition, it is established that the handcrafted image features efficiently enhance the detection performance of deep features, and the proposed method outperforms previous methods.Entities:
Keywords: face recognition; handcrafted features; spatial and temporal information; stacked convolutional neural network (CNN)-recurrent neural network (RNN); visible-light camera sensor-based presentation attack detection
Mesh:
Year: 2019 PMID: 30669531 PMCID: PMC6359417 DOI: 10.3390/s19020410
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
A summary of previous studies on face-PAD with comparison with our proposed method.
| Category | Detection Method | Strength | Weakness |
|---|---|---|---|
| Uses still images |
Uses handcrafted image features [ |
Detection system is simple and easy to implement Can achieve high processing speed |
Detection performance is limited because of handcrafted features designed by humans based on limited observation aspects of face-PAD problem |
|
Uses deep image features: CNN [ |
Uses deep features extracted by CNN for enhancing detection performance |
More complex and requires more power and processing time than the methods that only use handcrafted image features | |
|
Uses a combination of deep and handcrafted image features [ |
Uses a very deep CNN network to efficiently extract image features Uses SVM for classification instead of fully-connected layer that might reduce the overfitting problem. Higher detection performance using a combination of deep and handcrafted image features |
More complex and requires more power and processing time than the methods that only use handcrafted image features | |
| Uses sequence images |
Uses stacked CNN-RNN network to learn the temporal relation between image frames for face-PAD [ |
Obtains higher detection performance than previous methods that only use a still image for detection using information learnt from more than one image |
Complex structure requiring more power and processing time - The CNN network is shallow with only two convolution layers and one fully connected layer |
Uses very deep stacked CNN-RNN to learn the temporal relation between image frames Combines deep and handcrafted image features to enhance the detection performance |
Uses very deep CNN network to efficiently extract image features for inputs of RNN Obtains higher detection performance than previous methods using very deep CNN-RNN network and handcrafted image features |
Requires more power and processing time to process a sequence of images |
Figure 1Working sequence of the proposed method for face-PAD.
Figure 2Demonstration of our preprocessing step: (a) input face image from NUAA dataset [16]; (b) detected face region on input face image using ERT method; (c) face region is aligned using center points of face, left and right eyes; (d) final extracted face region.
Figure 3Demonstration of an RNN network: (a) a simple RNN cell; (b) structure of a standard LSTM cell.
Figure 4General architecture of a stacked CNN-RNN network for temporal image extraction.
Detailed description of architecture of the stacked CNN-RNN network in our study.
| Repeat Times | Layer Type | Padding Size | Stride | Filter Size | Number of Filters (Neurons) | Size of Feature Maps | Number of Parameters |
|---|---|---|---|---|---|---|---|
| 1 | Input Layer | n/a | n/a | n/a | n/a | 5 | 0 |
| 2 | Convolution | 1 | 1 | 3 | 64 | 5 | 38,720 |
| ReLU | n/a | n/a | n/a | n/a | 5 | 0 | |
| 1 | Max Pooling | n/a | 2 | 2 | 1 | 5 | 0 |
| 2 | Convolution | 1 | 1 | 3 | 128 | 5 | 221,440 |
| ReLU | n/a | n/a | n/a | n/a | 5 | 0 | |
| 1 | Max Pooling | n/a | 2 | 2 | 1 | 5 | 0 |
| 4 | Convolution | 1 | 1 | 3 | 256 | 5 | 2,065,408 |
| ReLU | n/a | n/a | n/a | n/a | 5 | 0 | |
| 1 | Max Pooling | n/a | 2 | 2 | 1 | 5 | 0 |
| 4 | Convolution | 1 | 1 | 3 | 512 | 5 | 8,259,584 |
| ReLU | n/a | n/a | n/a | n/a | 5 | 0 | |
| 1 | Max Pooling | n/a | 2 | 2 | 1 | 5 | 0 |
| 4 | Convolution | 1 | 1 | 3 | 512 | 5 | 9,439,232 |
| ReLU | n/a | n/a | n/a | n/a | 5 | 0 | |
| 1 | Max Pooling | n/a | 2 | 2 | 1 | 5 | 0 |
| 1 | Global Average Pooling | n/a | n/a | n/a | 1 | 5 | 0 |
| 1 | Fully Connected Layer | n/a | n/a | n/a | 1024 | 5 | 525,312 |
| 1 | Batch Normalization | n/a | n/a | n/a | n/a | 5 | 4096 |
| 1 | ReLU | n/a | n/a | n/a | n/a | 5 | 0 |
| 1 | LSTM | n/a | n/a | n/a | n/a | 1024 | 8,392,704 |
| 1 | Dropout | n/a | n/a | n/a | n/a | 1024 | 0 |
| 1 | Fully Connected Layer | n/a | n/a | n/a | 2 | 2 | 2050 |
| Total number of parameters: 28,948,546 | |||||||
Figure 5Handcrafted image feature extraction process using the MLBP method: (a) an input face image from NUAA dataset [16]; (b) formation of the MLBP features of (a) (left: encoded LBP image; right: LBP features).
Figure 6Feature level fusion approach.
Figure 7Score level fusion approach.
Parameters of SGD method for training the stacked CNN-RNN network in our experiments.
| Mini-Batch Size | Initial Learning Rate | Learning Rate Drop Period (Epochs) | Learning Rate Drop Factor | Number of Training Epochs | Momentum |
|---|---|---|---|---|---|
| 4 | 0.00001 | 2 | 0.1 | 9 | 0.9 |
Description of the CASIA dataset used in our study (unit: image sequences).
| CASIA Dataset | Training Dataset (20 people) | Testing Dataset (30 people) | Total | ||
|---|---|---|---|---|---|
| Real Access | Presentation Attack | Real Access | Presentation Attack | ||
| Video | 60 | 180 | 90 | 270 | 600 |
| Image Sequence without Data Augmentation | 10,940 | 34,148 | 16,029 | 49,694 | 110,811 |
| Image Sequence with Data Augmentation | 65,640 | 68,296 | 16,029 | 49,694 | 199,659 |
Description of the Replay-mobile dataset used in our study (unit: image sequences).
| Replay-Mobile Dataset | Training Dataset (12 people) | Validation Dataset (16 people) | Testing Dataset(12 people) | Total | |||
|---|---|---|---|---|---|---|---|
| Real Access | Presentation Attack | Real Access | Presentation Attack | Real Access | Presentation Attack | ||
| Video | 120 | 192 | 160 | 256 | 110 | 192 | 1030 |
| Image Sequence without Data Augmentation | 35,087 | 56,875 | 47,003 | 75,911 | 32,169 | 56,612 | 303,657 |
| Image Sequence with Data Augmentation | 105,261 | 113,750 | 141,009 | 151,822 | 32,169 | 56,612 | 600,623 |
Figure 8Convergence graphs (accuracy and loss) of the training procedure on the CASIA dataset.
Detection errors (APCER, BPCER, ACER, and HTER) of our proposed method with the CASIA dataset using three types of PAI (unit: %).
| PAI | Wrap-photo Access | Cut-photo Access | Video Display | Overall | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| APCER | BPCER | ACER | APCER | BPCER | ACER | APCER | BPCER | ACER | APCER | BPCER | ACER | HTER | |
| Using CNN Features [ | 3.975 | 2.770 | 3.373 | 0.643 | 2.770 | 1.7065 | 1.810 | 2.770 | 2.290 | 3.975 | 2.770 | 3.373 | 2.536 |
| Using CNN-RNN Features | 1.531 | 1.385 | 1.458 | 0.331 | 1.385 | 0.858 | 0.831 | 1.385 | 1.108 | 1.531 | 1.385 | 1.458 | 0.954 |
| Using MLBP Features | 9.133 | 10.343 | 9.738 | 10.018 | 10.343 | 10.181 | 9.425 | 10.343 | 9.884 | 10.018 | 10.343 | 10.181 | 9.488 |
| FLF | 3.508 | 0.917 | 2.212 | 0.676 | 0.917 | 0.797 | 1.292 | 0.917 | 1.104 | 3.508 | 0.917 | 2.212 | 1.443 |
| SLF | 1.536 | 1.036 | 1.286 | 0.507 | 1.036 | 0.771 | 0.121 | 1.036 | 0.579 | 1.536 | 1.036 | 1.286 | 0.910 |
Figure 9DET curves of the face-PAD systems using various feature combination approaches with a testing subset of the CASIA dataset.
Description of subsets of the CASIA dataset used in our study (unit: image sequences).
| Dataset Name | Training Dataset (20 people) | Testing Dataset (30 people) | Total | ||
|---|---|---|---|---|---|
| Real Access | Presentation Attack | Real Access | Presentation Attack | ||
| Low Quality Dataset | 3140 | 11,019 | 5298 | 16,166 | 35,623 |
| Normal Quality Dataset | 3223 | 11,275 | 4949 | 16,141 | 35,588 |
| High Quality Dataset | 4577 | 11,854 | 5782 | 17,387 | 39,600 |
| Wrap-Photo Dataset | 10,940 | 12,871 | 16,029 | 19,271 | 59,111 |
| Cut-Photo Dataset | 10,940 | 9499 | 16,029 | 14,784 | 51,252 |
| Video Display Dataset | 10,940 | 11,778 | 16,029 | 15,639 | 54,386 |
Detection errors (ACERs) of various face-PAD methods using a subset of the CASIA dataset according to the quality and type of presentation attack samples (unit: %).
| Detection Method | Quality of the Presentation Attack Samples | Type of Presentation Attack Samples | ||||
|---|---|---|---|---|---|---|
| Low Quality Dataset | Normal Quality Dataset | High Quality Dataset | Wrap-Photo Dataset | Cut-Photo Dataset | Video Access Dataset | |
| Baseline Method [ | 13.0 | 13.0 | 26.0 | 16.0 | 6.0 | 24.0 |
| IQA [ | 31.7 | 22.2 | 5.6 | 26.1 | 18.3 | 34.4 |
| LBP-TOP [ | 10.0 | 12.0 | 13.0 | 6.0 | 12.0 | 10.0 |
| LBP + Fisher Score + SVM [ | 7.2 | 8.8 | 14.4 | 12.0 | 10.0 | 14.7 |
| Patch-based Classification [ | 5.26 | 6.00 | 5.30 | 5.78 | 5.49 | 5.02 |
| LBP of Color Texture Image [ | 7.8 | 10.1 | 6.4 | 7.5 | 5.4 | 8.4 |
| CNN + MLBP [ | 1.834 | 3.950 | 2.210 | 2.054 | 0.545 | 4.835 |
| Proposed Method (FLF) | 2.096 | 3.354 | 1.484 | 1.886 | 0.425 | 1.611 |
| Proposed method (SLF) | 1.417 | 0.040 | 1.085 | 2.005 | 0.428 | 1.423 |
Comparison of detection error (ACER) of our proposed method with various previous studies (unit: %).
| Baseline Method [ | LBP + Fisher Score + SVM [ | LBP of Color Texture Image [ | Dynamic Local Ternary Pattern [ | Patch-based Classification [ | CNN + MLBP [ | Proposed Method |
|---|---|---|---|---|---|---|
| 17.000 | 13.100 | 6.200 | 5.400 | 5.070 | 1.696 | 1.286 |
Figure 10Convergence graphs (accuracy and loss) of the training procedure on the Replay-mobile dataset.
Detection errors (APCER, BPCER, ACER, and HTER) of our proposed method with the Replay-mobile dataset using two types of PAI (unit: %).
| PAI | EER | Matte-Screen Attack | Print Attack | Overall | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| APCER | BPCER | ACER | APCER | BPCER | ACER | APCER | BPCER | ACER | HTER | ||
| Using CNN Features [ | 0.067 | 0.000 | 0.009 | 0.0045 | 0.000 | 0.009 | 0.0045 | 0.000 | 0.009 | 0.0045 | 0.0045 |
| Using CNN-RNN Features | 0.002 | 0.000 | 0.003 | 0.0015 | 0.000 | 0.003 | 0.0015 | 0.000 | 0.003 | 0.0015 | 0.0015 |
| Using only MLBP features | 4.659 | 8.820 | 1.937 | 5.379 | 2.451 | 1.937 | 2.194 | 8.820 | 1.937 | 5.379 | 5.684 |
| FLF | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| SLF | 0.000 | 0.000 | 0.003 | 0.0015 | 0.000 | 0.003 | 0.0015 | 0.000 | 0.003 | 0.0015 | 0.0015 |
Detection results (APCER, BPCER, ACER, and HTER) of cross-dataset testing (Trained with CASIA; Tested with Replay-mobile) (unit: %).
| PAI | Matte-Screen Attack | Print Attack | Overall | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| APCER | BPCER | ACER | APCER | BPCER | ACER | APCER | BPCER | ACER | HTER | |
| FLF | 4.304 | 22.714 | 13.509 | 0.039 | 22.714 | 11.377 | 4.304 | 22.714 | 13.509 | 12.459 |
| SLF | 12.838 | 34.341 | 23.589 | 0.822 | 34.341 | 17.581 | 12.838 | 34.341 | 23.589 | 20.632 |
Detection results (APCER, BPCER, ACER, and HTER) of cross-dataset testing (Trained with Replay-mobile; Tested with CASIA) (unit: %).
| PAI | Wrap-photo Attack | Cut-photo Attack | Video Attack | Overall | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| APCER | BPCER | ACER | APCER | BPCER | ACER | APCER | BPCER | ACER | APCER | BPCER | ACER | HTER | |
| FLF | 65.451 | 14.499 | 39.975 | 82.434 | 14.499 | 48.466 | 67.255 | 14.499 | 40.877 | 82.434 | 14.499 | 48.466 | 42.785 |
| SLF | 77.510 | 13.039 | 45.275 | 89.035 | 13.039 | 51.037 | 72.505 | 13.039 | 42.772 | 89.035 | 13.039 | 51.037 | 46.201 |
Comparison of detection error (HTER) produced by our proposed method with the previous study by Peng et al. for the cross-dataset setup (unit: %).
| Detection Method | Trained with | Tested with | HTER |
|---|---|---|---|
| Using LBP + GS-LBP [ | CASIA | Replay-mobile | 41.25 |
| Replay-mobile | CASIA | 48.59 | |
| Using LGBP [ | CASIA | Replay-mobile | 51.29 |
| Replay-mobile | CASIA | 50.04 | |
| Using CNN [ | CASIA | Replay-mobile | 21.496 |
| Replay-mobile | CASIA | 34.530 | |
| Our Proposed Method | CASIA | Replay-mobile | 12.459 |
| Replay-mobile | CASIA | 42.785 |