| Literature DB >> 30205500 |
Se Woon Cho1, Na Rae Baek2, Min Cheol Kim3, Ja Hyung Koo4, Jong Hyun Kim5, Kang Ryoung Park6.
Abstract
Conventional nighttime face detection studies mostly use near-infrared (NIR) light cameras or thermal cameras, which are robust to environmental illumination variation and low illumination. However, for the NIR camera, it is difficult to adjust the intensity and angle of the additional NIR illuminator according to its distance from an object. As for the thermal camera, it is expensive to use as a surveillance camera. For these reasons, we propose a nighttime face detection method based on deep learning using a single visible-light camera. In a long-distance night image, it is difficult to detect faces directly from the entire image due to noise and image blur. Therefore, we propose Two-Step Faster region-based convolutional neural network (R-CNN) based on the image preprocessed by histogram equalization (HE). As a two-step scheme, our method sequentially performs the detectors of body and face areas, and locates the face inside a limited body area. By using our two-step method, the processing time by Faster R-CNN can be reduced while maintaining the accuracy of face detection by Faster R-CNN. Using a self-constructed database called Dongguk Nighttime Face Detection database (DNFD-DB1) and an open database of Fudan University, we proved that the proposed method performs better compared to other existing face detectors. In addition, the proposed Two-Step Faster R-CNN outperformed single Faster R-CNN and our method with HE showed higher accuracies than those without our preprocessing in nighttime face detection.Entities:
Keywords: deep learning; nighttime face detection; surveillance camera; visible-light camera
Year: 2018 PMID: 30205500 PMCID: PMC6164007 DOI: 10.3390/s18092995
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Comparison of previous studies and proposed method on face detection.
| Category | Method | Advantages | Disadvantages | |
|---|---|---|---|---|
| Multiple camera-based method | Dual-band system of NIR and SWIR cameras [ |
NIR and SWIR cameras are robust to illumination changes and low light intensity. The algorithm is not complicated because of image fusion method. |
A calibration between cameras is necessary. The intensity and angle of IR illuminator need to be adjusted according to its distance from the object. | |
| Single camera-based methods | Using thermal camera | Multi-slit method [ |
A thermal camera is robust to illumination changes and low light intensity. Complicated computation is not required [ Facial features in thermal images are used [ |
A thermal camera is expensive. It is difficult to detect faces in an environment where the background and human temperatures are similar. If the position and angle of camera change, the parameters need to be updated [ |
| Using NIR or SWIR camera | Three adaboost cascades [ |
NIR and SWIR cameras are robust to illumination changes and low light intensity. Three adaboost cascades are used to consider changes in the driver’s facial pose [ | The intensity and angle of IR illuminator need to be adjusted according to its distance from the object. | |
| Using visible-light camera | Hybrid skin segmentation [ | The price of camera is low. |
Performance is low at night when little color information is available and the noise level is high [ Multiple faces cannot be detected [ | |
| Image enhancement for face detection [ |
The contrast of night image is enhanced to increase the visibility of faces. |
Noise level increases with increased visibility. Processing time increases due to preprocessing. | ||
| Two-Step Faster R-CNN |
Accuracy is improved through a two-step detection. Deep learning-based features improve detection performance even with high noise or blur. | Training data and time to learn CNN are required. | ||
NIR: near-infrared; SWIR: short-wave IR; LBP: local binary pattern; HOG: histogram of oriented gradients; IR: infrared; AMB-LTP: absolute multiblock local ternary pattern; FCN: fully convolutional network; PRO-NPD: promotion normalized pixel difference; GA: genetic algorithm; RASW: run-time adaptive sliding window; R-CNN: region-based convolutional neural network.
Figure 1Flowchart of the proposed method.
Figure 2Example images for histogram equalization processing from: (a) Dongguk Nighttime Face Detection database (DNFD-DB1); and (b) open database of Fudan University. In (a,b), the left and right images show the original and histogram equalization (HE)-processed images, respectively.
Figure 3Process flow of Faster R-CNN network. RPN: region proposal network; ROI: region of interest.
Architecture of feature extractor of Figure 3.
| Layer Type | Number of Filters | Size of Feature Map (Height × Width × Channel) | Size of Kernel (Height × Width × Channel) | Number of Strides | Number of Paddings |
|---|---|---|---|---|---|
| Input layer [image] | 300 × 800 × 3 | ||||
| Conv1_1 (1st convolutional layer) | 64 | 300 × 800 × 64 | 3 × 3 × 3 | 1 × 1 | 1 × 1 |
| Relu1_1 | 300 × 800 × 64 | ||||
| Conv1_2 (2nd convolutional layer) | 64 | 300 × 800 × 64 | 3 × 3 × 64 | 1 × 1 | 1 × 1 |
| Relu1_2 | 300 × 800 × 64 | ||||
| Max pooling layer | 1 | 150 × 400 × 64 | 2 × 2 × 1 | 2 × 2 | 0 × 0 |
| Conv2_1 (3rd convolutional layer) | 128 | 150 × 400 × 128 | 3 × 3 × 64 | 1 × 1 | 1 × 1 |
| Relu2_1 | 150 × 400 × 128 | ||||
| Conv2_2 (4th convolutional layer) | 128 | 150 × 400 × 128 | 3 × 3 × 128 | 1 × 1 | 1 × 1 |
| Relu2_2 | 150 × 400 × 128 | ||||
| Max pooling layer | 1 | 75 × 200 × 128 | 2 × 2 × 1 | 2 × 2 | 0 × 0 |
| Conv3_1 (5th convolutional layer) | 256 | 75 × 200 × 256 | 3 × 3 × 128 | 1 × 1 | 1 × 1 |
| Relu3_1 | 75 × 200 × 256 | ||||
| Conv3_2 (6th convolutional layer) | 256 | 75 × 200 × 256 | 3 × 3 × 256 | 1 × 1 | 1 × 1 |
| Relu3_2 | 75 × 200 × 256 | ||||
| Conv3_3 (7th convolutional layer) | 256 | 75 × 200 × 256 | 3 × 3 × 256 | 1 × 1 | 1 × 1 |
| Relu3_3 | 75 × 200 × 256 | ||||
| Max pooling layer | 1 | 38 × 100 × 256 | 2 × 2 × 1 | 2 × 2 | 0 × 0 |
| Conv4_1 (8th convolutional layer) | 512 | 38 × 100 × 512 | 3 × 3 × 256 | 1 × 1 | 1 × 1 |
| Relu4_1 | 38 × 100 × 512 | ||||
| Conv4_2 (9th convolutional layer) | 512 | 38 × 100 × 512 | 3 × 3 × 512 | 1 × 1 | 1 × 1 |
| Relu4_2 | 38 × 100 × 512 | ||||
| Conv4_3 (10th convolutional layer) | 512 | 38 × 100 × 512 | 3 × 3 × 512 | 1 × 1 | 1 × 1 |
| Relu4_3 | 38 × 100 × 512 | ||||
| Max pooling layer | 1 | 19 × 50 × 512 | 2 × 2 × 1 | 2 × 2 | 0 × 0 |
| Conv5_1 (11th convolutional layer) | 512 | 19 × 50 × 512 | 3 × 3 × 512 | 1 × 1 | 1 × 1 |
| Relu5_1 | 19 × 50 × 512 | ||||
| Conv5_2 (12th convolutional layer) | 512 | 19 × 50 × 512 | 3 × 3 × 512 | 1 × 1 | 1 × 1 |
| Relu5_2 | 19 × 50 × 512 | ||||
| Conv5_3 (13th convolutional layer) | 512 | 19 × 50 × 512 | 3 × 3 × 512 | 1 × 1 | 1 × 1 |
| Relu5_3 | 19 × 50 × 512 |
Architecture of region proposal network (RPN) of Figure 3.
| Layer Type | Number of Filters | Size of Feature Map (Height × Width × Channel) | Size of Kernel (Height × Width × Channel) | Number of Strides | Number of Paddings |
|---|---|---|---|---|---|
| Input layer [Conv5_3] | 19 × 50 × 512 | ||||
| Conv6 (14th convolutional layer) | 512 | 19 × 50 × 512 | 3 × 3 × 512 | 1 × 1 | 1 × 1 |
| Classification (convolutional layer) | 18 | 19 × 50 × 18 | 1 × 1 × 512 | 1 × 1 | 0 × 0 |
| Regression (convolutional layer) | 36 | 19 × 50 × 36 | 1 × 1 × 512 | 1 × 1 | 0 × 0 |
Architecture of classifier of Figure 3. (From the ROI pooling layer, the processed results of the proposals are displayed instead of the entire input image; * denotes the coordinates of the proposals (x_min, y_min, x_max, and y_max); ** denotes the probability of each face and background.)
| Layer Type | Size of Output |
|---|---|
| Input layer | |
| [Conv5_3] | 19 × 50 × 512 |
| [region proposals] | 300 × 4 * |
| ROI pooling layer | 7 × 7 × 512 × 300 |
| Fc6 (1st fully connected layer) | 4096 × 300 |
| Relu6 | 4096 × 300 |
| Dropout6 | 4096 × 300 |
| Fc7 (2nd fully connected layer) | 4096 × 300 |
| Relu7 | 4096 × 300 |
| Dropout7 | 4096 × 300 |
| Classification (fully connected layer) | 2 ** × 300 |
| Softmax | 2 × 300 |
| Regression (fully connected layer) | 4 × 300 |
Figure 4Body detection of step-1 Faster R-CNN: (a) input image; and (b) image with body detection results.
Figure 5Face detection of step-2 Faster R-CNN: (a) upper body region; and (b) face detection result.
Figure 6Nine different anchor boxes in Two-Step Faster R-CNN: (a) anchor boxes used in step-1 Faster R-CNN; and (b) anchor boxes used in step-2 Faster R-CNN. (Boxes with the same color in each image have the same scale).
Description of Dongguk Nighttime Face Detection database (DNFD-DB1).
| DNFD-DB1 | Subset 1 | Subset 2 |
|---|---|---|
| Number of people | 10 | 10 |
| Number of images | 848 | 1154 |
| Number of augmented images | 1696 | 2308 |
| Number of face annotations | 4286 | 5809 |
| Resolution (width × height) (pixels) | 1600 × 600 | |
| Width of face (min – max) (pixels) | 45 − 80 | |
| Height of face (min – max) (pixels) | 48 − 86 | |
| Environment of database |
Images were obtained using a visible-light camera in a surveillance camera environment. The height of the camera is approximately 2.3 m from the ground, and the distance from a person is approximately 20–22 m. Images were taken at night environment of approximately 10–20 lux (at 9–10 pm). | |
Figure 7Examples of images in DNFD-DB1 used for experiments: (a) images of DNFD-DB1 (the original image is on the left and the HE-processed image is on the right); and (b) upper body images of DNFD-DB1.
Figure 8Schematic of the four-step alternating training. (1)–(4) are the steps for learning Faster R-CNN. The feature extractors in Steps (1) and (2) are initialized to weights of VGG Net-16, which are pretrained with ImageNet dataset by using the end-to-end learning. The feature extractors in Steps (3) and (4) use the weights of the feature extractor learned in Step (2), and only the RPN and classifier are fine-tuned (the red box indicates a network that does not learn).
Two-fold cross-validation results for body detection at equal error rate (EER) of recall and precision (unit: %).
| Models | DNFD-DB1 Subsets | Recall | Precision | Average Recall | Average Precision |
|---|---|---|---|---|---|
| RPN | 1st fold | 99.16 | 99.16 | 98.34 | 98.34 |
| 2nd fold | 97.52 | 97.52 | |||
| Step-1 Faster R-CNN | 1st fold | 99.97 | 99.97 | 99.94 | 99.94 |
| 2nd fold | 99.91 | 99.91 |
Performance comparison between existing and proposed methods at EER points of recall and precision (unit: %). (avg. and std. mean average value and standard deviation value, respectively) (#FP and #FN are the average numbers of false positive and false negative from 10 trials, respectively).
| Methods | Recall | Precision | #FP | #FN |
|---|---|---|---|---|
| MTCNN [ | 34.74 (0.0834) | 34.74 (0.0211) | 1579.4 | 1579.4 |
| NPDFace [ | 44.26 (0.0126) | 44.26 (0.0506) | 1345.4 | 1345.4 |
| Adaboost [ | 51.88 (0.0143) | 51.88 (0.0225) | 1029.8 | 1029.8 |
| Step-1 Faster R-CNN + Fine-tuned YOLOv2 [ | 66.36 (0.0363) | 66.36 (0.0182) | 862.5 | 862.5 |
| HR [ | 86.12 (0.0216) | 86.12 (0.0360) | 338.8 | 338.8 |
| Fine-tuned YOLOv2 [ | 90.49 (0.0087) | 90.49 (0.0166) | 251.2 | 251.2 |
| Fine-tuned HR [ | 95.66 (0.0154) | 95.66 (0.0448) | 137.9 | 137.9 |
| Two-Step Faster R-CNN (proposed method) | 99.75 (0.0024) | 99.75 (0.0020) | 6.9 | 6.9 |
Two-fold cross-validation results with and without preprocessing at EER points of recall and precision (unit: %).
| Input Image | DNFD-DB1 Subsets | Recall | Precision | Average Recall | Average Precision |
|---|---|---|---|---|---|
| Original nighttime image | 1st fold | 98.83 | 98.83 | 98.50 | 98.50 |
| 2nd fold | 98.17 | 98.17 | |||
| HE-processed image | 1st fold | 99.89 | 99.89 | 99.76 | 99.76 |
| 2nd fold | 99.63 | 99.63 |
Two-fold cross-validation results of Two-Step Faster R-CNN and single Faster R-CNN at EER points of recall and precision (unit: %) (#FP and #FN are the average numbers of false positive and false negative from two-fold cross validation, respectively).
| Methods | DNFD-DB1 Subsets | Recall | Precision | Average Recall | Average Precision | #FP | #FN |
|---|---|---|---|---|---|---|---|
| Single Faster R-CNN | 1st fold | 79.93 | 79.93 | 79.04 | 79.04 | 2115.9 | 2115.9 |
| 2nd fold | 78.15 | 78.15 | |||||
| Two-Step Faster R-CNN | 1st fold | 99.89 | 99.89 | 99.76 | 99.76 | 24.2 | 24.2 |
| 2nd fold | 99.63 | 99.63 |
Average performance of existing methods and proposed methods using the open database at EER points of recall and precision (unit: %) (avg. and std. mean average value and standard deviation value, respectively).
| Methods | Recall | Precision |
|---|---|---|
| Adaboost [ | 3.43 (0.0098) | 3.43 (0.0102) |
| NPDFace [ | 4.18 (0.0177) | 4.18 (0.0348) |
| MTCNN [ | 16.53 (0.0361) | 16.53 (0.0277) |
| HR [ | 61.31 (0.0798) | 61.31 (0.0430) |
| Fine-tuned YOLOv2 [ | 66.23 (0.0255) | 66.23 (0.0462) |
| Fine-tuned HR [ | 87.41 (0.0234) | 87.41 (0.0463) |
| Two-Step Faster R-CNN (Proposed method) | 90.77 (0.0078) | 90.77 (0.0209) |
Figure 9Examples of body detection results using: (a) Step-1 Faster R-CNN; and (b) RPN.
Figure 10Example of face detection results using Two-Step Faster R-CNN: (a) test results using HE-processed image; and (b) test results using original nighttime image.
Figure 11Example of face detection results using Two-Step Faster R-CNN and single Faster R-CNN: (a) test result using Two-Step Faster R-CNN; and (b) test results using single Faster R-CNN.
Figure 12Nighttime face detection performance of existing methods and the proposed method using DNFD-DB1: (a) True positive rate (TPR) curves according to total FPs; and (b) receiver operating characteristic (ROC) curve of recall and precision.
Figure 13T-test with the accuracies (EER of: (a) recall; and (b) precision) by our method and the second best method (fine-tuned HR).
Figure 14Example image of DNFD-DB1 night face detection using Two-Step Faster R-CNN: (a) correct detection cases; and (b) error cases. (The red box is the detection box, the blue box is the upper body detection box, and the green box is the ground-truth.)
Figure 15Examples of images in open database used for experiments: (a) images of the open database (the image on the left is the original image, and the image on the right is the HE-processed image); and (b) upper body images of the open database.
Figure 16Graphs of nighttime face detection performances of existing methods and proposed method using the open database: (a) TPR curves according to total FPs; and (b) ROC curve of recall and precision.
Figure 17T-test with the accuracies (EER of: (a) recall; and (b) precision) by our method and the second best method (fine-tuned HR).
Figure 18Example image of nighttime face detection by Two-Step Faster R-CNN using the open database: (a) correct detection cases; and (b) error cases. (The red box is the detection box, the blue box is the upper body detection box, and the green box is the ground-truth.)
Comparisons of the computational performances (average processing time per each image) by our method and previous methods (unit: ms).
| Methods | Processing Time |
|---|---|
| MTCNN [ | 122 |
| NPDFace [ | 47 |
| Adaboost [ | 70 |
| HR [ | 1182 |
| Fine-tuned YOLOv2 [ | 23 |
| Fine-tuned HR [ | 1182 |
| Step-1 Faster R-CNN + Fine-tuned YOLOv2 [ | 98.4 |
| Two-Step Faster R-CNN (proposed method) | 315 |