| Literature DB >> 28481301 |
Jong Hyun Kim1, Hyung Gil Hong2, Kang Ryoung Park3.
Abstract
Because intelligent surveillance systems have recently undergone rapid growth, research on accurately detecting humans in videos captured at a long distance is growing in importance. The existing research using visible light cameras has mainly focused on methods of human detection for daytime hours when there is outside light, but human detection during nighttime hours when there is no outside light is difficult. Thus, methods that employ additional near-infrared (NIR) illuminators and NIR cameras or thermal cameras have been used. However, in the case of NIR illuminators, there are limitations in terms of the illumination angle and distance. There are also difficulties because the illuminator power must be adaptively adjusted depending on whether the object is close or far away. In the case of thermal cameras, their cost is still high, which makes it difficult to install and use them in a variety of places. Because of this, research has been conducted on nighttime human detection using visible light cameras, but this has focused on objects at a short distance in an indoor environment or the use of video-based methods to capture multiple images and process them, which causes problems related to the increase in the processing time. To resolve these problems, this paper presents a method that uses a single image captured at night on a visible light camera to detect humans in a variety of environments based on a convolutional neural network. Experimental results using a self-constructed Dongguk night-time human detection database (DNHD-DB1) and two open databases (Korea advanced institute of science and technology (KAIST) and computer vision center (CVC) databases), as well as high-accuracy human detection in a variety of environments, show that the method has excellent performance compared to existing methods.Entities:
Keywords: convolutional neural network; intelligent surveillance system; nighttime human detection; visible light image
Year: 2017 PMID: 28481301 PMCID: PMC5469670 DOI: 10.3390/s17051065
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Comparison of proposed method and previous study results.
| Category | Method | Advantages | Disadvantages | |
|---|---|---|---|---|
| Multiple camera-based method | Using visible light and FIR cameras [ | Spatial-temporal filtering, seeded region growing, and min-max score fusion | Uses data from two cameras to improve human detection accuracy | Correspondence points must be manually set between two cameras for calibration Requires a sequence of image frames Difficult to use in most normal surveillance environments because of high cost of FIR cameras Images from two cameras must be processed, which takes a long time, and the processing speed is lowered if many objects are detected |
| Single camera-based methods | Using IR camera | GMM [ | Uses one camera, which eliminates the need for calibration, and has a faster processing time than multiple camera-based methods | Can only be used in a fixed camera environment [ If an NIR camera is used, an additional NIR illuminator must be used. NIR illuminators are limited in terms of their illumination angle and distance, and the illuminator power must be adaptively adjusted for near and far objects [ FIR cameras are expensive, and their image resolution is much lower than visible light cameras. Thus, there are few features that can be captured in a human area during human detection at a long distance [ |
| Using visible light camera | Uses local change in contrast over time [ | Uses low-cost visible light cameras | Can only be used in a fixed camera environment Has trouble detecting humans who are standing still Must use continuous video frames, which requires a fast capture speed, and it processes multiple images, which increases the processing time | |
| Histogram processing or intensity mapping-based image enhancement [ | Uses low-cost visible light cameras | Has only produced experimental results for raising the visibility through image enhancement, with no results produced for human detection in nighttime images | ||
| Denoising and image enhancement [ | Effectively removes noise that occurs during image enhancement | Because denoising requires many operations, the processing time is long compared to histogram methods Has only produced experimental results for raising visibility through image enhancement, with no results produced for human detection in nighttime images | ||
| CNN | Independently processes single images. Thus, even stationary objects can be detected. Can be used with moving or fixed cameras | Adequate data and time are required to train CNN | ||
Figure 1Flowchart of proposed method.
Proposed CNN architecture used in our research.
| Layer Type | Number of Filters | Size of Feature Map | Size of Kernel | Number of Stride | Number of Padding |
|---|---|---|---|---|---|
| Image input layer | 183 (height) × 119 (width) × 3 (channel) | ||||
| 1st convolutional layer | 96 | 87 × 55 × 96 | 11 × 11 × 3 | 2 × 2 | 0 × 0 |
| ReLU layer | 87 × 55 × 96 | ||||
| Cross channel normalization layer | 87 × 55 × 96 | ||||
| Max pooling layer | 1 | 43 × 27 × 96 | 3 × 3 | 2 × 2 | 0 × 0 |
| 2nd convolutional layer | 128 | 43 × 27 × 128 | 5 × 5 × 96 | 1 × 1 | 2 × 2 |
| ReLU layer | 43 × 27 × 128 | ||||
| Cross channel normalization layer | 43 × 27 × 128 | ||||
| Max pooling layer | 1 | 21 × 13 × 128 | 3 × 3 | 2 × 2 | 0 × 0 |
| 3rd convolutional layer | 256 | 21 × 13 × 256 | 3 × 3 × 128 | 1 × 1 | 1 × 1 |
| ReLU layer | 21 × 13 × 256 | ||||
| 4th convolutional layer | 256 | 21 × 13 × 256 | 3 × 3 × 256 | 1 × 1 | 1 × 1 |
| ReLU layer | 21 × 13 × 256 | ||||
| 5th convolutional layer | 128 | 21 × 13 × 128 | 3 × 3 × 256 | 1 × 1 | 1 × 1 |
| ReLU layer | 21 × 13 × 128 | ||||
| Max pooling layer | 1 | 10 × 6 × 128 | 3 × 3 | 2 × 2 | 0 × 0 |
| 1st fully connected layer | 4096 | ||||
| ReLU layer | 4096 | ||||
| 2nd fully connected layer | 1024 | ||||
| ReLU layer | 1024 | ||||
| Dropout layer | 1024 | ||||
| 3rd fully connected layer | 2 | ||||
| Softmax layer | 2 | ||||
| Classification layer (output layer) | 2 |
Figure 2Proposed CNN architecture.
Figure 3Examples from three databases used in experiments: (a) collected images of DNHD-DB1; (b) human and background images of DNHD-DB1; (c) collected image of KAIST database; (d) human and background images of KAIST database; (e) collected image of CVC-14 database; and (f) human and background images of CVC-14 database. In (a,c,e), human areas are shown as red dashed box. In (b,d,f), left three figures are human images whereas the right three ones are background images, respectively.
Databases used in this research.
| DNHD-DB1 | CVC-14 Database | KAIST Database | ||
|---|---|---|---|---|
| Number of images | Human | 19,760 | 36,920 | 37,336 |
| Background | 19,760 | 36,920 | 37,336 | |
| Number of channel | Color (3 channels) | Gray (1 channel) | Color (3 channels) | |
| Width of human (background) image (min.~max.) (pixels) | 15–219 | 64 | 21–106 | |
| Height of human (background) image (min.~max.) (pixels) | 45–313 | 128 | 27–293 | |
| Environment of database collection | Image acquisition using a static camera in surveillance environment The height of the camera from the ground was about 6–10 m Database was collected at various places (at 8–10 pm) | Image acquisition using a camera mounted on the roof of a car while driving | ||
Figure 4Examples of loss and accuracy curves with training data of four-fold cross validation in case of using combined database images: (a) 1st fold; (b) 2nd fold; (c) 3rd fold; and (d) 4th fold.
Figure 5Examples of 96 filters obtained from 1st convolution layer through training: (a) CVC-14 database; (b) DNHD-DB1; (c) KAIST database; and (d) combined database. In (a–d), the left four images show the 96 filters obtained by training with the original images, whereas the right four images represent those obtained by training with the HE images. In the left and right four images, the left-upper, right-upper, left-lower, and right lower images show the 96 filters obtained by training using 1-, 2-, 3-, and 4-fold cross validation, respectively.
Confusion matrix of recognition accuracies with original images by four-fold cross validation: (a–d) CVC-14 database; (e–h) DNHD-DB1; and (i–l) KAIST database (unit: %).
| ( | |||
| Actual | Human | 96.59 | 3.41 |
| Background | 0.27 | 99.73 | |
| ( | |||
| Actual | Human | 97.11 | 2.89 |
| Background | 0.22 | 99.78 | |
| ( | |||
| Actual | Human | 96.27 | 3.73 |
| Background | 0.38 | 99.62 | |
| ( | |||
| Actual | Human | 97.27 | 2.73 |
| Background | 0.21 | 99.79 | |
| ( | |||
| Actual | Human | 96.36 | 3.64 |
| Background | 2.61 | 97.39 | |
| ( | |||
| Actual | Human | 95.99 | 4.01 |
| Background | 7.01 | 92.99 | |
| ( | |||
| Actual | Human | 96.82 | 3.18 |
| Background | 4.54 | 95.46 | |
| ( | |||
| Actual | Human | 96.68 | 3.32 |
| Background | 2.90 | 97.10 | |
| ( | |||
| Actual | Human | 92.63 | 7.37 |
| Background | 0.14 | 99.86 | |
| ( | |||
| Actual | Human | 83.02 | 16.98 |
| Background | 0.32 | 99.68 | |
| ( | |||
| Actual | Human | 86.16 | 13.84 |
| Background | 0.50 | 99.50 | |
| ( | |||
| Actual | Human | 95.02 | 4.98 |
| Background | 0.25 | 99.75 | |
Confusion matrix of recognition accuracies with HE processed images by four-fold cross validation: (a–d) CVC-14 database; (e–h) DNHD-DB1; and (i–l) KAIST database (unit: %).
| ( | |||
| Actual | Human | 96.08 | 3.92 |
| Background | 0.26 | 99.74 | |
| ( | |||
| Actual | Human | 98.88 | 1.12 |
| Background | 0.31 | 99.69 | |
| ( | |||
| Actual | Human | 95.99 | 4.01 |
| Background | 0.46 | 99.54 | |
| ( | |||
| Actual | Human | 96.42 | 3.58 |
| Background | 0.31 | 99.69 | |
| ( | |||
| Actual | Human | 92.55 | 7.45 |
| Background | 3.73 | 96.27 | |
| ( | |||
| Actual | Human | 97.69 | 2.31 |
| Background | 4.80 | 95.20 | |
| ( | |||
| Actual | Human | 97.17 | 2.83 |
| Background | 6.34 | 93.66 | |
| ( | |||
| Actual | Human | 97.15 | 2.85 |
| Background | 3.16 | 96.84 | |
| ( | |||
| Actual | Human | 93.03 | 6.97 |
| Background | 0.14 | 99.86 | |
| ( | |||
| Actual | Human | 96.19 | 3.81 |
| Background | 0.11 | 99.89 | |
| ( | |||
| Actual | Human | 97.98 | 2.02 |
| Background | 0.13 | 99.87 | |
| ( | |||
| Actual | Human | 96.50 | 3.50 |
| Background | 0.14 | 99.86 | |
Figure 6ROC curves of human and background detection: (a) with original images, ROC curves in the part ranges of FPR and TPR (top), ROC curves in the whole ranges of FPR and TPR (bottom) and (b) with HE processed images, ROC curves in the part ranges of FPR and TPR (top), ROC curves in the whole ranges of FPR and TPR (bottom).
Average testing accuracies of human and background detection with original images (unit: %).
| Database | PPV | TPR | ACC | F_Score |
|---|---|---|---|---|
| CVC-14 database | 99.72 | 96.81 | 98.27 | 98.24 |
| DNHD-DB1 | 95.73 | 96.46 | 96.09 | 96.09 |
| KAIST database | 99.64 | 89.21 | 94.60 | 94.07 |
| Average | 98.36 | 94.16 | 96.32 | 96.13 |
Average testing accuracies of human and background detection with images processed by HE (unit: %).
| Database | PPV | TPR | ACC | F_Score |
|---|---|---|---|---|
| CVC-14 database | 99.65 | 96.84 | 98.25 | 98.23 |
| DNHD-DB1 | 95.48 | 96.14 | 95.81 | 95.79 |
| KAIST database | 99.85 | 95.93 | 97.95 | 97.84 |
| Average | 98.33 | 96.30 | 97.34 | 97.29 |
Figure 7Examples of FN and FP errors in cases using original images of (a) CVC-14 database; (b) DNHD-DB1; and (c) KAIST database. In (a–c), the left two images show the FN cases, whereas the right two images represent the FP cases.
Figure 8Examples of FN and FP errors in cases using HE processed images of (a) CVC-14 database; (b) DNHD-DB1; and (c) KAIST database. In (a–c), the left two images show the FN cases, whereas the right two images represent the FP cases.
Average testing accuracies of human and background detection with combined database (unit: %).
| Kinds of Input Image to CNN | PPV | TPR | ACC | F_Score |
|---|---|---|---|---|
| Original | 99.31 | 93.44 | 96.44 | 96.26 |
| HE | 99.11 | 97.65 | 98.41 | 98.38 |
Figure 9ROC curves of human and background detection by proposed and previous methods [16,42]. ROC curves in the part ranges of FPR and TPR (top), ROC curves in the whole ranges of FPR and TPR (bottom).
Comparisons of average testing accuracies with previous and proposed methods with combined database (unit: %).
| Methods | PPV | TPR | ACC | F_Score |
|---|---|---|---|---|
| HOG-SVM [ | 96.56 | 98.40 | 97.48 | 97.47 |
| Proposed method | 99.11 | 97.65 | 98.41 | 98.38 |