| Literature DB >> 28335510 |
Dat Tien Nguyen1, Ki Wan Kim2, Hyung Gil Hong3, Ja Hyung Koo4, Min Cheol Kim5, Kang Ryoung Park6.
Abstract
Extracting powerful image features plays an important role in computer vision systems. Many methods have previously been proposed to extract image features for various computer vision applications, such as the scale-invariant feature transform (SIFT), speed-up robust feature (SURF), local binary patterns (LBP), histogram of oriented gradients (HOG), and weighted HOG. Recently, the convolutional neural network (CNN) method for image feature extraction and classification in computer vision has been used in various applications. In this research, we propose a new gender recognition method for recognizing males and females in observation scenes of surveillance systems based on feature extraction from visible-light and thermal camera videos through CNN. Experimental results confirm the superiority of our proposed method over state-of-the-art recognition methods for the gender recognition problem using human body images.Entities:
Keywords: convolutional neural network; gender recognition; human body images; visible-light and thermal camera videos
Year: 2017 PMID: 28335510 PMCID: PMC5375923 DOI: 10.3390/s17030637
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Summary of previous studies on body-image-based gender recognition.
| Categories | Methods | Strength | Weakness |
|---|---|---|---|
| Using a pre-designed (hand-designed) feature extractor for extracting image features. | Using gait or 3D shape information [ | High recognition accuracy can be obtained. | Requires a series of human body images. Requires the cooperation of users in the image acquisition step. Uses a predesigned feature extraction method (features). Requires an expensive capturing device (scanner) to obtain 3D information of the human body [ |
Using the HOG or BIFs feature extraction method in only single visible-light images [ | Easy to implement. | Limits recognition accuracy because of the use of predesigned and weak feature extraction methods (HOG and BIFs features). | |
Using HOG feature in the combined visible-light and thermal images [ | Easy to implement. Enhances recognition accuracy by utilizing both visible-light and thermal images of the human body. | ||
Using a weighted HOG feature in combined visible-light and thermal images [ | Compensates the effects of background regions on recognition accuracy by applying weight values on HOG features. Enhances recognition accuracy by utilizing both visible-light and thermal images of the human body. | Limits recognition accuracy because of the use of a predesigned and weak feature extraction method (weighted HOG feature). | |
| Using a leaning-based feature extractor method for extracting image features ( | Learns the feature extractor using CNN for extracting image features. | Extracts the more suitable image features for recognition using a pre-trained feature extractor model based on CNN. Higher recognition accuracy can be obtained compared to predesigned feature extractor methods, such as HOG, BIFs, or weighted HOG. | Needs training time to train the feature extractor (CNN model). |
Figure 1Overview of our proposed method for gender recognition using CNN for image feature extraction.
Figure 2Design architecture of our CNN for gender recognition using visible-light or thermal images.
Detailed structure description of our proposed CNN method for the gender recognition problem.
| Layer Name | Number of Filters | Filter Size | Stride Size | Padding Size | Window Channel Size | Dropout Probability | Output Size |
|---|---|---|---|---|---|---|---|
| Input Layer | n/a | n/a | n/a | n/a | n/a | n/a | 183 × 119 × 1 |
| 96 | 11 × 11 × 1 | 2 × 2 | 0 | n/a | n/a | 87 × 55 × 96 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 87 × 55 × 96 |
| Cross-Channel Normalization Layer | n/a | n/a | n/a | n/a | 5 | n/a | 87 × 55 × 96 |
| MAX Pooling Layer 1 | 1 | 3 × 3 | 2 × 2 | 0 | n/a | n/a | 43 × 27 × 96 |
| 128 | 5 × 5 × 96 | 1 × 1 | 2 × 2 | n/a | n/a | 43 × 27 × 128 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 43 × 27 × 128 |
| Cross-Channel Normalization Layer | n/a | n/a | n/a | n/a | 5 | n/a | 43 × 27 × 128 |
| MAX Pooling Layer 2 | 1 | 3 × 3 | 2 × 2 | 0 | n/a | n/a | 21 × 13 × 128 |
| 256 | 3 × 3 × 128 | 1 × 1 | 1 × 1 | n/a | n/a | 21 × 13 × 256 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 21 × 13 × 256 |
| 256 | 3 × 3 × 256 | 1 × 1 | 1 × 1 | n/a | n/a | 21 × 13 × 256 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 21 × 13 × 256 |
| 128 | 3 × 3 × 256 | 1 × 1 | 1 × 1 | n/a | n/a | 21 × 13 × 128 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 21 × 13 × 128 |
| MAX Pooling Layer 5 | 1 | 3 × 3 | 2 × 2 | 0 | n/a | n/a | 10 × 6 × 128 |
| n/a | n/a | n/a | n/a | n/a | n/a | 4096 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 4096 |
| n/a | n/a | n/a | n/a | n/a | n/a | 1024 | |
| Rectified Linear Unit | n/a | n/a | n/a | n/a | n/a | n/a | 1024 |
| Dropout Layer | n/a | n/a | n/a | n/a | n/a | 50% | 1024 |
| n/a | n/a | n/a | n/a | n/a | n/a | 2 | |
| Softmax Layer | n/a | n/a | n/a | n/a | n/a | n/a | 2 |
| Classification Layer | n/a | n/a | n/a | n/a | n/a | n/a | 2 |
Figure 3Feature-level fusion combination method for gender recognition using visible-light and thermal images of the human body.
Figure 4Score-level fusion combination method for gender recognition using visible-light and thermal images of the human body.
Description of our self-established collected database for our experiments (10 visible-light images/person and 10 corresponding thermal images/person).
| Database | Males | Females | Total |
|---|---|---|---|
| Number of persons | 254 | 158 | 412 (persons) |
| Number of images | 5080 | 3160 | 8240 (images) |
Figure 5Sample images in our self-established collected database: (a) thermal-visible image pairs of male persons; and (b) thermal-visible image pairs of female persons.
Description of the training and testing sub-databases and the corresponding augmented databases for our experiments.
| Database | Males | Females | Total | ||
|---|---|---|---|---|---|
| Augmented database | Learning database | Number of persons | 204 (persons) | 127 (persons) | 331 (persons) |
| Number of images | 73,440 (204 × 20 × 18 images) | 76,200 (127 × 20 × 30 images) | 149,640 (images) | ||
| Testing database | Number of persons | 50 (persons) | 31 (persons) | 81 (persons) | |
| Number of images | 18,000 (50 × 20 × 18 images) | 18,600 (31 × 20 × 30 images) | 36,600 (images) | ||
Recognition accuracy (EER) of a recognition system that uses only visible-light or thermal images for the recognition problem using CNN (unit: %).
| Accuracies of Recognition Systems Using Single Image Types | |||||
|---|---|---|---|---|---|
| Using only Visible-Light Images | Using only Thermal Images | ||||
| EER | FAR | GAR | EER | FAR | GAR |
| 17.216 | 10.000 | 71.589 | 16.610 | 10.000 | 72.099 |
| 15.000 | 80.299 | 15.000 | 81.350 | ||
| 20.00 | 85.327 | 20.00 | 87.285 | ||
| 25.00 | 88.532 | 25.00 | 91.457 | ||
Recognition accuracy (EER) of a recognition system that uses only visible-light or thermal images for the recognition problem using SVM and CNN features without PCA (unit: %).
| SVM Kernel | Accuracies of Recognition Systems Using Single Image Types | |||||
|---|---|---|---|---|---|---|
| Using only Visible-Light Images | Using only Thermal Images | |||||
| EER | FAR | GAR | EER | FAR | GAR | |
| Linear | 17.379 | 10.000 | 72.541 | 16.560 | 10.000 | 73.228 |
| 15.000 | 80.085 | 15.000 | 81.392 | |||
| 20.00 | 85.239 | 20.00 | 87.361 | |||
| 25.00 | 88.618 | 25.00 | 91.471 | |||
| RBF | 17.379 | 10.000 | 72.565 | 16.510 | 10.000 | 73.082 |
| 15.000 | 80.058 | 15.000 | 81.471 | |||
| 20.00 | 85.055 | 20.00 | 87.475 | |||
| 25.00 | 88.604 | 25.00 | 91.577 | |||
Recognition accuracy (EER) of a recognition system that uses only visible-light or thermal images for the recognition problem using SVM and CNN features with PCA (unit: %).
| SVM Kernel | Accuracies of Recognition Systems Using Single Image Types | |||||
|---|---|---|---|---|---|---|
| Using only Visible-Light Images | Using only Thermal Images | |||||
| EER | FAR | GAR | EER | FAR | GAR | |
| Linear | 17.064 | 10.000 | 73.261 | 16.114 | 10.000 | 73.744 |
| 15.000 | 80.906 | 15.000 | 82.518 | |||
| 20.000 | 85.607 | 20.000 | 88.079 | |||
| 25.000 | 88.583 | 25.000 | 91.758 | |||
| RBF | 17.489 | 10.000 | 68.958 | 17.596 | 10.000 | 69.433 |
| 15.000 | 79.090 | 15.000 | 78.042 | |||
| 20.000 | 84.627 | 20.000 | 85.149 | |||
| 25.000 | 87.055 | 25.000 | 89.830 | |||
Recognition accuracy (EER) of recognition systems using a combination of visible-light and thermal images for the recognition problem without PCA (unit: %).
| First SVM Layer Kernel | Accuracy of Recognition Systems Using Combined Visible-Light and Thermal Images | ||||||
|---|---|---|---|---|---|---|---|
| Feature-Level Fusion Approach | Score-Level Fusion Approach | ||||||
| EER | FAR | GAR | Second SVM layer kernel | Accuracy | |||
| EER | FAR | GAR | |||||
| Linear | 11.766 | 5.000 | 73.356 | Linear | 11.919 | 5.000 | 73.684 |
| 10.000 | 85.297 | ||||||
| 10.000 | 85.465 | ||||||
| 15.000 | 91.214 | ||||||
| 20.000 | 94.388 | ||||||
| RBF | 11.956 | 5.000 | 73.271 | ||||
| 15.000 | 91.280 | 10.000 | 85.339 | ||||
| 20.000 | 94.488 | 15.000 | 91.169 | ||||
| 20.000 | 94.618 | ||||||
| RBF | 11.684 | 5.000 | 73.394 | Linear | 11.850 | 5.000 | 74.008 |
| 10.000 | 85.206 | ||||||
| 10.000 | 85.689 | ||||||
| 15.000 | 91.402 | ||||||
| 20.000 | 94.627 | ||||||
| RBF | 11.956 | 5.000 | 73.201 | ||||
| 15.000 | 91.545 | 10.000 | 85.292 | ||||
| 20.000 | 94.692 | 15.000 | 91.288 | ||||
| 20.000 | 94.670 | ||||||
Recognition accuracy (EER) of recognition systems using a combination of visible-light and thermal images for the recognition problem with PCA (unit: %).
| First SVM Layer Kernel | Accuracy of Recognition Systems Using Combined Visible-Light and Thermal Images | ||||||
|---|---|---|---|---|---|---|---|
| Feature-Level Fusion Approach | Score-Level Fusion Approach | ||||||
| EER | FAR | GAR | Second SVM Layer Kernel | Accuracy | |||
| EER | FAR | GAR | |||||
| Linear | 5.000 | 76.014 | Linear | 11.849 | 5.000 | 74.966 | |
| 10.000 | 85.834 | ||||||
| 10.000 | 86.570 | ||||||
| 15.000 | 91.274 | ||||||
| 20.000 | 94.270 | ||||||
| RBF | 11.863 | 5.000 | 73.472 | ||||
| 15.000 | 92.087 | 10.000 | 85.875 | ||||
| 20.000 | 94.787 | 15.000 | 91.286 | ||||
| 20.000 | 94.490 | ||||||
| RBF | 12.808 | 5.000 | 71.375 | Linear | 5.000 | 73.008 | |
| 10.000 | 85.933 | ||||||
| 10.000 | 81.986 | ||||||
| 15.000 | 91.483 | ||||||
| 20.000 | 94.297 | ||||||
| RBF | 11.753 | 5.000 | 72.040 | ||||
| 15.000 | 89.346 | 10.000 | 85.781 | ||||
| 20.000 | 92.534 | 15.000 | 91.530 | ||||
| 20.000 | 94.368 | ||||||
Figure 6Average ROC curves of recognition systems using single image types with the CNN-based method in Figure 2.
Summary of the recognition accuracy of our proposed recognition system in comparison with previous studies (unit: %).
| Method | Using Single Visible-Light Images | Using Single Thermal Images | Feature-Level Fusion | Score-Level Fusion |
|---|---|---|---|---|
| HOG+SVM [ | 17.817 | 20.463 | 16.632 | 16.277 |
| EWHOG+SVM [ | 15.113 | 19.198 | 14.767 | 14.135 |
| wHOG+SVM [ | 15.219 | 18.257 | 14.819 | 13.060 |
| 17.064 | 16.114 |
Figure 7Average ROC curves of the recognition systems using single image types with the CNN-based method and SVM-based method.
Figure 8Average ROC curves of recognition systems using our proposed method.
Figure 9Examples of correct recognition results using our proposed method: (a–c) examples of male images correctly recognized as males, and (d–f) examples of female images correctly recognized as females.
Figure 10Examples of errors using our proposed method: (a–c) examples of male images incorrectly recognized as females, and (d–f) examples of female images incorrectly recognized as males.