| Literature DB >> 36060225 |
Zhiwei Cao1,2, Yong Qin1,3, Yongling Li1,2, Zhengyu Xie2, Jianyuan Guo2, Limin Jia1,3.
Abstract
COVID-19 spreads rapidly among people, so that more and more people are wearing masks in rail transit stations. However, the current face detection algorithms cannot distinguish between a face wearing a mask and a face not wearing a mask. This paper proposes a face detection algorithm based on single shot detector and active learning in rail transit surveillance, effectively detecting faces and faces wearing masks. Firstly, we propose a real-time face detection algorithm based on single shot detector, which improves the accuracy by optimizing backbone network, feature pyramid network, spatial attention module, and loss function. Subsequently, this paper proposes a semi-supervised active learning method to select valuable samples from video surveillance of rail transit to retrain the face detection algorithm, which improves the generalization of the algorithm in rail transit and reduces the time to label samples. Extensive experimental results demonstrate that the proposed method achieves significant performance over the state-of-the-art algorithms on rail transit dataset. The proposed algorithm has a wide range of applications in rail transit stations, including passenger flow statistics, epidemiological analysis, and reminders of passenger who do not wear masks. Simultaneously, our algorithm does not collect and store face information of passengers, which effectively protects the privacy of passengers.Entities:
Keywords: Active learning; Face detection; Mask detection; Rail transit passengers; Single shot detector
Year: 2022 PMID: 36060225 PMCID: PMC9425808 DOI: 10.1007/s11042-022-13491-x
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.577
Fig. 1The cameras installed at the rail transit station
Fig. 2SSD model framework
Fig. 3The pool-based active learning
Fig. 4Architecture of SSD-Mask
Fig. 5Aspect ratios distribution of faces from mask dataset
Fig. 6Activation function: Mish and ReLU
Fig. 7Low-level feature pyramid network
Fig. 8Spatial attention module
Fig. 9Annotating samples of various methods. a Standard active learning; b Semi-supervised learning; c Semi-supervised active learning
Semi-supervised Active Learning Algorithm
Fig. 10Mask dataset: (a) The mask dataset consisting of various scenes images; (b) The mask dataset only consisting of rail transit images
Hyper-parameters of SSD-Mask in training
| Parameter | Describe | Value |
|---|---|---|
| Input image size | 512 × 512 | |
| Input channels | 3 | |
| Batch size | 16 | |
| Iteration | 120,000 | |
| Momentum | 0.9 | |
| Learn rate | 0.001,0.0001 | |
| Learn rate step | 70,000,100,000 | |
| Decay | 0.0005 | |
| Negative/positive | 3 |
Results on the proposed testing set
| Faster R-CNN [ | YOLOv3 [ | SSD [ | YOLOv4 [ | AIZOO | Baidu [ | SSD-Mask | |
|---|---|---|---|---|---|---|---|
| mAP(%) | 79.48 | 85.5 | 84.91 | 83.53 | 63.71 | ||
| FPS | 0.5 | 21 | 17.5 | 46 | 1.5 |
Fig. 11Several detection results of various algorithms on the proposed testing set. (a) input (b) Faster R-CNN [35] (c) YOLOv3 [34] (d) SSD [23] (e) YOLOv4 [1] (f) AIZOO [4] (g) Baidu [42] (h) SSD-Mask
Results on public mask dataset
| Method | Backbone | Input size | AP (%) | |
|---|---|---|---|---|
| Face | Mask | |||
| Faster R-CNN [ | ResNet-50 | ∽600 × 1000 | 86.40 | 93.20 |
| YOLOv3 [ | Darknet-53 | 416 × 416 | 90.45 | 93.50 |
| SSD [ | VGGNet | 512 × 512 | 90.62 | 90.32 |
| YOLOv4 [ | CSPDarkNet53 | 416 × 416 | 88.36 | |
| AIZOO [ | ConvNet | 360 × 360 | 88.81 | 90.04 |
| Baidu [ | VGGNet | 128 × 128 | 54.35 | 76.10 |
| SSD-Mask | VGGNet | 512 × 512 | 91.35 | |
Results on the testing dataset of rail transit
| Method | Backbone | Input size | FPS | mAP(%) | AP (%) | |
|---|---|---|---|---|---|---|
| Face | Mask | |||||
| Faster R-CNN [ | ResNet-50 | ∽600 × 1000 | 0.5 | 54.74 | 61.38 | 48.10 |
| YOLOv3 [ | Darknet-53 | 416 × 416 | 21 | 71.85 | 81.20 | 62.50 |
| SSD [ | VGGNet | 512 × 512 | 69.42 | 80.47 | 58.38 | |
| YOLOv4 [ | CSPDarkNet53 | 416 × 416 | 17.5 | 80.57 | 72.48 | |
| AIZOO [ | ConvNet | 360 × 360 | 46 | 67.41 | 42.19 | 54.80 |
| Baidu [ | VGGNet | 128 × 128 | 1.5 | 73.75 | 73.58 | 73.66 |
| SSD-Mask | VGGNet | 512 × 512 | 42 | 76.55 | 82.48 | 70.62 |
| SSD-Mask + SSAL | VGGNet | 512 × 512 | 42 | 84.85 | ||
Fig. 12Several detection results of various algorithms on rail transit scenarios. (a) input (b) Faster R-CNN [35] (c) YOLOv3 [34] (d) SSD [23] (e) YOLOv4 [1] (f) AIZOO [4] (g) Baidu [42] (h) SSD-Mask (i) SSD-Mask + SSAL
Fig. 13The classification loss of SSD and SSD-Mask
Ablation study of SSD-Mask
| Anchor | Mish | FPN | SAM | Focal loss | mAP(%) |
|---|---|---|---|---|---|
| 84.91 | |||||
| √ | 85.50 | ||||
| √ | 85.54 | ||||
| √ | 85.55 | ||||
| √ | 85.60 | ||||
| √ | 85.52 | ||||
| √ | √ | √ | √ | √ |
Fig. 14The mAP on the testing dataset of rail transit