| Literature DB >> 35880102 |
Yassine Himeur1, Somaya Al-Maadeed1, Noor Almaadeed1, Khalid Abualsaud1, Amr Mohamed1, Tamer Khattab2, Omar Elharrouss1.
Abstract
Since the start of the COVID-19 pandemic, social distancing (SD) has played an essential role in controlling and slowing down the spread of the virus in smart cities. To ensure the respect of SD in public areas, visual SD monitoring (VSDM) provides promising opportunities by (i) controlling and analyzing the physical distance between pedestrians in real-time, (ii) detecting SD violations among the crowds, and (iii) tracking and reporting individuals violating SD norms. To the authors' best knowledge, this paper proposes the first comprehensive survey of VSDM frameworks and identifies their challenges and future perspectives. Typically, we review existing contributions by presenting the background of VSDM, describing evaluation metrics, and discussing SD datasets. Then, VSDM techniques are carefully reviewed after dividing them into two main categories: hand-crafted feature-based and deep-learning-based methods. A significant focus is paid to convolutional neural networks (CNN)-based methodologies as most of the frameworks have used either one-stage, two-stage, or multi-stage CNN models. A comparative study is also conducted to identify their pros and cons. Thereafter, a critical analysis is performed to highlight the issues and impediments that hold back the expansion of VSDM systems. Finally, future directions attracting significant research and development are derived.Entities:
Keywords: Bird’s eye view; Convolutional neural networks; Euclidean distance; Pedestrian detection; Transfer learning; Visual social distancing monitoring
Year: 2022 PMID: 35880102 PMCID: PMC9301907 DOI: 10.1016/j.scs.2022.104064
Source DB: PubMed Journal: Sustain Cities Soc ISSN: 2210-6707 Impact factor: 10.696
Fig. 1Smart DL-based VSDM system for smart cities: the most important steps are explained including (i) data collection, (ii) data storage, (iii) pedestrian detection, (iv) distance measurement and (vi) violation detection.
Fig. 2The risk level of contaminating COVID-19 between monitored people: the closer or denser a crowd is, the more risky it is considered. The least risky level is on the top left while the more risky is on the top right.
Fig. 3Flowchart explaining the main steps of a CNN-based VSDM system.
Fig. 4Taxonomy of existing VSDM techniques proposed in the last two years with reference to the type of CNN models (complex of lightweight), transfer learning approaches, pedestrian detectors, data recording technique, and overall methodology.
Fig. 5Difference between conventional ML and TL techniques for multiple tasks: (a) conventional ML and (b) TL.
Fig. 6The computational cost evaluation of different pedestrian detectors in terms of the fps score (Rahim et al., 2022).
Summary of the VSDM frameworks based on CNN and their characteristics.
| Work | ML model | Description | Dataset | Best VSDM performance | Advantage/limitation |
|---|---|---|---|---|---|
| RCNN, Faster-RCNN, SSD, YOLOv3 | VSD analysis alert system based on object detection and CNN models | MS-COCO and PASCAL-VOC datasets | mAP = 75% (SSD) | Performance needs further improvement and validation on real-world scenarios is missing. | |
| CNN, MobileNet-V2 | Real-time SD detection | Private data | Acc = 92.8% | Validated small image dataset and have moderate performance and privacy issues. | |
| YOLOv4 | Viewpoint-independent pedestrian detection and VSDM | VOC, MS-COCO, ImageNet ILSVRC | Acc = 99.8%) | Based on frame-by-frame pedestrian detection. Also, privacy concerns were not addressed. | |
| YOLOv3 | social distance monitoring using drone surveillance | Validation on frontal and side view images | Acc = 95% | Validate on small dataset and privacy preservation is not addressed. | |
| SSD300 | VSDM using object detection | VOC2007 | mAP = 88.4% | The raining set is small and the performance needs further improvement | |
| YOLOv3 | Distance between persons calculated using BEV coordinates | OTC, Mall dataset, TSD | Acc = 92.80%, PR = 95.36%, 95.94% | Missed detections with Mall-D and TSD datasets, and the proposed method is validated in datasets with simple scenes. | |
| MobileNetv2 | Euclidean distance between them was calculated using a symmetric distance matrix and a 3D projected image of each frame | Private video data | Acc = 94.1% | Focus on an indoor manufactory-setup and tested on a small dataset. | |
| Faster-RCNN, SSD, YOLOv3 | Transformation of real-time video to BEV. | OTC | mAP = 0.868%, meanIoU = 0.907% | Privacy preservation is not discussed and moderate performance. | |
| YOLOv3, RetinaNet, and Mask RCNN | Quantification of pedestrian density and distance | MS-COCO, existing video sequences | SDAR = 97.6% | Difficult to quantify the efficiency of the system. Also, pedestrian overlapping can significantly bias the results. | |
| Spatio-temporal analysis | VSDM using online spatio-temporal trajectories and euclidean distance | Market1501, MOT16 and SCU-VSD | Acc = 61.4%, PR = 79.1%) ARP-USD = 75.90% | Privacy concerns were not addressed. | |
| Faster-RCNN, ResNet-50, Faster-RCNN Inception-v2 and MobileNet SSDv3 | SD in real-world scenarios | OTC | Acc = 97.7% | (i) Implementation on embedded systems, (ii) validation on real-world scenarios and (iii) high detection accuracy. | |
| YOLOv3 | Indoor VSDM using RGB-D and CCTV cameras | Private video data | Acc = 88% | Cannot differentiate between strangers and individuals from the same family/house. | |
| SPP-SSD-MobileNetV2 | Real-time VSDM in public gatherings | TOC | Acc = 99.1%, PR = 99.2% | Validated on a small dataset (one video), the efficiency on other large-scale datasets is needed. | |
| YOLOv4 | A predefined SD threshold and a violation index are used to detect SD violations determine when the violation | Mall-D, PETS2009, OTC, VIRAT | Acc = 96% | Privacy preservation was not considered and cannot distinguish between individuals from the same family and strangers. | |
| CNN | Real-time VSDM | INRIA | Acc = 98.50% | Pedestrians overlapping can bias detection performance (unique camera) and validation on a small image dataset. | |
| YOLOv2 | Real-time VSDM and body temperature detection from thermal videos. | Private thermal image dataset | Accuracy = 95.6% | (i) Real-time validation in real-world scenarios, (ii) appropriate for distributed video surveillance system. | |
| YOLOv3 | DL-based BEV SD analysis | OxTown | mAP = 93.6% | Extremely sensitive to the spatial position of the camera. | |
| Tiny-YOLOv4 and DeepSORT | Crowd counting and SD monitoring in a top-view camera perspective | Youtube video | mAP = 92.94% | Process video streaming in real-time, however, further improvement is need to improve detection accuracy. | |
| ResNet-34 | Use CV and radio IoT sub-systems for tracking people and retrieving the IDs of their devices. | PETS2006 and real-world video data | Acc = 95.2%, F1 = 97.5% | Raise some privacy issues. | |
| DFCN | A cost-effective VSD approach that perceives people’s 3D locations and their body orientation from images | KITTI dataset | Acc = 84.7%, recall = 85.3% | Work with single RGB images, (ii) privacy safe, (iii) does not require homography calibration, (iii) generalize well across different datasets, (iv) work on fixed or moving cameras | |
| YOLOv4 | Validation on video data recorded using fixed single motionless time of flight (ToF) camera | ExDARK dataset | mAP = 97.84%, MAE = 1.01 cm | Can be applied in real-world scenarios because of high precision and the low error rate. Used only with fixed cameras. | |
| Faster-RCNN | Using SIoV to detect SD violations in real-time | Stanford Vehicles’ Dataset | mAP = 0.76 | Validation on open-source simulation platforms, and further improvement are required to improve the detection accuracy. | |
| YOLOv3 | Contact tracing and simulation the spread of droplets among the healthy population. | PennFudanPed, MS-COCO, VOC | Precision = 69.41% | Do not report SD violations. Average precision is quite low. | |
| YOLOv3 | Transforming video frames into top-down view for distance measurement. | Private data | N/A | Validation on small dataset and the performance was not reported. | |
| YOLOv4 | People tracking using NvDCF to extract pedestrian trajectories. | SVD, NS-3, SUMO | N/A | No insight was provided about the accuracy of SD detection. | |
| PeleeNet | Real-time UAV-based VSDM using light-weight CNN | Merge-Head, UAV-Head | AP = 92.22% | Instability caused intensive wind and the performance needs further improvement. | |
| Faster-RCNN, YOLOv2 | Secure and privacy preserving VSDM using blockchain | COCO | AUC = 73% | Although a secure and privacy preserving VSDM framework is presented, the detection accuracy needs further improvement. |
Fig. 7Example of the real-time YOLO-based VSDM system in Rezaei and Azarmi (2020) validated on the OTC dataset, which is based on the DarkNet architecture.
Fig. 8Difference between conventional ML and TL techniques for multiple tasks: (a) conventional ML and (b) TL.
Fig. 9The TL-based pedestrian detection framework proposed in Ahmed, Ahmad, Rodrigues, et al. (2021), which is built using YOLOv3 and overhead video frames from real-world.
Summary of the TL-based VSDM frameworks and their characteristics.
| Work | ML model | Method description | Dataset | Best VSDM performance | Advantage/limitation |
|---|---|---|---|---|---|
| TL-based YOLOv3 | Track the detected people using BBs | Validation on frontal view dataset | mAP = 0.846% | No statistical analysis of the outcome of their results is provided. Furthermore, no discussion about the validity of the distance measurements is provided. | |
| TL-based YOLOv3 | Using pre-trained CNN and approximation of physical distances to detect SD violations. | MS-COCO, private video dataset | Acc = 95%, PR = 86%, RE = 83% | Validated on one video with simple scenes and lack of privacy protection mechanisms. | |
| TL-based for Faster-RCNN | Conducting VSDM from the top view perspective | Private video data | Acc = 96%, RE = 92%, F1 = 94% | Validated on a small dataset and the privacy concerns are not addressed. | |
| TL-based Improved SSD | Real-time VSDM based on overhead position | MS-COCO, private overhead video dataset | Acc = 95.3% | Validated on a small dataset. | |
| TL-based YOLOv4 | Indoor VSDM | MS-COCO + private video data | 93.7% | Validated on small private video dataset. | |
| TL-based AlexNet | VSDM based on crowd behavior analysis in drone images | UCF-ARG | Acc = 99.58% | Privacy concerns are not addressed. | |
| DA-based Faster-RCNN | VSDM and crowd counting using DA and pre-calibration strategy. | ViPeD, CrowdVisorPisa | mAP = 83.6% | The performance needs further improvement, and privacy concerns have not been addressed. |
Fig. 10Example of an indoor video scene with four detected pedestrians recorded at the CUMTB-Campus to evaluate the VSDM system developed in Niu et al. (2021).
Evaluation of pedestrian localization and height errors of an indoor video scene recorded at the CUMTB-Campus (Niu et al., 2021).
| Pedestrian | Distance | Localization | Height error | SD evaluation | ||||
|---|---|---|---|---|---|---|---|---|
| number | (m) | error (m) | (m) | Adjacent | Truth value | Measured | Absolute | SD accuracy |
| pedestrians | (m) | value (m) | error (m) | |||||
| 1 | 10.203 | 0.32 | 0.054 | – | – | – | – | |
| 2 | 14.271 | 0.242 | 0.007 | 1 2 | 4.832 | 4.77 | 0.062 | 0.987 |
| 3 | 24.421 | 0.133 | 0.229 | 2 3 | 10.266 | 10.199 | 0.067 | 0.993 |
| 4 | 36.563 | 0.299 | 0.071 | 3 4 | 13.122 | 13.209 | 0.087 | 0.993 |
| Average | 0.248 | 0.09 | – | – | – | 0.072 | 0.991 | |
Fig. 11Example of an outdoor video scene with eight detected pedestrians recorded at the CUMTB-Campus to evaluate the VSDM system developed in Niu et al. (2021).
Evaluation of pedestrian localization and height errors in an outdoor scene recorded at CUMTB-Campus (Niu et al., 2021).
| Pedestrian | Distance | Localization | Height error | SD evaluation | ||||
|---|---|---|---|---|---|---|---|---|
| number | (m) | error (m) | (m) | Adjacent | Truth value | Measured | Absolute | SD accuracy |
| pedestrians | (m) | value (m) | error (m) | |||||
| 1 | 18.257 | 0.185 | 0.058 | – | – | – | – | – |
| 2 | 18.602 | 0.116 | 0.043 | 1 2 | 0.682 | 0.589 | 0.093 | 0.863 |
| 3 | 25.197 | 0.476 | 0.141 | 2 3 | 8.059 | 8.176 | 0.117 | 0.985 |
| 4 | 11.238 | 0.093 | 0.017 | 3 4 | 14.709 | 14.548 | 0.161 | 0.989 |
| 5 | 51.072 | 0.517 | 0.139 | 4 5 | 41.132 | 40.925 | 0.207 | 0.994 |
| 6 | 42.151 | 0.341 | 0.114 | 5 6 | 9.171 | 9.367 | 0.196 | 0.978 |
| 7 | 41.984 | 0.392 | 0.128 | 6 7 | 0.676 | 0.558 | 0.118 | 0.825 |
| 8 | 36.879 | 0.271 | 0.122 | 7 8 | 6.205 | 6.112 | 0.093 | 0.985 |
| Average | – | 0.298 | 0.095 | – | – | – | 0.140 | 0.945 |
Fig. 12Example of an outdoor video scene recorded with a drone-based camera with eight detected pedestrians used to evaluate the VSDM system developed in Shao et al. (2021).
Evaluation of pedestrian localization errors and SD accuracy in an outdoor scene from the Merge-Head dataset (Shao et al., 2021).
| Inter-pedestrian distance | Truth value (m) | Measured value (m) | Absolute error (m) | SD accuracy |
|---|---|---|---|---|
| 3 4 | 1.980 | 1.905 | 0.075 | 0.962 |
| 4 5 | 1.980 | 1.855 | 0.125 | 0.936 |
| 3 6 | 1.980 | 1.877 | 0.103 | 0.947 |
| 5 6 | 1.980 | 1.860 | 0.120 | 0.939 |
| 2 7 | 1.980 | 1.897 | 0.083 | 0.958 |
| 4 7 | 1.980 | 1.859 | 0.121 | 0.938 |
| 2 8 | 1.980 | 1.855 | 0.125 | 0.936 |
| 4 8 | 1.980 | 1.853 | 0.127 | 0.935 |
| Average | 1.980 | 1.870 | 0.109 | 0.945 |
Fig. 13The computational complexity analysis results concerning the frame and processing rates: (a) without the smoothing/tracking stage and (b) with the smoothing/tracking stage (Al-Sa’d et al., 2022).
Fig. 14Example of tilted image calibration used for physical distance measurement and VSDM in Shao et al. (2021).
Fig. 15mAP vs. gpu computation cost for (a) CNN-based object detectors, and (b) CNN-based feature extractors. Every object detector (or feature extractor) can correspond to different on the graph because of the changing input strides, sizes, etc. (Huang et al., 2017).