Literature DB >> 36196269

Face mask detection and social distance monitoring system for COVID-19 pandemic.

Iram Javed¹, Muhammad Atif Butt², Samina Khalid¹, Tehmina Shehryar³, Rashid Amin⁴, Adeel Muzaffar Syed⁵, Marium Sadiq¹.

Abstract

Coronavirus triggers several respirational infections such as sneezing, coughing, and pneumonia, which transmit humans to humans through airborne droplets. According to the guidelines of the World Health Organization, the spread of COVID-19 can be mitigated by avoiding public interactions in proximity and following standard operating procedures (SOPs) including wearing a face mask and maintaining social distancing in schools, shopping malls, and crowded areas. However, enforcing the adaptation of these SOPs on a larger scale is still a challenging task. With the emergence of deep learning-based visual object detection networks, numerous methods have been proposed to perform face mask detection on public spots. However, these methods require a huge amount of data to ensure robustness in real-time applications. Also, to the best of our knowledge, there is no standard outdoor surveillance-based dataset available to ensure the efficacy of face mask detection and social distancing methods in public spots. To this end, we present a large-scale dataset comprising of 10,000 outdoor images categorized into a binary class labeling i.e., face mask, and non-face masked people to accelerate the development of automated face mask detection and social distance measurement on public spots. Alongside, we also present an end-to-end pipeline to perform real-time face mask detection and social distance measurement in an outdoor environment. Initially, existing state-of-the-art single and multi-stage object detection networks are fine-tuned on the proposed dataset to evaluate their performance in terms of accuracy and inference time. Based on better performance, YOLO-v3 architecture is further optimized by tuning its feature extraction and region proposal generation layers to improve the performance in real-time applications. Our results indicate that the presented pipeline performed better than the baseline version, showing an improvement of 5.3% in terms of accuracy.

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Entities: Chemical

Keywords: Coronavirus; Face mask detection; Single and multi-stage detectors; Social distance measurement

Year: 2022 PMID： 36196269 PMCID： PMC9522539 DOI： 10.1007/s11042-022-13913-w

Source DB: PubMed Journal: Multimed Tools Appl ISSN： 1380-7501 Impact factor: 2.577

Introduction

Coronavirus broke out at the end of 2019, and it is still devastating havoc on the livelihood and businesses of millions of people around the world [13]. Since the world has started recovering from the pandemic, people intend to return to a state of regularity, the same as before the pandemic. However, there is an upsurge of uneasiness among the people in getting back to their normal routine because this virus spreads through droplets of saliva from an infected person which can affect the people within the range of approximately 6 feet. The main symptoms of this infection are fever, headache, cough, respiratory difficulties, loss of taste, and smell ability which leads to the death of the infected person [41]. The incidence rate of COVID-19 is higher than other acute respiratory problems like severe acute respiratory syndrome (SARS) and the Middle East respiratory syndrome (MERS). To prevent this deadly virus, World Health Organization (WHO) [35] issued guidelines and SOPs such as wearing a face mask and maintaining social distance in public spots. In this regard, several research studies also reported that maintaining the distance while physical interaction between people can prevent the spread of most respiratory diseases [21]. Tangana et. al [1] presented a mathematical model to demonstrate the impact of physical distance while interaction on transmission possibilities of virus among the people. In another study [15], it is demonstrated that wearing a face mask is highly effective in mitigating the reproduction of coronavirus. However, manual monitoring and enforcement of the aforementioned SOPs in public places such as schools, universities, shopping malls, and parks is a quite challenging task. In step with the rapid advancement in Artificial Intelligence (AI), Deep Learning in particular, the computer vision community has contributed various state-of-the-art methods for intelligent surveillance [65], object detection [6] and recognition [5, 7], and scene understanding [46]. These methods can be employed to develop an intelligent monitoring system for face mask detection and social distance measurement in public places. However, there are two main challenges in this direction. Firstly, to the best of our knowledge, there is no South Asian standard benchmark available to evaluate facial mask detection and social distance measurement methods. Secondly, there is no pipeline available for the development of an end-to-end real-time intelligent monitoring system for facial mask detection and social distance measurement. It is important to mention that several research studies have employed standard single- and multi-stage object detectors such as Faster-RCNN, SSD, and Retina-Net to perform face mask detection [17]. However, these methods do not consider the impact of social distance measurement, which make these methods insufficient for deployment in actual public places. To address the aforementioned short-comings of existing state-of-the-art methods, in this paper, we have made the following contributions. A local dataset containing 10,000 images based on two classes (i.e., masked face and unmasked face) has been collected from public places. It is worth noting that these classes are unique in orientation and dress codes, which are not covered in the existing datasets. Existing state-of-the-art single and multi-stage object detectors are fine-tuned on the proposed dataset. Based on the analysis, an improved YOLO-v3 based object detection architecture is presented to enhance robustness of real-time surveillance systems. Alongside, a machine-vision based distance measurement method has been proposed to ensure social distancing on public places. Lastly, an extensive comparative study has been carried out between state-of-the-art Face mask detection methods and the proposed method to demonstrate the effectiveness of our proposed method in terms of higher detection and recognition accuracy, and inference time. The rest of the paper is organized as follow. In Section 2, we briefly discuss existing state-of-the-art facial mask detection and social distance measurement methods, along with the available datasets. In Section 3, we present a detailed overview of our proposed end-to-end pipeline for face mask detection and social distance measurement. The experimental results have been presented in Section 4. Finally, the paper is concluded in Section 5.

Related work

Real-time object detection and recognition methods can play an important role in developing intelligent monitoring methods for face mask detection and social distancing measurement to prevent coronavirus transmission. In this section, we analyze the existing state-of-the-art methods employed in developing an intelligent monitoring system for face mask detection and social distancing measurement which includes: (i) single- and multi-stage detection methods—for face masked and non-masked face detection, (ii) Available Datasets—to develop generalized face detection systems and, (iii) social distance measurement methods.

Facial mask detection

In the majority of existing research works, the researchers focused on face construction and identity recognition while wearing face masks. However, the aim of this study is to identify the human face in both states—wearing the mask, or not wearing the mask in order to assist in reducing COVID-19 transmission and spread. In recent studies, researchers have demonstrated that wearing face masks minimizes the rate of COVID-19 spread as it can interrupt airborne germs effectively [38]. However, monitoring the people in public places is still a challenging task. In this regard, Zhang et al. [62] proposed a single shot refinement face detector namely Refine Face to to detect people not wearing a face mask. In another research work, Jagadeeswari et al. [19] proposed SSD-based face mask detection method for an outdoor environment. Khandelwal et al.[22] presented a deep learning approach for classifying human face with and without mask. Onyema et al.[40] proposed method for facial expression recognition based convolutional neural network. Hussain et al. [16] proposed deep learning based IoT system to detect face mask using transfer learning approach. Besides, aforementioned approaches achieved better accuracy on the respective test data. However, the real-time face mask detection is still a critical challenge for the system developers. In this regard, Snyder et al. [56] introduced deep learning based approach for mask detection to prevent COVID-19 transmission. Kodali et al. [23] presented custom CNN-based model to detect face wearing a mask in the public spots. Similarly, Sagayam et al. [54] proposed deep neural network based method for binary class (i.e., masked and non-masked) face state recognition. Degadwala et al. [9] proposed YOLO-v4 based face detection method which has been trained and tested over WIDER-FACE and MAFA datasets. Likewise, Taneja et al. [58] presented facial mask detection system with MobileNetV2 lightweight CNN and achieved 99.98% accuracy. On the other hand, Sethi et al. [55] aims to detect mask using ResNet-50. The model give 11.07% and 6.44% higher precision and recall and compared it with RetinaFaceMask detector model. In another research work, Loey et. al [30] presented multi-stage detection method for face detection with wearing or not wearing mask. Alongside, ensemble method combined with deep learning model to detect face masks using real-world and synthetic data to improve the generalizability of machine learning models. These research works are discussed along with insightful strengths and limitations in the Table 1. To this end, we conclude that the deployment of the above-discussed face mask detection systems encounter several constraints at development and deployment level such as diverse types of face masks, face orientation, and illumines conditions [52]. Furthermore, stabilizing object detection model accurateness and real time condition, placement of detector on system with limited computing capacity. In the circumstance of the epidemic, facial mask detection is not still explored in images, videos as well as closed circuit television (CCTV) to control transmission chain of virus [37].

Table 1

An Overview of Existing Machine Learning Methods Used for Face Mask Detection and Recognition Tasks

Author	Methods	Dataset	Accuracy	Limitation
Roy et al. [52]	MOXA included YOLO-v3, Tiny YOLO-v3, SSD and Faster R-CNN	Kaggle’s medical masks dataset — 3000 images	YOLO-v3: 63.99% mAP, Tiny YOLOv3: 56.27% mAP, SSD: 46.52% mAP, and F-RCNN: 60.5% mAP	Unmanned approach MOXA requires improvement including more innovative object detectors
Nagrath et al.[37]	Single shot multibox object detection model and MobileNetV2	Kaggle’s medical masks and PyImage search dataset contains 1,376 images	The SSDMNV2 model attained 92.64% accuracy	SSDMNV2 was trained on artificially produced images, still not tested in real situations as well as with real-time CCTV
Hussain et al.[16]	Transfer learning with CNN, VGG-16, MobileNetV2, ResNet-50, Inceptionv3	MAFA dataset, Masked Face-Net, and Bing dataset	Using VGG-16 achieved 99.81% accuracy and with MobileNetV2 attained 99.6% accuracy	Online accessible dataset contain noisy and construct by artificially, which is not suitable for real time system
Snyder et al.[56]	ResNet-50 with FPN and Multi-Task CNN	MCelebFaces Attributes, Microsoft Common Objects in Context, WIDER FACE dataset and Custom Mask Community Dataset	87.7% detection accuracy	Incorrectly identify faces with mask and without mask
Kodali et al.[23]	CNN model	Kaggle dataset with 853 images	96% detection accuracy	Incorrectly identify faces with mask and without mask
Sagayam et al.[54]	OpenCV and MobileNet-V2 used to detect face mask	Kaggle’s medical masks and PyImage search	99% accuracy achieved by MobieNet-V2	Trained on limited dataset which is not perform well in real time situation
Degadwala et al.[9]	YOLO-v4	MAFA and WIDER-FACE dataset	99.98% accuracy obtained by YOLO-v4	Have need of more computational power and require 30FPS camera resolution rate
Taneja et al.[58]	MobileNet-V2 lightweight CNN model used to detect face mask	Medical Masks Dataset and the Face Mask Dataset	99.98% accuracy	Performance of MobileNet-V2 is not accurate as compared to Faster R-CNN and Inception-V2
Chadav et al.[8]	Multi-stage CNN model	Kaggle with 853 images	98% accuracy	Dual-stage CNN model do not detect side views of the face
Bhuiyan et al.[3]	YOLO-v3	Google colab datasets having 650 images	96% accuracy	Limited dataset used and cannot test on real time condition
Ejaz et al.[11]	Using Principal Component Analysis (PCA) recognition faces with masks and without the mask	ORL face dataset is used for masked faces containing 500 images	Attain accuracy for face mask is 72% and without the mask is 95%	PCA gave poor results in mask face, only front side face images use for the dataset
Qin et al.[44]	Classification of facial image with SRCNet and automatic identifies faces wear with mask	Medical Masks dataset having 3835 images	Acquire 98.70% accuracy with image super resolution classification network (SRCNet)	Use of limited number of images
Jiang et al.[20]	Proposed SSD to classify face with FPN	7959 images collected from internet	Without mask: 89.6% precision, with mask: 91.9% precision	Do not differentiate between mask and unmask face properly
Rahman et al.[45]	Facial mask detection in smart city through CCTV	1539 images are collected from different sources	98.7% Achieve accuracy	Confuse system with a hand covered face
Punn et al.[43]	YOLO-v3 used to monitoring real time social distance	800 images taken from OID dataset	YOLO v3 with deep sort acquire better result as compared to FPS	Privacy issue, do not record violations
Yang et al.[61]	Faster R-CNN and YOLOv4 detects real time social distance and critical density	Taken 12300 images from MS-COCO dataset	Accuracy and performance are good to monitor social distance	Do not record data, crowd analysis still a challenge
Militante et al.[36]	Single shot detector used to detect face mask and physical distance with alarm system	20000 images collected from web	accuracy rate of 97%	Do not detect face mask and distance at the same time
Yadav et al.[60]	face mask and social distance detection and generate an alert signal with SSD	used custom dataset of 3165 images	obtain accuracy 85% and 95%	N/A

An Overview of Existing Machine Learning Methods Used for Face Mask Detection and Recognition Tasks

Available datasets

In the context of COVID-19, the face datasets have an essential role in training deep models for face mask and non-masked face detection. Recently, several datasets have been proposed to accelerate research in this direction. In this regard,Ge et al.[12] proposed MAFA dataset contains 30811 images which are collected from the Internet. These images have distinct types of masks, several occlusion degree and orientations. Furthermore,Laxel [33] introduced Face Mask Dataset (FMA) holds 853 images with three classes collected from Kaggle. Another extent version of kaggle dataset proposed by Wobot [18] denoted as FMA containing 6024 images having 20 classes. Rahmani et al. [45] proposed Medical Mask Dataset (MMD). The MMD dataset consist of 9067 images with three classes use to detect only medical mask. On the other hand, Wang et al. [59] proposed a large-scale dataset of masked faces for detection and recognition Masked-Face Detection Dataset (MFDD), the Real-world Masked-Face Recognition Dataset (RMFRD) and the Simulated Masked-Face Recognition Dataset (SMFRD). MFDD contain 24771 mask face image that were collected from the internet. The RMFRD have 2203 mask face image and 90,000 without mask images. SMFRD includes 50000 images. These detectors achieve 95% accuracy with the multi granularity model. We do not collect any images from the existing available dataset. Instead, we build a challenging dataset to perform experiments on existing object detector.

Social distance measurement

Social distancing is a significant safety measure to control the spread of COVID-19. Computer vision application has shown better applicability in detection [47] and emotion enable cognition task in real time environment [4]. In this regard, computer vision play an important role to dimensionality reduction with Matrix Factorization (MF) has valuable framework to treat against COVID-19 [34]. Additionally, Feature Selection and Prognosis Classification used to develop machine learning based intelligent system for COVID-19 disease [53]. The spectral clustering [2] and gene selection technique [51] has been presented to map to a low-dimensional space by merging node centrality and community detection. Due to increase spread of COVID-19 outbreak cause serious condition to the global education systems. During the school closure, Computer Science innovation technologies have been useful and comfortable for teaching as well as learning [10]. Prem et al. [42] used susceptible-exposed-infected-removed (SEIR) method to study the special effects of social distancing on the spread of the virus. Levchev et al. [26] aimed to study a database configuration in multiple sensor technologies similar to cameras, LiDAR, inertial gyroscopes, wireless sensors and additional sensors used as data acquisition stages. Liang et al. [27] utilized various sensors to get image information and geographic location information at the same time build an indoor 3D chart using geographic coordinates. Niu et al. [39] highlighted social distancing problem in 3D view by using monocular cameras pedestrian 3D localization. Futhermore, Magoo et al. [31] setting bird eye view framework with YOLO v3 model to monitor social distance in public area. Though, the research community has contributed several social distance measurement methods, however, deployment of such systems in real-world environment is still a challenging task.

The method

To address the above-mentioned issues, we propose a novel pipeline for developing an end-to-end face mask detection methods to monitor the public spots in order to mitigate the COVID-19 spread, as shown in the Fig. 1. Firstly, we present a large-scale M UST F ace D ataset (MFD)—containing 10,000 images along with binary class bounding box annotations i.e., Face wearing mask, and Face not wearing mask. Alongside, we analyzed the existing state-of-the-art single stage and multi-stage object detector over our proposed dataset. Specifically, we fine-tuned the existing YOLO-v3 [49], SSD [63], RetinaNet-50 [28], Fast-RCNN [50], Faster R-CNN (FPN) [32], Faster-RCNN (ResNet-50) [25] and Faster-RCNN (ResNet-101) [29] on our proposed dataset through transfer learning. Based on the better performance, we further improved the YOLO-v3 architecture to robustify its performance in outdoor environment. On the basis of our face detector, we employed our self-proposed social distance measurement method—which takes input from the face detector and computes the distance between the two human beings to mitigate the COVID-19 spread in public spots.

Fig. 1

The Proposed Pipeline For Developing Face Mask Detection And Social Distance Measurement in Public Places

MUST face dataset

To this end, we collect and release M UST F ace D ataset (MFD)—a large-scale dataset to accelerate the development of generalized methods for end-to-end face mask detection in public places. Our MFD contains 10,000 images along with binary class (i.e., masked face, non-masked face) bounding box annotations. The proposed dataset is generated from the video sequences captured by the surveillance cameras installed at the outdoors of the departmental buildings. The average height of the installed cameras is in the range of 12 feet to 15 feet from the ground. After successful video sequence collection, the crowded frames are manually extracted while ensuring the quality control parameters such as positioning of the people and the clarity of the images. It is important to mention that we comply with the regulatory bodies and collected the data from the permitted areas. To protect the privacy, we do not disclose or release the personal identities, Geo-location, incoming and outgoing pattern based information of the people. After completing frame extraction, considering the use-case of our proposed method, we defined two classes for annotations i.e., masked face, and non-masked face. For this purpose, we employed LabelImg annotation tool to label the human faces according to the aforementioned defined classes. One of the reasons of manual annotations instead of automated labeling is to maintain the accuracy of the coordinates of ground truth which plays an important role in training a robust face detection model. All the annotations are cross-validated by a team of experts to ensure the quality of ground truth. Some of the samples of our dataset are shown in the Fig. 2.

Fig. 2

Sample Images From Our M UST F ace D ataset

Suitable face detection method selection

Till recently, deep object detection methods have demonstrated better applicability in various real-time object detection and recognition tasks [24]. To select the suitable deep learning object detector, firstly, we fine-tuned the existing state-of-the art single-stage and multi-stage detection methods including YOLO-v3 [49], SSD [63], RetinaNet-50 [28], Fast-RCNN [50], Faster R-CNN (FPN) [32], Faster-RCNN (ResNet-50) [25] and Faster-RCNN (ResNet-101) [29] on our proposed MFD through transfer learning. The results show that existing YOLO-v3 outperformed aforementioned employed detection methods in terms of inference time and accuracy. Based on the better performance, we further improved the YOLO-v3 architecture to robustify its performance in outdoor environment.

Proposed facial mask detection architecture

In the proposed framework, we have employed YOLO-v3 architecture to perform facial mask detection in real-time, one of the most outstanding deep learning object detectors proposed by Joseph Redmon and Ali Farhadi in 2018 [48], which demonstrated consistent performance for object detection and recognition tasks. One of the main issues in existing detection network was the vanishing gradient problem, which commonly occurs by increasing network layers. Therefore, multi-scale YOLO-v3 has been proposed which hold residual connections—which join the input from the previous layer to output of next layer similar to ResNet architecture. Resultantly, Yolo-v3 achieved good performance even over low resolution images due to inclusion of multi-scale feature extraction property. To this end, we employed the existing YOLO-v3 architecture and inserted k-means anchoring to 9 anchor boxes and then isolate them into three locations to get more more bounding boxes per image than baseline version. The input layer takes an RGB image with a size of 416x416 pixels. As a backbone network, we employed DarkNet-53 to accomplish the maximum calculated floating-point procedure per second. The internal structure of the model includes fully connected network that does not contain max-pooling layer. As depicted in Fig. 1 the network contains convolution block, residual block, and scale output layers. In convolution block, convolution functions of the kernel size hold strides instead of max pooling to reduce size of input images; each monitored by batch normalization and ReLU activation. On the other hand, residual block having different kernel size of two convolution block named as mega-block. In existing YOLO-v3 architecture, the convolution blocks iterates by 1x, 2x, 4x, and 8x. However, considering the use-case of our application, we reduced the iterations of convolution blocks to 1x, 2x, 4x in order to improve the learning performance and inference time. In the bottom of the architecture, an average pool, followed by a fully connected layer and softmax activation is employed to down-sample the feature map and get binary class output probability, respectively. To improve the learning process, we applied the concept of transfer learning to utilize the storing knowledge of a neural network to do new tasks by simply learning new weights. The ultimate aim of employing this technique is to increase the learning process.

Social distance measurement methods

With the recent advancement in the field of AI, computer vision based applications have demonstrated better applicability in several applications such as scene understanding, object recognition, speed, and distance estimation [14]. Some research used proportional-integral-derived (PID) [57] due to it’s simplicity and non-optimal performance. Since, it is suitable for distance measurement as well as will consume less power and memory. Zhang et al. [64], proposed distance estimation method to localization of an object in the camera coordinate frame. Their method contain three steps. The first step is regarding camera calibration and the second step is concerned with constitute a model for distance measurement between camera coordinate frame with their projection frame and third step is representing absolute distance estimation. The distance is computed with respect to the pivot point of bounding box known as centroid—which is calculated using (1), mentioned below. It can be seen from (1), C means centroid—means that minimum and maximum width of the bounding box whereas y_min , y_max means that minimum and maximum height of the bounding box. Calculated centroid and then use Euclidean distance formula to measure distance between centroids, as shown in the (2) and then compared the distance with ground truth value. After calculating centroid of bounding box, a unique ID is assigned to each centroid. In the next step, the distance between every detected centroid is computed using Euclidean distance. To validate the correctness, Root Mean Square Error (RMSE) (mentioned in the equation 3) to estimate the error between actual value and predicted value of the model.

Proposed algorithm for real-time face mask detection

Here we present a novel algorithm, depicted in Algorithm 1 , for developing and deploying an end-to-end face mask detection and social distance monitoring system in the public spots. In the first step, the real-time stream of the camera get the visual frames—which is passed to our developed face mask detection method for inference. Our proposed method analyzes the frames, if there is no face detected, our network returns null. If face is detected, face detect and also compute distance between faces by using our proposed method. To find out the precautionary measure according to the facial mask and measure social distance, a discussion performed in the Section 2. Following scenarios has been performed: if person wear a mask and distance is greater than 6 feet then no action performed. But when person not wearing a mask and social distance is greater than 6 feet then alert is high. On the other hand, when person wear mask and social distance is less than 6 feet again alarm generated. The masked person and not maintain social distance, then generated warning. Real-time surveillance Pro.

Experiments and results

In this section, we evaluate the effectiveness of the proposed mask/non-mask face detection method and present the comparison study with current cutting-edge techniques. The studies are conducted on a powerful computer running a 64-bit version of Windows 10 that has an RTX 2080TI graphics card, an 11 GB DDR5 GPU, a core i9- 9900k CPU, and 32 GB of RAM.

Training setup

The training process of the proposed pipeline is divided into three fundamental steps: data pre-processing, model training, and model evaluation. Firstly, the whole dataset is randomly split into training, validation, and test set with 80:10:10 percent ratio and normalized the input size to 416x416 pixel resolution. In the next step, Pytorch library is used for the implementation of the proposed pipeline. Moreover, the experiments are categorized into three phases i.e. (i) evaluation of the existing state-of-the-art object detection networks on proposed dataset, and (ii) evaluation of improved Yolo-v3 network on proposed dataset, and (iii) evaluation of proposed distance measurement method.

Evaluation of existing state-of-the-art object detection networks on proposed dataset

To evaluate the existing state-of-the-art deep object detection models—YOLO-v3, SSD, RetinaNet-50, RetinaNet-101, Fast-RCNN, Faster R-CNN (FPN), Faster-RCNN (ResNet-50) and Faster-RCNN (ResNet-101) are are fine-tuned on the proposed face mask detection dataset. Pytorch 1.4.0 library and cuda 11.0 version are used to configure the training runs. The hyper-parameters such as learning rate, batch size and epochs are set to 0.0001, 32, and 100 with the stochastic gradient descent optimizer to update model weights, respectively. The performance matrices of the employed models are shown in Table 2.

Table 2

Evaluation of existing state-of-the-art object detection networks on proposed dataset

Method	Mean Accuracy	mAP	mAP @ 0.95	Inf.Time (ms)
YOLO-V3	64.1%	59.6%	53.1%	28
SSD	61.8%	56.2%	48.6%	34
RETINA-NET 50	55.2%	51.9%	44.7%	37
RETINA-NET 101	51.0%	46.3%	41.8%	39
FAST-RCNN	41.7%	39.4%	37.1%	132
FASTER-RCNN (FPN)	47.3%	44.0%	41.5%	119
FASTER-RCNN (ResNet-50)	59.0%	57.4%	55.6%	108
FASTER-RCNN (ResNet-101)	62.7%	61.3%	59.0%	98

Evaluation of existing state-of-the-art object detection networks on proposed dataset It can be seen from Table 2 that single-stage detectors demonstrated better applicability in term of low inference time due to their less parametric architectures. Whereas, the multi-stage object detectors have been computationally expensive while achieving significantly higher inference time. It is also important to mention that Yolo-v3 with 53 layers demonstrated better accuracy than the SSD, RetinaNet-50, RetinaNet-101, Fast-RCNN, Faster R-CNN (FPN), Faster-RCNN (ResNet-50) and Faster-RCNN (ResNet-101). For instance, YOLO-v3 achieved 64.1% mean accuracy, 59.6% mAP, 53.1% mAP @ 0.95 and 28ms inference time. Similarly, SSD achieved 61.8% mean accuracy, mAP 56.2%, mAP @ 0.95 is 48.6% and take 34 prediction time. Also, RetinaNet-50 demonstrate 55.2% mean accuracy, mAP 51.9%, mAP @ 0.95 is 44.7% with the inference time of 37ms on the test set. Whereas, RetinaNet-101 achieved 51.0% mean accuracy, mAP 46.3%, and 44.7% mAP @ 0.95 with 39ms inference time which is comparatively higher than RetinaNet-50. On the other hand, We next analyze the multi-stage object detector i.e., Fast R-CNN which demonstrated 41.7% mean accuracy, 39.4% mAP, and 37.1% mAP @ 0.95 with 132ms inference time on the our test set which is significantly higher than the employed single shot detectors. In another experiment, Faster R-CNN based on FPN 119 achieved mean accuracy of 47.3%, 44% mAP, and 41.5% mAP @ 0.95. Whereas, the sample Faster R-CNN with ResNet-50 feature extraction network achieved mean accuracy of 59.0%, mAP 44%, and 57.4% mAP @ 0.95 with inference time of 108ms. However, with ResNet-101 as a backbone feature extraction network, Faster-RCNN shows mean accuracy of 62.7%, mAP 61.3%, and 59.0% mAP @ 0.95 with inference time of 98ms. Consequently, it can be assumed that YOLO-v3 with DarkNet-53 can achieve better accuracy after further architectural fine-tuning.

Evaluation of improved YOLO-V3 architecture on proposed dataset

Based on the above discussed analysis, the architecture of the YOLO-v3 is further improved by trimming the less contributing convolutional layers and residual connections.The improved feature extractor—DarkNet has been evaluated on the proposed dataset. In order to train the network faster, we employed transfer learning to learn the high level features from the proposed dataset. In the training setup, we employed SGD optimization algorithm with momentum to train and evaluate the improved network on our proposed dataset for mask/non-mask face detection tasks. The re-known performance metrics such as mean accuracy, mAP, mAP @ 0.95 and inference time are used to evaluate the performance of our improved face mask/non-mask face detection on our dataset. The mean accuracy refers to the sum of correct predictions divided by the sum of total data samples. Whereas, mAP denotes mean average precision, and AP @ 0.95 shows the average precision with 0.95 intersection over union. Furthermore, inference time refers to the total time taken from getting an input to producing an output. Evaluation of improved YOLO-v3 on proposed dataset It can be seen from the Table 3 that our improved Yolo-V3 based detection network outperformed the baseline Yolo-v3 in mask/non-mask face detection tasks on our proposed dataset. One of the main reasons behind the increase of accuracy in our model is the trimming of less contributing residual connections with accelerated the performance of our model as compared to the baseline model. Some of sample results are demonstrated in Fig. 3 to show the effectiveness of our proposed masked/non-masked face detection method.

Table 3

Evaluation of improved YOLO-v3 on proposed dataset

Method	Mean Accuracy	mAP	mAP @ 0.95	Inf. Time
Existing	64.1%	59.6%	53.1%	28 ms
YOLO-V3
Proposed	69.4%	64.7%	62.0%	25 ms
YOLO-V3

Fig. 3

Qualitative examples of our masked/non-masked face detection method on our face mask dataset

Evaluation of proposed distance measurement method

After evaluating our proposed mask/non-mask face detection, in next step, we evaluated our proposed machine-vision based distance measurement method to ensure social distancing on public places. Following the standard performance metrics, we employed root mean square error to analyze the correctness of our method as compared to the ground truth. Some of the quantitative analysis is shown in Table 4.

Table 4

Results of proposed distance measurement methods

Sr. No.	Ground Truth (ft)	Predictions (ft)	RMSE
Distance 1	2.44	2.37	0.035
Distance 2	2.99	2.95	0.020
Distance 3	3.16	3.10	0.030

Results of proposed distance measurement methods The vision based system detect faces of the person and give the bounding boxes information. Later on, detect the central point of the bounding boxes around the face and then measure distance between two central point (centroid) using the standard equation of euclidean distance. The error rate is computed using RMSE which computes the difference between ground truth value and predicted value of the model. For instance,in the Distance 1 sample, the actual distance (ground truth) between two persons is 2.44 feet, whereas our proposed vision-based distance measurement method predicts 2.37 with quite lesser error rate i.e., 0.035 RMSE. In the next data sample i.e., Distance 2, the actual distance is 2.99 whereas, our model inferred 2.95 with the RMSE of 0.020. Similarly, in Distance 3 sample, the ground truth value is 3.16 whereas, the proposed method predicts 3.10 holding the error rate of 0.030 which is quite effective performance on our test set.

Conclusion

In this paper, a novel pipeline for developing an end-to-end masked/non-masked face detection method is proposed to improve the effectiveness of real-time surveillance systems at public places. Alongside, a new dataset containing 10,000 images of two classes (masked face, non-masked face) is constructed to develop a generalized masked/non-masked face detection and social distance measurement in outdoor public places. While fine-tuning existing state-of-the-art single-stage and multi-stage detection methods, it is observed that Yolo-v3 outperformed the other networks in terms of accuracy and inference time. Based on analysis, we further improved the baseline Yolo-v3 by eliminating the less contributing residual connections in the network. Consequently, the results indicate that our customized YOLO-v3 performed better than baseline version, showing an improvement of 5.3% in terms of accuracy. In the future, we are aiming to extend our work to develop an image segmentation-based system that can provide accurate level information and gives greater clarity to detect face mask.

23 in total

Review 1. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

2. Local Decision Making for Implementing Social Distancing in Response to Outbreaks.

Authors: Rebecca Katz; Andrea Vaught; Samuel J Simmens
Journal: Public Health Rep Date: 2019-01-18 Impact factor: 2.792

3. High dimensionality reduction by matrix factorization for systems pharmacology.

Authors: Adel Mehrpooya; Farid Saberi-Movahed; Najmeh Azizizadeh; Mohammad Rezaei-Ravari; Farshad Saberi-Movahed; Mahdi Eftekhari; Iman Tavassoly
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

4. Enhancement of Patient Facial Recognition through Deep Learning Algorithm: ConvNet.

Authors: Edeh Michael Onyema; Piyush Kumar Shukla; Surjeet Dalal; Mayuri Neeraj Mathur; Mohammed Zakariah; Basant Tiwari
Journal: J Healthc Eng Date: 2021-12-06 Impact factor: 2.682

5. Decoding clinical biomarker space of COVID-19: Exploring matrix factorization-based feature selection methods.

Authors: Farshad Saberi-Movahed; Mahyar Mohammadifard; Adel Mehrpooya; Mohammad Rezaei-Ravari; Kamal Berahmand; Mehrdad Rostami; Saeed Karami; Mohammad Najafzadeh; Davood Hajinezhad; Mina Jamshidi; Farshid Abedi; Mahtab Mohammadifard; Elnaz Farbod; Farinaz Safavi; Mohammadreza Dorvash; Negar Mottaghi-Dastjerdi; Shahrzad Vahedi; Mahdi Eftekhari; Farid Saberi-Movahed; Hamid Alinejad-Rokny; Shahab S Band; Iman Tavassoly
Journal: Comput Biol Med Date: 2022-04-05 Impact factor: 6.698

6. A Vision-Based Social Distancing and Critical Density Detection System for COVID-19.

Authors: Dongfang Yang; Ekim Yurtsever; Vishnu Renganathan; Keith A Redmill; Ümit Özgüner
Journal: Sensors (Basel) Date: 2021-07-05 Impact factor: 3.576

7. The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study.

Authors: Kiesha Prem; Yang Liu; Timothy W Russell; Adam J Kucharski; Rosalind M Eggo; Nicholas Davies; Mark Jit; Petra Klepac
Journal: Lancet Public Health Date: 2020-03-25

8. Deep learning-based bird eye view social distancing monitoring using surveillance video for curbing the COVID-19 spread.

Authors: Raghav Magoo; Harpreet Singh; Neeru Jindal; Nishtha Hooda; Prashant Singh Rana
Journal: Neural Comput Appl Date: 2021-07-02 Impact factor: 5.606