Literature DB >> 34848910

FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public.

Peishu Wu¹, Han Li¹, Nianyin Zeng¹, Fengping Li².

Abstract

Coronavirus disease 2019 (COVID-19) is a world-wide epidemic and efficient prevention and control of this disease has become the focus of global scientific communities. In this paper, a novel face mask detection framework FMD-Yolo is proposed to monitor whether people wear masks in a right way in public, which is an effective way to block the virus transmission. In particular, the feature extractor employs Im-Res2Net-101 which combines Res2Net module and deep residual network, where utilization of hierarchical convolutional structure, deformable convolution and non-local mechanisms enables thorough information extraction from the input. Afterwards, an enhanced path aggregation network En-PAN is applied for feature fusion, where high-level semantic information and low-level details are sufficiently merged so that the model robustness and generalization ability can be enhanced. Moreover, localization loss is designed and adopted in model training phase, and Matrix NMS method is used in the inference stage to improve the detection efficiency and accuracy. Benchmark evaluation is performed on two public databases with the results compared with other eight state-of-the-art detection algorithms. At IoU = 0.5 level, proposed FMD-Yolo has achieved the best precision AP50 of 92.0% and 88.4% on the two datasets, and AP75 at IoU = 0.75 has improved 5.5% and 3.9% respectively compared with the second one, which demonstrates the superiority of FMD-Yolo in face mask detection with both theoretical values and practical significance.

Entities: Chemical

Keywords: COVID-19; Face mask detection; Feature extraction and fusion; Improved YoloV3 algorithm

Year: 2021 PMID： 34848910 PMCID： PMC8612756 DOI： 10.1016/j.imavis.2021.104341

Source DB: PubMed Journal: Image Vis Comput ISSN： 0262-8856 Impact factor: 2.818

Introduction

Coronavirus Disease 2019 (COVID-19) is a world-wide spreading epidemic, which severely threatens the life of all human beings since its outbreak. Recently, the reported mutation of COVID-19 even worsens the situation. How to handle this disaster has become a focused issue in global scientific community and many efforts have been made, such as the development of effective vaccine, timely treatment approach, etc. According to the researches [25], COVID-19 mainly spreads by droplet and aerosol transmission, which can occur in the social interaction with an infected person. Therefore, proper wearing of protective masks becomes a scientific and available way of blocking the spread of COVID-19 as the masks can filter and block virus particles in the air [18], which is also recognized and advocated by the government. In order to ensure the effective implementation of above measures, it is necessary to develop an automatic detection method to detect whether individuals in public wear masks in the right way, which also facilitates intelligent prevention and control of the epidemic. Besides, people are also suggested to keep distances in public places and take routine temperature screening [1], all of which are convenient and effective methods for reducing transmission of COVID-19. In the past few years, with the rapid development of deep learning, computer vision (CV) has been widely used in facial expression recognition [12], [13], [41], disease diagnosis [42] and so on. At the same time, CV techniques have boomed and many classical object detection methods have emerged, including one-stage, two-stage and other anchor-free algorithms like Faster RCNN [26], RetinaNet [20], YOLOv3 [29], YOLOv4 [3], CornerNet [17], FCOS [34], etc. Existing algorithms for face and mask detection can be divided into two classes, where one is real-time detection method that pursues a fast inference speed [2], [15], [24]; the other is the high-performance detection method which prioritizes the accuracy. So far many in-depth researches have been carried out on face mask detection [24], [16], [32]. In [2], inspired by MobileNet and SSD, a lightweight model BlazeFace is proposed, which can reach sub-millisecond face detection speed and is suitable for deployment on mobile devices. [15] proposes a deep regression-based user image detector (DRUID), which makes segmentation-based deep convolution algorithms available in the mobile domain and also has great advantages of saving time. A face detection model that integrates Squeeze-Excitation Network and ResNet is designed in [44], where a regression network is applied to track and localize faces in video sequences. As for mask detection, [32] applies YOLOv3 and Faster RCNN to detect whether a mask is worn or not, and [24] proposes a transfer learning method based on ResNet50 and YOLOv2 for the detection of medical masks. In addition, RCNN series algorithms are adopted to realize a multi-task problem in [16], including social distance calculation and mask detection. According to above discussions, it is found that few of current methods have highlighted the unique characteristics of face mask detection, and they generally handle this problem as a conventional object detection task. However, in practice not only detection of both mask and face is necessary, but localization of the mask in face is also of vital significance, which can estimate whether the mask is worn in the right or wrong way. In order to meet the special demand, in this paper, a face mask wearing oriented detection scheme is proposed along with the corresponding algorithm, as shown in Fig. 1 . In particular, three scenarios have been defined, including not wearing masks and wearing mask in the right/wrong way. Before splitting the data, preprocessing operations are performed to eliminate dirty data and adjust image size. Afterwards, anchor clustering is performed to select suitable anchor scale of the training data, which benefits model training by introducing in specific designed prior knowledge for different tasks; and then other data augmentation operations are applied to enrich the samples. By means of sufficient training epochs, the best model and corresponding weight parameters will be saved and evaluated. Finally, once the precision reaches a predefined criterion, the model will be deployed on the terminal device to realize practical application. It is remarkable that when the video streams or pictures are fed into terminal devices, in-time detection results can be outputted and visualized through the designed algorithm. Consequently, intelligent monitoring of mask wearing can be accomplished by deploying relevant devices in places with high population flow-rate such as airports and supermarkets, which can effectively improve the efficiency of COVID-19 epidemic prevention and control. The major contributions of this paper are summarized as follows:

Fig. 1

General flowchart of face mask detection algorithm (FMDA) framework.

General flowchart of face mask detection algorithm (FMDA) framework. The remainder of this paper is organized as follows. Some preliminaries of related work are provided in Section 2 and in Section 3, and detailed implementations of FMD-Yolo are elaborated, including data preprocessing, analytic methods, overall structure and post-processing techniques. Benchmark evaluation results and parallel comparisons with comprehensive discussions are presented in Section 4, and in Section 5, conclusions are drawn. A general face mask detection framework is proposed, which is mask wearing oriented and can be easily extended for secondary development and improvement. A novel face mask detection algorithm FMD-Yolo is proposed, where the feature extractor, feature fusion operation and post-processing techniques are all specifically designed. Experimental results on two public datasets have fully demonstrated the superiority of FMD-Yolo, which outperforms several state-of-the-art methods. Moreover, the high generalization ability enables application in other situations.

Related work

For a comprehensive and in-depth understanding of proposed face mask detection algorithm, some background information are presented in this section, including the feature extractor, feature fusion methods and YoloV3 framework.

Feature extractor based on residual structure

In the field of computer vision (CV), one important research is to effectively extract features from multi-scale information. With the rapid development of deep learning techniques, feature extraction has gradually transformed from manual design to automatic construct by algorithms [14]. In object detection tasks, corresponding algorithm is known as the backbone, and up to now plenty of networks have emerged such as VGG [30], ResNet [8], DenseNet [9], DLA (Deep Layer Aggregation) [40]. One practical strategy applied in those methods is to enrich connections among feature maps, like stacking layers and setting skip connections. It should be pointed out that ResNet is one of the most widely applied backbone which designs a residual block to overcome the degradation problem associated with training a deep network. Res2Net [5] constructs a hierarchical progressive residual connection, which is conducive to improving the multi-scale feature expression ability of the network at a finer level. Major innovation of Res2Net is to replace 3 × 3 convolution in the bottleneck of ResNet with hierarchical cascaded feature group convolution, and this modification is shown in Fig. 2 . The design of hierarchical convolution facilitates enlarging the receptive field within the bottleneck block, where more refined multi-branches can achieve better capture of the objects. To be specific, Res2Net integrates s groups of convolution kernel and has channels in total, where represents the width of network. With larger value of s, more abundant receptive fields will be allowed to learn multi-scale features. In this paper, s and are set to 4 and 26 respectively.

Fig. 2

Comparison of conventional bottleneck structure with Res2Net module.

Comparison of conventional bottleneck structure with Res2Net module. As shown in Fig. 2, bottleneck in Res2Net is divided into s subsets after the 1 × 1 convolution, which are defined as . Each subset has the same size as the input, but the number of channels becomes . Except for X 1, each sub-feature is performed a 3 × 3 convolution operation and lateral connections are adopted for enhancing interaction. The hierarchical residual connection enables the changes of receptive fields at a finer grained level to capture details and global characteristics. Notice the 3 × 3 convolution in each group can receive information from the left branch, thus the final output can contain multi-scope features, which is denoted as and can be obtained as:

Feature pyramid network

In early object detection algorithms, output of the backbone network will be directly fed into the subsequent component, which is the so-called detection head. However, these output feature maps usually cannot make full use of information to characterize objects at multiple scales, and simultaneously the resolution is too small compared with the original input. To overcome such limitation and enhance detection accuracy and reliability, feature fusion methods come to another research hot-spot, which can merge multi-scale features at different levels. In [21], the authors propose a single shot detector (SSD), which directly uses feature maps of different stages to achieve multi-scale detection, but there is no interconnection among individual feature maps. Feature pyramid network (FPN) [19] applies lateral concatenation and up-sampling operation to effectively merge high-level semantic features in top layers and high-resolution information in bottom layers. It is remarkable that many other feature fusion methods are designed based on FPN, like path aggregation network (PANet) [23], which adds an extra bottom-up path. This simple bidirectional fusion is proven helpful as information in bottom layer can be delivered upwards, and afterwards, more complicated bidirectional fusion methods have emerged. For example, the bi-directional feature pyramid network (BiFPN) [33] adopts cross-scale connection to the PANet. By means of multiple repetitions, information of each stage is able to be sufficiently integrated, even the fused feature maps can be reused. To sum up, the feature map fusion approaches have developed from simple manner like top-down and bidirectional fusion to more complex ones, as is shown in Fig. 3 . denotes the feature maps generated at each stage of the backbone network and denotes obtained output after multi-scale feature fusion. Define the fusion operation as , and the relationship between output and input can be depicted as:

Fig. 3

Four types of feature fusion methods.

You only look once (YOLO)

YOLO (You only look once) series algorithms have already experienced four versions [27], [28], [29], [3], which are deemed as magnificent masterpieces in the field of object detection and have been widely applied into various scenarios, including road marking detection [39], license plate recognition [10] and agricultural production [37], etc. In particular, YoloV3 [29] is recognized as an important breakthrough, which makes a trade-off between detection precision and speed. Structure of YoloV3 is illustrated in Fig. 4 , where backbone network adopts the DarkNet53 with fully connected layer removed, and primary components include convolution blocks and stacked residual units (Res Unit), which are denoted as CBR (Conv BN Relu) and ResN respectively. Notice each CBR block contains convolution, batch normalization and leaky relu activation; N in ResN is the number of stacked residual units.

Fig. 4

The architecture of YoloV3 network.

The architecture of YoloV3 network. As shown in Fig. 4, another unique characteristic of YoloV3 is the implementation of multi-scale prediction, i.e., detection at 32×, 16× and 8× down-sampling for the input images, which generates feature maps with size of 13 × 13, 26 × 26 and 52 × 52 for detection of large, medium and small size objects, respectively. It is also remarkable that feature fusion manner in YoloV3 is similar to the idea of FPN, which takes full advantages of high-level semantic features and low-level geometric information so that the overall framework can have strong feature presentation ability and consequently, more accurate detection will be achieved.

Detailed implementations of proposed FMD-Yolo

In this section, the overall framework of the proposed FMD-Yolo method is elaborated, including data preprocessing, model implementation and post-processing method.

Data pre-processing and analysis methods

Fig. 1 presents the procedure of applied data preprocessing, which contains sample and batch transformations. At first, Mixup [43] method is applied along with other traditional data augmentation operations like random color distort, expansion and flip, etc. Traditional methods enhance the robustness of model training by introducing noises and perturbations in a single image, but Mixup helps to enhance the generalization ability of the model and improve the stability of training process by linearly interpolating two random samples and corresponding labels in training set to generate new virtual samples, which can be depicted as:where x and x represent two different types of the input, y and y are corresponding one-hot labels, and λ is mixing coefficient calculated by beta distribution (α and β both take 1.5 in this paper). Secondly, input images are resized to 640 × 640 and then normalized by Normalize operation. To further match training data with FDM-Yolo model, k-means clustering and mutation operation in genetic algorithm are used to select the best combination of prior anchor sizes, i.e., the “Anchor Cluster” in Fig. 1 [4]. At last, the size of anchors is further optimized and determined by mutation operation in genetic algorithm. It is noteworthy that an appropriate anchor size setting benefits the model convergence and useful prior knowledge can be learned, which can speed up the model training process and obtain more matching results with the actual values, and detailed implementation flowchart of anchor cluster is shown in Fig. 5 .

Fig. 5

The anchor cluster algorithm flow used in this paper.

The anchor cluster algorithm flow used in this paper. As shown in Fig. 5, firstly K initial clustering centers are selected where the K is set to 9 in this paper. Then each sample point is assigned to the nearest neighbor clusters, which is measured by the IoU(Intersection over Union)-based distance between clustering center and the sample point, as provided in Eq. (4). Until the cluster centers are fixed and no longer changed, k-means clustering is accomplished and genetic algorithm will be applied to evolve the anchor settings. To be specific, after parameter initialization (fitness function f, mutation probability mp, variation coefficient σ and generations gen), mutation operation is carried out to generate a new generation, which is then multiplied with previously obtained clustering centers to calculate the f. Original clustering center will be updated only when a larger fitness value is obtained, and after 1000 iterations the update is terminated with final anchor size setting saved.

Details of FMD-Yolo network structure

Fig. 6 illustrates the framework and implementation details of proposed face mask detection model FMD-Yolo. There are three major components in designed deep network, which are Im-Res2Net-101, En-PAN and Classification + Regression. In particular, Im-Res2Net-101 (101 denotes the number of layers) is a modified Res2Net structure serving as the backbone of FMD-Yolo, which is responsible for extracting feature from the inputs and have three-stage outputs denoted as C 3, C 4 and C 5. En-PAN is a bidirectional fusion approach based on PANet to enhance the detection performance of FMD-Yolo network by further refining and fusing outputted information from backbone. Meanwhile, the loss function of FMD-Yolo is a modification on that of YoloV3, which will be introduced in next subsection. For a more comprehensive understanding, Fig. 7 and Fig. 8 present the designed Im-Res2Net-101 and En-PAN architecture, respectively.

Fig. 6

Overall structure of proposed FMD-Yolo.

Fig. 7

The structure of Im-Res2Net-101 backbone.

Fig. 8

The implementation of En-PAN structure.

Overall structure of proposed FMD-Yolo. The structure of Im-Res2Net-101 backbone. The implementation of En-PAN structure. Im-Res2Net-101 contains a total of five stages, and FMD-Yolo utilizes the output feature maps of the last three stages and passes them into the subsequent En-PAN feature fusion module. In stage 1, by using three 3 × 3 convolutions, the learning ability is guaranteed with only a small number of parameters. Then in stage 2, in order to retain as much valid information in the feature map as possible, a 3 × 3 max pooling operation with a stride size of 2 is performed after which the output will enter an Im-Res2Net-a module for 3 times. From stage 3 to 5, Im-Res2Net-b module is used and repeated for 4, 23 and 3 times respectively. Nonlocal operation is introduced in stage 3 and 4, and deformable convolution [45] is applied in stage 4 and 5. Since the size and shape of human faces and masks in real scenes are changeable, a fixed receptive field will lead to performance degradation. Nevertheless, deformable convolution allows modeling capability of geometric transformation by adding a two-dimensional offset values to conventional convolution operation, which enables sampling region to be deformed freely. The output feature maps of regular and deformable convolution at each position c 0 are denoted as RC and DC respectively, as described in Eq. (5) and Eq. (6). where S refers to the sampling grid area, c represents each position traversed on S, and is the weight function; Δ is the offset value that displaces each point on area S by using offset . In addition, to reduce loss of information after convolution, the right branch of Im-Res2Net-b applies a 2 × 2 average pooling with stride of 2 and a 1 × 1 convolution operation with stride of 1, rather than a 1 × 1 convolution with stride of 2. The Im-Res2Net-b module is followed by a Spacetime Non-local block, which is essentially based on a self-attention mechanism that makes the processed information not limited to a local region [35], and thus captures the remote dependencies directly by computing the interaction between any two positions. “Spacetime” represents space-based and time-based information, that is, the attention mechanism can also play its role in video sequences. Non-local operation directly fuses the global information instead of simply stacking multiple convolutional layers and can generate richer semantic information, which is equivalent to constructing a convolution kernel with the same size of corresponding input feature map. Following equation describes the work principle of Non-local operation.where x and y refer to the input and output feature maps, respectively; i and j represent the current and global position, which are both indices of space and time. f(·) is the calculation of similarity between i and j, where the Embedded Gaussian function is used; g(·) calculates the linear transformation representation of the feature map at j position, which can be realized by 1 × 1 convolution; C(·) is a response factor, and the final output y is obtained after standardization. Fig. 8 illustrates the structure and implementation of En-PAN, which adopts the bi-directional fusion of PANet and makes full advantages of multi-scale information based on the Yolo Residual module to enhance feature presentation ability. Except for a top-down path, a bottom-up path is added and connected horizontally to fully integrate the overall target information of both the high-level feature maps and the low-level texture information. The top-down and bottom-up paths go through a total of five Yolo Residual modules to thoroughly improve the robustness, and to make those feature maps match in size. Upsample block and 3 × 3 convolution operation with stride of 2 have been employed in the top-down and bottom-up path, respectively. Based on aforementioned three-stage outputs of C 3, C 4 and C 5 of Im-Res2Net-101, output of En-PAN can be obtained as:where YR and Up denote the operations associated with the Yolo Residual block and the UpSample block respectively, and Co represents a 3 × 3 convolution with stride = 2. Yolo Residual Block also has two branches, where the left side includes one 1 × 1 convolution and three Conv Blocks, while the right side has one 1 × 1 convolution and SE (Squeeze and Excitation) Layer [11]. Eventually output of both branches will be concatenated and after a 1 × 1 convolution, final output is obtained. It should be pointed out that applied Conv Block introduces Coordinate Convolution (Coord Conv) [22], Spatial Pyramid Pooling (SPP) [7] and DropBlock [6] optimization strategies. In particular, Coord Conv adds two channels to the input feature maps to represent the horizontal and vertical coordinates, which makes the convolution spatially aware. Since the face mask detection task in this paper is to output the bounding boxes by observing the image pixels in Cartesian coordinate space, the application of Coord Conv can solve the defects caused by ordinary convolution during spatial transformation so as to improve the space perception ability. DropBlock is a regularization method that can be deemed as a variant of Dropout, which not only drops one unit in feature map, but also neighbor regions together to prevent over-fitting problem and guarantee the efficiency of regularization work, so it is more suitable for the face mask detection task of FMD-Yolo. SPP further aggregates the features obtained from the previous convolutional layers and can adapt to changes in image target size and aspect ratio, which is proven a multi-resolution strategy and makes the deformation more robust against the target object. In this paper, SPP module is constructed by using three parallel max-pooling layers with sizes of 5, 9 and 13 respectively. Then inputs and outputs of the three layers are concatenated. Moreover, SE Layer of Yolo Residual Block enables adaptive calibration of feature channels by explicitly modeling the interdependence therein through squeeze and excitation operations, allowing the network to selectively enhance beneficial feature channels and suppress useless ones by using global information, which is realized by one average pooling and two fully-connected layers.

Loss function and post-processing methods

On the basis of cross-entropy loss function of original YoloV3 network, proposed FMD-Yolo algorithm further introduces in a localization loss, including IoU loss and IoU-aware loss. IoU measures the differences between predicted bounding box and corresponding ground-truth, which reflects the localization accuracy and is helpful for keeping model performance in complex environments such as overlapping multiple targets. In this paper it is calculated as:where is an introduced weight and takes value of 2.5. IoU-aware loss further adopts an additional IoU-aware prediction branch for better measure the localization accuracy [38], which can be obtained by:where iou denotes the intersection over union between the predicted boxes and ground-truth, o is the output of IoU-aware branch, and σ is the sigmoid activation function. So far, a localization loss is defined as L  = L  + L , and final loss function L applied in FMD-Yolo consists of coordinates loss L , objectness loss L , classification loss L and localization loss L : Additionally, it is remarkable that in object detection area, non-maximum suppression (NMS) technique is able to remove redundant boxes by extracting target detection boxes with high confidence and suppressing others to finally obtain the true optimal prediction results. Applying a suitable NMS method not only facilitates improving the efficiency, but also enhances the prediction accuracy. Hence, in the inference phase of FMD-Yolo, Matrix NMS [36] is selected as the post-processing method, which uses the sorted upper triangular IoU matrix for parallelization operation to improve the detection accuracy, and can effectively prevent overlapping similar object prediction boxes from conflicting with each other.

Experiments and discussions

In this section, proposed FMD-Yolo is evaluated on two public human face mask detection datasets, and the experimental results are reported and discussed in detail. At first, a brief introduction of our experimental preparation is given, including information of applied databases, evaluation metircs and parameterization. Then, FMD-Yolo algorithm is comprehensively analyzed and compared with other state-of-the-art deep learning methods.

Experiment environment and settings

Applied datasets

In this paper, two face mask datasets have been selected to veriry the effectiveness of FMD-Yolo, which are publicly available at website https://www.kaggle.com/andrewmvd/face-mask-detection and https://aistudio.baidu.com/aistudio/datasetdetail/24982. For convenience, they are denoted as MD-2 and MD-3, respectively. In particular, MD-2 contains two categories face (a1) and face _ mask (a2) referring to faces and faces with masks, respectively; whereas MD-3 includes three categories, which are without _ mask (b1), with _ mask (b2) and mask _ weared _ incorrect (b3), where the last one indicates the mask is worn incorrectly. Table 1 provides the category distrubution of each database and the partition for network training.

Table 1

Datasets information.

Numbers	MD-2	MD-3
Training set	6362	683
Validation set	1590	170
Category a1 / b1 of training set	9937	567
Category a2 / b2 of training set	3212	2388
Category b3 of training set	–	107
Mean boxes per image	2.067	4.490

Datasets information.

Evaluation metrics

As face mask detection is essentially a classification and localization task, typical indicators have been used for evaluation including true positive (TP), true negative (TN), false positive (FP) and false negative (FN), based on which Precision and Recall are defined as follows: Furthermore, intersection over union (IoU) is adopted for evaluation which denotes the ratio of overlapping area between prediction boxes and corresponding ground-truth, and larger IoU value reflects a more precise localization so IoU = 1 will be the best case. Combining with IoU value, AP50 and AP75 is applied to report average precision at IoU = 0.5 and IoU = 0.75 level; similarly, AR50 means average recall at IoU = 0.5. mAP and mAR report the average of 10 precision and recall values at IoU varies from 0.5 to 0.95 with 0.05 interval, respectively. As for detailed performance on each category, indicators cAP50 and cAR50 have been used, and c is the abbreviation of category. In addition, P-R curves are plotted for each category with the horizontal coordinate of precision and the vertical coordinate of recall to further evaluate the comprehensive performance of the face mask detection models.

Experimental settings

Table 2 presents the experimental settings of proposed FMD-Yolo algorithm on both MD-2 and MD-3 databases. In order to further demonstrate the superiority of FMD-Yolo, it has been compared with other state-of-the-art object detection algorithms including Faster RCNN baseline [26], Faster RCNN with FPN [19], YoloV3 [29], YoloV4 [3], RetinaNet [20], FCOS [34], EfficientDet [33] and HRNet [31]. For a fair comparison, training and validation datasets maintain the same and the best convergent model of each algorithm are selected, and all experiments are run under the deep learning framework PaddlePaddle 2.1.2 with a Tesla V100 GPU (32GB). In other words, the eight comparison algorithms and FMD Yolo have the same hardware configuration and software environment.

Table 2

Datasets information.

Parameters	MD-2	MD-3
Max Iterations	120,000	150,000
Base Learning Rate	0.000625
PiecewiseDecay Iters	110,000	130,000
Warmup Steps	4000
Optimizer	SGD with Momentum (factor = 0.9)
Regularizer	L2 (factor = 0.0005)
Train Batch Size	6

Datasets information.

Experimental results and discussions

Overall results and discussions

Table 3 reports the experimental results on MD-2 database, where proposed FMD-Yolo and other eight advanced detection algorithms are compared on five evaluation metrics of mAP, AP50, AP75, mAR and AR50. As can be seen, FMD-Yolo algorithm performs best on all indicators with mAP, AP50, AP75, mAR and AR50 values of 66.4%, 92.0%, 80.0%, 77.4% and 95.5%. Compared with corresponding model that ranks second on each indicator, FMD-Yolo has made an improvement of 4.6%, 1.8%, 5.5%, 10.0% and 2.3% respectively, which demonstrates the superiority of proposed FMD-Yolo in this two-category detection problem.

Table 3

Performance comparison of FMD-Yolo and other eight detection algorithms on MD-2 dataset.

Methods	Evaluation metrics
Methods	mAP	AP50	AP75	mAR	AR50
Faster RCNN baseline	0.565	0.869	0.668	0.623	0.895
Faster RCNN with FPN	0.597	0.886	0.713	0.655	0.900
Yolo V3	0.574	0.888	0.659	0.670	0.929
Yolo V4	0.599	0.899	0.718	0.661	0.920
RetinaNet	0.616	0.887	0.726	0.664	0.909
FCOS	0.605	0.894	0.710	0.674	0.932
EfficientDet	0.613	0.870	0.726	0.665	0.910
HRNet	0.618	0.902	0.745	0.671	0.916
FMD-Yolo (ours)	0.664	0.920	0.800	0.774	0.955

Performance comparison of FMD-Yolo and other eight detection algorithms on MD-2 dataset. According to Eq. (12) and Eq. (13), precision measures the accuracy of model prediction results and recall reflects the ability of identifying TP instances in all positive samples. Notice that at a larger IoU value, improvement brought by FMD-Yolo on AP75 is even 3.7% higher than that on AP50, which indicates that FMD-Yolo satisfies the higher localization accuracy with a more significant performance advantage. It is mainly because IoU and IoU-aware loss have been adopted in the training process of FMD-Yolo that prediction results are more precise. In terms of mAR which exceeds the second one by 10%, it can be inferred that preprocessing phase of FMD-Yolo model have contributed the most, including various data augmentation methods and the Anchor Cluster algorithm combining k-means and genetic algorithms, which matches the best size settings of anchors. In addition, comparison with classical YoloV3 and YoloV4 models has demonstrated the effectiveness of designed feature extractor and fusion structure in FMD-Yolo. In Table 4 , experimental results on MD-3 database have been reported. Notice that compared with MD-2, MD-3 has an extra category which leads to more difficulty in recognition. Therefore, some algorithms such as EfficientDet have presented performance degradation while working on MD-3 than that on MD-2, as can be seen from Table 3 and Table 4. However, proposed FMD-Yolo still performs best, which improves mAP, AP50, AP75, mAR and AR50 by 3.9%, 3.8%, 3.9%, 7.1%, and 0.2% respectively, compared with corresponding model that ranks second. It can be concluded that FMD-Yolo algorithm has so strong robustness that it can present overwhelming advantages in both two-category and three-category face mask detection tasks, which is suitable for application in more complex and variable real-world scenarios.

Table 4

Performance comparison of FMD-Yolo and other eight detection algorithms on MD-3 dataset.

Methods	Evaluation metrics
Methods	mAP	AP50	AP75	mAR	AR50
Faster RCNN baseline	0.521	0.821	0.604	0.601	0.889
Faster RCNN with FPN	0.536	0.846	0.551	0.615	0.907
Yolo V3	0.507	0.824	0.585	0.609	0.927
Yolo V4	0.525	0.843	0.591	0.587	0.897
RetinaNet	0.489	0.792	0.536	0.579	0.875
FCOS	0.521	0.796	0.558	0.628	0.916
EfficientDet	0.454	0.699	0.518	0.569	0.841
HRNet	0.512	0.802	0.556	0.571	0.845
FMD-Yolo (ours)	0.575	0.884	0.643	0.699	0.929

Performance comparison of FMD-Yolo and other eight detection algorithms on MD-3 dataset. In order to further intuitively analyze and compare the performance of each algorithm in face mask detection task, Fig. 9 shows the overall P-R curves for each algorithm on MD-2 and MD-3 datasets where IoU = 0.5 is selected as the threshold to divide positive and negative samples. For any two models, if the P-R curve of one is completely wrapped by that of another, the latter is deemed better. In addition, the superiority of individual models can be estimated based on the area enclosed by P-R curve and coordinate axis or based on the break even point (BEP). The so-called BEP is the coordinate value of the P-R curve of each model when the precision and recall are equal, which is the intersection of P-R curve and the precision = recall line. As illustrated in Fig. 9, both area under P-R curve and BEP of FMD-Yolo method are significantly better than those of other algorithms regardless of the database, which further indicates that FMD-Yolo does have outstanding performance in face mask detection applications.

Fig. 9

Overall P-R curves for each model on MD-2 and MD-3 datasets (IoU=0.5).

Category-wise results and analysis

In this subsection, more in-depth category-wise discussions are presented. Table 5 provides the evaluation results on MD-2 database, where precision cAP50 and average recall cAR50 at IoU = 0.5 of categories face(a1) and face _ mask(a2) are reported. From the results, it is found that all methods have better performence on detecting a2 category. In particular, FMD-Yolo presents a 10.8% higher cAP50 and 8.7% better cAR50 value on a2 than those of a1. But even regarding to the a1 class, FMD-Yolo also performs the best results, which outperforms the second one by 3.2% and 4.1% on cAP50 and cAR50, respectively.

Table 5

Performance comparison on each category of the MD-2 dataset.

Methods	face(a1)		face _ m ask(a2)
Methods	cAP50	cAR50	cAP50	cAR50
Faster RCNN baseline	0.776	0.811	0.962	0.979
Faster RCNN with FPN	0.811	0.824	0.961	0.975
Yolo V3	0.804	0.863	0.971	0.994
Yolo V4	0.829	0.855	0.969	0.985
RetinaNet	0.796	0.825	0.978	0.993
FCOS	0.819	0.870	0.969	0.994
EfficientDet	0.762	0.826	0.979	0.993
HRNet	0.834	0.852	0.971	0.980
FMD-Yolo (ours)	0.866	0.911	0.974	0.998

Performance comparison on each category of the MD-2 dataset. Similarly, Table 6 reports the category-wise evaluation results on MD-3 dataset. Obviously, the category b3 is the most difficult one for accurate detection. FMD-Yolo takes the first place in terms of precision on each category, specifically on b3; the cAP50 value is even 3.3% higher than the second one, which demonstrates the superority of FMD-Yolo in complicated detection work. As for cAR50, FMD-Yolo ranks fourth on category b3 and best on the others. Slightly loss of cAR50 value on b3 may be resulted from smaller number of this category in the validation set, and the fact that FMD-Yolo model pays more attention to higher quality detection with larger IoU. But still, FMD-Yolo presents a satisfactory performance on this three-category detection problem.

Table 6

Performance comparison on each category of the MD-3 dataset.

Methods	Category b1 / b2 / b3
Methods	cAP50	cAR50
Faster RCNN baseline	0.817 / 0.917 / 0.730	0.860 / 0.930 / 0.875
Faster RCNN with FPN	0.838 / 0.913 / 0.787	0.860 / 0.922 / 0.938
Yolo V3	0.864 / 0.928 / 0.681	0.900 / 0.943 / 0.938
Yolo V4	0.840 / 0.903 / 0.785	0.887 / 0.928 / 0.875
RetinaNet	0.774 / 0.870 / 0.732	0.800 / 0.886 / 0.938
FCOS	0.842 / 0.917 / 0.627	0.933 / 0.940 / 0.875
EfficientDet	0.696 / 0.870 / 0.530	0.793 / 0.917 / 0.812
HRNet	0.825 / 0.915 / 0.666	0.860 / 0.925 / 0.750
FMD-Yolo (ours)	0.892 / 0.939 / 0.820	0.947 / 0.964/ 0.875

Performance comparison on each category of the MD-3 dataset. Furthermore, as shown in Fig. 10 , the P-R curves of FMD-Yolo are illustrated on categories a1, a2 in MD-2 dataset, and categories b1, b2, b3 in MD-3 dataset. Notice that the values of BEP (break even point) and the area under the curves are satisfied with a2 > a1 and b2 > b1 > b3, which verifies the reasonability of above discussions. It is also worth noting that both category a2 and b2 belong to the face wearing mask type, where FMD-Yolo performs well with cAP50 of 97.4% and 93.9% respectively, and this indicates that in real-world application FMD-Yolo is capable of accurately identifying people who have worn a mask.

Fig. 10

P-R curves of FMD-Yolo on each category (IoU=0.5).

Conclusion

In this paper, an efficient automatic face mask recognition and detection framework FMD-Yolo and corresponding algorithm are proposed. In particular, a modified Res2Net structure Im-Res2Net-101 serves as the backbone to extract features with rich receptive fields from the input. Subsequently there follows a feature fusion component En-PAN, which is a novel path aggregation network and primarily consists of Yolo Residual Block, SPP, Coord Conv, SE Layer blocks. In En-PAN, high-level semantic information and low-level details can be sufficiently merged so that highly distinctive feature with strong robustness can be constructed. In addition, the localization loss function including IoU and IoU-aware loss is applied for training FMD-Yolo, and parallel computing method Matrix NMS is applied in the post-processing stage, which greatly enhances the model inference efficiency. Benchmark evaluations on two publicly available datasets and comparison with other eight state-of-the-art object detection algorithms have demonstrated the effectiveness and superiority of FMD-Yolo. In future work, we will further enhance the generalization ability of FMD-Yolo so as to enable applications in various real-world detection scenarios, which can be achieved by designing a more efficient feature fusion model with channel and spatial self-attention mechanism.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

12 in total

1. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.

Authors: Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2015-09 Impact factor: 6.226

2. Optimal adaptive k-means algorithm with dynamic adjustment of learning rate.

Authors: C Chinrungrueng; C H Sequin
Journal: IEEE Trans Neural Netw Date: 1995

3. Squeeze-and-Excitation Networks.

Authors: Jie Hu; Li Shen; Samuel Albanie; Gang Sun; Enhua Wu
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2019-04-29 Impact factor: 6.226

4. Focal Loss for Dense Object Detection.

Authors: Tsung-Yi Lin; Priya Goyal; Ross Girshick; Kaiming He; Piotr Dollar
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2018-07-23 Impact factor: 6.226

5. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

Authors: Shaoqing Ren; Kaiming He; Ross Girshick; Jian Sun
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2016-06-06 Impact factor: 6.226

6. Coronavirus disease (COVID-19) prevention and treatment methods and effective parameters: A systematic literature review.

Authors: Amir Masoud Rahmani; Seyedeh Yasaman Hosseini Mirmahaleh
Journal: Sustain Cities Soc Date: 2020-10-22 Impact factor: 10.696

7. Real time data analysis of face mask detection and social distance measurement using Matlab.

Authors: S Meivel; K Indira Devi; S Uma Maheswari; J Vijaya Menaka
Journal: Mater Today Proc Date: 2021-02-20

8. Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment.

Authors: Sunil Singh; Umang Ahuja; Munish Kumar; Krishan Kumar; Monika Sachdeva
Journal: Multimed Tools Appl Date: 2021-03-01 Impact factor: 2.757

9. Evaluation of indoor hospital acclimatization of body temperature before COVID-19 fever screening.

Authors: A Bassi; B M Henry; L Pighi; L Leone; G Lippi
Journal: J Hosp Infect Date: 2021-02-25 Impact factor: 3.926

10. Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection.

Authors: Mohamed Loey; Gunasekaran Manogaran; Mohamed Hamed N Taha; Nour Eldeen M Khalifa
Journal: Sustain Cities Soc Date: 2020-11-12 Impact factor: 7.587

13 in total

1. scSE-NL V-Net: A Brain Tumor Automatic Segmentation Method Based on Spatial and Channel "Squeeze-and-Excitation" Network With Non-local Block.

Authors: Juhua Zhou; Jianming Ye; Yu Liang; Jialu Zhao; Yan Wu; Siyuan Luo; Xiaobo Lai; Jianqing Wang
Journal: Front Neurosci Date: 2022-05-27 Impact factor: 5.152

2. Research on Mask-Wearing Detection Algorithm Based on Improved YOLOv5.

Authors: Shuyi Guo; Lulu Li; Tianyou Guo; Yunyu Cao; Yinlei Li
Journal: Sensors (Basel) Date: 2022-06-29 Impact factor: 3.847

3. Combined Channel Attention and Spatial Attention Module Network for Chinese Herbal Slices Automated Recognition.

Authors: Jianqing Wang; Weitao Mo; Yan Wu; Xiaomei Xu; Yi Li; Jianming Ye; Xiaobo Lai
Journal: Front Neurosci Date: 2022-06-13 Impact factor: 5.152

4. A Radiomics Nomogram for Classifying Hematoma Entities in Acute Spontaneous Intracerebral Hemorrhage on Non-contrast-Enhanced Computed Tomography.

Authors: Jia Wang; Xing Xiong; Jing Ye; Yang Yang; Jie He; Juan Liu; Yi-Li Yin
Journal: Front Neurosci Date: 2022-06-10 Impact factor: 5.152

5. Intelligent Real-Time Face-Mask Detection System with Hardware Acceleration for COVID-19 Mitigation.

Authors: Peter Sertic; Ayman Alahmar; Thangarajah Akilan; Marko Javorac; Yash Gupta
Journal: Healthcare (Basel) Date: 2022-05-09

6. SMD-YOLO: An efficient and lightweight detection method for mask wearing status during the COVID-19 pandemic.

Authors: Zhenggong Han; Haisong Huang; Qingsong Fan; Yiting Li; Yuqin Li; Xingran Chen
Journal: Comput Methods Programs Biomed Date: 2022-05-13 Impact factor: 7.027

7. A fuzzy-enhanced deep learning approach for early detection of Covid-19 pneumonia from portable chest X-ray images.

Authors: Cosimo Ieracitano; Nadia Mammone; Mario Versaci; Giuseppe Varone; Abder-Rahman Ali; Antonio Armentano; Grazia Calabrese; Anna Ferrarelli; Lorena Turano; Carmela Tebala; Zain Hussain; Zakariya Sheikh; Aziz Sheikh; Giuseppe Sceni; Amir Hussain; Francesco Carlo Morabito
Journal: Neurocomputing Date: 2022-01-21 Impact factor: 5.719

8. A novel wavelet decomposition and transformation convolutional neural network with data augmentation for breast cancer detection using digital mammogram.

Authors: Olaide N Oyelade; Absalom E Ezugwu
Journal: Sci Rep Date: 2022-04-08 Impact factor: 4.379

9. Theoretical and Experimental Studies of Micro-Surface Crack Detections Based on BOTDA.

Authors: Baolong Yuan; Yu Ying; Maurizio Morgese; Farhad Ansari
Journal: Sensors (Basel) Date: 2022-05-06 Impact factor: 3.576

10. A fused biometrics information graph convolutional neural network for effective classification of patellofemoral pain syndrome.

Authors: Baoping Xiong; Yaozong OuYang; Yiran Chang; Guoju Mao; Min Du; Bijing Liu; Yong Xu
Journal: Front Neurosci Date: 2022-07-29 Impact factor: 5.152