Literature DB >> 35765303

Deep transfer learning for the recognition of types of face masks as a core measure to prevent the transmission of COVID-19.

Ricardo Mar-Cupido¹, Vicente García¹, Gilberto Rivera¹, J Salvador Sánchez².

Abstract

The use of face masks in public places has emerged as one of the most effective non-pharmaceutical measures to lower the spread of COVID-19 infection. This has led to the development of several detection systems for identifying people who do not wear a face mask. However, not all face masks or coverings are equally effective in preventing virus transmission or illness caused by viruses and therefore, it appears important for those systems to incorporate the ability to distinguish between the different types of face masks. This paper implements four pre-trained deep transfer learning models (NasNetMobile, MobileNetv2, ResNet101v2, and ResNet152v2) to classify images based on the type of face mask (KN95, N95, surgical and cloth) worn by people. Experimental results indicate that the deep residual networks (ResNet101v2 and ResNet152v2) provide the best performance with the highest accuracy and the lowest loss.

Entities: Chemical

Keywords: COVID-19; Deep learning; Face mask; Recognition; Transfer learning

Year: 2022 PMID： 35765303 PMCID： PMC9222491 DOI： 10.1016/j.asoc.2022.109207

Source DB: PubMed Journal: Appl Soft Comput ISSN： 1568-4946 Impact factor: 8.263

Introduction

The COVID-19 outbreak caused by a new strain of coronavirus named SARS-CoV-2 was declared as a global pandemic by the World Health Organization (WHO) on March 11, 2020, because of its high viral transmission and mortality rates. As of 14 October 2021, more than 239.01 million confirmed cases and 4.87 million deaths (https://covid19.who.int/) have globally been registered since the first patient was detected in Wuhan, China, in December 2019. The main transmission pathway of COVID-19 is airborne, that is, human-to-human transmission via droplets nuclei or aerosols produced by infected individuals during all expiratory activities, such as coughing, sneezing, talking, singing, shouting or breathing, especially at a distance of less than 1.5 to 2 metres [1]. To mitigate the risk of viral infection and control the spread of COVID-19, several prevention measures such as hand hygiene, social distancing and wearing face masks have been highly recommended by health agencies worldwide and applied in many countries [2], [3], [4]. It has been found that the use of face masks in public areas, such as in supermarkets and schools or on public transport, is an effective and easy-to-use barrier against the spread of COVID-19 infection [5], [6]. Although mask fit plays a critical role, the effectiveness of each type of face mask in reducing infection risk depends on the mask material, source, structure and particulate efficacies (particle filtration efficiency and filter quality factor) [7], [8]. Examples of some of the face masks most widely used during the COVID-19 pandemic are shown in Fig. 1, whose main features can be summarized as follows [9], [10]:

Fig. 1

Types of face masks commonly worn during the COVID-19 pandemic (from left to right, they are KN95 mask, N95 mask, cloth mask, surgical mask, and without mask).

Cloth masks: Particle filtration efficiency of cloth masks made of a different fabric (typically 100% cotton or 50% cotton/50% polyester blend) varies from 25% to 38%. Surgical masks: Commonly, these are comprised of three layers made from non-woven fabric; the middle filter layer removes particulates, whereas the inner layer absorbs droplets from the mouth, and the outer layer is water-repellent. The surgical masks can capture 45%–55% of particles in the air. N95 and KN95 face respirator masks: Both are made from multiple layers of synthetic material (typically a polypropylene plastic polymer) to filter out 95% of particles. Cloth coverings and surgical masks do not provide the wearer with the same level of protection from inhaling smaller airborne particles as an N95 (or KN95) mask [11]. Oberg & Brosseau [12] demonstrated that surgical masks do not exhibit adequate filter performance against aerosols measuring 0.9, 2.0, and in diameter. Lee et al. [13] showed that particles 0.04 to m can penetrate surgical masks. Although surgical masks are not ideal for containing airborne transmission in a high-risk environment because the SARS-CoV-2 particle has a size similar to that of SARS–CoV (estimated as 0.08 to ), there is good evidence that they are very useful at a population level [14]. Regarding N95 and KN95 face respirator masks, their main difference refers to the fact that N95 is the United States standard for respirator masks, whereas KN95 is the China standard (and FFP2 is the Europe standard). Besides, N95 masks have slightly stricter requirements for pressure drop while inhaling or exhaling, which means that they have stronger breathability than KN95 masks. On the other hand, KN95 masks usually have earloops, while N95 masks have head straps. Due to the importance of wearing face masks to control the spreading and transmission of COVID-19, the governments of many countries have enacted laws and rules that mandate their use in public places. This has led to the development of various security and surveillance systems for face mask detection with the aim to identify people who do not wear a face mask. Types of face masks commonly worn during the COVID-19 pandemic (from left to right, they are KN95 mask, N95 mask, cloth mask, surgical mask, and without mask). At the beginning of the pandemic, there was no explicit recommendation on the type of mask to wear. However, with the arrival of new variants of SARS-CoV-2 of increased contagiousness, such as the quick-spreading Omicron, Governments and healthcare agencies changed the guidelines to minimize infections. On January 14, 2022, the US Centers for Disease Control and Prevention (CDC) released the report “Type of Masks and Respirators” [15], which lists a series of recommendations on what type of face mask to use depending on specific risk and vulnerability situations. Thus, some hospitals joined these recommendations by establishing their policies on masks and respirators. For example, the Mayo Clinic Health System website [16] enumerates the types of preferred and acceptable masks that are required for staff, patients and visitors (N95, KN95, and surgical/procedural) to slow the spread of COVID-19, and those that cannot be used because of their reduced effectiveness (masks with vents, bandanas, neck gaiters, face shields, and cloth masks). An article entitled “Why Cloth Masks Might Not Be Enough as Omicron Spreads” [17] in the Wall Street Journal collects the different recommendations given by doctors on the use of the types of face masks, as well as the time it takes to transmit COVID-19 between an infected person and an uninfected person depending on whether they are wearing an N95 mask, surgical mask, cloth mask or none. In this sense, clinicians and infectious disease specialists suggest the importance to educate the public on the different quality of face masks [17]. In this paper, we introduce a model based on deep transfer learning to recognize/classify four different types of face masks (plus faces without a mask). The proposed model could be easily integrated as a software upgrade into existing surveillance systems developed to detect people not wearing a mask. The integration of a classification module may enhance the functionality of surveillance systems by generating alerts to prevent access to people wearing an inappropriate type of face mask depending on where the system is installed. The correct use of the face mask according to the characteristics of the place can limit the spread and significantly reduce the number of deaths [18]. In some hospitals, cloth masks are allowed as long as it is inside or outside work. However, it is mandatory to replace it with a surgical or procedure mask when at the work unit [16]. Henceforward, this paper is organized as follows. Section 2 reviews some papers that addressed face mask detection for fighting against COVID-19. Section 3 describes the data sets used in the experiments. Section 4 introduces the model adopted to identify the types of face masks. Section 5 presents the experimental set-up and discusses the results. Finally, Section 6 remarks on the main findings and outlines possible directions for further research.

Some works on face mask detection

This section offers a non-exhaustive sample of works that have been published on the face mask detection problem. We have consciously omitted any study on facial recognition or detection applied to people wearing masks that has not been explicitly related to the COVID-19 pandemic. A transfer learning approach built by fine-tuning the pre-trained InceptionV3 model was proposed to identify people who are not wearing a mask in public places and crowded areas [19]. Experimental results with the transfer learning model trained for 80 epochs on a testing data set with 479 images achieved a 100% in terms of accuracy, precision, and specificity. Qin & Li [20] developed a face mask-wearing condition identification model by combining image super-resolution and classification networks to classify images into three categories: no face mask-wearing, incorrect face mask-wearing, and correct face mask-wearing. The authors reported that after 200 epochs, the proposed model achieved a 98.70% of accuracy for the face mask-wearing condition identification. In contrast, when it is used to identify two different face masks, the accuracy in folded face masks vs. N95 was 98.02%, and in medical-surgical vs. basic cloth face masks was 99.19%. A third experiment for colour face mask detection suggests that the proposed model could detect blue, white, black, and other face mask colours with an accuracy of 98.84%, 98.68%, 97.81%, and 98.25%, respectively. Loey et al. [21] employed the ResNet-50 deep transfer learning algorithm for the feature extraction process and YOLOv2 for real-time detection of medical face masks. The proposed detector optimized with Adam and 60 epochs gave an average precision of 81% on a new combined data set, obtained from two public medical masks data sets. Singh et al. [22] used the YOLOv3 and faster R-CNN convolutional networks to monitor people wearing face masks in public places. The data set used was composed of two data sets MAFA and WIDER FACE, and images taken from several web sources. Experimental results on the validation data set showed an average precision of 62% after testing 50 epochs. Loey et al. [23] proposed a hybrid method for face mask detection that comprises two stages: one for feature extraction using ResNet-50 and a second part to detect face masks using traditional machine learning methods (support vector machine, decision tree and ensemble of classifiers). The authors used four data sets as a two-class classification problem: (1) real face masks (RMFD), (2) fake face masks (SMFD), (3) a combined data set from fake and real face masks, and (4) simulated face masks (LWF). From all experimental results, the best performance was obtained using the SVM, where the accuracy testing on RMFD was 99.64%, SMFD was 99.49% on SMFD, and 100% on LFW. Wang et al. [24] designed a two-stage method to detect wearing masks based on the transfer model of Faster_RCNN and InceptionV2 structure. Moreover, they also generated a data set for wearing mask detection that includes 7804 realistic images with 26403 wearing masks and multiple scenes (available at https://github.com/BingshuCV/WMD). The effectiveness of the proposal on the whole data set, which contained images of different numbers of people wearing masks (1, 2 to 4, and 5 or more persons), was measured in terms of precision, recall, and the F-measure (F1-score). The experimental results indicate that the model could get competitive testing effectiveness of 93.54% in recall, 94.84% in precision, and 94.19% in F1-score. Chen, Liu & Zhang [25] devised a transfer learning algorithm combined with a skip-connected structure to improve the classification accuracy of eight masked face poses. Besides, a semi-synthetic masked face pose data set was constructed by superimposing the face pose images taken from the CAS-PEAL-R1 data set [26], and a diversity of the mask images. The aim was to replace the ImageNet data set as a source domain to improve the model’s accuracy and generalization ability. Experiments using AlexNet and VGG16 architectures yielded an accuracy testing on each model of 96.44% and 99.29%, respectively. Jiang et al. [27] introduced an algorithmic improvement to the original YOLOv3 by adding the squeeze and excitation block between the convolutional layer of Darknet53 and replacing the mean squared error loss function with GIoU loss. The authors also built a large data set with 9205 images of mask-wearing faces belonging to three predefined classes: with mask, incorrect mask, and without mask. The final average precision of the proposed model over the three classes was 75%, representing an increase of 8.6% compared to YOLOv3. Yu & Zhang [28] proposed an improved algorithm based on YOLOv4 for face mask-wearing detection to reduce the computing cost of the network and improve the learning ability of the model. The experiments were carried out on a data set constructed from two sources: (1) the published RMFD and (2) the MaskedFace-Net. The detection task aimed to classify three images: without mask, face mask, and incorrect mask. The model’s performance was measured using precision, recall, F1-score, and average precision over all classes. The experimental results indicated that the proposal reached accurate rate recognition of 95.1% for precision, 98.2% for recall, 96.7% for F1-score, and 98.3% for mean average precision. Sethi, Kathuria & Kaushik [29] presented a highly accurate and real-time technique based on bounding box affine transformation and transfer learning to detect mask faces in public places. The authors reported that their proposal implemented with ResNet50 achieved an accuracy of 98.2%, while the recall in each class was 99.0% (face detection) and 98.24% (mask detection). Prusty, Tripathi & Dubey [30] implemented a two-class mask detection model based on YOLOv3 and DarkNet53 that consists of a data augmentation block using image filtering techniques such as greyscale and Gaussian blur. The model was trained in two parts according to the data set used. First, using the data set taken from Kaggle, and in the second stage, using the augmented data set. The proposal was evaluated considering three detection problems: (1) individual detection on images, (2) group detection on images, and (3) group detection in video. In all cases, the confidence level was between 0.98 and 0.99, while in the video analysis of a group of people, the average precision over the two classes was 99.8%. The MobileNetv2 convolutional neural network to detect people wearing and not wearing a face mask in public places has been used in several studies [31], [32], [33], [34], [35]. The experimental results found in the literature suggest that with the use of MobileNetv2 in classification problems of two classes (mask vs. unmask), the accuracy rates are above 98%. Nagrath et al. [36] integrated the single shot detector (SSD) as a face detector and the SSD MobileNetv2 to determine whether a mask is worn or not. The proposal methodology involves two steps; first, a data set obtained from different sources is preprocessed and augmented; second, a MobileNetv2 is trained using the Adam optimizer. The authors state that their model achieved an accuracy of 92.64% and F1-score of 93%. Gola et al. [37] investigated the performance of MobileNetv2, YOLOv3 and YOLOv4 in the context of real-time face mask detection in the Indian community on various types of masks such as home-made, handkerchiefs, and dupattas, concluding that YOLOv4 performs the best in terms of sensitivity and precision. Joshi et al. [38] proposed an approach for detecting face masks in videos using the multi-task cascaded convolutional network and the MobileNetv2 model. The experimental evaluation was carried out on a data set compiled from YouTube videos of multiple geographical locations. The proposal model achieved in the facial mask prediction task a precision of 84.39% and an accuracy of 81.74%. Alguzo, Alzubi & Albalas [39] implemented a multi-graph convolutional network to detect people wearing a face mask. The proposed methodology is characterized by using several convolutional filters that transform the extracted facial relief and generalizes image frequencies. The model was evaluated on the publicity data set RMFD. A comparative study suggests that the proposal obtains a high accuracy testing of 97.9%, outperforming six state-of-the-art models. A methodology that combines a pre-trained RetinaFace model for face detection and the NASNetMobile neural network for classifying the detected faces as masked or non-masked was proposed by [40]. To get cultural diversity, the authors combined several data sets to avoid a model biased toward Asian faces. The proposed method obtained identical results to the DenseNet121 and MobileNetv2 models with an F1-score of 99.13%. However, it performed significantly better in terms of inference speed and also resulted in a smaller size than the DenseNet121 and MobileNetv2 models. Three different convolutional neural networks (VGG-19, Xception and MobileNetv2) were compared to extract features from images of faces that were further processed using a support vector machine and a -nearest neighbours classifier [41]. The best results were obtained when the SVM was combined with MobileNetv2 on a small data set augmented with different techniques such as cropping, padding, and horizontal flipping. The proposal reached an accuracy of 97.11%. Similarly, Hussain et al. [42] classified individuals who wear the face mask properly, improperly, and without a face mask using VGG-16, MobileNetv2, InceptionV3 and ResNet-50 using a transfer learning approach, obtaining the highest performance with VGG-16 and MobileNetv2 models. This detection system was a module of a complete system called IoT-based Smart Screening and Disinfection Walkthrough Gate, which included a non-contact temperature measurement system and could be installed at the entrance of any place. The authors created a database from different available sources, such as MAFA, MaskedFace-Net and Bing images, to evaluate the proposal. The experimental results showed an accuracy of 99.8% in the detection of three types of situations: without mask, proper mask, and improper mask. Finally, an experiment to evaluate the effectiveness of the method in detecting two types of masks (N95 and surgical) yielded a 98.17% of accuracy.

Data sets

The present study was carried out using images taken from four publicly available databases: The Face Mask Classification (FMC) database [43] contains a total of 440 images of different resolutions: 220 for faces with a mask and 220 for faces without a mask. The Face Mask Detection (FMD) database [44] consists of 853 images of faces to detect mask, no mask and incorrectly worn masks. The Masked Faces in Real-World for Face Recognition (MFR2) database [45] is a small data set of real-world masked faces of 53 identities and 269 images aligned and preprocessed. The Masked Faces (MAFA) database [46] contains 30,811 images of unmasked faces and 35,806 images of masked faces. Faces in the data set have various orientations and occlusion degrees, while a mask occludes at least one part of each face. Apart from these databases, we also picked up images from Bing Images and Google Images to compensate for the lack of images with N95 and KN95 face respirator masks. Fig. 2 illustrates some example images of faces with and without masks taken from the different databases and repositories used in this work.

Fig. 2

Sample images from the databases for faces with and without a mask.

After collecting images from the public databases and repositories just mentioned, the distribution of images per class was as follows: 1450 images with N95 respirator mask, 1848 images with KN95 respirator mask, 2685 images with surgical mask, 2114 images with fabric mask, and 2716 images without a face mask. As the distribution of classes was moderately skewed due to the limited number of images with N95 and KN95 face masks, only 1450 images from each category were randomly selected to form a class-balanced data set and thus to avoid the possible bias towards the most represented classes during the recognition process. Sample images from the databases for faces with and without a mask.

The proposed methodology

This section introduces the methodology designed to classify images based on the different types of face masks. In summary, it consists of a first component to prepare/preprocess the image set and a second one for developing the deep transfer learning model. The flowchart of the proposed methodology is illustrated in Fig. 3.

Fig. 3

Framework of the proposed methodology.

Data preprocessing

As each database had a different image format, the first step was to convert all the images into a single format (png) using ImageMagick. Next, we extracted out the region of interest (ROI) from the raw images to keep only the face region and remove the unnecessary parts that could hinder the performance of the recognition process (Fig. 4). The images (ROIs) were then resized to 224 × 224 pixels in 3 channels because this is the default size accepted by the MobileNetv2, ResNet and DenseNet deep transfer learning models used in this work.

Fig. 4

Region of interest (ROI).

Finally, all the images were normalized through scaling down by factor 1./255 before feeding to the model. Thus the colour value of every pixel was transformed from range to range so that images contribute more evenly to the total loss of the model. Region of interest (ROI).

Data augmentation

An exploitable deep learning model must show a low generalization performance, also called test error, when performing well on unobserved examples [47]. When this does not occur, the model is said to be overfitting, which means that the difference between the training error and the test error is too significant. This problem can be originated when the training data is undersized [48]. As a rule of thumb, it has been suggested that to get deep learning models with satisfactory performance, each class in the training set should have at least 5000 examples [47]. Two major solutions can be found in the literature to overcome the overfitting [49]. First, the architecture or learning algorithm level includes a variety of strategies, such as dropout, batch normalization, transfer learning, pre-training, early stopping, and limiting the number of parameters by filter size [49], [50]. Second, at the data level, data augmentation is considered the natural way to deal with the insufficient amount of data, which consists on creating new training examples from the existing ones either by basic image manipulation, deep learning techniques, or both [50]. Basic image manipulation is the most straightforward strategy for enlarging the training set size via geometric transformations of existing data (e.g., rotation, reflection, flipping, shifting, cropping, elastic distortion, and shearing). Shijie et al. [51] showed that cropping, flipping, and rotation generally yields better results than other data augmentation techniques. In addition, several face mask detection works suggest that this strategy improves the performance and robustness of the deep learning methods [20], [23], [27], [36], [42]. For this study, the data set was split into two parts, 80% for training the model and the remaining 20% for testing the recognition performance of the models. Then, we applied some transformations to the training images using Keras ImageDataGenerator: rotation (the rotation parameter was set to 20 degrees), width and height shifting (in both cases, the parameter was set to 0.2), shearing (the parameter was set to 0.15), horizontal flipping, and filling (we filled the empty pixels with the value of their nearest pixel).

Recognition models

In this paper, we developed some transfer learning based models pre-trained on the ImageNet database to solve the requirement of a huge amount of labelled examples: MobileNetv2 is based on an inverted residual structure where the shortcut connections of the residual block are between the thin bottleneck layers [52]. The intermediate expansion layer uses lightweight depth-wise convolutions to filter the features as a source of non-linearity. This architecture consists of the primary full convolution layer through 32 filters and 19 residual bottleneck layers. ResNet101v2 and ResNet152v2 are examples of the so-called residual networks, which contain a lot of layers (101 and 152 in this paper) with high performance [53]. The main difference between (v2) and the original (v1) is that (v2) uses batch normalization before each weight layer. NasNetMobile is an architecture that aims at searching for an optimal convolutional architecture using reinforcement learning [54]. Two types of convolutional cells are used in the architecture: the normal cells return a feature map of the same input, whereas the reduction cells return a feature map where the height and width of the feature map are reduced by a factor of two.

Performance evaluation in classification tasks

Scalar metrics are the most frequently used alternative to evaluate the quality of a predictive classification/recognition system. The scalar performance metrics can be obtained from a confusion matrix representing the number of successes and errors distributed across classes. In a multi-class problem, it is a square matrix, where each element denotes the number of images or examples whose true class is , , and was assigned to class , , by the classifier. The confusion matrix can be written as in Table 1, where the diagonal entries () correspond to the correctly classified examples for the class , and the non-diagonal elements () represent the misclassification examples that were assigned to the class when the actual class is .

Table 1

Confusion matrix for multi-class problems.

	Predicted
Actual	Pred_C1	Pred_C2	…	Pred_Cm	Total
Act_C1	a11	a12	…	a1m	nC1=∑j=1ma1j
Act_C2	a21	a22	…	a2m	nC2=∑j=1ma2j
⋮	⋮	⋮	⋱	⋮	⋮
Act_Cm	am1	am2	…	amm	nCm=∑j=1mamj

Total	∑i=1mai1	∑i=1mai2	…	∑i=1maim	n

The most popular scalar performance metrics are accuracy and its complement the error rate: and Confusion matrix for multi-class problems. Accuracy and error rate have been shown to be inappropriate in those cases where the classes do not have an equal number of examples (the class imbalance problem). Therefore, it could not consider the unbalance in the classification results [55]. Straightforward metrics on a single class can be computed. Considering that is the class of interest (also called the positive class), then Recall is the true positive rate (TPR) of a classifier on the class . It is also called the sensitivity of the classifier. For a specific class, it can be computed as follows: If class is considered as the positive class, then the rest of the classes can be regarded as the negative class, that is, the negative class is a subset for and . Therefore, we can calculate the proportion of all correctly labeled examples that do not belong to the class named as the true negative rate (TNR) or specificity, Precision or Positive Predictive Value (PPV) indicates the rate of truly samples that belong to from among all the examples predicted as . Consequently, it represents how accurate was the algorithm on a given class. and can be defined as: Examples of images misclassified by the ResNet101v2 model (text means ‘true classpredicted class’). Accuracy and loss for the recognition models. When the number of samples in each class is unequal, some authors suggest that measures that combine and may be more appropriate [56], [57]. From the text retrieval area, the score for a class corresponds to the harmonic mean of and : where is a weight to adjust the importance of concerning . Commonly to give the same importance to both metrics , As can be seen, the previous metrics were defined for a two-class problem, where each time, a different class was selected as the class of interest (positive) while the rest () was the negative class. However, two extensions to multi-class problems have been proposed in the literature: macro-average and micro-average. It has been demonstrated that the macro-average is the best strategy when the classes are not well-balanced [57]. From , , and , the following macro-average measures can be obtained [57], [58]:

Experiments

The performance of the deep neural networks was analysed over the data set generated as described in Section 3 for the classification of face images into five different classes: face with a KN95 mask, face with an N95 mask, face with a cloth mask, face with a surgical mask, and face with no mask. To mitigate the problem of long training times, all architectures were trained for 10 epochs with 362 passes over the training samples. All experiments were performed on the Google Colab computing service using Python programming language and two open-source software libraries, Keras and TensorFlow. The source code is available at https://github.com/ricardomcupido/Gogle-Colab-Face-Mask-Classification. The performance of the models was evaluated using five widely-used criteria: accuracy, recall, precision, specificity, and F1-score.

Results and discussion

Table 2 reports the values of accuracy and loss for the pre-trained deep transfer learning models. As can be observed, the highest training accuracy was achieved with the residual networks ResNet101v2 and ResNet152v2, which were as high as 0.9800 and 0.9737 respectively. Both these pre-trained models based on transfer learning also exhibited the lowest loss values during the training step. Regarding to the results of NasNetMobile and MobileNetv2, one can see that these models obtained an accuracy of 0.9324, whereas the loss of NasNetMobile was the worst.

Table 2

Accuracy and loss for the recognition models.

	Accuracy	Loss
NasNetMobile	0.9324	0.2160
MobileNetv2	0.9324	0.2131
ResNet101v2	0.9800	0.0734
ResNet152v2	0.9737	0.0808

For a more detailed review of the performance results, Table 3 shows the confusion matrix of the deep transfer learning models. With the help of a confusion matrix, the impact of misclassifications on the model performance can be better analysed. Thus, values in the confusion matrices indicate that all the models mainly failed to identify the images of faces wearing a cloth mask; it has to be noted that most errors were due to mistaking those for images of faces with a surgical mask or images of faces without a mask (for instance, the ResNet152v2 model misclassified 11 images of faces with a cloth mask as being of faces without a mask). Another common error found in these confusion matrices refers to the recognition of images with faces wearing a KN95 that were identified as an image with an N95 face mask and vice versa, which can be explained because these masks are alike and provide much the same structural and functional characteristics.

Table 3

Confusion matrix of the pre-trained deep transfer learning models (true class labels on the rows and predicted class labels on the columns).

Hereafter we concentrate on the results of ResNet101v2 as the best performing pre-trained model. From the values in the confusion matrix of ResNet101v2, it is possible to calculate recall, precision, specificity and F1-score. Table 4 shows these performance evaluation measures of each class. As can be viewed, the highest value of recall was for the images of faces with a surgical mask, whereas the highest values of precision, specificity and F1-score were obtained for the class of KN95 face respirator masks. However, the most important picture of this table is that all classes (types of face masks) yielded high performance.

Table 4

Recall, precision, specificity and F1-score for the ResNet101v2 model.

	Recall	Precision	Specificity	F1-score
KN95	0.9793	0.9895	0.9974	0.9844
N95	0.9828	0.9794	0.9948	0.9811
Cloth	0.9655	0.9825	0.9957	0.9739
Surgical	0.9897	0.9696	0.9922	0.9795
Without	0.9828	0.9794	0.9948	0.9811

Macro-avg	0.9802	0.9801	0.9950	0.9802

Confusion matrix of the pre-trained deep transfer learning models (true class labels on the rows and predicted class labels on the columns). Fig. 5 collects some examples of images that were misclassified with the ResNet101v2 model. This can help one better understand the rationale of the errors produced during image recognition. For instance, the KN95 face mask in image labelled as A1 was recognized as a cloth mask, which can be due to the profile view of the image and also to the material and colour of the face mask. Faces partially occluded can also be a source of error in the recognition of a face mask as is the case of images A2 and B2. The model did not detect a face mask in images labelled as C2 and D2 because of their resemblance to a face. Errors on images in the third row of Fig. 5 result from misidentifying some article (jewel, sticker, hair, scarf) in the face as a face mask.

Fig. 5

Examples of images misclassified by the ResNet101v2 model (text means ‘true classpredicted class’).

Recall, precision, specificity and F1-score for the ResNet101v2 model.

Comparison with related works

As mentioned in Section 2, similar proposals for mask detection can be found in the literature. However, they present several variations in the classification model, transfer learning approach, data sets, number of classes, and performance evaluation metrics. Therefore, making a fair and informative comparison under different protocols is not straightforward. Despite this, we compared our experimental results to show that the method proposed in this paper yields competitive recognition rates against state-of-the-art methods. Table 5 summarizes the performance results of related works and those achieved by our proposal. For comparison purposes, we included the best results from all the experiments carried out in each paper. As can be observed, most of the works dealt with two-class problems to recognize mask vs. unmask, generally giving results higher than 98% in terms of accuracy and F1-score.

Table 5

Comparison of our proposal against related works.

Reference	Techniques	Classification problem	# Classes	Results
This paper	ResNet101v2	Multi-class	5	Accuracy = 98.00%
[19]	InceptionV3	Two-class	2	Accuracy = 100%
[20]	SRCNet	Multi-class	3	Accuracy = 98.70%
[21]	ResNet50 and YOLOv2	Two-class	2	Average Precision = 81%
[22]	YOLOv3 and Faster R-CNN (ResNet101-FPN)	Two-class	2	Average Precision = 62%
[23]	ResNet50 and SVM	Two-class	2	Accuracy = 100%
[24]	Faster R-CNN and InceptionV2	Two-class	2	F1-score = 94.19%
[25]	VGG16	Multi-class	7	Accuracy = 99.29%
[27]	SE-YOLOv3	Multi-class	3	Average Precision = 73.7%
[28]	CSPDarkNet53, PANet and YOLOv4	Multi-class	3	Average Precision = 98.30%, F1-score = 96.7%
[29]	ResNet50	Two-class	2	Accuracy = 98.20%
[30]	YOLOv3	Two-class	2	Average confidence = 97.00%
[31]	MobileNetv2	Two-class	2	Accuracy = 96.85%
[32]	MobileNetv2 and Single Shot Detector	Two-class	2	Accuracy = 91.70%
[33]	MobileNetv2	Two-class	2	Accuracy= 98.00%
[34]	MobileNetv2	Two-class	2	Accuracy = 98.00%
[35]	InceptionV3	Two-class	2	Accuracy = 98.00%
[36]	Single Shot Multibox Detector and MobileNetv2	Two-class	2	Accuracy = 92.64%, F1-score = 93.00%
[37]	YOLOv4	Two-class	2	Average Precision = 88%, F1-score = 99.54%
[38]	MobileNetv2	Two-class	2	Accuracy = 81.74%
[39]	Multigraph CN-VGG16	Two-class	2	Accuracy = 97.9%
[40]	MobileNetv2, DenseNet121, NASNet	Two-class	2	F1-score = 99.40%
[41]	MobileNetv2-SVM	Two-class	2	Accuracy = 97.11%
[42]	VGG-16	Multi-class	3	Accuracy = 99.81%

The only work focused on identifying two different types of masks (N95 and surgical) by using VGG-16 was the one by Hussain et al. [42], reporting a recall of 97% and 99% for N95 and surgical masks, respectively. As can be seen in Table 4, the ResNet101v2 model applied to our multi-class problem (KN95, N95, cloth, surgical, and without) performed similar or even better than the VGG-16 approach ran over an easier two-class problem. Comparison of our proposal against related works.

Conclusions

This paper has developed a deep transfer learning model for the recognition of different types of face masks. The proposed method consists of two main stages. The first part includes data preprocessing (conversion of image format, ROI detection, and normalization) and data augmentation (rotation, width and height shifting, shearing, horizontal flipping, and pixel filling), whereas the second stage is for the recognition process using a deep transfer learning neural network. Experiments have investigated the performance of four deep transfer learning models (NasNetMobile, MobileNetv2, ResNet101v2, and ResNet152v2) pre-trained on ImageNet database. The experimental results have shown that both residual networks perform better than the other models, yielding very high accuracy and low loss. In particular, the ResNet101v2 appears as the best deep transfer learning model for the task of distinguishing between four different types of face masks, with an accuracy of 98% and a loss of 0.0734. Despite its contributions, the results of this paper should not be interpreted without accounting for some limitations that could be addressed in future works. The emphasis of this paper has been on four different types of face masks, but other types should be included for a more real scenario. Although assessing mask fitting is way beyond the scope of this work, this seems to be a critical thing to be watched as a loose-fit can significantly impair even high-filtration masks. On the other hand, the models have been trained with images from public databases, but a further step will consist of elucidating the performance with a set of images of much lower resolution, such as CCTV footage. Finally, the research has analysed four deep transfer learning models, but it could be extended to other models such as Inception and DenseNet.

CRediT authorship contribution statement

Ricardo Mar-Cupido: Methodology, Software, Formal analysis, Investigation. Vicente García: Conceptualization, Methodology, Validation, Supervision. Gilberto Rivera: Methodology, Validation, Supervision. J. Salvador Sánchez: Supervision, Writing – original draft, Writing – review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

22 in total

Review 1. The use of facemasks to prevent respiratory infection: a literature review in the context of the Health Belief Model.

Authors: Shin Wei Sim; Kirm Seng Peter Moey; Ngiap Chuan Tan
Journal: Singapore Med J Date: 2014-03 Impact factor: 1.858

2. Effectiveness of surgical, KF94, and N95 respirator masks in blocking SARS-CoV-2: a controlled comparison in 7 patients.

Authors: Min-Chul Kim; Seongman Bae; Ji Yeun Kim; Se Yoon Park; Joon Seo Lim; Minki Sung; Sung-Han Kim
Journal: Infect Dis (Lond) Date: 2020-08-26

3. Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis.

Authors: Derek K Chu; Elie A Akl; Stephanie Duda; Karla Solo; Sally Yaacoub; Holger J Schünemann
Journal: Lancet Date: 2020-06-01 Impact factor: 79.321

8. Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection.

Authors: Mohamed Loey; Gunasekaran Manogaran; Mohamed Hamed N Taha; Nour Eldeen M Khalifa
Journal: Sustain Cities Soc Date: 2020-11-12 Impact factor: 7.587

9. Respiratory performance offered by N95 respirators and surgical masks: human subject evaluation with NaCl aerosol representing bacterial and viral particle size range.

Authors: Shu-An Lee; Sergey A Grinshpun; Tiina Reponen
Journal: Ann Occup Hyg Date: 2008-03-07