O K Sikha1, Bandla Bharath1. 1. Department of Computer Science & Engineering Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, Coimbatore, India.
Abstract
With the recent COVID-19 pandemic, wearing masks has become a necessity in our daily lives. People are encouraged to wear masks to protect themselves from the outside world and thus from infection with COVID-19. The presence of masks raised serious concerns about the accuracy of existing facial recognition systems since most of the facial features are obscured by the mask. To address these challenges, a new method for masked face recognition is proposed that combines a cropping-based approach (upper half of the face) with an improved VGG-16 architecture. The finest features from the un-occluded facial region are extracted using a transfer learned VGG-16 model (Forehead and eyes). The optimal cropping ratio is investigated to give an enhanced feature representation for recognition. To avoid the overhead of bias, the obtained feature vector is mapped into a lower-dimensional feature representation using a Random Fourier Feature extraction module. Comprehensive experiments on the Georgia Tech Face Dataset, Head Pose Image Dataset, and Face Dataset by Robotics Lab show that the proposed approach outperforms other state-of-the-art approaches for masked face recognition.
With the recent COVID-19 pandemic, wearing masks has become a necessity in our daily lives. People are encouraged to wear masks to protect themselves from the outside world and thus from infection with COVID-19. The presence of masks raised serious concerns about the accuracy of existing facial recognition systems since most of the facial features are obscured by the mask. To address these challenges, a new method for masked face recognition is proposed that combines a cropping-based approach (upper half of the face) with an improved VGG-16 architecture. The finest features from the un-occluded facial region are extracted using a transfer learned VGG-16 model (Forehead and eyes). The optimal cropping ratio is investigated to give an enhanced feature representation for recognition. To avoid the overhead of bias, the obtained feature vector is mapped into a lower-dimensional feature representation using a Random Fourier Feature extraction module. Comprehensive experiments on the Georgia Tech Face Dataset, Head Pose Image Dataset, and Face Dataset by Robotics Lab show that the proposed approach outperforms other state-of-the-art approaches for masked face recognition.
COVID-19 is currently wreaking havoc on the world. The uncontrolled coronavirus disease (COVID-19) is a viral infection caused by severe acute respiratory syndrome (SARS-CoV-2). With its deadly spread to 222 countries and territories worldwide, the COVID-19 has caused a global crisis. A total of 352, 234, 810 confirmed cases of COVID-19 that originated in Wuhan, China, have been reported, with a death toll of 5,615,082 as of January 2022. People become infected by coughing, sneezing, and/or talking in close proximity to an infected person’s respiratory droplets. Furthermore, the virus can be spread by touching a virus-infected surface or object and then touching the mouth, nose, or eyes. Wearing a mask and social distancing are the most effective ways to combat this pandemic as shown in Fig. 1. Implementing these safety measures, particularly masking the lower half of the face, has a significant impact on the current security systems based on facial recognition that have already been implemented by a number of corporations and government agencies.
Fig. 1
Chance of Covid 19 transmission with/without social distancing and face mask
Chance of Covid 19 transmission with/without social distancing and face maskIn comparison to traditional methods such as PIN, secret password, finger impression (Multimodal fingerprint spoof detection using white light 2016; Schroff et al. 2015), etc., face recognition has gained a lot of attention as a remarkable biometric authentication procedure across all automatic personal authentication systems (Singh et al. 2021) (Hong et al. 2021; Nefian and Hayes 2000). Many organizations and government bodies rely on this technique to protect their assets and to secure public places such as airports, bus stands, and railway stations etc. With the rapid advancements and expansion of machine learning and deep learning techniques, challenges in the face recognition domain have also been well addressed. Machine learning-based face detection and verification models necessitate manual feature extraction and learning, whereas deep learning models do not. Deep learning architectures, specifically Convolutional Neural Networks (CNN), can learn valuable features from training images automatically CNNs perform a series of Convolution operations and nonlinear activations on the image, therefore, making them more suitable for working with image data than generic Artificial Neural Network (ANN) architectures. Several studies on the application of CNN-based models for face recognition in various poses and disguises have been published in the literature, with promising accuracies on various open benchmark datasets. The traditional Eigenface algorithm’s recognition accuracy on the LFW dataset is approximately 60% (Lfw recognition results). Whereas the most advanced deep learning face recognition models such as Facenet (Priya and Banu 2014) reported a recognition accuracy of 99% and beyond. Majority of modern machine learning/deep learning face recognition models with high recognition accuracy require labeled un-occluded face datasets (Deng et al. 2019; Li et al. 2021) for recognition. Models trained on normal un-occluded face images learn key facial points for recognition such as face edges, lips, and eyes, but they may fail with masked images since the mask covers most of the key facial points as illustrated in Fig. 2. According to a study conducted by NIST (Nist finds facial recognition has trouble with face masks), the failure rate of state-of-the-art face recognition models ranged between 20% and 50% for masked input images.
Fig. 2
Facial Key points on original (Singh and Mary 2016) and masked face images
Facial Key points on original (Singh and Mary 2016) and masked face imagesRecently certain studies have been carried out to solve the problem of Masked Face Recognition (MFR) which can be categorized primarily into three classes:Occlusion removal-based models, deep learning-based models, and face reconstruction/face completion-based models. This paper proposes an occlusion removal based deep learning architecture for masked face recognition which involves two tasks: (1) face mask detection and (2) masked face recognition. Haar cascade model is used to detect the mask and the masked region is removed for further processing. A hybrid VGG16- Random Fourier deep learning architecture is then used to extract enhanced facial features for recognition. The major contributions of the presented work are as follows:Unavailability of masked face images to be trained with is the primary challenge for a masked face recognition model. The proposed model uses existing databases for training, therefore, negating the need for the creation of a masked face database.Colour of mask causes uncertainty in the accuracy of a model trained on masked images. The proposed model considers only the part of the face which is visible and discards the part of the face which is covered by the mask hence invariant to mask color. Optimal cropping for MFR is also explored.A hybrid VGG16-Random Fourier deep learning model is proposed to extract enhanced features from the upper half of the face for recognition.The paper is arranged as follows: Related works are detailed in Sect. 2, and 3 describes the proposed model in detail. The experimental setup is detailed in Sect. 4 and 5 describes the results. Finally, the paper conclues with Sect. 6.
Related works
Occlusion is a major challenge faced by most of the existing Face Recognition (FR) systems which are generally caused due to the presence of objects or ornaments worn by a person which covers a part of the face such as glasses, hats, masks, bandages, etc. Masks differ from glasses and hats by causing a huge loss in the visual re- region of the face (Kumar et al. 2019; Simonyan and Zisserman 1409), hence most of the state-of-the-art face recognition models fails when masked faces are given as input. There are three major problems with masked face recognition compared to other occlusion face recognition. To begin with, there is a scarcity of large face datasets with masks. Second, the mouth and nasal features have been significantly harmed, and the effective characteristics have been substantially diminished (refer to Fig. 2). Finally, a mask-wearing face is difficult to detect. During this new normal where the mask has become an inevitable aspect, researchers started to explore techniques to perform face recognition with a mask. Masked face recognition models reported in the literature can be classified into three major categories: occlusion removal approaches, restoration approaches, and deep learning-based approaches.
Occlusion removal
Occlusion removal-based approaches initially detect the occluded face part and discard them completely during recognition. Priya and Banu (Neha and Nithin 2018) divided the face into smaller regions and used SVM classifier to remove the occlusion. The global masked projection method is used by Alyuz et al (Deng et al. 2019) to remove the occluded region.Partial Gappy PCA is then applied to reconstruct the face using Eigenvectors. To detect occluded regions and eliminate them during the recognition phase, Andres et al. (Andr´es et al.. 2014) calculated the difference between occluded and non-occluded images of the same person. Most of the occlusion removal-based MFR models reported in the literature spend more time on the detection and removal of occlusion. The proposed model reduces the overhead of mask detection using optimal cropping of face images.
Image restoration
Restoration-based models reconstruct the entire face for recognition (Dolhansky and Ferrer 2018; Wu 2021). A 3D face restoration model was proposed by Bagchi et al. (Din et al. 2020). The authors first detected the occluded face part by thresholding the depth map, Principal Component Analysis (PCA) is then used to reconstruct the face. Cen et al. (Cen and Wang 2019) introduced a robust occlusion FR classification system based on depth dictionary representation, which used a convolutional neural network as a feature extractor and then linearly encoded the extracted depth data using the dictionary. For occlusion FR, Du et al. (Du and Hu 2019) presented Nuclear Norm-based Adapted Occlusion Dictionary Learning (NNAODL), which used a dictionary of occluded pictures to build a unique reconstruction model. The image reconstruction algorithms have improved the occlusion FR process, yet the major drawback is the computational complexity as they try to reconstruct the original face image.
Deep learning
Deep learning has seen a lot of success in FR in re- cent years. Deep features have outperformed classical features and are widely used in occlusion FR. The Dynamic Feature Matching (DFM) technique, which combines FCN with SRC to recognize partial faces of any size, was proposed in He et al. (2016). This work explored the effectiveness of the deep features after the last pooling layer to represent the database images. An end-to-end BoostGAN model was proposed by Duan etal. (Duan and Zhang 2020) in which non-occluded face images were synthesized from the input occluded image for refined face recognition. An open-source tool called MaskTheFace was introduced by Anwar et al. (Anwar and Raychowdhury 2008), which initially generates a masked face image dataset from existing unmasked face images. The generated masked faces were then used for training a deep face recognition model. Walid et al. (Gourier et al. 2004) used a pre-trained CNN model to extract features from the upper half of the face (after mask removal). The quantized bag of feature representation was further extended for facial recognition. Table 1 summarizes various state-of-the-art Masked Face Recognition models.
Table 1
Summary of various state- of-the-art masked face recognition models
Paper
Model Used
Dataset
Methodology
Other requirements
Accuracy(%)
Din et al. 2020)
GANs
CelebA
Map and editing modules
VGG-19
Alzu’bi et al.. 2666)
Pre-trained CNNs
RMFRD
Comparative study on CNNs
VGGFace, facenet,openface, deepface
68.17
Kumar et al. 2021)
MTArcFace
LFW, CFP, Agedb
Combination of arcface lossand mask-usage classification loss
ArcFace
99.78
Lucena et al. 2017)
VGG-16 and facenet
Custom dataset
Learning cosine distance
100
Hariri 2105)
VGG-16, AlexNet,ResNet-50
RMFRD, SMFRD
Deep features of facial areas
VGG-16, alexnet,resnet-50
91.3
Geng et al. 2254)
Facemasknet-21
Custom dataset
Deep metric learning
Facemasknet
88.92
Wang et al. 2003)
Resnet
MFDD, RMFRD
Attention-driven Model
95
Szegedy et al. 2016)
Attention-based
MFDD, RMFRD
Face-eye-based multi-granularity
99
Anwar and Raychowdhury 2008)
Masktheface
VGGFace2-mini-SM,LFW-SM
MaskTheFace with FaceNet
Facenet
97.25
He et al. 2018)
Wearmask3D
Normalized softmax loss
MFR2, MFW-mini
Resnet-50
95.8
3–37 (2019). 2019)
GANs
MFSR, CASIA-WebFace, VGGFace2
IAMGAN with DCR
86.5
Lane 2020)
GANs
Celeb-A, LFW, AR
De-occlusion distillation
95.44
Li et al. 2020)
CBAM
Webface, AR, Yela B, LFW
Face cropping
CBAM
92.61
Deng et al. 2021)
MFCosface
VGGFace2 m, LFW m,CASIAFaceV5 m, MFR2, RMFD
Learning large margin cosine loss
Facenet
98.5
Du et al. 2021)
Siamese networks
Oulu-CASIA NIR-VIS,BUAA-VisNir
Heterogeneous semi-Siamese training
ResNet-50
98.6
Summary of various state- of-the-art masked face recognition modelsInspired by the high performance and accuracy of deep CNN-based models robust to facial expression, facial collusion, and illumination, in this paper we pro- pose a discard occlusion-based hybrid deep mask face recognition model.
Proposed system
The proposed model comprises four key modules: (1) Face alignment and pre-processing (2) Mask removal (3) Feature Extraction (4) Face recognition. Fig. 3 demonstrates the architecture of the proposed model.
Fig. 3
Architecture of the proposed model
Architecture of the proposed model
Face alignment and preprocessing
Face detection (Karthika and Parameswaran 2016; Kumar et al. 2019) and alignment correction is a vital step in face recognition. Eye-Haar cascade configurations available in the OpenCV library is used to detect eye points, using which the frontal face obtained from the camera is oriented such that the face is perpendicular to the normal of the image. The face alignment algorithm is described in Algorithm. 1 and the corresponding steps are shown in Fig. 4.
Fig. 4
Face alignment and preprocessing
Face alignment and preprocessing
Cropping masked region
Masks are usually worn below the eyes and cover a part of the nose region. Therefore, to remove the mask the face image below the eye region is cropped. Al- though Haar cascades are capable of localizing the eyes correctly the windows which are drawn on each eye are not equal and differ in position. Therefore, the bounding box which is the lower of the two predicted eyes are selected and the region below the bounding box is cropped to remove the mask from the image. The lower of the two bounding boxes are identified by calculating the distance between the bottom line of the bounding box and the bottom center of the image as in Eq. 1.where P1 and P2 are the edge points of the bottom line of a bounding box and P3 is the bottom center point of the image. Fig. 5 shows the mask removal from the input image. One of the major concerns of the cropping-based approach for MFR is: “where to crop the facial image?” Sample facial images at different cropping proportions are shown in Fig. 6. Different cropping proportions are calculated using three key facial points as shown in Fig. 7. Euclidean distance between A(x1, y1), B(x2, y2) (eye key points) in Fig. 7 is calculated as follows:
Fig. 5
Mask removal from the input image
Fig. 6
Cropped facial images at different cropping proportions
Fig. 7
Calculating cropping proportion (L)
Mask removal from the input imageCropped facial images at different cropping proportionsCalculating cropping proportion (L)The perpendicular bisector of the line AB is calculated (Point C in Fig. 7). Finally, face images are cropped as C as the reference point.
Feature extraction using improved VGG-16 model
VGG16 is a renowned classification model proposed by K. Simonyan and A. Zisserman (Shaheed et al. 2022) which achieved 92.7% accuracy on ImageNet classification with more than 14 million images and 1000 classes. The input convolutional layer accepts RGB image of size 224X224. The image is further passed through a stack of 3X3 convolutional layers. The initial layers of VGG-16 capture low-level features and the deep layers capture high-level features. The original VGG-16 architecture is modified by adding an additional Random Fourier layer after the final fully convolutional layer as shown in Fig. 8. A Random Fourier feature layer is added to extract enhanced features from the partial facial image. The selection of VGG-16 as the backbone network is done based on the comparative analysis of various deep convolutional models.
Fig. 8
Proposed VGG16-Random Fourier hybrid deep learning model for masked face recognition. Feature representation and classification using transfer learned VGG16 model
Proposed VGG16-Random Fourier hybrid deep learning model for masked face recognition. Feature representation and classification using transfer learned VGG16 model
Random fourier features layer
In general, the lower half of the face contributes more to facial recognition since the features extracted from the bottom of the eyes to the chin will be unique for a person. An enhanced feature representation is crucial in masked face recognition where the lower part of the face is completely occluded by the mask. The Random Fourier feature is a well-known, simple, and effective method for scaling-up kernel functions. The underlying principle of the method is a consequence of Bochner’s theorem (Bochner 1932), which states that “any bounded, continuous and shift-invariant kernel is a Fourier transformation to the input features and then training a linear model on top of the transformed features. Depending on the loss function used in the linear model, the incorporation of a random Fourier layer is analogous to kernel SVMs (for hinge loss) and kernel logistic regression (for logistic loss).The first set of random features consists of random Fourier bases cos(ωx + b) where ωϵR and bϵR are random variables. These mappings project data points on a randomly chosen line and then pass the resulting scalar through a sinusoidal function as shown in Fig. 9. Each component of the feature map z(x) projects x onto a random direction ω drawn from the Fourier trans-form p(ω) of k(∆), and wraps this line onto the unit circle in R2. After the transformation of two points x and y as above, their inner product can be considered as an unbiased estimator of k(x, y) (Fig. 10). The mapping z(x) = cos(ωx + b) additionally rotates this circle by a random amount b and projects the points onto the interval [0, 1] (Fig. 11). In the proposed model a Random Fourier Feature layer of size 4096X1 with scale 10 is appended with the final fully connected layer of VGG -16, which is then connected to a multi-layer perceptron for classification with SoftMax layer (Fig. 12). The hybrid VGG-16-Random Fourier model is expected to extract enhanced features from the upper half of the face image, which can be further used for facial recognition/authentication.
Fig. 9
Visualization of data projection on to random Fourier bases
Fig. 10
(a) Average images of normal face image (b) Average images of masked faces (c) Difference between average normal and average masked
Fig. 11
Eigen images of normal and masked face images
Fig. 12
Data distribution of three datasets x-axis shows the number of samples and Y axis shows the number of images per sample
Visualization of data projection on to random Fourier bases(a) Average images of normal face image (b) Average images of masked faces (c) Difference between average normal and average maskedEigen images of normal and masked face imagesData distribution of three datasets x-axis shows the number of samples and Y axis shows the number of images per sample
Experimental setup
This section introduces the datasets used in the experiment, the preprocessing and exploratory data analysis performed on the dataset, and concludes with feature representation and classification. Three publicly available face datasets are used for evaluating the proposed MFR model: Face detection dataset by Robotics Lab.
Datasets used for the experiment and exploratory data analysis
This section describes the datasets used for evaluating the proposed model and exploratory data analysis carried out on the datasets.
Face dataset by robotics Lab
This dataset (Chu et al. 2007) includes 6660 images of 90 different subjects. Each subject comprises 74 images, with 37 images captured every 5 degrees in the pan rotation from the right profile (defined as +90°) to the left profile (defined as -90°). The additional 37 images are created (synthesized) by flipping the original 37 photos horizontally using commercial image processing software.
Head pose image dataset
The head pose database is a collection of 2790 monocular face images of 15 people with pan and tilt angles ranging from -90 to +90 degrees (Golwalkar and Mehendale 2022). There are two series 93 images (93 different poses) available for each person. The goal of having two series for each person is to be able to train and test algorithms on both known and unknown faces. In order to focus on face operations, the background is purposefully neutral and uncluttered.
Georgia tech face dataset
The Georgia Tech face dataset (Maharani et al. 2020) is a well-known face detection and recognition dataset reported in the literature. It consists of images of 50 people taken in two or three sessions at Georgia Institute of Technology’s Center for Signal and Image Processing between 06/01/99 and 11/15/99. All of the people in the dataset are represented by 15 color JPEG images with cluttered backgrounds taken at 640x480 pixel resolution. The average face size in these images is 150x150 pixels. The images depict frontal and/or tilted faces with various facial expressions, lighting conditions, and scale.The selection of the aforementioned datasets is done owing to the fact that the Random Fourier layer maps the generated feature vector onto a unit circle. Hence, to generate similar features for individual classes it is better to have images that span from right profile to left profile. Table 2 summarizes the details of the three datasets used in detail. Fig. 13. shows the original and synthesized masked images from the three publically available face datasets. Synthesized masked images are used to train the model since the proposed hybrid deep learning model for masked face recognition uses only the upper half of the face for face recognition hence the presence/absence of mask need not be considered.
Table 2
Details of different datasets used in the experiment
Dataset
Total number of persons
Pose, illumination and facial expression variations
Total number of facial images
Face dataset by robotics lab
90
5 degree from right profile(defined as + 90°) to left profile(defined as -90°) in the pan rotation
6660
Head pose image dataset
15
Variations of pan and tilt angles from -90 to + 90 degrees
2790
Georgia tech face dataset
50
Frontal and/or tilted faces with different facial expressions
750
Fig. 13
Sample images used in the experiment. (a, b)Original & Synthesized masked face images from Face Dataset by Robotics Lab. (c, d) Original & Synthesized masked face images from Head Pose Dataset. (e, f)Original & Synthesized masked face images from Georgia Tech Face Dataset
Details of different datasets used in the experimentSample images used in the experiment. (a, b)Original & Synthesized masked face images from Face Dataset by Robotics Lab. (c, d) Original & Synthesized masked face images from Head Pose Dataset. (e, f)Original & Synthesized masked face images from Georgia Tech Face DatasetFig. 10 shows the average images generated for normal face images and masked face images of the three datasets described earlier. As shown in Fig. 10b, the average-mask images have more obstructions around the masked areas. In Fig. 10c, which depicts the difference between average-normal and average-mask, it is clear that the upper half of the face averages are consistent across subjects and datasets while the masked regions exhibit high variability. The image locations with the highest variability are highlighted in blue in the difference image (masked region), prompting the proposed model to investigate the upper half of the face image for feature extraction. Fig. 11 shows the Eigen images for four random human samples from the dataset, which are essentially the eigenvectors (components) of PCA of the facial images. For each class, we visualize the principal components that account for 70% of the variability. For masked samples, it is clear that the eigenfaces capture facial key points in the upper half of the face images. Based on these findings, it is possible to conclude that the upper half of the image encompasses features that can be further investigated for masked face recognition. The data distribution (class-wise) of the three datasets under consideration is shown in Fig. 12. All the three datasets are balanced across samples: Georgia Tech dataset contains 50 human samples with 15 images per sample, the head pose dataset contains 15 human samples with 186 images per sample, and the robotics lab dataset contains 90 human samples with 74 images per sample.
Preprocessing
The input images are resized to 500X500. OpenCV eye Haar cascade configurations are then used to detect the eye locations, which is then used for alignment correction. The masked part of the facial image is then removed based on the eye location.
Feature representation and classification
The proposed model is transfer learned on VGG-16 architecture using ImageNet weights. Stevo Bozinovski and Ante Fulgosi (Bozinovski 1976; Bozinovski and Fulgosi 1976) introduced transfer learning in 1976, and it is widely used to improve the accuracy of Neural Networks (Bansal et al. 2021; Liu et al. 2019) over the years. Transfer learning is the process of initializing the weights of a neural network trained for a task with the weights of previously learned neural networks on a large-scale dataset. The idea is to reuse the intelligence learned by the large-scale neural network to extract high-level features common to both tasks, reducing the need for a massive database to achieve near state-of-the-art predictions. This paper uses a VGGG-16 model transfer learned on Imagenet weights for the experiment as depicted in Fig. 8. ImageNet (Karthika and Parameswaran 2016) is a large-scale image database with over 14 million images labeled into approximately 1000 classes. The ImageNet Large-Scale Visual Recognition Challenge was launched in 2010 as a global competition to develop state-of-the-art (SOA) classification, models. Using various SOA neural networks, significant progress has been made in classifying ImageNet over the years. ImageNet weights are widely used for computer vision applications because they are highly reusable for different computer vision tasks and are easily accessible on the web. The layers of VGG-16 model were kept non- trainable during the experiment. The flattened FC layer of size 1X61440 was then mapped into Random Fourier features layer for computing the enhanced lower-dimensional feature vector of size 1X4096. The feature vector is then passed on to a softmax layer for classification.
Results and analysis
This section describes experiments for validating the proposed masked face recognition model. The experiments carried out in this paper are two-fold: (1) Cropping-based approach for MFR using state-of-the-art deep convolutional models (2) Cropping-based approach integrated with Random Fourier layer for MFR.
Cropping-based approach for MFR
The first part of this section discusses the selection of a deep convolutional neural network as the base model for MFR. Secondly, the effect of cropped face images for MFR is experimented with and evaluated. To select the best deep CNN model for masked face recognition, the performance of state-of-the-art deep learning architectures are evaluated in terms of recognition accuracy. VGG- 16(Shaheed et al. 2022), Inception V3, (Soyel and Demirel 2010) and ResNet50 (Hariri 2022) are chosen for the experiment as the base models. Each of the aforementioned models are transfer learned for masked face recognition. Table 3 summarizes the obtained recognition accuracy for two cases: with mask and without mask on three benchmark datasets (The necessary findings are indicated by bold entries). For case 1 the entire face image with mask is considered for the recognition and for case 2 upper half of the face after cropping out the masked region is considered. From the table, it is clear that all the three deep convolutional models produce better accuracy on the cropped face images without masks across the three datasets under study. Inception V3 gave the best accuracy on all the three datasets: On the face dataset by the robotics lab the accuracy is 83.607% for case1 87.113% for case2, with a difference of +3.5% for case 2, on Head pose image dataset 84.807% for case1 and 89.32% for case2, with a difference of +4.5% for case2, on Georgia tech face dataset 84.205% for case1 88.541% for case2, with a difference of +4.3% for case2. The cropping-based technique not only reduces the amount of computer resources required for occlusion detection, but also overcomes the problems of occlusion. However, the most crucial factor impacting the method’s performance is where to crop?. Recognition performance of the three deep convolutional neural networks on MFR at different cropping proportions is tested and the results are tabulated in Table 4. The optimal cropping proportion for VGG-16 is at 0.85 with a recognition accuracy of 85.002% which shows an increase of 4.379% compared to No-Crop case. For Inception V3 and ResNet50 the optimum cropping proportion is at 0.8 and 0.95 with a recognition accuracy of 87.113% and 86.006%, respectively. Compared to No-Crop the recognition accuracy increased by 3.506% and 1.816% for InceptionV3 and ResNet50. From Table 3 and Table 4 it can be concluded that the cropping-based approach significantly improves the recognition accuracy. To further interpret the model’s performance on MFR, class activation maps (CAM) are generated as in Fig. 14. The CAM heat maps are overlayed on the face image for better visualization. The heat map of the original face image concentrates on the eye and the lower face region (Fig. 14b), whereas the heat map of synthesized masked images focuses more on the eye and cheek regions. The cropping-based approach for MFR focuses on the area around eyes, which changes as the cropping proportion changes (Fig. 14.d, Fig. 14.e), so determining the best cropping proportion is critical in MFR. The visualization of CAM (Fig. 14.d, Fig. 14.e) shows that the cropping-based approach for MFR can precisely find the forehead-eye area with the use of prior information about mask location. More importantly, as shown by the proposed approach’s CAM maps, the most discriminative areas are not all areas above the mask, but the regions around two eyes.
Table 3
Comparison of three deep CNN models on MFR with/without mask on three benchmark datasets
Face dataset by robotics lab
Head pose image dataset
Georgia tech face dataset
With mask (Acc %)
Without mask (Acc %)
With mask (Acc %)
Without mask (Acc %)
With mask (Acc %)
Without mask (Acc %)
VGG-16
80.623
85.002
82.930
86.141
81.032
86.079
Inception V3
83.607
87.113
84.807
89.32
84.205
88.541
ResNet50
84.250
86.066
85.619
88.995
84.876
87.380
Table 4
The MFR performance at different cropping pro- portions without Random Fourier module on Face dataset by Robotics Lab
Cropping proportion
VGG-16 (Acc%)
Inception V3 (Acc%)
ResNet50 (Acc%)
0.4
78.338
76.910
80.408
0.5
78.342
77.285
80.291
0.55
78.94
78.991
81.730
0.6
79.480
79.480
82.776
0.65
81.002
82.632
82.924
0.7
83.619
83.746
83.252
0.75
84.281
85.955
83.925
0.8
84.893
87.113
84.42
0.85
85.002
87.008
85.268
0.9
84.622
86.491
85.702
0.95
82.781
85.087
86.066
1
81.058
84.550
85.82
1.2
80.926
83.9
85.117
No-Crop
80.623
83.607
84.250
Maximum accuracy obtained and the corresponding cropping proportion for each model is shown in bold
Fig. 14
Visualization of feature maps of two different persons. (a) Original face image (b) Feature map obtained from original face image (c) feature map from synthesized masked images (d) feature map from cropped eye regions (e) feature map from the optimal cropping proportion: locates non- occluded face region precisely and generates unique features for MFR
Comparison of three deep CNN models on MFR with/without mask on three benchmark datasetsThe MFR performance at different cropping pro- portions without Random Fourier module on Face dataset by Robotics LabMaximum accuracy obtained and the corresponding cropping proportion for each model is shown in boldVisualization of feature maps of two different persons. (a) Original face image (b) Feature map obtained from original face image (c) feature map from synthesized masked images (d) feature map from cropped eye regions (e) feature map from the optimal cropping proportion: locates non- occluded face region precisely and generates unique features for MFR
Integration of random fourier layer with cropping based approach
To evaluate the effectiveness of the Random Fourier layer on feature enhancement for MFR, a cropping-based approach is integrated with a Random Fourier module. The recognition accuracy of different models at the optimal cropping proportion with pose correction, with/without Random Fourier layer is listed in Table 5. It is evident from the table that all the three models gave better accuracy when integrated with Random Fourier layer across all the three benchmark datasets: 85.0% to 97.46% for VGG 16, 87.113% to 93.836% for Inception V3, 86.066% to 95.992% for ResNet50 on Face dataset by Robotics lab. Similarly, 86.141% to 97.634% for VGG 16, 89.32% to 95.209% for Inception V3, 88.995% to 96.47% for ResNet50 on Head Pose Image Dataset and 86.079% to 97.552% for VGG 16, 88.541% to 94.936% for Inception V3, 87.380% to 96.002% for ResNet50 on Georgia Tech Face Dataset. The performance at different cropping proportions with Random Fourier module on Face dataset by Robotics lab is also tested and tabulated in Table 6. The optimal cropping proportion for VGG-16, Inception-V3, and ResNet50 remains the same at 0.85, 0.8, and 0.9, respectively. Compared to No- Crop, the recognition accuracy increased by 11.515% for VGG-16, 8.822 % for InceptionV3 and 11.20% for ResNet50. Based on these observations we fixed VGG- 16 with Random Fourier layer as the base model for the proposed masked face recognition model because of its superior performance when compared to other models.
Table 5
Comparison of state-of-the-art deep CNN as base models for masked face recognition (With Pose Correction) + (Without Mask)
Face dataset by robotics lab
Head pose image dataset
Georgia tech face datase
Without random fourier module (Acc %)
With random fourier module (Acc %)
Without random fourier module (Acc %)
With random fourier module (Acc %)
Without random fourier module (Acc %)
With random fourier module (Acc %)
VGG-16
85.002
97.460
86.141
97.634
86.079
97.552
Inception V3
87.113
93.836
89.32
95.209
88.541
94.936
ResNet50
86.066
95.992
88.995
96.47
87.380
96.002
Maximum accuracy obtained for each dataset is shown in bold
Table 6
The MFR performance at different cropping proportions with Random Fourier on Face dataset by Robotics Lab
Cropping proportion
VGG-16 (Acc%)
Inception V3 (Acc%)
ResNet50 (Acc%)
0.4
87.405
83.991
89.80
0.5
88.11
84.806
89.582
0.55
89.626
85.52
90.881
0.6
90.04
87.947
91.403
0.65
91.597
88.60
91.892
0.7
93.942
90.136
92.620
0.75
94.813
92.052
93.73
0.8
95.729
93.836
94.381
0.85
97.460
92.993
94.94
0.9
96.103
91.591
94.694
0.95
92.2
90.104
95.992
1
89.649
88.728
89.922
1.2
86.80
86.297
87.8
No Crop
85.951
85.014
84.79
Maximum accuracy obtained and the corresponding cropping proportion for each model is shown in bold
Comparison of state-of-the-art deep CNN as base models for masked face recognition (With Pose Correction) + (Without Mask)Maximum accuracy obtained for each dataset is shown in boldThe MFR performance at different cropping proportions with Random Fourier on Face dataset by Robotics LabMaximum accuracy obtained and the corresponding cropping proportion for each model is shown in boldLine chart of accuracy at different cropping proportions without Random Fourier module and With Random Fourier Module for three models on Face dataset by Robotics lab is shown in Fig. 15. It can be observed from the figure that for all the three models accuracy increases initially as the cropping proportion increases and then decreases further. This may be because, as the cropping proportion increases masked part in the image also increase which in turn brings inadequate features. Table 7, Table 8 compares the effect of pose correction on the proposed hybrid model and basic VGG-16 model on MFR on three datasets. It is evident from the tables that, both architectures shows improved accuracy with pose correction (Table 8: 85.951%-including mask, 97.460%- excluding mask on Face dataset by Robotics Lab, 87.008%-including mask, 97.634%- excluding mask on Head Pose Image Dataset and 86.705%- including mask, 97.552%-excluding mask on Georgia Tech Face Dataset). To further evaluate the performance of the proposed MFR model, it is compared against state-of-the-art face recognition models on occlusion face recognition at the optimum cropping proportion. Arcface(Deng et al. 2019), Facenet[41] and Python face-recognition 1.3.0 module (Geitgey 2019) are considered for the comparison. The comparison results are tabulated in Table 9 and visualized in Fig. 16. It is clear from the Table 9 that: Our model outperforms Facenet and Arcface, which are the benchmark models for common face recognition by 11.13% (97.46-86.329), 4.85% (97.634-86.056) on Face dataset by Robotics Lab, 11.58% (97.634-86.056),4.344% (97.552- 86.19) on Head Pose Image Dataset and 11.362% (97.552- 86.19),4.82% (97.552-92.73) on Georgia Tech Face Dataset. It can be concluded from the result that models developed for common face recognition are inadequate for Masked Face Recognition since most facial landmarks are obscured by masks, whereas the inputs and feature receptive fields for common face recognition demand the entire face.
Fig. 15
Cropping proportion comparison in terms of accuracy on Face dataset by Robotics Lab. (a) Without Random Fourier module (b) With Random Fourier module
Table 7
Performance comparison of VGG16 without Random Fourier Layer on Masked Face Recognition
Including masked face region
Excluding masked face region
No pose correction
With pose correction
No pose correction
With pose correction
Face dataset by robotics lab
Accuracy
79.407
80.623
82.519
85.002
True positive
70.529
71.27
74.779
78.635
False positive
24.63
20.129
15.309
9.01
Head pose image dataset
Accuracy
79.894
82.930
83.612
86.141
True positive
71.4006
72.594
73.087
79.820
False positive
23.918
19.006
14.215
9.929
Georgia tech face dataset
Accuracy
79.627
81.032
82.981
86.079
True positive
71.295
72.184
74.266
79.015
False positive
24.390
19.860
14.904
9.427
The best results are shown in bold
Table 8
Performance comparison of VGG16-random fourier hybrid model on masked face recognition
Including masked face region
Excluding masked face region
No pose correction
With pose correction
No pose correction
With pose correction
Face dataset by Robotics Lab
Accuracy (%)
85.722
85.951
96.381
97.460
True positive (%)
78.860
79.667
91.163
93.001
False positive (%)
14.374
15.396
5.468
2.293
Head Pose Image Dataset
Accuracy (%)
85.521
87.008
95.986
97.634
True positive (%)
79.8
81.749
91.371
94.601
False positive (%)
13.029
13.962
3.358
1.720
Georgia Tech Face Dataset
Accuracy (%)
84.92
86.705
95.131
97.552
True positive (%)
79.016
82.140
91.54
93.872
False positive (%)
13.812
14.288
2.830
2.065
The best results are shown in bold
Table 9
Performance Comparison of state-of-the-art models with the proposed VGG 16-Random Fourier hybrid model on three benchmark datasets
Accuracy (%)
True positive (%)
False positive (%)
Face dataset by robotics lab
Facenet
86.329
74.192
13.6051
Python face-recognition 1.3.0 module
89.827
80.251
8.306
ArcFace
92.610
78.042
12.886
Proposed
97.460
93.001
2.293
Head pose image dataset
Facenet
86.056
76.721
12.572
Python face-recognition 1.3.0 module
90.418
79.994
8.76
ArcFace
93.290
80.672
11.415
Proposed
97.634
94.601
1.720
Georgia tech face dataset
Facenet
86.193
75.405
12.942
Python face-recognition 1.3.0 module
90.824
79.218
9.570
ArcFace
92.73
79.045
11.386
Proposed
97.552
93.872
2.065
The best results are shown in bold
Fig. 16
Performance comparison of MFR with different ap approaches on three benchmark datasets
Cropping proportion comparison in terms of accuracy on Face dataset by Robotics Lab. (a) Without Random Fourier module (b) With Random Fourier modulePerformance comparison of VGG16 without Random Fourier Layer on Masked Face RecognitionThe best results are shown in boldPerformance comparison of VGG16-random fourier hybrid model on masked face recognitionThe best results are shown in boldPerformance Comparison of state-of-the-art models with the proposed VGG 16-Random Fourier hybrid model on three benchmark datasetsThe best results are shown in boldPerformance comparison of MFR with different ap approaches on three benchmark datasetsIt is also evident that the proposed model gave accurate and generalized performance across the three datasets in terms of recognition accuracy. We display the top-7 retrieved images from the VGG16-Random Fourier hybrid model in Figure 17, and the model was able to retrieve a similar identity from the dataset for each of the query images. Table 10 compares the rank scores (rank1, rank5, rank10) and mean Average Precision(mAP) obtained for the proposed model across the three datasets. From the table, it is clear that the proposed model achieves best rank1 score of 81.73% and mAP of 75.92% on Head Pose Image dataset. The proposed model also produced comparable results on Face Dataset by Robotics Lab (Rank1 score of 78.67%, mAP of 73.87%) and on Georgia Tech Face Dataset (Rank1 score of 79.95%, mAP of 74.67%). To further evaluate the behavior of the proposed model, the training-validation graph is plotted in Fig. 18 which includes train-accuracy, validation-accuracy, train-loss, and validation-loss. We ran the model for 30 epochs, and it converged after the 25 epoch, as shown in Fig. 18. The best model with the highest validation accuracy (97.55) was saved and tested.
Fig. 17
Results of the proposed VGG16-Random Fourier MFR model on identification for masked faces
Table 10
Results of VGG16-Random Fourier hybrid model on Masked Face Recognition
Accuracy
mAP
Rank 1
Rank 5
Rank 10
Face dataset by robotics lab
97.460
73.87
78.67
84.78
89.56
Head pose image dataset
97.634
75.92
81.73
88.89
92.57
Georgia tech face dataset
97.552
74.67
79.95
86.67
90.83
Fig. 18
Train-Validation curve of the proposed model on Georgia Tech Face dataset (with pose correction)
Results of the proposed VGG16-Random Fourier MFR model on identification for masked facesResults of VGG16-Random Fourier hybrid model on Masked Face RecognitionTrain-Validation curve of the proposed model on Georgia Tech Face dataset (with pose correction)
Conclusion
The COVID-19 epidemic necessitates people to wear masks; however, the presence of masks raises serious concerns about the accuracy of existing facial recognition systems, since the mask obscures most of the facial features. This paper proposes a cropping-based deep learning architecture to address the issue of masked face recognition. A hybrid VGG16-Random Fourier deep learning model is introduced to extract enhanced features from the upper half of the face, excluding masks for recognition. The selection of VGG-16 as the backbone network was done experimentally after comparing the recognition accuracies with respect to other benchmark CNN models in the literature. The proposed method consists of four major modules: Face alignment and preprocessing is the first module, Eye-Haar cascade configurations available in the OpenCV library are used to detect eye points. The frontal face obtained from the camera is oriented such that the face is perpendicular to the normal of the image with respect to the detected eye points. The second module is mask removal, which involves selecting the lower bounding box of the two predicted eyes and cropping the region below the bounding box to remove the mask from the image. The optimal cropping ratio is investigated to give a better recognition accuracy, and the results show that the optimal cropping proportion for VGG-16 is around 0.85L. The third module is enhanced feature extraction, which is accomplished through the use of the proposed hybrid VGG16-Random Fourier deep learning architecture. Module 4 combines Imagenet’s transfer learned VGG16 architecture with a random Fourier layer to provide a better feature representation in a lower-dimensional plane for masked face recognition. The effectiveness of the proposed model with/without mask is evaluated and verified on three benchmark datasets: Georgia Tech Face Dataset, Head Pose Image Dataset and Face Dataset by Robotics Lab. Experimental re- results show that the proposed approach can increase the recognition accuracy by an average of 10.99% on MFR. Overall, the paper’s findings can be summarized as follows:Models developed for common face recognition are imprecise for masked face recognition because the mask obscures more than half of the face.Cropping-based approaches can significantly improve masked face recognition accuracy using transfer learned CNN models if enhanced feature extraction from the upper half is used.Cropping proportion impacts the recognition accuracy of the cropping-based approach for MFR. For VGG-16, the optimal cropping proportion was around 0.85L. The optimal cropping proportion for VGG-16 is around 0.85L for VGG-16. Integration of the optimal cropping and Random-Fourier module achieves the best recognition accuracy for MFR.In comparison to the state-of-the-art approach, the proposed VGG16-Random Fourier model delivers superior masked face recognition performance.This work can be further extended to different application scenarios such as recognition of normal faces with masked faces, and recognition of faces within group/crowd images with multiple faces.