COVID-19 is spreading around the world like wildfire. Chest X-rays are used as one of the primary tools for diagnosing COVID-19. However, about two-thirds of the world population do not have access to sufficient radiological services. In this work, we propose a deep learning-driven automated system, COVIDXception-Net, for diagnosing COVID-19 from chest X-rays. A primary challenge in any data-driven COVID-19 detection is the scarcity of COVID-19 data, which heavily deteriorates a deep learning model's performance. To address this issue, we incorporate a weighted-loss function that ensures the COVID-19 cases are given more importance during the training process. We also propose using Bayesian Optimization to find the best architecture for detecting COVID-19. Extensive experimentation on four publicly available COVID-19 datasets shows that our proposed model achieves an accuracy of 0.94, precision 0.95, recall 0.94, specificity 0.997, F1-score 0.94, and Matthews correlation coefficient 0.992 outperforming three widely used architectures-VGG16, MobileNetV2, and InceptionV3. It also surpasses the performance of several state-of-the-art COVID-19 detection methods. We also performed two ablation studies that show our model's accuracy degrades from 0.994 to 0.950 when a random search is used and to 0.983 when a regular loss function is employed instead of the Bayesian and weighted loss, respectively.
COVID-19 is spreading around the world like wildfire. Chest X-rays are used as one of the primary tools for diagnosing COVID-19. However, about two-thirds of the world population do not have access to sufficient radiological services. In this work, we propose a deep learning-driven automated system, COVIDXception-Net, for diagnosing COVID-19 from chest X-rays. A primary challenge in any data-driven COVID-19 detection is the scarcity of COVID-19 data, which heavily deteriorates a deep learning model's performance. To address this issue, we incorporate a weighted-loss function that ensures the COVID-19 cases are given more importance during the training process. We also propose using Bayesian Optimization to find the best architecture for detecting COVID-19. Extensive experimentation on four publicly available COVID-19 datasets shows that our proposed model achieves an accuracy of 0.94, precision 0.95, recall 0.94, specificity 0.997, F1-score 0.94, and Matthews correlation coefficient 0.992 outperforming three widely used architectures-VGG16, MobileNetV2, and InceptionV3. It also surpasses the performance of several state-of-the-art COVID-19 detection methods. We also performed two ablation studies that show our model's accuracy degrades from 0.994 to 0.950 when a random search is used and to 0.983 when a regular loss function is employed instead of the Bayesian and weighted loss, respectively.
On December 31, 2019, the Chinese authorities informed the World Health Organization (WHO) about several cases of pneumonia caused by an unknown virus. This virus was later identified as a new strain of the coronavirus, now known as the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), causing the disease COVID-19. This virus belongs to the family of viruses called Coronaviridae [1]. Two viruses of this family, i.e., Severe Acute Respiratory Syndrome Coronavirus 1 (SARS-CoV-1) and Middle East Respiratory Syndrome (MERS) coronavirus, have previously caused epidemics. The mortality rate for these two viruses was high, with SARS 11% and MERS over 30% [2]. COVID-19 has a lower mortality rate of 5.7% [3]. However, its transmission rate is much higher. The virus spreads exponentially with new cases being identified worldwide, resulting in the WHO declaring it as a global pandemic on March 11, 2020 [4]. Six months from its first appearance, the virus has taken over the world by storm, with over 10 million confirmed cases and over 500,000 deaths. It has forced governments worldwide to shut off their borders and close various institutions to reduce the spread of the virus [5]. As a result, the world is heading towards an economic recession [6]. Nevertheless, there is a bigger problem—research suggests that about 40–45% of the patients are asymptomatic. That means they do not show any symptoms of the disease [7]. These asymptomatic patients, unaware of their infection, silently contribute to the spread of the virus. Thus, widespread testing and isolation is the only solution in this situation, emphasized by the WHO [8].At present, Reverse Transcription Polymerase Chain Reaction (RT-PCR) is the most widely used method for testing COVID-19 patients [9]. The process involves detecting viral Ribonucleic Acid (RNA) in a respiratory specimen collected from the patient. The RT-PCR method has several drawbacks. It provides low sensitivity to detection of COVID-19 [10]. Moreover, the result varies for the same patient at distinct points in time, even during the patient’s diagnosis and treatment [11]. Hence, repeated tests are often required to get an accurate result [12]. The process is also time-consuming, and requires specific material and equipment which are not easily accessible [13, 14].Radiological images like Chest X-Ray (CXR) and Computed Tomography (CT) can be used as an alternative to RT-PCR, as the COVID-19 primarily affects the respiratory system of the human body [15]. In COVID-19, the most common CXR and CT findings are Ground Glass Opacity (GGO) and lung consolidation [16]. CT was found helpful in early diagnosis of COVID-19, even when the patient did not show any respiratory symptoms or fever [17]. In some cases, GGO was seen in CT images of patients who were initially tested negative using RT-PCR tests. However, these patients were tested positive for COVID-19 later [18]. One experiment shows that the sensitivity of CT for detecting COVID-19 is 98% compared to RT-PCR, which is only 71% [10]. Although CT’s sensitivity for detecting COVID-19 is more than RT-PCR, CT is not widely available, and it requires intense decontamination after scanning each COVID-19 patient, which disrupts the service. This is why the American College of Radiology (ACR) suggests using portable chest radiography to minimize the risk of cross-infection [19]. CXR is widely available, and the infection control issue is much less compared to CT [20].Even though the radiological imaging techniques to detect COVID-19 give better sensitivity and can be done faster compared to RT-PCR, the system still relies heavily on radiologists for the detection of disease. Nevertheless, there is a global shortage of radiologists [21], and therefore, Artificial Intelligence (AI)-based diagnostic system can reduce the pressure on radiologists and help in faster diagnosis [22].Deep learning is a sub-field of AI that is inspired by the brain’s structure and function. It uses Artificial Neural Networks (ANN) to perform different tasks, including identifying objects from images, speech recognition, sentiment analysis, and many more [23]. It can also be applied to various types of medical images like X-ray, CT Scan, Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET). These images can be used to segment, denoise, and classify diseases [24-28]. Convolutional Neural Network (CNN), a variant of ANN, has been used before to detect diseases from CXR [29]. It has also been applied to classify lung nodule from CT images [30].CNN has parameters and hyperparameters. Parameters are weights and biases of a neural network that are learned during the training process and hyperparameters control the learning process of the neural network. Examples of hyperparameters are the number of layers or the number of nodes in each layer of a neural network. There are many hyperparameters of a neural network which can take any value. Selecting the right values of hyperparameters is very important as the neural network’s performance is greatly affected by our choice. Nevertheless, the manual method of choosing hyperparameter is an arduous and time-consuming task. There are some automatic methods for tuning hyperparameters, e.g., random search, grid search, etc. [31]. However, the random search does not guarantee optimal results when the hyperparameter space is large, and the grid search is computationally expensive. Bayesian optimization is a more efficient method of tuning hyperparameters [32, 33]. It can give better results in fewer function evaluations compared to the grid and random search [34]. Using Bayesian optimization, it is also possible to take humans out of the loop of tuning hyperparameters [35]. In this paper, we used Bayesian optimization to tune the hyperparameters of the neural network. Our proposed network can classify X-ray images into three classes—normal, pneumonia, and COVID-19.Data imbalance is a common problem for COVID-19 studies [36]. The amount of COVID-19 samples are much less compared to samples of other classes. To solve this issue, most of the early research on the diagnosis of COVID-19 from CXR used data augmentation [37-41] and class resampling [42-46]. In our approach, instead of using data augmentation or class resampling, we used a weighted loss function that ensures that the COVID-19 cases are given more importance while updating a neural network’s parameters using the gradient of calculated loss.The workflow of our proposed framework is presented in Fig. 1.
Fig. 1
Workflow of proposed framework for detection of COVID-19 using CNN
Workflow of proposed framework for detection of COVID-19 using CNNAt first, a dataset is chosen which has three classes—normal, pneumonia, and COVID-19. Then, the data are pre-processed by performing resize and rescale operations. After pre-processing, the data were split into train, test, and validation sets. The train and validation set were used for model development and tuning. Models are built using four powerful CNN architectures—VGG16, MobileNetV2, InceptionV3, and Xception. These architectures were tuned using the Bayesian optimization technique to find the best-performing model. The best-performing model obtained from each architecture were then evaluated on the test data. To, evaluate the models, confusion matrix, accuracy, precision, recall, specificity, F1-score, Matthews correlation coefficient, and ROC curve were used. Based on the results obtained using these evaluation matrices, the best model was identified.The overall contributions of the paper are listed below:The rest of the paper is organized as follows: Sect. “Related Works” discusses the related works that have been done in this field. The work methodology is discussed in Sect. “Our Approach”. Section “Experimental Results” presents the results of the conducted experiments. Section “Discussion” discusses the findings from experimental results. Section “Conclusion” concludes the paper.We developed a novel model for diagnosis of COVID-19 called COVIDXception-Net based on Bayesian optimization of the Xception-Net architecture [47].We performed extensive experiments to determine the impact of Bayesian optimization in building our model. In addition, we demonstrated how each hyperparameter influences our model using Partial Dependence Plots.We proposed to use a weighted loss function to deal with the skewed distribution of samples due to the scarcity of the COVID-19 data.
Related Works
The authors in [48] sought to analyze the findings obtained during the SARS and MERS outbreaks to help combat the COVID-19 pandemic. Both SARS and MERS were caused by distinct strains of the coronavirus family and produced similar respiratory symptoms. The authors have identified a substantial overlap of the imaging features of COVID-19 with that of SARS and MERS. They also found abnormalities in the early stage chest images of 85% of patients, suggesting that imaging techniques can help early diagnosis of infection.In [49], the authors used CNN techniques to classify COVID-19 cases from CXR images. They used transfer learning to classify three classes from the CXR images: COVID-19, bacterial pneumonia, and normal. The CXR images were collected from publicly available medical repositories. The authors developed a neural network model that yielded the best accuracy of 96.78%, which suggests the feasibility of using such techniques for the diagnosis of COVID-19 cases.The authors in [38] also explored the feasibility of using deep learning techniques for the diagnosis of COVID-19 from CXR. For this purpose, the authors created a public dataset of CXR images that combines three existing publicly available databases and images collected from various recent publications on this subject. The dataset comprises 190 cases of COVID-19, 1345 cases of viral pneumonia, and 1341 normal images. To mitigate the problem of large data requirements for CNNs, the authors used transfer learning to develop a classifier using four different pre-trained models: AlexNet, ResNet18, DenseNet201, and SqueezeNet. The authors developed both two-class and three-class classification model. The two-class classification model can classify normal and COVID-19 X-rays, whereas three-class classification model can also classify viral pneumonia X-rays along with normal and COVID-19 X-rays. Both the models were trained with and without augmentation. The authors found that CheXNet achieved best accuracy of 97.74% on three-class classification problem when the dataset was not augmented and DenseNet achieved best accuracy of 97.94% when the dataset was augmented.In [50], the authors developed an open-source deep CNN named COVID-Net to detect COVID-19 cases from CXR images. COVID-Net classifies the CXR images into either of the three classes—no infection, non-COVID-19 infection, and COVID-19 infection. Its architecture comprises a lightweight design pattern consisting of a mix of convolution layers with diverse kernel sizes and grouping configurations. The performance of the COVID-Net was evaluated by comparing it with two popular architectures—VGG19 and ResNet50. COVID-Net achieved a test accuracy of 93.3%, which was higher than the other two. The authors also developed an open-source bench-marking dataset termed as COVIDx, which was used for training and evaluation purposes. The COVIDx dataset consists of 13,975 CXR images from 13,870 patients, which were compiled from publicly available data repositories.The authors in [37] did similar work to automate the diagnosis of COVID-19 using deep learning. They used the COVIDx dataset developed in [50], which we already discussed. To deal with the shortage of COVID-19 images in the dataset, they augmented the input images by vertical flipping, random rotation within 15, and varying lighting conditions. The authors used the ResNet50 CNN architecture pre-trained on the ImageNet dataset. They trained the network in three stages using a progressive resizing technique. In each of the three stages, the images were resized, starting from the smallest size of to the largest size of . It was possible to train the model in fewer epochs using this progressive resizing technique. They achieved an accuracy of 96.23% after training for only 41 epochs. The authors named their resultant CNN architecture as COVID-ResNet.Authors in [39] used Bayes-SqueezeNet to diagnose COVID-19 from CXR images. They applied different augmentation techniques like shearing, adding noise, and increasing and decreasing brightness on the COVID-19 images to eliminate the negative effect of the imbalanced distribution of the raw dataset. The SqueezeNet architecture was pre-trained on ImageNet. They improved the SqueezeNet architecture by tuning hyperparameters using Bayesian optimization, and achieved an overall 98.3% accuracy, 98.3% correctness, 98.3% completeness, 99.1% specificity, 98.3% F1-score, and 97.4% Matthews correlation coefficient.Most of the early studies to diagnose COVID-19 from CXR images uses CNNs. CNNs need a considerable amount of data to work efficiently, but a large amount of data is not yet available, as the COVID-19 situation is relatively new. CNNs have another disadvantage—it loses spatial information between image instances. To mitigate these problems, the authors in [51] proposed a model based on Capsule Networks named as COVID-CAPS. COVID-CAPS is capable of operating on small datasets. A small dataset comprises CXR images of COVID-19, bacterial pneumonia, viral pneumonia, and normal samples were used to train the model. Before training the model, the authors binarized the dataset’s labels to COVID positive and COVID negative. Then binary classification was performed using COVID-CAPS to diagnose COVID and non-COVID cases. Using this small dataset, the model achieved an accuracy of 95.7%, sensitivity of 90%, and specificity of 95.8%.Summary of literatures’-’ implies ’not addressed’As mentioned earlier, the primary challenge in any COVID-19 detection is the data imbalance issue due to the scarcity of COVID-19 image data. Therefore, the existing methods resort to different approaches, e.g., oversampling, under-sampling, data augmentation, and weighted loss to address the issue of skewness in the data. In oversampling, the number of images of one or more minority classes is increased by replication or selective data augmentation. In contrast, under-sampling reduces the number of images of one or more majority classes by random elimination. Data augmentation is also used to increase the number of samples and diversity by modifying the original data. Weighted loss function is used to control the learning process of an algorithm for better learning by assigning different weights to different parts of the loss function based on the number of appearances of different classes on the entire dataset.A summary of the related works is given in Table 1 where we observe that the data imbalance issue was not addressed in [52]. The works in [37–39, 45, 50, 53] performed oversampling using different data augmentation techniques, whereas [45, 51, 53] used a weighted loss function to address the issue of data imbalance. The proposed method deals with the data imbalance issue in a way similar to the works in [51, 53, 54] and [45] using a weighted loss function. Even though the loss function used is quite similar, there are some differences in the methodology and experimentation. Zhu et al. [54] developed a model incorporating weighted loss to classify severe and non-severe COVID-19 classification from CT scan images, which is different from ours. Punn et al. [45] also used a weighted loss function to detect COVID-19 from CXR. However, their task was slightly different, since they tried to classify CXR images into four classes—normal, COVID-19, pneumonia, and tuberculosis. Afshar et al. [51] developed a model using the Capsule network to distinguish between COVID and non-COVID images. Our task mostly resembles Goodwin et al. [53] who developed a model that classifies CXR images into one of the three classes—normal, pneumonia, and COVID-19. They used both: a weighted loss function and data augmentation. An ensemble of 14 architectures was used to make predictions. However, they did not use any algorithm for optimizing hyperparameters, whereas we used Bayesian optimization. Our model outperforms theirs in terms of accuracy and other evaluation metrics.
Table 1
Summary of literatures
Reference
Method
Data augmentation
Method to resolve data imbalance
Transfer learning
Hyperparameter optimization
[52]
DenseNet
No
–
Yes
No
[50]
Tailored CNN
Yes
Oversampling
Yes
No
[51]
Capsule networks
No
Weighted loss
No
No
[37]
ResNet50
Yes
Oversampling
Yes
No
[38]
Sgdm-SqueezeNet
Yes
Oversampling
Yes
No
[39]
COVIDiagnosis-Net
Yes
Oversampling
Yes
Yes
[53]
Model ensemble
Yes
Oversamping & Weighted loss
Yes
No
[45]
NASNet-Large
Yes
Oversampling & Weighted loss
Yes
No
Our approach
Bayes-XceptionNet
No
Weighted loss
Yes
Yes
’-’ implies ’not addressed’
In a similar manner to ours, Ucar and Korkmaz [39] used Bayesian optimization to tune the hyperparameters of neural networks. However, they tuned a different set of hyperparameters, i.e., initial learning rate, momentum value and L2-regularization, whereas we tuned learning rate, number of dense layers, number of dense nodes, and activation function. The approach in [39] tuned only a single architecture, whereas we tuned four architectures and presented a quantitative comparative analysis of different models after the optimization. We obtained better results in terms of accuracy and other evaluation metrics in comparison to the method in [39].
Our Approach
We use a deep learning approach to classify CXR images into three classes—normal, pneumonia, and COVID-19. Here, we formulate the problem and describe our proposed model for solving the problem. We discuss the three important components of our solution approach namely feature extraction, learning classifier, and hyperparameter tuning using Bayesian optimization.
Problem Formulation
Assume we have a labeled training set that consists of N CXR images, of distinct classes, , that represent normal, pneumonia, and COVID-19. Given a test X-ray sample , our goal is to assign a label by learning an end-to-end deep leaning model on the training set, .
Proposed Framework
In this paper, Bayesian optimization technique is used to optimize neural network. The overall process consists of three components—feature extraction, classification, and hyperparameter optimization. The feature extraction and classification is done by the CNN and the hyperparameter optimization is done using the Bayesian optimization technique. At first, the CNN is trained using the training data and the validation error is calculated. The validation error is used to update the Gaussian process model. After the approximation of the Gaussian process model, the expected improvement function proposes a new set of hyperparameter values. The expected improvement function decides whether to take the next set of hyperparameter values randomly or by exploiting the Gaussian process model. The new set of hyperparameter values are used to update the CNN model. Then, the CNN model is trained using the training data. The process continues for b iterations. After b iterations, the best model is picked based on the validation error and the model is evaluated on the test set. Figure 2 illustrates our proposed approach. Each of the three components is discussed in detail in the next sections.
Fig. 2
Feature extraction, classification, and hyperparameter optimization
Feature extraction, classification, and hyperparameter optimization
Feature Extraction
CNN architectures have two basic parts—the feature extraction and the classification. The purpose of feature extraction layers is to extract important features from the data. In this study, feature extraction layers from four different CNN architectures: VGG16 [55], MobileNetV2 [56], InceptionV3 [57], Xception [47] are used. All these architectures are pre-trained on ImageNet [58].
Classification
The purpose of the classification layers is to learn to classify the data based on the features obtained using the feature extraction layer. The classification layers consist of one or more fully connected layers. Each layer has some number of nodes. Various non-linear activation functions are used in each layer to learn complex function mapping from input to output. In this study, Bayesian optimization technique was used to learn the configuration of the classification layer.
Hyperparameter Tuning Using Bayesian Optimization
At the core of Bayesian optimization, we have the Bayes theorem. According to Bayes theorem:Here, H is the hypothesis. E is the evidence. P(H) is the prior probability. P(E|H) is the likelihood. P(H|E) is the posterior probability.The actual or true function that the Bayesian optimization technique tries to estimate is called the objective function. Bayesian optimization optimizes this function without knowing its gradient by taking sample points from the hyperparameter space. From the result of evaluating the function at these sample points, it tries to estimate the objective function. This estimation of the objective function is the surrogate function. It acts as the prior of our objective function.Here, is the surrogate function. f is the objective function.Bayesian optimization incorporates prior belief about the objective function f using the surrogate function . It then updates the prior by drawing samples from f to get a posterior that approximates f better. Gaussian Process (GP) is used as surrogate function in our experiments. There are two reasons for choosing GP—it is cheap to evaluate—and it approximates the objective function very well.While choosing the sample points for evaluating the function, Bayesian optimization makes ’educated guess’. Acquisition function helps Bayesian optimization to make these educated guesses. It decides which point will have to be sampled next, based on mean and variance of the surrogate function. It also does the trade-off between exploration and exploitation. Exploration is the desire to look at an area or point where the variance of surrogate function is high, i.e., high uncertainty. Exploitation is the desire to look at an area or point where the mean of surrogate function is high, i.e., high promise.There are different types of acquisition functions like Probability of Improvement (PI), Expected Improvement (EI), and Upper Confidence Bound (UCB). EI is used as acquisition function in our experiments.EI can be defined by the equation:Here, represents the location of the best sample and represents the value of the best sample so far.Using GP model, we can evaluate expected improvement:Here, represents the mean of GP posterior predictive at x and represents the standard deviation of GP posterior predictive at x. is the Cumulative Distribution Function (CDF) of standard normal distribution and is the Probability Density Function (PDF) of standard normal distribution. maintains exploitation and maintains exploration. The amount of exploration is controlled by . If is more, the amount of exploration is more. If is less, the amount of exploration less.To summarize the discussion so far: at first, an approximate function of the objective function called surrogate function is built. This function acts as the prior. Then, another function called acquisition function is used to decide whether to look at an area of high uncertainty or high promise. Based on the decision, the function is evaluated at that point and a posterior is obtained.In this paper, the objective function is the validation loss and the sample points are values of the hyperparameters that are allowed to be tuned. The hyperparameters that we will tune are number of dense layers, number of dense nodes, learning rate, and activation function.The algorithm of our proposed framework is shown in Algorithm 1.The proposed framework takes as input a defined hyperparameter space , a objective function f(x) which can be evaluated by training a neural network, and the maximum number of iterations n. The outputs are and , which are the best hyperparameter configuration and the best objective value, respectively. The first step is to select an initial configuration. Based on the initial configuration, a neural network is built and the objective value is calculated by training the neural network. After that, the variables , and the training set D which is used to update the surrogate model are initialized. Training contains the set hyperparameter configurations and their objective values obtained until iteration i. After initialization, a loop runs for n iterations. A new hyperparameter configuration is generated at each iteration by modifying an acquisition function . Afterwards, objective value is calculated for this hyperparameter configuration. This new configuration and objective value are then added to the training set , and the surrogate model is updated. If the calculated objective is less than , both and are updated. After running for n iterations, the algorithm returns the best hyperparameter configuration and best objective value associated with it.
Dealing with Imbalanced Data: Weighted Loss
Since we have three classes in the problem under study, we use the categorical cross-entropy loss function given as follows:Here, L(total) is the total categorical cross-entropy loss. , , and are the cross-entropy losses for normal, pneumonia, and COVID samples, respectively.However, since the dataset is imbalanced, we use the following weighted loss function instead that helps to ensure each class is given the same relative importance [59]:In Eq. 7, weight of each class is calculated using the following formula:Here, N is the total number of observations in the training dataset and k is the total number of classes, whereas and denote the weight and total number of observations in class c, respectively.
Experimental Results
In this section, we present the experimental setup, evaluation process, results, and two ablations studies.
Dataset Description
We used the dataset compiled by Chowdhury et al. [38]. To create the dataset, normal, pneumonia and COVID-19 X-ray images from four different datasets were combined. Out of 220 COVID-19 X-ray images, 85 were collected from Italian Society of Medical and Interventional Radiology (SIRM) COVID-19 Database [60] and the rest were collected from different articles by Cohen et al. [61] and Chowdhury et al. [38]. Rest of the images of viral pneumonia and normal X-rays were collected from labeled CXR dataset on Mendeley [62]. The dataset was chosen, because it is open source and annotated by expert radiologists. Random samples of normal, viral pneumonia, and COVID-19 X-rays from the dataset are shown in Fig. 3.
Fig. 3
Random samples of CXR from the dataset
Random samples of CXR from the datasetThe dataset was randomly split into train, test, and validation set. The distribution of data is shown in Table 2.
Table 2
Class distribution of the dataset
Class
Total
Train
Validation
Test
Normal
1341
1261
20
60
Viral Pneumonia
1345
1265
20
60
COVID-19
220
140
20
60
Class distribution of the datasetTo train the deep learning model, the images were resized to for VGG16, MobileNetV2, and InceptionV3 architecture and for Xception architecture. In terms of percentage, only 7.2% of X-ray images were of COVID-19. That means the dataset is highly imbalanced. As mentioned earlier, we used the weighted loss function in Eq. 7 to address this issue.
Implementation Details
The details of training configurations are presented in Table 3.
Table 3
Training configuration
Training configuration
Optimizer
Adam
Loss function
Weighted categorical cross-entropy
Number of epochs
250
Early stopping
Yes
Patience
5
Transfer learning
Yes
Mini-batch size
8
Training configurationThe experiments were performed on a laptop with Intel(R) Core(TM) i7-9750H CPU @ 2.60 GHz, NVIDIA GeForce RTX 2080 GPU, and 32 GB RAM. All the architectures were trained using Adam optimizer [63], and a weighted categorical cross-entropy loss function was used. The equation for the weighted loss function is given in Eq. 7. The weight for each class is calculated using Eq. 8. The number of epochs was set to 250 with patience of 5 for early stopping. The mini-batch size was set to 8. All the networks were pre-trained on ImageNet [58].
Hyperparameter Optimization of Neural Networks Using Bayesian Search
In our experiments, the Bayesian optimization technique was used to tune four hyperparameters of the neural networks. These hyperparameters were number of dense layers, number of dense nodes, learning rate, and activation function. Feature extraction layers were taken from VGG16, MobileNetV2, InceptionV3, and Xception architecture.Figure 4 presents the convergence plot of four optimized models.
Fig. 4
Convergence plot for Bayesian search on four architectures
Convergence plot for Bayesian search on four architecturesThe Y-axis of the plot indicates the minimum value of the objective function f(x) or the validation loss after n number of calls or trials. The X-axis represents the number of calls to the objective function. The validation losses after 20 iterations for InceptionV3, MobileNetV2, and Xception were very close. The VGG16 model achieved slightly higher validation loss.The details of each iteration of Bayesian optimization on Xception architecture are presented in Table 4. The table presents two things—the observed value of the objective function and the hyperparameters of the model. In our experiments, the objective function was validation loss, and the model hyperparameters were learning rate, number of dense layers, number of dense nodes, and activation function. Each row of the table corresponds to the observed value of the objective function in a particular iteration and the model hyperparameters used to get that observed value. It is observed that the best or minimum value of the objective function was 0.0006 which was found in the 20th iteration. The model hyperparameters at that iteration were learning rate , number of dense layers 4, number of dense nodes 107, and tanh activation.
Table 4
Iteration results for Bayesian optimization on Xception architecture with model hyperparameters
Iteration no.
Objective
Learning rate
No. of Dense layers
No. of dense nodes
Activation
1
1.1405
1e-25
1
1
tanh
2
1.0885
6.62e-11
5
129
relu
3
1.0323
3.88e-10
3
45
tanh
4
1.1380
6.55e-19
3
122
tanh
5
1.1202
6.60e-16
5
51
relu
6
1.0938
1.61e-16
6
22
relu
7
1.1024
6.92e-14
5
79
relu
8
0.1052
1.04e-07
4
81
relu
9
1.0944
4.44e-23
3
29
relu
10
1.0821
2.59e-20
2
49
tanh
11
1.1208
3.61e-20
3
135
tanh
12
0.0011
3.11e-05
6
124
relu
13
0.0383
2.53e-06
3
109
relu
14
1.0984
1.85e-06
6
1
relu
15
0.0536
6.87e-04
6
98
relu
16
0.0434
2.65e-04
5
121
tanh
17
0.0383
6.34e-06
6
93
relu
18
1.1647
6.51e-02
1
150
relu
19
0.0160
7.68e-06
4
98
relu
20
0.0006
6.92e-05
4
107
tanh
Iteration results for Bayesian optimization on Xception architecture with model hyperparametersFigure 5 presents three partial dependence plots for Bayesian optimization on Xception architecture with three hyperparameters—number of dense layers, number of dense nodes, and learning rate. Partial Dependence Plot (PDP) provides a relation between target response and target features. In this paper, the target response was the validation loss, and the target features were learning rate, number of dense layers, number of dense nodes, and activation function. A partial dependence plot can take one or more target features and express its relationship with the target response. The plot has two extreme regions:We are interested in the observations of yellow regions, as these are the regions where we get minimum validation loss. Each black dot represents a point in search space that was sampled. The red star shows where the best value of the hyperparameter was found.
Fig. 5
Partial dependence plots
The yellow regions in the plots are the regions where the function value is minimum, i.e., the validation loss is minimum.The blue regions in the plots are the regions where the function value is maximum, i.e., the validation loss is maximum.Partial dependence plotsFrom the PDP with the number of dense layers and the learning rate, we observe that the model performance was worse, or the validation loss was more when the learning rate was low. Performance improved with the increase of the learning rate and was dropped again after reaching a learning rate of . Performance did not get much affected by the number of dense layers.From the PDP with the number of dense nodes and the learning rate, we observe similar pattern for the learning rate. The performance was better when the learning rate was high. However, this time, the number of dense nodes had a large impact. Performance improved or validation loss decreased with the increase of the number of dense nodes. However, the impact was much less when the learning rate was low. When the learning rate was high, changing the number of dense nodes highly affected the performance of the model. Model performance was better when the learning rate was above . The best value of the learning rate and the number of dense nodes were found in the red star point.From the PDP with the number of dense layers and the number of dense nodes, we observe that the model performance was more dependent on the number of dense nodes. When the number of dense nodes was less, it did not matter how many layers we use, the performance was still bad. Model performance was better when the number of dense nodes was between 80 and 100. The best value of the number of dense layers and the number of dense nodes was found in the red star point.
Evaluation of the Optimized Models
In this subsection, we introduced different metrics for evaluating the optimized models and discussed the results of the evaluations.
Evaluation Metrics
Accuracy, precision (correctness), recall (completeness), specificity, F1-score, Matthews Correlation Coefficient (MCC), confusion matrix, and Receiver-Operating Characteristic (ROC) curve were used to evaluate the models.Accuracy is the ratio of the number of samples correctly classified to the total number of samples.Here, , , , and denote the number of true-positive, true-negative, false-positive, and false-negative samples for class c, respectively.Precision or correctness is the ratio of number of correct positive results to the number of all predicted positives.Recall or completeness is the ratio of number of correct positive results to the number of all samples that should be identified as positives.Specificity is the ratio of number of correct negative results to the number of all samples that should be identified as negatives. F1 score is the harmonic mean of precision and recall. It can take values ranging from 0 to 1. The model is perfect when the F1 score is 1 and the model is failure when the F1 score is 0.Here, is the precision for class c and is the recall for class c.Precision, recall, specificity, and F1-score are all asymmetric evaluation metrices. That means, their values change when the class labels are swapped. Matthews Correlation Coefficient (MCC) is a symmetric evaluation metrics. It does not change with the change of class label. It can take values between − 1 to 1. − 1 means the model misclassified all the data, 0 means the prediction of the model is random, and 1 means the model perfectly classified all the data.
Evaluation Results
The classification performance of all four optimized models is presented in Table 5. The Xception model outperforms all the other models in terms of accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient on the test set.
Table 5
Classification performance of four optimized models
Model
ACC
PRE
REC
SPE
F1
MCC
VGG16
0.989
0.989
0.989
0.994
0.989
0.983
MobileNetV2
0.978
0.978
0.978
0.988
0.978
0.967
InceptionV3
0.978
0.978
0.978
0.988
0.978
0.967
Xception
0.994
0.995
0.994
0.997
0.994
0.992
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient
Classification performance of four optimized models’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation CoefficientThe confusion matrices for four optimized models are presented in Fig. 6.
Fig. 6
Confusion matrix of four optimized models
Confusion matrix of four optimized modelsFrom the confusion matrices, it is observed that none of the models managed to predict all 60 COVID-19 cases correctly. For the other two cases (normal and pneumonia), VGG16 misclassified 1, and MobileNetV2 and InceptionV3 misclassified 3, but the Xception model did not misclassify any case. The Xception model misclassified only 1 X-ray in total. The actual X-ray was of COVID-19, but the model predicted it as normal. The confusion matrices clearly demonstrate the superiority of the Xception model over the other three models.The training loss, training accuracy, validation loss, and validation accuracy of four best-performing models found by optimizing VGG16, MobileNetV2, InceptionV3, and Xception using Bayesian Optimization technique are shown in Fig. 7.
Fig. 7
Loss and accuracy plots of best models obtained from each architecture
Loss and accuracy plots of best models obtained from each architectureAll the models were set to train for 250 epochs. However, since early stopping was used, models which did not show any improvement in validation loss for five consecutive epochs were early stopped. The final validation losses achieved by the MobileNetV2, InceptionV3, and the Xception model were very close. The validation loss for VGG16 is slightly higher compared to the other three models. At some point during the training, all models achieved the highest validation accuracy of 1.0, except for the VGG16 model, which achieved a maximum validation accuracy of 0.95.A Receiver-Operating Characteristic (ROC) curve shows the performance of the model at different thresholds. The Area Under the Curve (AUC) is used to summarize the findings of an ROC curve. It represents the ability of a model to distinguish among different classes. The ROC curves for the VGG16, MobileNetV2, InceptionV3, and Xception models are depicted in Fig. 8.
Fig. 8
Receiver-Operating Characteristic (ROC) Curve
Receiver-Operating Characteristic (ROC) CurveFrom Fig. 8, it is observed that the Xception model has the highest AUC value for the normal and pneumonia class, which is 1. A large AUC value for a particular class signifies that the model is capable of discriminating that class from the other two classes very well. The AUC value of 1.0 for the normal class in the Xception model implies that the Xception model can discriminate normal samples from pneumonia and COVID-19 samples perfectly. And, the AUC value of 1.0 for pneumonia in the Xception model signifies that the Xception model can discriminate pneumonia samples from normal and COVID-19 samples perfectly. The AUC values for the normal and pneumonia classes for the VGG16, MobileNetV2, and InceptionV3 are less than 1. It means that the discrimination of each class (normal or pneumonia) from the other two classes is not perfect. In case of the COVID-19 class, the Xception and the InceptionV3 model have the highest AUC value, which is 0.9999. This means that even though the discrimination of COVID-19 class from the other two classes is not perfect for these two models, but it discriminates very well. With AUC values of 0.9986 and 0.9971, respectively, the VGG16 and the MobileNetV2 models do not discriminate as well between the COVID-19 class and the other two classes as the Xception and InceptionV3 models do.
Complexity Comparison
The comparison of the complexity of four optimized models along with two existing approaches is shown in Table 6. To compare the complexity among different models, we took into account training time, inference time, model size, number of parameters, Multiply-Accumulate (MAC) Operations, and Floating-Point Operations Per Second (FLOPS).
Table 6
Comparison of complexity of different models
Model
Training time
Inference time
Model size
Number of parameters
MACs
FLOPS
VGG16
2882.3 s
26.8 ms
219.8 MB
18.31 M
30.7 G
15.4 G
MobileNetV2
5982.9 s
28.9 ms
32.1 MB
2.63 M
0.613 G
0.307 G
InceptionV3
5815.5 s
33.7 ms
335.1 MB
27.86 M
5.71 G
2.85 G
Xception
2255.0 s
26.8 ms
513.9 MB
42.8 M
16.8 G
8.41 G
CheXNet [38]
5532.1 s
40 ms
65.7 MB
28.1 M
5.7 G
2.85 G
VGG19 [49]
4907.3 s
27 ms
277.8 MB
23.14 M
39.0 G
19.5 G
Training Time: MobileNetV2 took the most time to train - 5982.9s. InceptionV3 also took a similar amount of time (5815.9s). Xception took only 2255.0s which is the least.Inference Time: Among the four trained models, the inference time for the VGG16 model and the Xception model is the lowest, which is 26.8 ms. MobileNetV2 and InceptionV3 took 28.9ms and 33.7ms, respectively.Model Size: MobileNetV2 is the most lightweight model with a size of only 32.1 MB. VGG16 comes in the second with a size of 219.8 MB. InceptionV3 and Xception are 335.1 MB and 513.9 MB, respectively.Number of Parameters: MobileNetV2 has the least number of parameters—2.63 M. The number of parameters in VGG16, InceptionV3, and Xception is 18.31 M, 27.86 M, and 42.8 M, respectively.Multiply-Accumulate (MAC) Operations: VGG16 requires the highest MAC operations, whereas MobileNetV2 requires the least. The number of MAC operations for VGG16, MobileNetV2, InceptionV3, and Xception is 30.7 G, 0.613 G, 5.71 G, and 16.8 G, respectively.Floating-Point Operations Per Second (FLOPS): VGG16 requires the highest FLOPS, whereas MobileNetV2 requires the least. The number of FLOPS for VGG16, MobileNetV2, InceptionV3, and Xception is 15.4 G, 0.307 G, 2.85 G, and 8.41 G, respectivelyComparison of complexity of different modelsFrom Table 6, it is observed that, among the four models that we optimized using the Bayesian optimization technique, the Xception model is the most expensive in terms of model size and number of parameters, whereas the VGG16 model is the most expensive in terms of MACs and FLOPS. The MobileNetV2 model is the least expensive in terms of model size, number of parameters, MACs, and FLOPS. Even though MobileNetV2 is the least expensive model, from Table 5, it is observed that the MobileNetV2 is also the most underperforming model, whereas the VGG16 and the Xception models are the best-performing ones. Hence, a clear trade-off between the complexity of the models and their performance are observed. It is possible to design lightweight models but at the expense of performance. In the healthcare sector, there is no room for error. Hence, deploying a larger model which performs better is an obvious choice.The complexity of the four optimized models were also compared with two approaches from the literature. The CheXNet [38] and the VGG19 [49] models from these literatures were trained on our dataset. The VGG19 model [49] is more expensive in terms of MACs and FLOPS compared to our models, and the inference time for the CheXNet model [38] is very high. From Table 10, it is observed that the Xception model outperforms these two models in terms of accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient.
Table 10
Comparison between the proposed model with previous state-of-the-art methods on the same dataset
Study
Class
ACC
PRE
REC
SPE
F1
MCC
[38]
3
0.983
0.983
0.983
0.992
0.983
0.975
[49]
3
0.961
0.961
0.961
0.981
0.961
0.942
COVIDXception-Net
3
0.994
0.995
0.994
0.997
0.994
0.992
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient
COVIDXception-Net: Best Model Obtained Using Bayesian Search
Proposed COVIDXception-Net architectureThe Xception architecture tuned using Bayesian optimization technique performed the best. We named this architecture COVIDXception-Net. The backbone of the COVIDXception-Net architecture is the Xception network, which is used to extract features from the images. The feature extraction network is followed by dense fully connected layers to perform classification. The architecture of COVIDXception-Net is demonstrated in Fig. 9. The architecture takes as input an X-ray image of size . The input layer is followed by two convolution layers with ReLU activation function. After the first two convolution layers, three convolution layer blocks—ConvA, ConvB, and ConvC—are linearly stacked one after another. Instead of using regular convolution operations, these layers mostly use depthwise separable convolution operations. This is the salient feature of the Xception architecture. The advantage of using depthwise separable convolution over regular convolution is they require much fewer parameters. Due to the use of fewer parameters, they are also less prone to overfitting and are faster.
Fig. 9
Proposed COVIDXception-Net architecture
In block ConvA, there are two depthwise separable convolution layers. Batch normalization is used after the first layer. The output of the second depthwise separable convolution layer enters into a max-pooling layer. A skip connection with a convolution layer is also used in this block. The output obtained from the max-pooling layer and the layer is added. Three ConvA layers are stacked one after another. The output from the third ConvA block enters into the ConvB block.In ConvB block, three depthwise separable convolution layers with batch normalization are used. The first two depthwise separable convolution layers used ReLU non-linearity, but the third one did not use any non-linearity. The output of the third depthwise separable convolution layer is added with the input of the ConvB block using a skip connection. Eight ConvB are linearly stacked one after another. The output of the eighths ConvB block enters into the ConvC block.The input of the ConvC block passes through two depthwise separable convolution layers, followed by a max-pooling layer. The output from the max-pooling layer is added to the convoluted input of the ConvC block using a skip connection. The output obtained after performing the addition operation then passes through two depthwise separable convolution layers.The output of ConvC enters into the global average pooling layer. With the global average pooling layer, the feature extraction part of the network ends. The output obtained from the global average pooling layer is flattened and enters into the classification layers of the COVIDXception-Net.The classification layer of the COVIDXception-Net architecture consists of four dense layers. Each dense layer consists of 107 neurons with tanh activation function. The final layer of the COVIDXception-Net architecture consists of three neurons with softmax non-linearity representing the three classes—normal, pneumonia, and COVID-19.The classification performance for the COVIDXception-Net is presented in Table 7.
Table 7
Classification performance for the COVIDXception-Net model
Class
ACC
PRE
REC
SPE
F1
MCC
COVID
0.983
1.000
0.983
1
0.992
0.988
Normal
1.000
0.984
1.000
0.992
0.992
0.988
Pneumonia
1.000
1.000
1.000
1
1.000
1
Overall
0.994
0.995
0.994
0.997
0.994
0.992
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient
Classification performance for the COVIDXception-Net model’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation CoefficientBoth normal and pneumonia classes were classified with an accuracy of 1.0. However, one COVID-19 sample was misclassified as normal. Hence, the accuracy was reduced to 0.983. Precision and recall for the COVID-19 class were 1.0 and 0.983, respectively. F1-score for pneumonia class was 1.0, which means that both precision and recall for pneumonia were 1.0. Specificity for the COVID-19 class is 1 as all the non-COVID-19 classes were predicted correctly. The COVIDXception-Net model also achieved a high overall Matthews correlation coefficient. The overall classification accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient for the Xception model were 0.994, 0.995, 0.994, 0.997, 0.994, and 0.992, respectively.
Ablation Study
In this section, we discuss two ablation studies that we performed to evaluate the impact of Bayesian optimization and weighted loss function on our model.
Impact of Bayesian Search
To evaluate the impact of optimization using Bayesian search, we compared it with random search.Figure 10 presents a convergence plot of Xception architecture tuned using both Bayesian and random search. The final converged validation loss for Bayesian search was 0.0006, which was lower than the final converged validation loss for random search, 0.0641791.
Fig. 10
Convergence plot for Bayesian and random search on Xception architecture
Convergence plot for Bayesian and random search on Xception architectureTable 8 presents the classification performance of Xception architecture tuned using both Bayesian and random search. Xception architecture tuned using Bayesian search outperformed the same architecture tuned using random search in terms of accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient.
Table 8
Classification performance for Bayesian and random search
Method
ACC
PRE
REC
SPE
F1
MCC
Bayesian search
0.994
0.995
0.994
0.997
0.994
0.992
Random search
0.95
0.952
0.95
0.975
0.95
0.926
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient
Classification performance for Bayesian and random search’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation CoefficientBoth Fig. 10 and the Table 8 suggest that Bayesian optimization was far better compared to random guessing of hyperparameters.
Impact of Weighted Loss Function
The Xception architecture was trained without using a weighted loss function to see how it affected the performance. From Table 9, it is observed that all accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient dropped when the weighted loss function was not used. It proves that the weighted loss function helps in the generalization of the neural network.
Table 9
Classification performance of Xception model with and without weighted loss
Method
ACC
PRE
REC
SPE
F1
MCC
Xception without weighted loss
0.983
0.984
0.983
0.991
0.983
0.975
Xception with weighted loss
0.994
0.995
0.994
0.997
0.994
0.992
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient
Classification performance of Xception model with and without weighted loss’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient
Qualitative Analysis
Class activation map generated for different X-ray samples with COVIDXception-Net architectureTo understand the decision making process of our COVIDXception-Net model, we implemented class activation mapping using the Gradient-weighted Class Activation Mapping (Grad-CAM) [64] technique. Class activation mapping provides a way of visually understanding the rationale behind making a certain prediction. It also helps to identify whether the model is leveraging the right features to make predictions or it is using erroneous features which is leading to the right decision.To generate a class activation map, an X-ray sample is forward propagated through the COVIDXception-Net model. Then, a heatmap is generated from the gradient information flowing into the final convolutional layer of the model for that X-ray sample. The heatmaps represent the regions responsible for making a certain prediction by the COVIDXception-Net model. The generated heatmap is then superimposed over the original X-ray image to identify the important regions for a certain prediction.Class activation maps generated for some sample test images from the dataset along with predictions are shown in Fig. 11. The heatmaps in the figure are indicated by the red and yellow areas, with red indicating more significant and yellow indicating slightly less significant regions of the X-ray responsible for making the prediction. From Fig. 11, it is observed that the heatmaps generated from the COVIDXception-Net architecture point out the salient regions of the X-ray, i.e., the region surrounding the lungs, to make decisions rather than relying on erroneous visual indicators or imaging artifacts, etc.
Fig. 11
Class activation map generated for different X-ray samples with COVIDXception-Net architecture
Comparison with the State-of-the-Art Methods
The problem with the comparison of works related to COVID is that most studies used different datasets and the split of the dataset into train, validation, and test sets are not publicly available. For this reason, to compare with our results, we trained and tested other studies approaches using our dataset. CheXNet model of Chowdhury et al. [38] was trained and tested on our dataset, and it obtained an accuracy of 98.3% on the test set. VGG19 model of Ioannis et al. [49] was also trained and tested on our dataset, and it obtained an accuracy of 96.1%. Our proposed model COVIDXception-Net achieved an accuracy of 99.4%. The comparison is presented in Table 10.Comparison between the proposed model with previous state-of-the-art methods on the same dataset’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation CoefficientComparison between the proposed model with previous state-of-the-art methods on similar datasets’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient. ’-’ indicates ’not mentioned’Comparison of our model with state-of-the-art methods on similar datasets is presented in Table 11. The nature of the dataset was similar in all cases, but the distribution of data and the evaluation protocols were different. Some studies used cross-validation, whereas some split the entire dataset into train, validation, and test sets. The datasets containing the three classes had normal, pneumonia, and COVID-19 X-rays. The datasets containing the two classes had normal and COVID-19 X-rays. Li and Zhu [52] developed a DenseNet-based model to classify three classes. They used a very small dataset of 377 X-rays. They obtained an overall accuracy of 0.889 on the test set. Wang and Wong [50] used a tailored CNN. They used a large dataset of 13,962 X-ray images. The tailored CNN model obtained overall accuracy 0.923, precision 0.913, and recall 0.887. Afshar et al. [51] proposed a capsule network-based binary classification model. Their model achieved an accuracy of 0.957 and recall of 0.90. Farooq and Hafeez [37] developed an ResNet50-based model to detect four classes—normal, COVID-19, viral pneumonia, and bacterial pneumonia. Using data augmentation and transfer learning, they achieved an overall accuracy 0.962, precision 0.969 and recall 0.969 and F1-score 0.969. Chowdhury et al. [38] tried different CNN architectures, and used both transfer learning and data augmentation to improve the performance of the model. Using DenseNet architecture, they obtained an overall accuracy of 0.979, precision 0.979, recall 0.979, and F1-score 0.979. Ucar and Korkmaz [39] used Bayes-SqueezeNet. To improve the performance and faster learning of the model, they used both data augmentation and transfer learning. Their Bayes-SqueezeNet model achieved 0.983 overall accuracy, 0.983 precision, 0.983 recall, 0.991 specificity, 0.983 F1-score, and 0.974 Matthews correlation coefficient. Narin et al. [65] trained a ResNet50 model that can distinguish between normal and COVID X-ray. They used a small dataset of 100 X-ray images. Their model achieved an accuracy of 0.98, precision 1.000, recall 0.96, and F1-score 0.980. We used a dataset of 2,906 X-ray images. We did not use any augmentation on data. Our model achieved 0.994 overall accuracy, 0.995 precision, 0.994 recall, 0.997 specificity, 0.994 F1-score, and 0.992 Matthews correlation coefficient. Compared to the previous results, our proposed model COVIDXception-Net achieved the highest overall accuracy, recall, specificity, F1-score, and Matthews correlation coefficient.
Table 11
Comparison between the proposed model with previous state-of-the-art methods on similar datasets
Study
Method
Class
ACC
PRE
REC
SPE
F1
MCC
[52]
DenseNet
3
0.889
–
–
–
–
–
[50]
Tailored CNN
3
0.923
0.913
0.887
–
0.900
–
[51]
Capsule network
2
0.957
–
0.900
0.958
–
–
[37]
ResNet50
4
0.962
0.969
0.969
–
0.969
-
[38]
DenseNet
3
0.979
0.979
0.979
0.990
0.979
–
[39]
SqueezeNet
3
0.983
0.983
0.983
0.991
0.983
0.974
[65]
ResNet50
2
0.980
1.000
0.960
–
0.980
–
COVIDXception-Net
Xception
3
0.994
0.995
0.994
0.997
0.994
0.992
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient. ’-’ indicates ’not mentioned’
Comparison between proposed model with previous state-of-the-art methods for COVID-19 class’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient. ’-’ indicates ’not mentioned’Comparison of our model with state-of-the-art methods for COVID-19 class is presented in Table 12. Wang and Wong [50] achieved COVID-19 class accuracy of 0.933, precision 0.989, and recall 0.910 using their proposed model COVID-Net. Farooq and Hafeez [37] achieved COVID-19 class accuracy, precision recall, and F1-score of 1.000 using their proposed model COVID-ResNet, but their test set contained only 8 COVID-19 images which is very small to meaningfully represent the COVID-19 cases that may arise in real world. Ucar and Korkmaz [39] also achieved a COVID-19 class accuracy of 1.000, but their test set was augmented. The raw test set contained only 10 COVID-19 images which is also very small to meaningfully represent the entire population. We used 60 non-augmented COVID-19 images in our test set out of which only 1 image was misclassified as normal, and hence, we obtained COVID-19 class accuracy of 0.983. Our model also achieved a precision and specificity of 1 and overall accuracy of 0.994 which is the highest to date.
Table 12
Comparison between proposed model with previous state-of-the-art methods for COVID-19 class
Study
Method
ACC
PRE
REC
SPE
F1
MCC
Overall Accuracy
[50]
Tailored CNN
0.933
0.989
0.910
–
–
–
0.933
[37]
ResNet50
1.000
1.000
1.000
–
1.000
-
0.962
[39]
SqueezeNet
1.000
0.994
1.000
0.997
0.997
0.995
0.983
COVIDXception-Net
Xception
0.983
1.000
0.983
1.000
0.992
0.988
0.994
’ACC’ refers to accuracy. ’PRE’ refers to precision or correctness. ’REC’ refers to recall or completeness. ’SPE’ refers to specificity or true-negative rate. ’F1’ refers to F1-score. ’MCC’ refers to Matthews Correlation Coefficient. ’-’ indicates ’not mentioned’
Discussion
In this study, we developed an Xception Network-based architecture referred to as the COVIDXception-Net for diagnosis of COVID-19 disease from CXR images. We applied the Bayesian optimization technique on four pre-trained architectures—VGG16, MobileNetV2, InceptionV3, and Xception. The results were compared using convergence plot, accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient. We found that the model based on Xception architecture performed best. To tackle the class imbalance problem, a weighted loss function was used. COVIDXception-Net achieved a satisfying performance on the test set with an accuracy of 99.4%. The impact of each hyperparameter that we tuned on the COVIDXception-Net was analyzed using multiple partial dependence plots. We compared our model COVIDXception-Net with previous state-of-the-art methods, and found that our model outperforms previous methods in terms of accuracy, precision, recall, and F1-score on our dataset. When the dataset is similar in nature but not the same, our model still outperforms previous state-of-the-art methods in terms of accuracy, recall, and F1-score.Our proposed model, COVIDXception-Net misclassified one COVID-19 CXR as normal. The misclassified case is shown in Fig. 12. The ground truth label of this CXR was COVID-19 in our dataset. We asked five separate doctors to diagnose this CXR and found that the case was a bit confusing for them as well. Two doctors said that the patient was infected with COVID-19, but the infection was at an early stage. One of the doctors said the patient was not infected with COVID-19, but he might have other lung conditions. Another doctor said the patient was in normal condition. The last doctor refused to diagnose, saying he could not diagnose without knowing the patient history. The rest of the COVID-19 chest X-rays were classified accurately by our COVIDXception-Net model.
Fig. 12
Misclassified COVID-19 case
Misclassified COVID-19 caseTwo ablation studies were performed to measure the impact of Bayesian optimization and weighted loss function on our model. In the first ablation study, the impact of Bayesian optimization was measured by comparing it with random search. To compare both the techniques convergence plot, accuracy, precision, recall, and F1-score were used. It was found that the best model obtained from the Bayesian search provided an accuracy of 99.4%, whereas the best model obtained from random search provided an accuracy of 95%. Bayesian optimization significantly improved the performance of the final model also in terms of precision, recall, and F1-score. In the second ablation study, COVIDXception-Net was trained using a regular loss function instead of a weighted loss function. An accuracy of 98.3% was obtained when regular loss function was used compared to 99.4% when weighted loss function was used. From these two ablation studies, we concluded that Bayesian search can be used to efficiently tune the hyperparameters of a neural network, whereas weighted loss can be used to improve the performance of our neural network.The complexities of the optimized models were calculated in terms of training time, inference time, model size, number of parameters, MACs, and FLOPS. Even though the COVIDXception-Net model is a bit expensive in terms of model size and number of parameters, it can perform inference much faster, and it is the best-performing model in terms of accuracy, precision, recall, specificity, F1-score, and Matthew correlation coefficient. When it comes to healthcare, making a mistake can have serious consequences. Therefore, it makes more sense to deploy the best-performing model than to deploy a lightweight model like MobileNet that is computationally cheaper. Gradient-weighted Class Activation Mapping (Grad-CAM) was used to perform qualitative analysis of the COVIDXception-Net model. By analyzing class activation maps for different X-ray samples, it was observed that the model was using important regions of the X-ray, i.e., regions surrounding the lungs to make predictions.
Conclusion
In this study, we introduced COVIDXception-Net, a deep convolutional neural network that can diagnose COVID-19 from chest X-ray images without any human intervention. We improved the Xception architecture using the Bayesian optimization technique to diagnose COVID-19 accurately. To tackle the class imbalance problem, the weighted loss function was used. The complexities of the optimized models are compared in terms of training time, inference time, model size, number of parameters, MACs, and FLOPS. COVIDXception-Net achieved a satisfying performance on the test set and outperformed state-of-the-art methods in overall accuracy, recall, and F1-score. We also performed two ablation studies to measure the impact of Bayesian optimization and weighted loss function on our model, and found that these techniques significantly improved the performance of the model. The qualitative analysis of COVIDXception-Net was done using the GRAD-CAM technique. The qualitative analysis suggest the model leveraging the salient regions from the X-ray to make predictions. A future research direction may include developing a model that can classify Chest X-rays based on mild, moderate, and severe symptoms of COVID-19 which will be helpful for the management of patients in a clinical setting.