Literature DB >> 36157353

A multi-class classification framework for disease screening and disease diagnosis of COVID-19 from chest X-ray images.

Ebenezer Jangam1,2, Chandra Sekhara Rao Annavarapu2, Aaron Antonio Dias Barreto3.   

Abstract

To accurately diagnose multiple lung diseases from chest X-rays, the critical aspect is to identify lung diseases with high sensitivity and specificity. This study proposed a novel multi-class classification framework that minimises either false positives or false negatives that is useful in computer aided diagnosis or computer aided detection respectively. To minimise false positives or false negatives, we generated respective stacked ensemble from pre-trained models and fully connected layers using selection metric and systematic method. The diversity of base classifiers was based on diverse set of false positives or false negatives generated. The proposed multi-class framework was evaluated on two chest X-ray datasets, and the performance was compared with the existing models and base classifiers. Moreover, we used LIME (Local Interpretable Model-agnostic Explanations) to locate the regions focused by the multi-class classification framework.
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Entities:  

Keywords:  COVID-19; Deep learning; Multi-class classification; Stacked ensemble; Transfer learning

Year:  2022        PMID: 36157353      PMCID: PMC9490695          DOI: 10.1007/s11042-022-13710-5

Source DB:  PubMed          Journal:  Multimed Tools Appl        ISSN: 1380-7501            Impact factor:   2.577


Introduction

Chest X-rays can be used to diagnose multiple diseases like effusion, pneumonia, infiltration, nodule, cardiomegaly. With the outbreak of COVID-19, the significance of a computer-aided diagnosis (CAD) system that assists radiologists in decision making and diagnosis of diseases has increased dramatically. Over the past decade, CAD has played a key role in diagnosing lung diseases. In the current scenario of COVID-19, the real-time reverse transcription-polymerase chain reaction (RT-PCR) method was used to test whether a patient is COVID-19-positive or not. However, the drawback of RT-PCR is high false negatives, which subsequently leads to the further spread of the virus, resulting in a pandemic [3, 16, 31]. The alternatives of RT-PCR for the diagnosis of COVID-19 are CT scan and chest X-ray scan. Chest X-ray scan is an affordable option that takes less time than RT-PCR and CT scan. Diagnosis using CT scan and chest X-ray scan of the symptomatic individuals yielded significantly low false negatives when compared to RT-PCR [8, 37, 43, 71]. However, the challenges with medical image analysis are as follows. The first challenge is a lack of large datasets. The second challenge is in feature extraction. A lack of large datasets can be addressed using transfer learning; feature extraction can be addressed using pre-trained deep-learning models and fully connected layers. In the design of a deep neural network, the width and depth are two factors that affect the model’s performance. The width refers to the number of neurons in the fully connected layers, whereas depth refers to trainable layers. In addition to the challenges mentioned above, detection of multiple lung diseases by distinguishing the features related to COVID-19-pneumonia and other pneumonia is a challenging task [39, 40] because COVID-19 characteristics are similar to other kinds of pneumonia [20]. The traditional feature learning approaches may not discern the dynamic characteristics in COVID-19 chest X-ray scans, and CT scans [4]. However, promising alternative solutions like convolutional neural network (CNN) were used to automatically extract the features for complex tasks in the past [58]. Therefore, deep learning (DL) using CNN is a promising option for automatic feature extraction in a disease like COVID-19. However, DL models need a large amount of data for the accurate prediction of the disease. Although data augmentation techniques were used to increase the size of the datasets, the performance of the DL model did not improve significantly. Therefore, transfer learning (TL) techniques can be employed to improve the performance of the DL models when the required amount of data are not available. For instance, the weights of the pre-trained CNN models of large-scale datasets, such as ImageNet, can be transferred to the task of COVID-19 diagnosis. Moreover, some studies reported that the pre-trained CNN models like Visual Geometry Group (VGG)-19 [61], the residual network (ResNet) [70] and the densely connected convolutional network (DenseNet) [21], gave high accuracy in the detection of COVID-19 from chest CT scan images and chest X-ray scan images. This study used DL, TL, and stacking concepts to design an ensemble to distinguish COVID-19 pneumonia from other pneumonia and normal chest X-ray images. The contributions of the study are as follows: We proposed a novel multi-class classification framework that can generate a pair of classifiers: disease screening and detection models. We proposed a new pair of selection measures to select accurate and diverse base classifiers from the pool of base classifiers for disease screening and disease detection mode, respectively. We explored the concessions between recall and precision to select the optimal threshold that results in high recall and accuracy. We used explainable Artificial Intelligence (AI) to illustrate that the proposed model focuses on the right areas of the chest X-ray images to predict. The study is organized as follows. Preliminaries needed for the study are explained in Section 2. Existing models for multi-class classification related to COVID-19 are summarised in Section 3. The systematic approach, selection metrics, and the proposed multi-class classification framework are explained in Section 4. The data and methodology of the experiment are explained in Section 5. The results obtained from the experiment are analyzed in Section 6. The performance of the proposed stacked ensemble is compared with the existing models in Section 7. Section 8 concludes the study.

Preliminaries

This section explains the fundamentals used in this study.

Deep learning

Deep learning (DL) [38] is a subset of machine learning that is based on artificial neural networks (ANNs). There can be multiple hidden layers depending on the complexity of the problem under consideration. One of the requirements of the DL is that it needs large datasets for training.

Transfer learning

When large datasets are not available for training, transfer learning [50] is a promising solution. Transfer learning helps by avoiding the need to initiate the learning process from scratch. The weights learned from conventional large datasets are transferred to a model used to address a problem of interest. There are two primary advantages of transfer learning: saving time during training and solving a problem for which large datasets are not available.

Convolutional neural network (CNN)

A CNN has two components: a convolutional base and a classifier. The convolutional base can capture the features from an image. The classifier is responsible for assigning labels to each example based on the features learned by the convolutional base. In medical image analysis, CNNs have exhibited better performance than clinical experts in various tasks [56]. In the case of chest radiography, as massive datasets are available, it is feasible to use a CNN to learn the features of a disease when the features are known. However, for diseases that have ambiguous characteristics, a CNN alone cannot capture the features [69]. A CNN performed better when compared with experts under different clinical scenarios. In the case of chest imaging, diverse algorithms have been developed to process the data. Training CT images in deep learning is a complex task because of the involvement of a 3D signal.

Stacking

Stacking is an ensemble learning technique in which heterogeneous base classifiers are combined using a meta classifier to provide better performance for a given task. The key phrases in the generation of a stacking ensemble are (1) the generation of a pool of base classifiers, (2) the selection of base classifiers that form the stacked ensemble, and (3) the combination of the outputs of the base classifiers using a meta-classifier [57].

Pre-trained models

A model was pre-trained on a large benchmark dataset to solve a similar but different problem. Training a new model on a large dataset is expensive; it is common to use existing pre-trained models. Although there are many existing pre-trained models [6], the following models were selected based on the popularity and performance when employed for various classification related tasks.

Visual geometry group (VGG)-19

VGG-19 [61] is a variation of a CNN proposed by the Visual Geometry Group from the University of Oxford that contains 19 layers and nearly 144 million parameters. The authors [61] investigated the impact of the depth of the CNN on its accuracy. After a thorough evaluation, the optimal performance was found at a depth of 16–19 layers. VGG-19 was successfully used for localization and classification tasks.

Residual network (ResNet)-101

ResNet-101 [70] is a variation of CNN with 101 layers that uses the concept of a skip connection to address the problem of vanishing/exploding gradients; this problem is prevalent in plain deep networks. A skip connection feeds the input from one layer to the following layer without any modification. ResNet-101 has been successfully applied to image classification, localization, and detection tasks.

Densely connected convolutional networks (DenseNet)-169

DenseNet-169 [21] is a variation of a CNN that addressed the vanishing/exploding gradient problem by ensuring maximum information and gradient flow. Each layer in every dense block is connected to every other layer in a feed-forward fashion. In addition to mitigating the vanishing gradient problem, DenseNet-169 has other advantages, such as the promotion of feature reuse, a reduction in the number of parameters, and the promotion of feature propagation.

Wide residual network (Wide ResNet)-50-2

Wide ResNet-50-2 [73] is a modified residual network that works by decreasing the depth and increasing the width. Wide ResNet-50-2 has a depth of 50 and a width of two with approximately 69 million parameters.

Literature review

In this section, we focus on existing multi-class classification techniques to distinguish COVID-19-pneumonia from other types of pneumonia and pneumonia-free chest X-ray images. Existing models proposed for multi-class classification of COVID-19-pneumonia were listed in Table 1.
Table 1

Multi-class classification models

StudyDataMethodContribution
Ibrahim et al. [24]chest X-rays combined from different sourcesAlexNetTwo-way, three-way and four-way classification
Nishio et al. [48]1248 images taken from two public chest X-ray datasetsVGG-16CADx system for evaluation of COVID-19 pneumonia, non-COVID-19 pneumonia and healthy images
Asif et al. [7]Mixed dataset of CXR and CT scan imagesDeep CNN based InceptionV3Three class classification of COVID-19 pneumonia, non-COVID-19 pneumonia and healthy images
Khan et al. [35]Mixed dataset of CXR imagesModification of Xception architecture and transfer learningThree-class classification of COVID-19 CXR images
Chowdhury et al. [13]Mixed dataset of CXR imagesImage augmentation, transfer learning and multiple pre-trained modelsComparision of performance of different pre-trained models
Shelke et al. [60]Indian dataset of CXR imagesPneumonia detection using DenseNet-161, COVID-19 detection using ResNet-18 and VGG-16Classification of COVID-19 pneumonia from normal, pneumonia and tuberculosis.
Bassi and Attux [9]Indian dataset of CXR imagesPneumonia detection using DenseNet-161, COVID-19 detection using ResNet-18 and VGG-16Classification of COVID-19 pneumonia from normal, pneumonia and tuberculosis.
Karakanis et al. [32]CXR images and synthetic imagesGAN for synthetic images; Lightweight ResNet8 for COVID-19 detection; GRAD CAM for heatmap generationThree-class classification of COVID-19 pneumonia from normal, pneumonia
Ibrahim et al. [23]33,676 CXR and CT images from RSNA and SIRMResNet152V2 and VGG19For classification of COVID-19 pneumonia from lung cancer and pneumonia, VGG19 provided better accuracy.
Ibrahim et al. [23]CXR images from two different datasetsEvaluation using ResNet, MobileNet, DenseNet and InceptionV3Comparision of accuracy of five pre-trained models. DenseNet121 provided better accuracy.
Karar et al. [33]CXR imagesVGG16, ResNet50V2 and DenseNet169VGG16, ResNet50V2 and DenseNet169 provided better performance for COVID19, Viral pneumonia and bacterial pneumonia respectively.
Zebin et al. [74]CXR imagesVGG16, ResNet50 and EfficientNetB0 and gradient class activation mapping for progress monitoringIn addition to classification, disease monitoring of COVID-19 was performed.
Gupta et al. [17]COVID19 Radiograph dataset and Chest X-ray datasetIntegrated stacking of pre-trained modelsClassification using intgrated stacking of pre-trained models.
Ismaelet and Ṡengür [27]Chest X-ray images collected from multiple datasetsResNet and VGG for feature extraction and SVM for classificationHybrid model for classification
Rahimzadehet and Attar [53]11,302 CXR images collected from two public datasetsConcatenation of Xception and ResNet50V2Three class classification
Mahmud et al. [44]Balanced dataset of CXR images collected from two datasetsDNN based on depthwise dilated convolution. Features extracted from different resolutions of X-rays are jointly converged by stacking algorithmclassification using Covxnet with feature extraction, stacking, and gradient-based activation mapping
Hussain et al. [22]Assembled CXR images and CT scansModification of existing architectureClassification using CoroDet
Abbas et al. [1]CXR images from JSRT and other public datasetDecompose using AlexNet; Transfer and compose using AlexNet, VGG19, ResNet, GoogleNet and SqeezenetProposed Decompose, Transfer and Compose method
Multi-class classification models Most of the existing studies used TL because available datasets were small. Some of the studies used a single pre-trained computer vision model for multi-class classification. AlexNet model was used for the classification of COVID-19-pneumonia, viral pneumonia, bacterial pneumonia, and normal CXR scans [24]. VGG-16 model was used for the classification of three categories of CXR scans: COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy specimens [48]. DCNN based model Inception V3 was used for the detection of coronavirus pneumonia infected patients using chest X-ray radiographs [7]. A study presented a deep CNN model based on Xception architecture for classification of normal, pneumonia-bacterial, pneumonia-viral and COVID-19 chest X-ray images [35]. The DarkNet model was used in another study [49] as a classifier for the ’you only look once (YOLO) real time object detection system. Studies have explored multiple pre-trained models for multi-class classification and reported the best-performing models. According to study [13], DenseNet-201 outperforms other different deep CNN networks. Another study [60] classified the chest X-rays into four categories viz. normal, pneumonia, tuberculosis (TB), and COVID-19. Further, the X-rays indicating COVID-19 were classified by severity-basis into mild, medium, and severe categories using ResNet-18. The deep learning model VGG-16 has given better results for classifying pneumonia, TB, and normal. For the segregation of normal pneumonia and COVID-19, the DenseNet-161 has given better accuracy. Pre-trained models on ImageNet and the NIH ChestX-ray14 dataset were used for fine-tuning the classification model in the study [9]. Two deep learning models following a lightweight architecture were proposed and proved to be more robust and reliable in COVID-19 detection than a baseline ResNet8 by another study [32]. A study [23] evaluated various pre-trained models for diagnosing COVID-19, pneumonia, and lung cancer from a combination of chest X-ray and CT images. The study found that the combination of VGG-19 and a CNN model outperforms three other combinations [23]. A fine-tuned DenseNet-121 achieved high accuracy for four classifications [30]. The study [33] considered 11 pre-trained convolutional neural network models. The results illustrated that VGG-16, ResNet50V2, and dense neural network (DenseNet169) models had given better accuracy for the classes of COVID-19, viral (Non-COVID-19) pneumonia, and bacterial pneumonia, respectively. A study [74] used multiple pre-trained convolutional backbones, such as VGG-16, ResNet-50, and EfficientNetB0, and the highest overall detection accuracy was achieved by EfficientNetB0. A gradient class activation mapping technique was used to highlight the regions of the input image that were important for predictions [74]. Multiple pre-trained models have been combined to achieve better performance. For instance, Insta Cov Net-19 was a combination of ResNet-101, Inception v3, Xception, MobileNetv2 and NASNet [17]. Another study [27] used pre-trained deep CNN models (ResNet-18, ResNet-50, ResNet-101, VGG-16 and VGG-19) for feature extraction and and a support vector machine (SVM) as a classier. A study proposed concatenation of the Xception and ResNet50V2 networks for classifying X-ray images into into normal, pneumonia and COVID-19 [53]. Some of the studies have proposed variations of the existing architectures: CovXNet [44]; CoroDet [22]; and decompose, transfer and compose (DeTraC) [1] for the classification of COVID-19 chest X-ray images. Some models propose a methodology that comprises multiple steps. For example, a three-step model is presented in [10]. The first step is to detect the presence of pneumonia in the given chest x-ray image. The second step is to distinguish between COVID-19 and pneumonia. The last step is aimed at localizing the areas in the X-ray that are symptomatic of the COVID-19 presence [10]. In [45] lung ultrasound images are used for detection of COVID 19. The authors of [45] propose a CNN network with fewer learning parameters, a multi layer fusion approach is employed to increase the efficiency of the model. In [26] the authors concluded that for diagnosis of COVID 19, chest CT scan is sensitive and moderately specific, chest X-ray is moderately sensitive and moderately specific while ultrasound is sensitive but not specific, therefore there is a higher probability of getting a false positive with ultrasound than with chest X-ray. Therefore, we chose to use datasets consisting of chest X-ray images. Most of the deep learning models developed to detect COVID 19, usually are trained on either chest CT scans or chest X-ray images but in [46] a CNN model is proposed that can collectively be used for both chest CT scans and Chest X-ray images. In [66] the authors used a CNN to extract features to learn individual image level representations and they used a graph convolutional network to learn relation-aware features. They then used Deep feature fusion to fuse the learned image level features and the relation-aware features. In [68] the authors first used pretrained models to learn features, they then proposed a novel transfer feature learning algorithm which used number of layers to be removed as a hyper-parameter. They then proposed a selection algorithm of pre-trained network for fusion to determine the two best models characterized by the pretrained models and the number of layers to be removed. Finally, they used deep CCT fusion by discriminant correlation analysis to fuse the two features from the two models. Some studies [64] have focused on application of lightweight deep learning architecture for binary classification of pneumonia and normal images. Development of light weight architectures reduces response time and computational resources required for disease detection. Feature extraction was performed to reduce computational costs to detect COVID 19. Authors in [55] used CNN for feature extraction and Marine Predators algorithm to select relevant features. A hybrid approach using parallel fusion and optimization of deep learning models for COVID 19 detection in [34]. Authors in [5] proposed mask extraction method based on multi-agent Deep Reinforcement Learning (DRL) from COVID 19 CT images. Since using AI for efficient detection of COVID 19 from chest X-ray images or chest CT scans requires patient records from hospitals or test records from testing facilities, privacy becomes a major issue therefore Federated Machine Learning is a promising area to explore in this context. The authors of [2] studied the efficacy of federated learning versus traditional learning. In this paper the authors used a descriptive dataset and chest X-ray images. The authors concluded that federated machine learning gives a better prediction accuracy but a higher performance time as compared to a traditional machine learning. In [25, 47] a light weight CNN-tailored shallow architecture was proposed to detect COVID 19 from Chest X-ray. The model was designed with fewer parameters as compared to other deep learning models and the model produced no false negatives. Researchers [36] have used structured model pruning to improve model memory usage and speed on TPUs without losing accuracy, especially for small datasets Researchers used single datasets for the evaluation of COVID-19 detection models [12, 14, 19, 41, 42, 51, 52, 59, 62, 67, 72]. However, it is good to evaluate on multiple datasets before arriving at a conclusion about the performance of a model. Models [7, 13, 18, 23, 24, 30, 33, 35, 48, 60, 74] were trained for multiple classifications; however, the single best performing pre-trained model was considered as the best model. A few models combined features from different pre-trained models [17, 27, 53], but there is no guarantee that the the highlighted combination was the best one. Models [1, 22, 44] were modifications in existing architectures, and further modifications in architectures can improve their performance. Moreover, none of the existing models focused on the minimisation of false positives. The reduction of false positives is crucial in diagnosing individuals as it prevents them from undergoing unnecessary treatment; financial and medical resources can also be utilized more effectively. Although there are a few papers that address the minimization of false positives [29, 42], there was insufficient coverage of the topic and, specifically, the consequences of false positives. Stacked ensembles were designed for medical image analysis to provide high accuracy [29] and high recall along with that accuracy [28]. Models [12, 14, 19, 41, 42, 51, 52, 59, 62, 67, 72] that detected COVID-19 based on binary classification did not consider the other pneumonia cases and were evaluated using a single dataset. Models [7, 9, 13, 23, 24, 30, 32, 33, 35, 48, 49, 60, 74] were trained for multi class classification; however, the single best performing pre-trained model is considered as the best model. A few models combined features from different pre-trained models [17, 27, 53], but there is no guarantee that the the highlighted combination was the best one. Models [1, 22, 44] were modifications in existing architectures and further modifications in architectures can improve the performance. Moreover, all the existing models did not focus on minimization of false positives [28, 29], which is crucial in the case of COVID-19 screening. On the other hand, minimization of false positives is crucial in the diagnosis of COVID-19. We propose a multi-class framework for disease screening and diagnosis using a systematic approach. Although there are existing deep learning models, the focus of the paper is to select efficient models for a given task. We proposed a selection metric and systematic method to select efficient deep learning models to form the stacked ensemble. Existing works used pre-trained models for COVID-19 detection. Some authors used selective pre-trained models and compared the performance of the pre-trained models and reported the better performing models. Some authors have modified the existing pre-trained models architecture. In this study, we proposed a metric and method to select efficient base classifiers to form the stacked ensemble. Most of the existing works do not focus on the minimisation of false positives or false negatives which is a crucial aspect in computer aided diagnosis. We designed stacked ensembles that minimise false positives or false negatives.

Proposed multi-class classification framework

This section presents an overview of the method used to generate stacked ensembles in a multi-class classification framework with high accuracy, followed by the selection measures used to select base classifiers. The details of the proposed multi-class classification framework are presented in the last subsection. The overview of the proposed multi-class classification framework was presented in Fig. 1.
Fig. 1

Overview of generation of multi-class classification framework

Overview of generation of multi-class classification framework

Method for the generation of multi-class classification framework

The generation of a stacked ensemble comprises of three phases: generation, selection and aggregation. The set of pre-trained models that provide a better performance in the detection of COVID-19 is selected. Each pre-trained model is used to generate multiple base classifiers by appending the fully connected layers. The second step is to select the efficient base classifiers that generate a distinct set of false positives or false negatives depending on the selection mode. The base classifiers are arranged in descending according to their performance and ability to provide distinct false positives. The meta-classifier collects the outputs from the base classifiers and combines them using a weighted average to predict the class of the given sample.

Selection metrics

A pair of selection metrics were designed for the multi-class classification of COVID-19, where accuracy, recall, and precision play a key role. Suppose there are N examples in the validation set. Out of N examples, N1 are positive examples and N0 are negative examples. In other words, the validation set constituted of positive and negative examples N = N0 + N1. The diversity between a pair of classifiers c and c can be calculated following. The selection metric is designed such that diverse and efficient base classifiers are chosen to form the disease screening model and disease diagnosis model. The diversity is measured in a diverse set of false positives generated from a considered pair of base classifiers. When the false positives of a model are minimized, the precision of the corresponding model increases. The following notations were adopted for designing the selection metric: N is the total number of samples in the dataset. N is the number of positive samples. N is the number of negative samples. c is the i base classifier. is the number of samples that belong to the negative class with predictions a and b from the classifiers c and c, respectively. is the number of samples that belong to the positive class with predictions a and b from the classifiers c and c, respectively. is the number of positive samples that belong to the disease class q with predictions a and b from the classifiers c and c, respectively. is the number of negative samples that belong to the disease class q with predictions a and b from the classifiers c and c, respectively. a is the accuracy of the base classifier c. r is the recall of the base classifier c. p is the precision of the base classifier c.

Selection metric for disease screening model

Base classifiers with the following characteristics were selected to form disease screening model where the value of q ranges from 1 to k − 1. The accuracy a of the base classifier c should be high. The recall r of the base classifier c should be high. The pair of base classifiers under consideration should give different sets of false negatives for each class. Let the number of classes be k. Among k classes, k − 1 classes denote different diseases, and the remaining one class corresponds to a healthy or normal class. The false negatives corresponding to each disease class should be reduced in disease screening mode, and the false positives corresponding to the normal class should be minimized. For a disease class q, the diversity of a pair of classifiers c and c is For normal class, the diversity of a pair of classifiers c and c is The selection metric was calculated using the following formula:

Selection metric for disease diagnosis model

Base classifiers with the following features were selected to form disease diagnosis model. The selection metric was calculated using the following formula: The accuracy a of the base classifier c should be high. The precision p of the base classifier c should be high. The pair of base classifiers under consideration should give different sets of false positives for each class. Let the number of classes be k. Among k classes, k − 1 classes denote different diseases, and the remaining one class corresponds to a healthy or normal class. The false positives corresponding to each disease class should be reduced in disease diagnosis mode, and the false negatives corresponding to the normal class should be minimized. For a disease class q, the diversity of a pair of classifiers c and c is For normal class, the diversity of a pair of classifiers c and c is Algorithm to generate stacked ensemble for disease screening.

Proposed multi-class classification framework

We proposed a multi-class classification framework comprising two models: disease screening and disease diagnosis models. Based on the context, a suitable model can be selected. The architecture of the multi-class classification framework is explained in this section. The disease screening model and disease diagnosis model were projected as two different architectures because either one of them can be selected based on the context. Disease screening model minimises the false negatives corresponding to disease samples and the disease diagnosis model minimises the false positives corresponding to disease samples. Algorithm to generate stacked ensemble for disease diagnosis.

Disease screening mode

We selected three base classifiers ds1,ds2,ds3 based on the selection measure values according to Algorithm 1. The VGG-19 model combination was selected as the first base classifier when sorted based on the selection metric. The first fully connected layer reduces the output from the VGG-19 to a single column comprised of 1,000 rows, which is reduced further by the second fully connected layer to a column comprised of 200 rows. The final fully connected layer reduced the output to only three rows. The activation function used with the first and second fully connected layers was rectified linear unit (ReLu), and the activation function used with the third fully connected layer is sigmoid. The first fully connected layer has an additional dropout layer, which reduces the chances of over-fitting. The second base classifier included in the disease screening model is ResNet-101 with three fully connected layers, and the third base classifier was Wide ResNet 50 2 with three fully connected layers. The three base classifiers are connected to the meta-classifier using weighted averaging. The architecture of the disease screening model is shown in Fig. 2.
Fig. 2

Disease screening model architecture

Disease screening model architecture

Disease diagnosis model

Disease diagnosis model was generated according to Algorithm 2. The first model in the disease diagnosis model was VGG-19, appended with three fully connected layers. The first fully connected layer reduces the column comprised of 1,000 rows to a column vector comprised of 500 rows, which is further reduced to a column vector of 200 rows by the second fully connected layer. The last fully connected layer reduces this column vector to a column vector of 3 rows. The activation function used at the first and second fully connected layer is ReLu, and the sigmoid activation function is used at the third fully connected layer. A dropout layer is added at the first fully connected layer to reduce the chances of overfitting. The second base classifier was comprised of Wide ResNet 50 2 along with one fully connected layer. The third base classifier chosen was comprised of Wide ResNet 50 2 and three fully connected layers. The outputs of the three classifiers were provided to the single neuron using weighted averaging. The meta classifier was designed such that it takes as input the outputs of three base classifiers and provides the predicted class as the output using weighted averaging as depicted in Fig. 3. The sigmoid function is used as the activation function to predict the class of the given image.
Fig. 3

Disease diagnosis model architecture

Disease diagnosis model architecture

Combined mode

The disease screening model and disease diagnosis model were projected as two different architectures because either one of them can be selected based on the context. If both functions(diagnosis and screening) are needed simultaneously both the models can be combined into a single model with four base classifiers. Among the four base classifiers two base classifiers are common to both the functions (disease screening and disease diagnosis).

Data and methodology

The details of the data and methodology used in the experiment are explained in this section.

Datasets

Two different chest X-ray datasets were used for the evaluation of the multi-class classification framework. The first dataset is a large chest X-ray image dataset, and the second dataset contains limited chest X-ray images. The two datasets have been summarized in Table 2.
Table 2

Datasets

NameDataset1 [15]Dataset2 [11]
Total Images15,1536,118
COVID-19 Positive3,616262
Viral Pneumonia1,3451,583
Normal Images10,1924,273
Images used for training13,9535,600
(9792 N, 945 P, 3216 C)(4100 N, 1400 P, 100 C)
Images used for Validation600218
(200 of each class)(73 N, 83 P, 62 C)
Images used for Testing600 (200 of each class)300 (100 of each class)

N : Normal, P : Pneumonia, C : COVID-19

COVID-19 Radiography Database [15]: This is a large dataset that contains three classes of chest X-ray images. Source: https://www.kaggle.com/tawsifurrahman/covid19-radiography-database Number of images in the dataset: 15,153 COVID-19 Positive labelled images: 3,616 Viral pneumonia labelled images: 1,345 Normal or healthy images: 10,192 Number of images used for training: 13,953 Number of images used for validation: 600 Number of images used for testing: 600 Chest X-ray images pneumonia and COVID-19 [11]: This is relatively smaller dataset with limited number of COVID-19 pneumonia chest X-ray images. Source: https://www.kaggle.com/masumrefat/chest-xray-images-pneumonia-and-covid19https://www.kaggle.com/masumrefat/chest-xray-images-pneumonia-and-covid19 Number of images in the dataset: 6,118 COVID-19 Positive labelled images: 262 Viral pneumonia labelled images: 1,583 Normal or healthy images: 4,273 Number of images used for training: 5,600 Number of images used for validation: 218 Number of images used for testing: 300 Datasets N : Normal, P : Pneumonia, C : COVID-19

Hyper-parameter tuning and data augmentation

During the validation phase, we used a grid search technique to initialize the values of the hyper-parameters. The following values were investigated for each hyper-parameter: During the training phase, the initial value of the batch size was set at four. The batch size was doubled until an error associated with memory limitations was encountered. The final value of the batch size was chosen such that it was the maximum value that did not generate an error related to memory. The number of epochs was initialized with a value of 50. After that, epochs were incremented by a step size often until the training and validation set accuracies remained constant. We used a PyTorch framework for the experiments using Adam as an optimizer and a cross-entropy loss function. The values considered for random resized crop size were 128, 200, and 224. The values considered for the random resized crop scale were (0.5, 1.0), (1.0, 0.5) and (0.5, 0.5). The range of values considered for the random rotation angle were [-3°, 3°], [-5°, 5°], and [-10°, 10°]. The values considered for the random horizontal flip probability were 0.3, 0.5, and 0.7. The values considered for Batch Size were 4, 8, 16, 32. The values considered for number of epochs were 50, 60, 70, 80, 90, 100. The values considered for learning rate were 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003 The following hyper-parameter values were used throughout the experiment: The total number of epochs was 100. The learning rate was set to 1e-3. The maximum value for the batch size was 16. The final value for the random resized crop size was 224. The final value for the random resized crop scale was (0.5, 1.0). The selected range of values for the random rotation angle was [-5°, 5°]. The final value for the random horizontal flip probability was 0.5. Tenfold cross validation was used to avoid overfitting. A dropout layer is added at the first fully connected layer of the base classifiers to reduce the chances of overfitting. Moreover, as the proposed models are stacked ensembles, the output of the model was received from multiple models that reduce the chances of overfitting.

Evaluation metrics

Four evaluation metrics were used to measure the performance of the proposed model as follows. The metrics for binary classification for a single class are: The metrics for multi-class classification were obtained by using macro averaging. Initially, Precision, Recall, F1 score, and accuracy were calculated for each class. The metrics for multi-class classification were obtained using macro averaging, i.e., by taking an average of the values obtained for three classes. Precision : Precision is the fraction of positive predictions that belong to the positive class. Recall : Recall is the fraction of positive examples in the dataset that are predicted positive. Specificity : Specificity is the fraction of negative examples in the dataset that are predicted negative. F1 Score : F1 Score is the harmonic mean of precision and recall. Accuracy : Accuracy is the fraction of the total predictions that are correct. where TP denote true positives, TN denote true negatives, FP denote false positives and FN denote false negatives.

Experimental results

The results obtained in the experiments are presented in this section. The experimental results section is divided into two subsections. The first subsection consists of analyzing the results obtained from the proposed stacked ensemble using three CT scan datasets. The second subsection provides details of the proposed model at different threshold values.

Performance analysis of the proposed multi-class classification framework

The proposed framework was evaluated using two datasets of chest X-rays i.e. COVID-19 Radiography Database [15] and Chest X-ray images pneumonia and COVID-19 [11]. The following notation is used to denote the models: M0 denotes the base classifier formed using a pre-trained model M appended with a softmax layer. M1 denotes the base classifier formed using a pre-trained model M appended with additional fully connected and softmax layers. M2 denotes the base classifier generated using a pre-trained model M appended with two additional fully connected layers and a softmax layer. The candidates considered for the pre-trained model M were ResNet-101, DenseNet-169, VGG-19, and WideResNet-50-2. The selection was based on their performances, as reported by the existing studies. The precision is calculated as the weighted average of the precisions of the three classes. Where the precision of the COVID-19 class is given a higher weight as compared to the other two classes. The same has been done for recall also. When the first dataset COVID-19 Radiography Database [15] was used for evaluation, the proportion of the train-validation-test was as follows. The test set was ensured to contain 600 images in order to test the generality of the model. The conditions required for the testing were facilitated and tested using a validation set containing 600 images. The remaining 13,953 images were included in the training set. The precision values of the base classifiers and the proposed models were compared in Table 3. The base classifiers provided high values of precision for each class. The disease diagnosis model yielded fewer false positives than base classifiers and provided high precision values for Pneumonia and COVID-19 classes. However, the disease screening model minimized the false positives of the normal class; therefore, the precision value for the normal class is high. Similarly, recall values of three classes are depicted in Table 4. The disease screening model has provided high recall values for pneumonia and COVID-19 classes. However, the disease diagnosis model yielded a high recall value for the normal or healthy class. When average values of precision, recall, accuracy, and F1 score are considered in Table 6, disease screening and diagnosis models yielded better values than base classifiers. The average precision obtained for the disease diagnosis model was higher than other models, and the average recall resulting from the disease screening model was better than other models. However, the margin of difference is small as the dataset under consideration contained enough images for training (Tables 5 and 6).
Table 3

Comparison of multi-class classification framework’s precision with base classifiers on COVID-19 Radiography Database [15]

ModelPrecision-NormalPrecision-PneumoniaPrecision-COVID-19
V GG190 0.90911 0.99
V GG191 0.93460.99470.9848
V GG192 0.91321 0.9898
DenseNet1690 0.91711 0.9747
DenseNet1691 0.89240.99450.9897
DenseNet1692 0.93370.96480.9366
ResNet1010 0.94580.99450.9213
ResNet1011 0.89591 0.9522
ResNet1012 0.94710.98960.96
WideResNet50 20 0.93271 0.9655
WideResNet50 21 0.95610.97010.9794
WideResNet50 22 0.98430.97070.9608
Disease diagnosis model0.93461 1
Disease screening model0.9852 0.99470.9476

Bold entries denote highest values for a specific evaluation metric (Column)

Table 4

Comparison of multi-class classification framework’s recall with base classifiers on COVID-19 Radiography Database [15]

ModelRecall-NormalRecall-PneumoniaRecall-COVID-19
V GG190 1 0.90.99
V GG191 1 0.940.97
V GG192 1 0.920.975
DenseNet1690 0.9950.9250.965
DenseNet1691 0.9950.9050.965
DenseNet1692 0.9150.960.96
ResNet1010 0.960.90.995
ResNet1011 0.990.850.995
ResNet1012 0.9850.950.96
WideResNet50 20 0.970.9450.98
WideResNet50 21 0.980.9750.95
WideResNet50 22 0.940.995 0.98
Disease diagnosis model1 0.980.95
Disease screening model1 0.930.995

Bold entries denote highest values for a specific evaluation metric (Column)

Table 6

Comparison of proposed multi-class classification framework with base classifiers on COVID-19 Radiography Database [15]

ModelPrecisionRecallAccuracyF1 Score
V GG190 0.97220.970.96330.9711
V GG191 0.97470.970.970.9723
V GG192 0.97320.96750.9650.9704
DenseNet1690 0.96660.96250.96170.9646
DenseNet1691 0.96660.95750.9550.962
DenseNet1692 0.94290.94880.9450.9458
ResNet1010 0.94570.96250.95170.954
ResNet1011 0.95010.95750.9450.9538
ResNet1012 0.96420.96380.9650.964
WideResNet50 20 0.96590.96880.9650.9673
WideResNet50 21 0.97130.96380.96830.9675
WideResNet50 22 0.96910.97380.97170.9714
Disease diagnosis model0.9836 0.970.9767 0.9768
Disease screening model0.96880.98 0.9750.9744

Bold entries denote highest values for a specific evaluation metric (Column)

Table 5

Comparison of multi-class classification framework’s specificity with base classifiers on COVID-19 Radiography Database [15]

ModelSpecificity-NormalSpecificity-PneumoniaSpecificity-COVID-19
V GG190 0.94971 0.9948
V GG191 0.96460.99750.9923
V GG192 0.95231 0.9948
DenseNet1690 0.95451 0.9871
DenseNet1691 0.93970.99750.9948
DenseNet1692 0.96730.98170.9665
ResNet1010 0.97180.99740.9563
ResNet1011 0.94131 0.9735
ResNet1012 0.97200.99490.9797
WideResNet50 20 0.96491 0.9821
WideResNet50 21 0.97720.98470.9899
WideResNet50 22 0.9923 0.98460.9797
Disease diagnosis model0.9651 1
Disease screening model0.9923 0.99750.9723

Bold entries denote highest values for a specific evaluation metric (Column)

Comparison of multi-class classification framework’s precision with base classifiers on COVID-19 Radiography Database [15] Bold entries denote highest values for a specific evaluation metric (Column) Comparison of multi-class classification framework’s recall with base classifiers on COVID-19 Radiography Database [15] Bold entries denote highest values for a specific evaluation metric (Column) Comparison of multi-class classification framework’s specificity with base classifiers on COVID-19 Radiography Database [15] Bold entries denote highest values for a specific evaluation metric (Column) Comparison of proposed multi-class classification framework with base classifiers on COVID-19 Radiography Database [15] Bold entries denote highest values for a specific evaluation metric (Column) When the second dataset Chest Xray Images PNEUMONIA and Covid-19 dataset [11] was used for evaluation, the proportion of the train-validation-test was as follows. The test set was ensured to contain 600 images in order to test the generality of the model. The conditions required for the testing were facilitated and tested using a validation set containing 600 images. The remaining 13,953 images were included in the training set (Tables 7 and 8).
Table 7

Confusion Matrix of the Disease Diagnosis Model on the COVID-19 Radiography Database [15]

Predicted Labels
NormalPneumoniaCOVID 19Total
True LabelsNormal20000200
Pneumonia41960200
COVID 19100190200
Total214196190600
Table 8

Confusion Matrix of the Disease Screening Model on the COVID-19 Radiography Database [15]

Predicted Labels
NormalPneumoniaCOVID 19Total
True LabelsNormal20000200
Pneumonia318611200
COVID 1901199200
Total203187210600
Confusion Matrix of the Disease Diagnosis Model on the COVID-19 Radiography Database [15] Confusion Matrix of the Disease Screening Model on the COVID-19 Radiography Database [15] The precision values of the base classifiers and the proposed models were compared in Table 9. The base classifiers provided high values of precision for each class. The disease diagnosis model yielded fewer false positives than base classifiers and provided high precision values for Pneumonia and COVID-19 classes. However, the disease screening model minimizes false positives of a normal class, and therefore the precision value for the normal class is high. Similarly, recall values of three classes are depicted in Table 10. The disease screening model has provided high recall values for pneumonia and COVID-19 classes. However, the disease diagnosis model yielded a high recall value for the normal or healthy class. When average values of precision, recall, accuracy, and F1 score are considered in Table 12, disease screening and diagnosis models yielded better values than base classifiers. The average precision obtained for the disease diagnosis model was higher than other models, and the average recall resulting from the disease screening model was better than other models. However, the margin of difference is small as the dataset under consideration contained enough images for training. The margin of difference between stacked ensembles and base classifiers for the second dataset is high compared to the first dataset, implying that the stacked ensemble is more efficient with limited data (Tables 11, 12, 13 and 14).
Table 9

Comparison of multi-class classification framework’s precision with base classifiers on Chest Xray Images PNEUMONIA and Covid-19 dataset [11]

ModelPrecision-NormalPrecision-PneumoniaPrecision-COVID-19
V GG190 0.91510.95151
V GG191 0.9490.82050.9882
V GG192 0.94170.96121
DenseNet1690 0.93750.89191
DenseNet1691 0.93810.90741
DenseNet1692 0.96840.92521
ResNet1010 0.93270.970.9896
ResNet1011 0.960.92521
ResNet1012 0.98860.87721
WideResNet50 20 0.960.92451
WideResNet50 21 0.95880.91591
WideResNet50 22 0.96810.90831
Disease detection model0.97940.9341
Disease screening model0.9897 1 0.9429

Bold entries denote highest values for a specific evaluation metric (Column)

Table 10

Comparison proposed multi-class classification framework’s recall with base classifiers on Chest Xray Images PNEUMONIA and Covid-19 dataset [11]

ModelRecall-NormalRecall-PneumoniaRecall-COVID-19
V GG190 0.970.980.91
V GG191 0.930.960.84
V GG192 0.970.990.94
DenseNet1690 0.90.990.93
DenseNet1691 0.910.980.95
DenseNet1692 0.920.990.98
ResNet1010 0.970.970.95
ResNet1011 0.960.990.93
ResNet1012 0.871 0.98
WideResNet50 20 0.98 0.960.94
WideResNet50 21 0.98 0.95670.96
WideResNet50 22 0.910.990.97
Disease diagnosis Model 0.950.990.97
Disease screening Model 0.960.980.99

Bold entries denote highest values for a specific evaluation metric (Column)

Table 12

Comparison of proposed multi-class classification framework with base classifiers using chest X-ray images pneumonia and COVID-19 [11]

ModelPrecisionRecallAccuracyF1 Score
V GG190 0.96660.94250.95330.9544
V GG191 0.93650.89250.910.914
V GG192 0.97570.960.96670.9678
DenseNet1690 0.95730.93750.940.9473
DenseNet1691 0.96140.94750.94670.9544
DenseNet1692 0.97340.96750.96330.9704
ResNet1010 0.97050.960.96330.9652
ResNet1011 0.97130.95250.960.9618
ResNet1012 0.96650.95750.950.962
WideResNet50 20 0.97110.9550.960.963
WideResNet50 21 0.96870.95750.95670.963
WideResNet50 22 0.96910.960.95670.9645
Disease diagnosis Model 0.9783 0.970.970.9742
Disease screeningmodel 0.96890.98 0.9767 0.9744

Bold entries denote highest values for a specific evaluation metric (Column)

Table 11

Comparison of multi-class classification framework’s specificity with base classifiers on Chest Xray Images PNEUMONIA and Covid-19 dataset [11]

ModelSpecificity-NormalSpecificity-PneumoniaSpecificity-COVID-19
V GG190 0.95450.97411
V GG191 0.9730.89390.9947
V GG192 0.96980.97951
DenseNet1690 0.96970.93851
DenseNet1691 0.96980.9491
DenseNet1692 0.9850.95961
ResNet1010 0.96480.98460.9949
ResNet1011 0.97960.95941
ResNet1012 0.995 0.92961
WideResNet50 20 0.97960.95961
WideResNet50 21 0.97980.95451
WideResNet50 22 0.98490.94951
Disease diagnosis model0.98990.96481
Disease screening model0.995 1 0.97

Bold entries denote highest values for a specific evaluation metric (Column)

Table 13

Confusion Matrix of the Disease Diagnosis Model on the Chest Xray Images PNEUMONIA and Covid-19 dataset [11]

Predicted Labels
NormalPneumoniaCOVID 19Total
True LabelsNormal9550100
Pneumonia1990100
COVID 191297100
Total9710697300
Table 14

Confusion Matrix of the Disease Screening Model on the Chest Xray Images PNEUMONIA and Covid-19 dataset [11]

Predicted Labels
NormalPneumoniaCOVID 19Total
True LabelsNormal9604100
Pneumonia0982100
COVID 191099100
Total9798105300
Comparison of multi-class classification framework’s precision with base classifiers on Chest Xray Images PNEUMONIA and Covid-19 dataset [11] Bold entries denote highest values for a specific evaluation metric (Column) Comparison proposed multi-class classification framework’s recall with base classifiers on Chest Xray Images PNEUMONIA and Covid-19 dataset [11] Bold entries denote highest values for a specific evaluation metric (Column) Comparison of multi-class classification framework’s specificity with base classifiers on Chest Xray Images PNEUMONIA and Covid-19 dataset [11] Bold entries denote highest values for a specific evaluation metric (Column) Comparison of proposed multi-class classification framework with base classifiers using chest X-ray images pneumonia and COVID-19 [11] Bold entries denote highest values for a specific evaluation metric (Column) Confusion Matrix of the Disease Diagnosis Model on the Chest Xray Images PNEUMONIA and Covid-19 dataset [11] Confusion Matrix of the Disease Screening Model on the Chest Xray Images PNEUMONIA and Covid-19 dataset [11]

Evaluation of the multi-class classification framework under varied thresholds

For the COVID-19 Radiography Database [15], with the rise in the threshold, the precision of all the classes initially increases and then decreases, and the recall of COVID-19 decreases, while for the other two classes, the recall increases. This is because as the threshold increased, the number of false negatives for COVID-19 increases, while for the other two classes, the number of false negatives decreases. The F1 score and accuracy are maximum when the threshold is 0.3. Tables 15, 16 and 17 show the experimental results and the results on the first dataset [15] are shown in Fig. 4.
Table 15

Precision of the proposed model on COVID-19 Radiography Database [15] under varied threshold

ThresholdPrecision-NormalPrecision-PneumoniaPrecision-COVID-19
0.10.90410.99440.9754
0.20.90050.99440.985
0.30.9434 0.9948 0.9949
0.40.93020.9948 0.9948
0.50.92590.9948 0.9948
0.60.90910.9948 0.9947
0.70.89690.9948 0.9946
0.80.87340.9948 0.9944
0.90.86210.98450.9943

Bold entries denote highest values for a specific evaluation metric (Column)

Table 16

Recall of the proposed model on COVID-19 Radiography Database [15] under varied threshold

ThresholdRecall-NormalRecall-PneumoniaRecall-COVID-19
0.10.990.8850.99
0.20.9950.890.985
0.31 0.95 0.98
0.41 0.95 0.965
0.51 0.95 0.96
0.61 0.95 0.94
0.71 0.95 0.925
0.81 0.95 0.895
0.91 0.95 0.87

Bold entries denote highest values for a specific evaluation metric (Column)

Table 17

Performance of the proposed model on COVID-19 Radiography Database [15] under varied thresholds

ThresholdPrecisionRecallAccuracyF1 Score
0.10.9580.9550.9550.9565
0.20.960.95670.95670.9583
0.30.9777 0.9767 0.9767 0.9772
0.40.97330.97170.97170.9725
0.50.97180.970.970.9709
0.60.96620.96330.96330.9648
0.70.96210.95830.95830.9602
0.80.95420.94830.94830.9513
0.90.94690.940.940.9435

Bold entries denote highest values for a specific evaluation metric (Column)

Fig. 4

Variation of Precision, Recall, Accuracy and F1 score with threshold on COVID-19 Radiography Database [15]

Precision of the proposed model on COVID-19 Radiography Database [15] under varied threshold Bold entries denote highest values for a specific evaluation metric (Column) Recall of the proposed model on COVID-19 Radiography Database [15] under varied threshold Bold entries denote highest values for a specific evaluation metric (Column) Performance of the proposed model on COVID-19 Radiography Database [15] under varied thresholds Bold entries denote highest values for a specific evaluation metric (Column) Variation of Precision, Recall, Accuracy and F1 score with threshold on COVID-19 Radiography Database [15] For the Chest X-ray images pneumonia and COVID-19 [11] dataset, the precision of the COVID-19 class remains the same, while that of the other two classes decreases with an increase in the threshold. The recall of the COVID-19 class decreases, while with an increase in threshold, that of the other two classes remains the same. The accuracy and F1 score are maximum at a threshold of 0.1. Tables 18, 19 and 20 show evaluation metrics obtained by varying threshold values for the chest X-ray images pneumonia and COVID-19 [11] are shown in Fig. 5.
Table 18

Performance of the proposed model on the chest X-ray images pneumonia and COVID-19 [11] under varied threshold

ThresholdPrecision-NormalPrecision-PneumoniaPrecision-COVID-19
0.10.9694 0.9519 1
0.20.95960.9519 1
0.30.95960.9519 1
0.40.95960.9519 1
0.50.95960.9519 1
0.60.95960.9519 1
0.70.950.94291
0.80.950.9341
0.90.94060.92521

Bold entries denote highest values for a specific evaluation metric (Column)

Table 19

Performance of the proposed model on the chest X-ray images pneumonia and COVID-19 [11] under varied threshold

ThresholdRecall-NormalRecall-PneumoniaRecall-COVID19
0.10.95 0.99 0.98
0.20.95 0.99 0.97
0.30.95 0.99 0.97
0.40.95 0.99 0.97
0.50.95 0.99 0.97
0.60.95 0.99 0.97
0.70.95 0.99 0.95
0.80.95 0.99 0.94
0.90.95 0.99 0.92

Bold entries denote highest values for a specific evaluation metric (Column)

Table 20

Performance of the proposed model on the chest X-ray images pneumonia and COVID-19 [11] under varied threshold

ThresholdPrecisionRecallAccuracyF1 Score
0.10.9737 0.9733 0.9733 0.9736
0.20.97050.970.970.9703
0.30.97050.970.970.9703
0.40.97050.970.970.9703
0.50.97050.970.970.9703
0.60.97050.970.970.9703
0.70.96430.96330.96330.9638
0.80.96130.960.960.9607
0.90.95330.95330.95330.9543

Bold entries denote highest values for a specific evaluation metric (Column)

Fig. 5

Variation of Precision, Recall, Accuracy and F1 score with threshold on the chest X-ray images pneumonia and COVID-19 [11]

Performance of the proposed model on the chest X-ray images pneumonia and COVID-19 [11] under varied threshold Bold entries denote highest values for a specific evaluation metric (Column) Performance of the proposed model on the chest X-ray images pneumonia and COVID-19 [11] under varied threshold Bold entries denote highest values for a specific evaluation metric (Column) Performance of the proposed model on the chest X-ray images pneumonia and COVID-19 [11] under varied threshold Bold entries denote highest values for a specific evaluation metric (Column) Variation of Precision, Recall, Accuracy and F1 score with threshold on the chest X-ray images pneumonia and COVID-19 [11]

Evaluation of the proposed model under varied noise levels

The performance of the Disease Diagnosis Model and the Disease Screening Model on the two datasets was explored under varied noise levels. Results obtained were listed in Tables 21, 22, 23 and 24. As seen from tables from the above mentioned tables the performance decreases with increase in the percent of noise, which is expected. Both models performed well when the noise percentage is less than 2.5%.
Table 21

Performance of the Disease Diagnosis Model on COVID-19 Radiography Database [15] under varied noise levels

Noise PercentagePrecisionRecallAccuracyF1 Score
00.9836 0.97 0.9767 0.9768
2.50.96870.95130.960.9599
50.9130.90130.910.9071
100.81840.80130.810.8097
300.61850.6050.6050.6117
Table 22

Performance of the Disease Screening Model on COVID-19 Radiography Database [15] under varied noise levels

Noise PercentagePrecisionRecallAccuracyF1 Score
00.9688 0.98 0.975 0.9744
2.50.93890.94250.94170.9407
50.88340.89880.89170.891
100.79730.81220.80480.8047
300.58330.59130.58330.5873
Table 23

Performance of the Disease Diagnosis Model on the chest X-ray images pneumonia and COVID-19 [11] under varied noise levels

Noise PercentagePrecisionRecallAccuracyF1 Score
00.9783 0.97 0.97 0.9742
2.50.93530.92750.92330.9314
50.88040.86750.86670.8739
100.80370.770.77330.7865
300.63850.60250.60.62
Table 24

Performance of the Disease Screening Model on the chest X-ray images pneumonia and COVID-19 [11] under varied noise levels

Noise PercentagePrecisionRecallAccuracyF1 Score
00.9689 0.98 0.9767 0.9744
2.50.9230.93750.930.9302
50.86840.88250.87330.8754
100.78860.79750.79670.793
300.6280.640.63670.634
Performance of the Disease Diagnosis Model on COVID-19 Radiography Database [15] under varied noise levels Performance of the Disease Screening Model on COVID-19 Radiography Database [15] under varied noise levels Performance of the Disease Diagnosis Model on the chest X-ray images pneumonia and COVID-19 [11] under varied noise levels Performance of the Disease Screening Model on the chest X-ray images pneumonia and COVID-19 [11] under varied noise levels

Explainable AI using LIME

In this study, we used Local Interpretable Model-agnostic Explanations (LIME) [54] to interpret the results produced by our proposed model. LIME provides interpretations for the results generated by a machine learning model. LIME does not depend on the model and can provide interpretations for the results of any model. LIME helps us understand the regions that our model is looking at while making predictions. LIME initially generates samples of a particular test image by perturbing the pixels of the original image. It will then generate the prediction for each sample of the test image. Next, it computes the weight for the sample of the test image using the cosine distance. Finally, it tries to fit a linear classifier using the generated samples and their prediction. Using the weights of this classifier, LIME then selects the most important features. The LIME interpretations of a few images belonging to the class pneumonia and COVID-19 are shown in Figs. 6 and 7. In Fig. 6, the regions marked in red are the areas that may contain COVID-19-pneumonia and hence the red regions decrease the probability of the Other Pneumonia class, and the regions marked in green are the areas that increase the probability of Other Pneumonia class.
Fig. 6

Pneumonia Images : Output from LIME

Fig. 7

COVID-19 Images : Output from LIME

Pneumonia Images : Output from LIME COVID-19 Images : Output from LIME In Fig. 7, the regions marked in green are the areas that increase the probability of COVID-19-pneumonia, and the regions marked in red are the areas that decrease the probability of the COVID-19-pneumonia. From the images mentioned above, it can be seen that the regions marked in green are part of the lung. Therefore our model was looking at the right areas to make a prediction.

Comparison with existing models

Multi-class classification framework’s performance was compared with the existing models [14, 35, 49, 63, 65] and the observation was multi-class classification framework performance was better than the existing models that were evaluated using the same datasets under similar environment conditions as shown in Table 25. The better performance of the proposed model can be ascribed to the following reasons. Unlike the other models or ensembles, our ensemble was generated using a systematic approach using a selection metric. The other two base classifiers compensated for the misclassification made by one base classifier.
Table 25

Comparison between the proposed model and models proposed in previous research papers

ModelAccuracyF1 Score
CNN Model [63]0.84220.8421
Automated Model [49]0.87020.8737
COVID NET Model [65]0.9240.9
CORONET Model [35]0.950.956
CVDNET Model [14]0.96690.9668
Proposed framework0.9767 0.9774
Comparison between the proposed model and models proposed in previous research papers Another reason for the better performance was the inclusion of fully connected layers. The performance of the multi-class classification framework was compared with existing models, and the results are presented in Table 25. The accuracy and F1 score of the proposed model are compared with pre-trained models as depicted in Figs. 8 and 9 respectively.
Fig. 8

F1 Score of Different models on different datasets

Fig. 9

Accuracy of Different models on different datasets

F1 Score of Different models on different datasets Accuracy of Different models on different datasets The experimental results reported in the study should be considered in the light of some limitations. The first limitation is that the margin of difference of the proposed stacked ensemble and the base classifiers is not significantly high as Base classifiers considered in the current study have performed well for the two chest X-ray datasets. Hence, the difference in the performance of base classifiers and ensembles is not significantly different. The second limitation correspond to the evaluation of the proposed model in terms of parameters like memory requirement, parameters and time of execution.

Conclusion and future work

A multi-class classification framework for disease detection was designed using stacked ensembles of pre-trained models. A stacked ensemble was generated for disease screening mode, and another stacked ensemble was designed for diagnosis mode. Different metrics for the selection of base classifiers were designed for the two modes. The proposed multi-class classification disease screening and diagnosis framework were trained and tested using two chest X-ray image datasets that contain three classes of images. The proposed framework has outperformed the base classifiers and existing models on two chest X-ray datasets. The average values of recall and precision for the proposed framework was 98%. The main objective of the study is to develop deep learning based stacked ensemble that are useful to develop efficient computer aided diagnosis system. unlike the existing studies, the current study focused on minimising false negatives or false positives based on the context using stacked ensemble of pre-trained models and fully connected layers. We designed selection metrics and systematic method to generate efficient stacked ensemble that minimises false positives or false negatives. Lightweight deep learning architectures can be used as base models to reduce the computational time and resources. Federated learning can be used to preserve the privacy of different data sets and to reduce the local computations. The proposed framework can be further extended for the classification of multiple lung diseases that can be detected using chest X-rays and CT scans. Based on the characteristics of the disease, suitable metrics can be selected to generate corresponding ensemble.
  56 in total

1.  COVID-19 detection and disease progression visualization: Deep learning on chest X-rays for classification and coarse localization.

Authors:  Tahmina Zebin; Shahadate Rezvy
Journal:  Appl Intell (Dordr)       Date:  2020-09-12       Impact factor: 5.086

2.  Review on COVID-19 diagnosis models based on machine learning and deep learning approaches.

Authors:  Zaid Abdi Alkareem Alyasseri; Mohammed Azmi Al-Betar; Iyad Abu Doush; Mohammed A Awadallah; Ammar Kamal Abasi; Sharif Naser Makhadmeh; Osama Ahmad Alomari; Karrar Hameed Abdulkareem; Afzan Adam; Robertas Damasevicius; Mazin Abed Mohammed; Raed Abu Zitar
Journal:  Expert Syst       Date:  2021-07-28       Impact factor: 2.812

3.  CT Imaging of the 2019 Novel Coronavirus (2019-nCoV) Pneumonia.

Authors:  Junqiang Lei; Junfeng Li; Xun Li; Xiaolong Qi
Journal:  Radiology       Date:  2020-01-31       Impact factor: 11.105

4.  A lightweight deep learning architecture for the automatic detection of pneumonia using chest X-ray images.

Authors:  Megha Trivedi; Abhishek Gupta
Journal:  Multimed Tools Appl       Date:  2021-12-27       Impact factor: 2.577

5.  Diagnosis of the Coronavirus disease (COVID-19): rRT-PCR or CT?

Authors:  Chunqin Long; Huaxiang Xu; Qinglin Shen; Xianghai Zhang; Bing Fan; Chuanhong Wang; Bingliang Zeng; Zicong Li; Xiaofen Li; Honglu Li
Journal:  Eur J Radiol       Date:  2020-03-25       Impact factor: 3.528

6.  CoroNet: A deep neural network for detection and diagnosis of COVID-19 from chest x-ray images.

Authors:  Asif Iqbal Khan; Junaid Latief Shah; Mohammad Mudasir Bhat
Journal:  Comput Methods Programs Biomed       Date:  2020-06-05       Impact factor: 5.428

7.  Diagnostic performance of chest CT to differentiate COVID-19 pneumonia in non-high-epidemic area in Japan.

Authors:  Yuki Himoto; Akihiko Sakata; Mitsuhiro Kirita; Takashi Hiroi; Ken-Ichiro Kobayashi; Kenji Kubo; Hyunjin Kim; Azusa Nishimoto; Chikara Maeda; Akira Kawamura; Nobuhiro Komiya; Shigeaki Umeoka
Journal:  Jpn J Radiol       Date:  2020-03-30       Impact factor: 2.374

8.  COVID-19 image classification using deep features and fractional-order marine predators algorithm.

Authors:  Ahmed T Sahlol; Dalia Yousri; Ahmed A Ewees; Mohammed A A Al-Qaness; Robertas Damasevicius; Mohamed Abd Elaziz
Journal:  Sci Rep       Date:  2020-09-21       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.