Literature DB >> 35615621

COVID-19 detection on Chest X-ray images: A comparison of CNN architectures and ensembles.

Abstract

COVID-19 quickly became a global pandemic after only four months of its first detection. It is crucial to detect this disease as soon as possible to decrease its spread. The use of chest X-ray (CXR) images became an effective screening strategy, complementary to the reverse transcription-polymerase chain reaction (RT-PCR). Convolutional neural networks (CNNs) are often used for automatic image classification and they can be very useful in CXR diagnostics. In this paper, 21 different CNN architectures are tested and compared in the task of identifying COVID-19 in CXR images. They were applied to the COVIDx8B dataset, a large COVID-19 dataset with 16,352 CXR images coming from patients of at least 51 countries. Ensembles of CNNs were also employed and they showed better efficacy than individual instances. The best individual CNN instance results were achieved by DenseNet169, with an accuracy of 98.15% and an F1 score of 98.12%. These were further increased to 99.25% and 99.24%, respectively, through an ensemble with five instances of DenseNet169. These results are higher than those obtained in recent works using the same dataset.

Entities: Chemical

Keywords: Chest X-ray images; Convolutional neural networks; Transfer learning

Year: 2022 PMID： 35615621 PMCID： PMC9122742 DOI： 10.1016/j.eswa.2022.117549

Source DB: PubMed Journal: Expert Syst Appl ISSN： 0957-4174 Impact factor: 8.665

Introduction

COVID-19 is an infectious disease caused by the Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) (Khan et al., 2021). It quickly became a global pandemic in less than four months after its first detection in December 2019 in Wuhan, China (Monshi et al., 2021). As of February 2022, it has over 434 million confirmed cases and almost 6 million deaths reported to World Health Organization (World Health Organization, 2022). Early detection of positive COVID-19 cases is critical for avoiding the virus’s spread. The most common technique for diagnosing COVID-19 is known as transcriptase-polymerase chain reaction (RT-PCR). It detects SARS-CoV-2 through collected respiratory specimens of nasopharyngeal or oropharyngeal swabs. However, RT-PCR testing is expensive, time-consuming, and shows poor sensitivity (Monshi et al., 2021, Mostafiz et al., 2020), especially in the first days of exposure to the virus (Long et al., 2020). Up to 54% of COVID-19 patients may have an initial negative RT-PCR result (Arevalo-Rodriguez et al., 2020). Patients that receive a false negative diagnosis may contact and infect other people before they are tested again. Therefore, it is important to have alternative methods to detect the disease, such as Chest X-ray (CXR) images. CXR equipment is widely available in hospitals and CXR images are cheap and fast to acquire. They can be inspected by radiologists to find visual indicators of the virus (Feng et al., 2020). In the past decade, the rise of deep learning methods (Goodfellow et al., 2016, LeCun et al., 2015, Schmidhuber, 2015), especially the convolutional neural networks (CNNs), were responsible for many advances in automatic image classification (Krizhevsky et al., 2012). CXR diagnostic using deep learning methods is a mechanism that can be explored to surpass the limitations of RT-PCR insufficient test kits, waiting time of test results, and test costs (Mostafiz et al., 2020). Many studies concerning the application of CNNs to COVID-19 diagnostic on CXR images were published since the last year (Abbas et al., 2021, Alawad et al., 2021, Chhikara et al., 2021, Heidari et al., 2020, Hira et al., 2021, Ismael and Şengür, 2021, Jia et al., 2021, Karthik et al., 2021, Khan et al., 2021, Mohammad Shorfuzzaman, 2020, Monshi et al., 2021, Mostafiz et al., 2020, Narin et al., 2021, Nigam et al., 2021). However, most of them used relatively small and more homogeneous datasets. In this paper, the COVIDx8B dataset1 (Zhao et al., 2021) is used. It has 16,352 CXR images, from which 2,358 are COVID-19 positive and the remaining are from both healthy and pneumonia patients. Released in March 2021, this dataset is composed of images from six other open-source chest radiography datasets. Therefore it is larger and more heterogeneous than earlier available datasets. However, there are only a few works that used this dataset so far (Dominik, 2021, Pavlova et al., 2021, Zhao et al., 2021). A recent survey on applications of artificial intelligence in the COVID-19 pandemic (Khan et al., 2021) reviewed dozens of papers, including papers on CNNs applied to CXR images and all of them used earlier available datasets which are smaller than COVIDx8B. In this paper, a comparison of different CNN models applied to the COVIDx8B dataset is presented, including popular architectures such as VGG (Simonyan & Zisserman, 2015), ResNet (He et al., 2016a), DenseNet (Huang et al., 2017), and EfficientNet (Tan & Le, 2019). They were all trained in the same conditions with the training and test subsets defined by the dataset authors. The initial weights of all methods were defined to those trained on the ImageNet dataset (Russakovsky et al., 2015), which is commonly used in transfer learning scenarios (Oquab et al., 2014). The accuracy, sensitivity (TPR), precision (PPV), and F1 score were evaluated using the test subset. Later, some models’ continuous output (before the classification layer) were combined (ensembles) to overcome individual limitations and provide better classification results. The remainder of this paper is organized as follows. Section 2 shows related work, in which CNNs were used to detect COVID-19 on CXR images. Section 3 presents the COVIDx8B dataset. Section 4 shows the CNN architectures employed in this paper. Section 5 shows the computer simulations comparing the proposed models and other recent approaches from the literature for COVID-19 classification on CXR images using the same dataset. Section 6 shows the computer simulations with CNN ensembles, improving the classification performance of individual models. Finally, the conclusions are drawn in Section 7.

Related work

Many studies have investigated the use of machine learning techniques to detect COVID-19. Many of the researchers used CNN techniques and CXR images and faced challenges due to the lack of available datasets (Alawad et al., 2021). While many authors provided tables comparing results achieved in different works, the comparisons are not fair, since the used datasets are frequently different and pose different levels of challenge. Therefore, here the related works are described focusing on what architectures have been used to handle the problem of COVID-19 detection on CXR images and the size of the evaluated datasets. Nigam et al. (2021) used VGG16, DenseNet121, Xception, NASNet, and EfficientNet in a dataset with 16,634 images. Though this dataset is slightly larger than COVIDx8B, unfortunately, the authors did not make it publicly available. The highest accuracy was 93.48% obtained with EfficientNetB7. Ismael and Şengür (2021) used ResNet18, ResNet50, ResNet101, VGG16, and VGG19 for deep feature extraction and support vector machines (SVM) for CXR images classification. The highest accuracy was 94.7% obtained with ResNet50. However, they used a small dataset with only CXR images. Abbas et al. (2021) validated a deep CNN called Decompose, Transfer, and Compose (DeTraC) for COVID-19 CXR images classification with 93.1% accuracy. They used a combination of two small datasets, totaling images. Hira et al. (2021) used the AlexNet, GoogleNet, ResNet-50, Se-ResNet-50, DenseNet121, Inception V4, Inception ResNet V2, ResNeXt-50, and Se-ResNeXt-50 architectures. Se-ResNeXt-50 achieved the highest classification accuracy of 99.32%. They used a combination of four datasets, totaling 8,830 CXR images. Alawad et al. (2021) used VGG16 both as a stand-alone classifier and as a feature extractor for SVM, Random-Forests (RF), and Extreme-Gradient-Boosting (XGBoost) classifiers. VGG-16 and VGG16+SVM models provide the best performance with 99.82% accuracy. They used a combination of five datasets, totaling 7,329 CXR images. Narin et al. (2021) used ResNet50, ResNet101, ResNet152, InceptionV3, and Inception-ResNetV2. ResNet50 achieved the highest classification performance with 96.1%, 99.5%, and 99.7% accuracy on three different datasets, totaling 7,406 CXR images. Monshi et al. (2021) focused on data augmentation and CNN hyperparameters optimization, increasing VGG19 and ResNet50 accuracy. They also proposed CovidXrayNet, a model based on EfficientNet-B0, which achieved an accuracy of 95.82% on an earlier version of the COVIDx dataset with 15,496 CXR images. Heidari et al. (2020) focused on preprocessing algorithms to improve the performance of VGG16. They used a dataset with 8,474 CXR images and reached 94.5% accuracy. Jia et al. (2021) proposed a modified MobileNet to classify CXR and CT images. They applied their method to a CXR dataset with 7,592 CXR images and achieved 99.3% accuracy. They also applied it to an earlier version of COVIDx with 13,975 CXR images, achieving 95.0% accuracy. Karthik et al. (2021) proposed a custom CNN architecture which they called Channel-Shuffled Dual-Branched (CSDB). They achieved an accuracy of 99.80% on a combination of seven datasets, totaling 15,265 images. Mostafiz et al. (2020) used a hybridization of CNN (ResNet50) and discrete wavelet transform (DWT) features. The random forest-based bagging approach was used for classification. They combined different datasets and used data augmentation techniques to produce a total of CXR images and achieved 98.5% accuracy. Mohammad Shorfuzzaman (2020) used VGG16, ResNet50V2, Xception, MobileNet, and DenseNet121 in a transfer learning scenario. They collected CXR images from different sources to compose a dataset with images. The best accuracy (98.15%) was achieved with ResNet50V2. They also made an ensemble of the four best models (ResNet50V2, Xception, MobileNet, and DenseNet121) with the final output obtained by majority voting, raising the accuracy to 99.26%. Chhikara et al. (2021) proposed a InceptionV3 based-model and applied it to three different datasets with 11,244, 8,246, and 14,486 CXR images, respectively. The model has reached an accuracy of 97.7%, 84.95%, and 97.03% on the mentioned datasets, respectively. Pavlova et al. (2021) proposed the COVIDx8B dataset, which they claim is the largest and most diverse COVID-19 CXR dataset in open access form, and the COVID-Net CXR-2 model, a CNN specially tailored for COVID-19 detection on CXR images using machine-driven design, which achieved an accuracy of 95.5%. Zhao et al. (2021) used ResNet50V2 to classify the COVIDx8B dataset with an accuracy of 96.5% in the best scenario. Dominik (2021) proposed a lightweight architecture called BaseNet and achieved an accuracy of 95.50% on COVIDx8B. He also used an ensemble composed of BaseNet, VGG16, VGG19, ResNet50, DenseNet121, and Xception to achieve 97.75% accuracy. It was further increased to 99.25% using an optimal classification threshold.

Dataset

Most of the early research regarding COVID-19 detection on CXR images suffered from the lack of available datasets (Alawad et al., 2021). The authors would frequently combine different smaller datasets, so fairly comparing the results was impossible. COVIDx8B is a large and heterogeneous COVID-19 CXR benchmark dataset with 16,352 CXR images coming from patients of at least countries (Pavlova et al., 2021). It is constructed with images extracted from six open-source chest radiography datasets, which are shown in Table 1. Notice that the sum of the images in the source datasets is much larger than the size of COVIDx8B since not all of them were selected by the authors. Example images from the COVIDx8B dataset are shown in Fig. 1.

Table 1

List of datasets that compose the COVIDx8B benchmark dataset.

Source dataset	Size	Reference
Covid-chestxray-dataset	950	Cohen et al. (2020)
COVID-19 Chest X-ray Dataset Initiative	55	Chung (2020b)
Actualmed COVID-19 Chest X-ray Dataset Initiative	238	Chung (2020a)
COVID-19 Radiography Database-Version 3	21,165	Chowdhury et al., 2020, Rahman et al., 2021
RSNA Pneumonia Detection Challenge	29,684	Wang et al. (2017)
RSNA International COVID-19 Open Radiology Database (RICORD)	1,257	Tsai et al. (2021)

Fig. 1

Examples of CXR images from the COVIDx8B dataset. The first row shows COVID-19 negative patient cases, and the second row shows COVID-19 positive patient cases.

Though COVIDx8B does not include information on patients’ demographics, half of their source datasets do. The covid-chestxray-dataset has registers from male patients and registers from female patients. The average age is years old. The COVID-19 positive registers are from male and female patients, with an average age of years old. The Fig. 1 COVID-19 Chest X-ray Dataset Initiative has only registers, most of them do not indicate sex. Among the remaining, there are male patients and female patients. Only patients have their exact age registered and the average is years old. All patients with the exact age described are COVID-19 positive or unlabeled. The RSNA International COVID-19 Open Radiology Database (RICORD) only has COVID-19 positive cases. They come from male and female patients, with an average age of years old. List of datasets that compose the COVIDx8B benchmark dataset. Four of the source datasets have both COVID-19 positive and negative cases. The RSNA Pneumonia Detection Challenge has only COVID-19 negative cases (non-COVID pneumonia, normal, etc.) and The RSNA International COVID-19 Open Radiology Database (RICORD) has only COVID-19 positive cases. COVID8xB training subset is composed of 15,952 images, from which 2,158 are COVID-19 positive and 13,794 are COVID-19 negative. The negative group also includes images of patients with non-COVID-19 pneumonia, which poses a major challenge as it is usually difficult to distinguish between COVID-19 and non-COVID19 pneumonia. The test subset has COVID-19 positive images from different patients and COVID-19 negative images. In the negative group, images are from healthy patients. The other images are from non-COVID pneumonia patients. The test images were randomly selected from international patient groups curated by the Radiological Society of North America (RSNA) (Tsai et al., 2021, Wang et al., 2017). The images were annotated by an international group of scientists and radiologists from different institutes around the world. The test set was selected in such a way to ensure no patient overlap between training and test sets (Pavlova et al., 2021). Examples of CXR images from the COVIDx8B dataset. The first row shows COVID-19 negative patient cases, and the second row shows COVID-19 positive patient cases.

CNN architectures

This section presents the CNN architectures explored in this work. It also describes the layers added to complete the models and perform the CXR images classification. Table 2 shows the tested architectures, some of their characteristics, and their respective literature references.

Table 2

CNN architectures, some of their characteristics, and their references.

Model	Input Image Resolution	Output of Last Conv. Layer	Trainable Parameters	Reference
DenseNet121	224 × 224	7×7×1024	7,216,770	Huang et al. (2017)
DenseNet169	224 × 224	7×7×1664	12,911,234	Huang et al. (2017)
DenseNet201	224 × 224	7×7×1920	18,585,218	Huang et al. (2017)
EfficientNetB0	224 × 224	7×7×1280	4,335,998	Tan and Le (2019)
EfficientNetB1	240 × 240	8×8×1280	6,841,634	Tan and Le (2019)
EfficientNetB2	260 × 260	9×9×1408	8,062,212	Tan and Le (2019)
EfficientNetB3	300 × 300	10×10×1536	11,090,218	Tan and Le (2019)
InceptionResNetV2	299 × 299	8×8×1536	54,670,178	Szegedy et al. (2017)
InceptionV3	299 × 299	8×8×2048	22,293,410	Szegedy et al. (2016)
MobileNet	224 × 224	7×7×1024	3,469,890	Howard et al. (2017)
MobileNetV2	224 × 224	7×7×1280	2,552,322	Sandler et al. (2018)
NASNetMobile	224 × 224	7×7×1056	4,504,084	Zoph et al. (2018)
ResNet101	224 × 224	7×7×2048	43,077,890	He et al. (2016a)
ResNet101V2	224 × 224	7×7×2048	43,053,954	He et al. (2016b)
ResNet152	224 × 224	7×7×2048	58,744,578	He et al. (2016a)
ResNet152V2	224 × 224	7×7×2048	58,712,962	He et al. (2016b)
ResNet50	224 × 224	7×7×2048	24,059,650	He et al. (2016a)
ResNet50V2	224 × 224	7×7×2048	24,044,418	He et al. (2016b)
VGG16	224 × 224	7×7×512	14,846,530	Simonyan and Zisserman (2015)
VGG19	224 × 224	7×7×512	20,156,226	Simonyan and Zisserman (2015)
Xception	299 × 299	10×10×2048	21,332,010	Chollet (2017)

The output of the last convolutional layer of the original CNN is fed to a global average pooling layer. Following, there is a dense layer with neurons using ReLU (Rectified Linear Unit) activation function, a dropout layer with a 20% rate, and a softmax classification layer. This proposed architecture is illustrated in Fig. 2, where indicates the horizontal and vertical input size of the CNN (image size), while , , and indicate the size of the CNN output in its last convolutional layer. These values depend on the original CNN architecture and they are indicated in Table 2. The table also shows the number of trainable parameters in each CNN architecture, including both their original layers and the dense layers added for COVID-19 classification.

Fig. 2

The proposed CNN Transfer Learning architecture.

CNN architectures, some of their characteristics, and their references. The proposed CNN Transfer Learning architecture.

CNN comparison

In this section, the computer simulations comparing the CNN models applied to the COVIDx8B dataset are presented. All simulations were performed using Python and TensorFlow in three desktop computers with NVIDIA GeForce GPU boards: GTX 970, GTX 1080, and RTX 2060 SUPER, respectively.2 No pre-processing was applied, except for those steps pre-defined by each CNN architecture, which is basically resizing the image to the CNN input size and normalizing the input range. In all tested scenarios, each CNN had its weights initially set to those pre-trained on the Imagenet dataset (Russakovsky et al., 2015), which has millions of images and hundreds of classes. This dataset is frequently used in transfer learning scenarios. The training phase was conducted using the Adam optimizer (Kingma & Ba, 2014). The learning rate was set to 10−5 for the original CNN layers and 10−3 for the dense layers proposed in this work. The idea is to allow bigger weight changes in the classification layers, which need to be trained from scratch, while only fine-tuning the CNN layers, taking advantage of the weights previously learned from the Imagenet dataset. From the training subset, 20% of the images are randomly taken to compose the validation subset, using stratification to keep the same classes proportion. Since the training subset is unbalanced, different class weights were defined for each class: 0.5782 and 3.6960 for negative and positive classes, respectively. These values were calculated based on TensorFlow documentation3 : where is the class weight, is the amount of examples belonging to class , and is the total amount of examples. All models were trained for up to epochs. An early stopping criterion was set to interrupt the training phase if the loss on the validation set did not decrease during the last epochs. The final weights are always restored to those adjusted in the epoch that achieved the lowest validation loss. For each CNN model, the training phase was performed five times with different training/validation splits, generating five instances with different adjusted weights. The same five training/validation splits were used for all models. Each instance was evaluated on the test subset and the following measures were obtained: accuracy (ACC), sensitivity (TPR), precision (PPV), and F1 score. The results are shown in Table 3. Each value is the average of the measures obtained on the five different instances of each model. The standard deviation is also presented. Results of the same evaluation applied to the training and validation subsets are available in Appendix.

Table 3

Comparison of 21 different CNN models applied to the COVIDx8B dataset. Each model is executed five times. The highest values for each measure are highlighted in bold.

Model	ACC		TPR		PPV		F1
	Mean	S.D.	Mean	S.D.	Mean	S.D.	Mean	S.D.
DenseNet169	0.9815	0.0056	0.9700	0.0138	0.9930	0.0075	0.9812	0.0058
EfficientNetB2	0.9760	0.0049	0.9600	0.0141	0.9918	0.0051	0.9756	0.0052
InceptionResNetV2	0.9755	0.0099	0.9590	0.0246	0.9919	0.0051	0.9749	0.0106
InceptionV3	0.9750	0.0065	0.9520	0.0144	0.9979	0.0041	0.9744	0.0069
MobileNet	0.9710	0.0060	0.9430	0.0136	0.9990	0.0021	0.9701	0.0064
EfficientNetB0	0.9705	0.0051	0.9510	0.0086	0.9896	0.0033	0.9699	0.0053
EfficientNetB3	0.9700	0.0163	0.9470	0.0337	0.9927	0.0051	0.9690	0.0177
DenseNet201	0.9695	0.0176	0.9400	0.0342	0.9989	0.0022	0.9683	0.0186
ResNet152V2	0.9695	0.0244	0.9420	0.0510	0.9970	0.0040	0.9679	0.0268
ResNet152	0.9660	0.0223	0.9370	0.0443	0.9947	0.0033	0.9644	0.0243
DenseNet121	0.9630	0.0053	0.9270	0.0103	0.9989	0.0022	0.9616	0.0057
Xception	0.9615	0.0077	0.9230	0.0154	1.0000	0.0000	0.9599	0.0083
VGG19	0.9580	0.0198	0.9170	0.0385	0.9989	0.0023	0.9558	0.0216
EfficientNetB1	0.9570	0.0224	0.9240	0.0413	0.9892	0.0075	0.9551	0.0242
ResNet50	0.9545	0.0172	0.9090	0.0344	1.0000	0.0000	0.9520	0.0192
VGG16	0.9525	0.0123	0.9090	0.0282	0.9958	0.0052	0.9501	0.0138
ResNet101V2	0.9530	0.0302	0.9100	0.0643	0.9959	0.0050	0.9497	0.0342
MobileNetV2	0.9485	0.0172	0.9030	0.0359	0.9935	0.0019	0.9457	0.0190
ResNet101	0.9410	0.0170	0.8830	0.0333	0.9988	0.0023	0.9370	0.0190
ResNet50V2	0.9280	0.0075	0.8590	0.0153	0.9966	0.0046	0.9226	0.0087
NASNetMobile	0.8530	0.0653	0.7090	0.1317	0.9960	0.0034	0.8212	0.0918

Average	0.9569	0.0162	0.9178	0.0334	0.9957	0.0036	0.9536	0.0187

It is worth noticing that most related work only shows the results of a single execution on each tested CNN architecture. This may lead to wrong conclusions as there is always some expected variance on multiple executions of neural networks, which are stochastic by nature. Comparison of 21 different CNN models applied to the COVIDx8B dataset. Each model is executed five times. The highest values for each measure are highlighted in bold. DenseNet169 achieved the highest accuracy (98.15%), TPR (97. 00%), and F1 score (98.12%) among all the tested models. The highest PPV (100%) was achieved by Xception and ResNet50 models. EfficientNetB2 achieved the second-best accuracy, PPV, and F1 score. Compared to other recent approaches applied to the same dataset, DenseNet169, EfficientNetB2, and InceptionResNetV2 achieved the best accuracy, TPR, and F1 score, as shown in Table 4. It is worth noticing that EfficientNetB2 has less trainable parameters (8.06 million) than all the other architectures in this comparison, including the Covid-Net CXR-2 (9.2 million), which was specially tailored for the COVIDx8B dataset.

Table 4

Model	ACC	TPR	PPV	F1	Source
DenseNet169	0.9815	0.9700	0.9930	0.9812	this paper
EfficientNetB2	0.9760	0.9600	0.9918	0.9756	this paper
InceptionResNetV2	0.9755	0.9590	0.9919	0.9749	this paper
InceptionV3	0.9750	0.9520	0.9979	0.9744	this paper
VGG16 (ImageNet)	0.9750	0.9500	1.0000	0.9744	Dominik (2021)
Covid-Net	0.9400	0.9350	1.0000	0.9664	Pavlova et al. (2021)
DenseNet121 (ChestXray)	0.9650	0.9350	0.9947	0.9639	Dominik (2021)
ResNet50V2 (Bit-M)	0.9650	0.9300	1.0000	0.9637	Zhao et al. (2021)
Covid-Net CXR-2	0.9630	0.9550	0.9700	0.9624	Pavlova et al. (2021)
VGG19 (ImageNet)	0.9625	0.9250	1.0000	0.9610	Dominik (2021)
ResNet-50 (ImageNet)	0.9575	0.9200	0.9946	0.9558	Dominik (2021)
DenseNet121 (ImageNet)	0.9575	0.9150	1.0000	0.9556	Dominik (2021)
Xception (ImageNet)	0.9550	0.9100	1.0000	0.9529	Dominik (2021)
ResNet50V2 (Bit-S)	0.9480	0.8950	1.0000	0.9446	Zhao et al. (2021)
ResNet50V2 (Random)	0.9280	0.8550	1.0000	0.9218	Zhao et al. (2021)
ResNet50	0.9050	0.8850	0.9220	0.9031	Pavlova et al. (2021)

There are some common characteristics among the two best- performing CNN architectures. DenseNet and EfficientNet are newer approaches (2017 and 2019, respectively) than VGG (2015) and ResNet (2016). DenseNet and EfficientNet also focus on architecture efficiency to use less trainable parameters than the earlier approaches. In this case, the strategy used in these newer models was more suitable for these types of CXR images. Unfortunately, many related works compared fewer and/or earlier models only. Therefore, future studies should consider a wider variety of models to verify if this tendency confirms. In particular, from the related work section, only Nigam et al. (2021) and Monshi et al. (2021) explored EfficientNet, but they also reported good results with it, showing this is a promising architecture for CXR images. Table 5 compares results reported in individual papers described in Section 2, where the authors are motivated to use a setup such their algorithm is the best performing, with the best result found in this paper for an individual CNN architecture, in which there is no motivation to implement optimizations to boost any particular architecture. Despite that, the best result from this paper is still in the top half of the best accuracy ranking. For each paper, the CNN architecture used and the dataset size are provided for reference.

Table 5

Comparison of different CNN-based models applied to different COVID-19 datasets found in individual papers and the best result by an individual CNN model applied to the COVIDx8B dataset in this paper.

Reference	Architecture	Dataset Size	Accuracy
Alawad et al. (2021)	VGG16	7,329	99.82%
Karthik et al. (2021)	CSDB	15,265	99.80%
Hira et al. (2021)	Se-ResNeXt-50	8,830	99.32%
Jia et al. (2021)	MobileNet	7,592	99.30%
Mostafiz et al. (2020)	ResNet50	4,809	98.50%
Narin et al. (2021)	ResNet50	7.406	98.43%
this paper	DenseNet169	16,352	98.15%
Chhikara et al. (2021)	InceptionV3	11,244	97.70%
Chhikara et al. (2021)	InceptionV3	14,486	97.03%
Monshi et al. (2021)	EfficientNetB0	15,496	95.82%
Jia et al. (2021)	MobileNet	13.975	95,00%
Ismael and Şengür (2021)	ResNet50	380	94.70%
Heidari et al. (2020)	VGG16	8,474	94.50%
Nigam et al. (2021)	EfficientNetB7	16,634	93.48%
Abbas et al. (2021)	DeTrac	196	93.10%
Chhikara et al. (2021)	InceptionV3	8,246	84.95%

Comparison of the best four models tested in this paper (in italic) with other recently proposed models applied to the COVIDx8B dataset. The highest values for each measure are highlighted in bold. The results obtained by other authors were compiled from the respective cited references. Comparison of different CNN-based models applied to different COVID-19 datasets found in individual papers and the best result by an individual CNN model applied to the COVIDx8B dataset in this paper.

CNN ensembles

This section presents the computer simulations with ensembles of different CNN models and ensembles of multiple instances of the same model. All the ensembles experiments used the output of the last dense layer, just before the softmax activation function. Therefore, for each image, each model will output two continuous values, which can be interpreted as the probability of each class. Then, the output of the ensemble will be the average of its members’ output. The same weights trained for the experiments in Section 5 were used for the experiments in this section. In the first ensemble experiment, the two models that achieved the best individual F1 score (DenseNet169 and EfficientNetB2) were combined in the first ensemble configuration. The second ensemble configuration adds the third-best model (InceptionResNetV2). The third ensemble configuration adds the fourth-best model (InceptionV3) and so on, with up to seven models. Then, in the last ensemble configuration, all the models are combined. In this first experiment, only one instance of each model composes each ensemble, thus there are five ensembles for each configuration. Table 6 shows the average and standard deviation of the measures obtained for each ensemble configuration.

Table 6

Ensembles of CNN models applied to the COVIDx8B dataset. Each ensemble configuration is executed five times with different instances of the models. The highest values for each measure are highlighted in bold.

Models	ACC		TPR		PPV		F1
	Mean	S.D.	Mean	S.D.	Mean	S.D.	Mean	S.D.
Top 2 models	0.9855	0.0024	0.9730	0.0040	0.9980	0.0025	0.9853	0.0025
Top 3 models	0.9885	0.0034	0.9770	0.0068	1.0000	0.0000	0.9884	0.0035
Top 4 models	0.9870	0.0019	0.9740	0.0037	1.0000	0.0000	0.9868	0.0019
Top 5 models	0.9865	0.0020	0.9730	0.0040	1.0000	0.0000	0.9863	0.0021
Top 6 models	0.9880	0.0010	0.9760	0.0020	1.0000	0.0000	0.9879	0.0010
Top 7 models	0.9865	0.0025	0.9730	0.0051	1.0000	0.0000	0.9863	0.0026
All models	0.9775	0.0032	0.9550	0.0063	1.0000	0.0000	0.9770	0.0033

The best accuracy, TPR, and F1 score were achieved when the three best models were combined (DenseNet169, EfficientNetB2, and InceptionResNetV2). All the ensembles, except for the one with the best two models, achieved a PPV of 100%. Except for the ensemble of all models, all the other ensembles achieved higher accuracy, TPR, and F1 scores than the best individual model. For the second ensembles experiment, the five instances of each model are combined to form an ensemble. It is expected that five instances, even if they are from the same model, will improve the measures by alleviating the randomness effects of the training. Ensembles of CNN models applied to the COVIDx8B dataset. Each ensemble configuration is executed five times with different instances of the models. The highest values for each measure are highlighted in bold. Table 7 shows the measures obtained with these ensembles for each model. It also shows the gain obtained by the ensemble when compared to the average of the single instances. All the models had gained with the ensembles. The highest measures were obtained by DenseNet169, with an F1 score of 99.24% and an accuracy of 99.25%. This is the same accuracy obtained by Dominik (2021) using an ensemble of multiple models and an optimized threshold. To the best of my knowledge, this is the highest accuracy achieved in this dataset at the time this paper is being written.

Table 7

Models	ACC		TPR		PPV		F1
	Mean	Gain	Mean	Gain	Mean	Gain	Mean	Gain
DenseNet169	0.9925	1.12%	0.9850	1.55%	1.0000	0.70%	0.9924	1.14%
EfficientNetB2	0.9850	0.92%	0.9750	1.56%	0.9949	0.31%	0.9848	0.94%
InceptionResNetV2	0.9875	1.23%	0.9750	1.67%	1.0000	0.82%	0.9873	1.27%
InceptionV3	0.9800	0.51%	0.9600	0.84%	1.0000	0.21%	0.9796	0.53%
MobileNet	0.9825	1.18%	0.9650	2.33%	1.0000	0.10%	0.9822	1.25%
EfficientNetB0	0.9750	0.46%	0.9600	0.95%	0.9897	0.01%	0.9746	0.48%
EfficientNetB3	0.9850	1.55%	0.9750	2.96%	0.9949	0.22%	0.9848	1.63%
DenseNet201	0.9825	1.34%	0.9650	2.66%	1.0000	0.11%	0.9822	1.44%
ResNet152V2	0.9900	2.11%	0.9800	4.03%	1.0000	0.30%	0.9899	2.27%
ResNet152	0.9800	1.45%	0.9650	2.99%	0.9948	0.01%	0.9797	1.59%
DenseNet121	0.9725	0.99%	0.9450	1.94%	1.0000	0.11%	0.9717	1.05%
Xception	0.9625	0.10%	0.9250	0.22%	1.0000	0.00%	0.9610	0.11%
VGG19	0.9700	1.25%	0.9400	2.51%	1.0000	0.11%	0.9691	1.39%
EfficientNetB1	0.9725	1.62%	0.9500	2.81%	0.9948	0.57%	0.9719	1.76%
ResNet50	0.9650	1.10%	0.9300	2.31%	1.0000	0.00%	0.9637	1.23%
VGG16	0.9550	0.26%	0.9100	0.11%	1.0000	0.42%	0.9529	0.29%
ResNet101V2	0.9650	1.26%	0.9300	2.20%	1.0000	0.41%	0.9637	1.47%
MobileNetV2	0.9650	1.74%	0.9350	3.54%	0.9947	0.12%	0.9639	1.92%
ResNet101	0.9575	1.75%	0.9150	3.62%	1.0000	0.12%	0.9556	1.99%
ResNet50V2	0.9350	0.75%	0.8700	1.28%	1.0000	0.34%	0.9305	0.86%
NASNetMobile	0.8750	2.58%	0.7500	5.78%	1.0000	0.40%	0.8571	4.37%

Average	0.9683	1.20%	0.9383	2.28%	0.9983	0.26%	0.9666	1.38%

For the third and last ensembles experiment, the first experiment is repeated, but now using all the five instances of each model in the ensemble. Table 8 shows the measures obtained with each ensemble and the gain obtained by these ensembles when compared to the ensembles which used only a single instance of each model. In this case, there were only small differences and some of them were negative. Therefore, the best ensemble overall is still the one with multiple instances of DenseNet169.

Table 8

Ensembles of CNN models applied to the COVIDx8B dataset. Each ensemble configuration has five instances of each participant model. The highest values for each measure and the highest gains in comparison with the ensembles of single instances for each model are highlighted in bold.

Models	ACC		TPR		PPV		F1
	Mean	Gain	Mean	Gain	Mean	Gain	Mean	Gain
Top 2 models	0.9850	−0.05%	0.9700	−0.31%	1.0000	0.20%	0.9848	−0.05%
Top 3 models	0.9875	−0.10%	0.9750	−0.20%	1.0000	0.00%	0.9873	−0.11%
Top 4 models	0.9875	0.05%	0.9750	0.10%	1.0000	0.00%	0.9873	0.05%
Top 5 models	0.9875	0.10%	0.9750	0.21%	1.0000	0.00%	0.9873	0.10%
Top 6 models	0.9875	−0.05%	0.9750	−0.10%	1.0000	0.00%	0.9873	−0.06%
Top 7 models	0.9875	0.10%	0.9750	0.21%	1.0000	0.00%	0.9873	0.10%
All models	0.9775	0.00%	0.9550	0.00%	1.0000	0.00%	0.9770	0.00%

Average	0.9857	0.01%	0.9714	−0.01%	1.0000	0.03%	0.9855	0.00%

Ensembles of CNN models applied to the COVIDx8B dataset. Each ensemble is composed of five instances of the same model, with different training/validation splits. The highest values for each measure and the highest gains in comparison to single instances of each model are highlighted in bold. Ensembles of CNN models applied to the COVIDx8B dataset. Each ensemble configuration has five instances of each participant model. The highest values for each measure and the highest gains in comparison with the ensembles of single instances for each model are highlighted in bold.

Conclusions

In this paper, different CNN architectures are applied to the detection of COVID-19 on CXR images. The comparison was performed using the COVIDx8B, a large and heterogeneous COVID-19 CXR images dataset, which is composed of six open-source CXR datasets. The training was repeated five times for each model, with different training and validation splits to get more reliable results, while most related works tested fewer models and performed only a single execution for each one. CNN ensembles were also explored in this work, combining both different models and multiple instances of the same model. DenseNet169 achieved the best results regarding the accuracy and the F1 score, both as a single instance and with an ensemble of five instances. The classification accuracies were 98.15% and 99.25% for the single instance and the ensemble, respectively, while the F1 scores were 98.12% and 99.24%, also respectively. These results are better than those achieved in recent works where the same dataset was used. The simulations performed for this paper add more evidence of the efficacy of CNNs in the detection of COVID-19 on CXR images, which is very important to assist in quick diagnostics and to avoid the spread of the disease. Moreover, these experiments may also guide future research as they tested a large amount of CNN architectures and identified which of them produces the best results for this particular task.

CRediT authorship contribution statement

Fabricio Aparecido Breve: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Table A.9

Classification accuracy (ACC) achieved by the CNN architectures when applied to the train, validation, and test subsets individually.

Dataset/Subset	Train	Validation	Test
DenseNet169	0.9951	0.9794	0.9815
EfficientNetB2	0.9936	0.9793	0.9760
InceptionResNetV2	0.9835	0.9681	0.9755
InceptionV3	0.9960	0.9784	0.9750
MobileNet	0.9936	0.9788	0.9710
EfficientNetB0	0.9894	0.9761	0.9705
EfficientNetB3	0.9948	0.9803	0.9700
ResNet152V2	0.9945	0.9757	0.9695
DenseNet201	0.9971	0.9816	0.9695
ResNet152	0.9923	0.9783	0.9660
DenseNet121	0.9962	0.9806	0.9630
Xception	0.9909	0.9777	0.9615
VGG19	0.9922	0.9804	0.9580
EfficientNetB1	0.9802	0.9697	0.9570
ResNet50	0.9955	0.9806	0.9545
ResNet101V2	0.9909	0.9707	0.9530
VGG16	0.9913	0.9772	0.9525
MobileNetV2	0.9987	0.9808	0.9485
ResNet101	0.9923	0.9803	0.9410
ResNet50V2	0.9859	0.9662	0.9280
NASNetMobile	0.9798	0.9660	0.8530

Table A.10

Classification sensitivity (TPR) achieved by the CNN architectures when applied to the train, validation, and test subsets individually.

Dataset/Subset	Train	Validation	Test
DenseNet169	0.9987	0.9611	0.9700
EfficientNetB2	0.9936	0.9662	0.9600
InceptionResNetV2	0.9830	0.9491	0.9590
InceptionV3	0.9965	0.9458	0.9520
EfficientNetB0	0.9760	0.9361	0.9510
EfficientNetB3	0.9935	0.9662	0.9470
MobileNet	0.9957	0.9440	0.9430
ResNet152V2	0.9928	0.9375	0.9420
DenseNet201	0.9964	0.9454	0.9400
ResNet152	0.9854	0.9509	0.9370
DenseNet121	0.9973	0.9421	0.9270
EfficientNetB1	0.9750	0.9505	0.9240
Xception	0.9847	0.9333	0.9230
VGG19	0.9874	0.9338	0.9170
ResNet101V2	0.9882	0.9278	0.9100
VGG16	0.9915	0.9324	0.9090
ResNet50	0.9918	0.9338	0.9090
MobileNetV2	0.9933	0.9241	0.9030
ResNet101	0.9701	0.9162	0.8830
ResNet50V2	0.9758	0.8921	0.8590
NASNetMobile	0.8874	0.8255	0.7090

Table A.11

Classification precision (PPV) achieved by the CNN architectures when applied to the train, validation, and test subsets individually.

Dataset/Subset	Train	Validation	Test
Xception	0.9502	0.9057	1.0000
ResNet50	0.9759	0.9242	1.0000
MobileNet	0.9587	0.9040	0.9990
VGG19	0.9566	0.9228	0.9989
DenseNet121	0.9754	0.9169	0.9989
DenseNet201	0.9822	0.9215	0.9989
ResNet101	0.9727	0.9366	0.9988
InceptionV3	0.9748	0.8998	0.9979
ResNet152V2	0.9682	0.8904	0.9970
ResNet50V2	0.9244	0.8630	0.9966
NASNetMobile	0.9602	0.9159	0.9960
ResNet101V2	0.9501	0.8701	0.9959
VGG16	0.9474	0.9044	0.9958
ResNet152	0.9598	0.8964	0.9947
MobileNetV2	0.9973	0.9337	0.9935
DenseNet169	0.9672	0.8946	0.9930
EfficientNetB3	0.9702	0.8977	0.9927
InceptionResNetV2	0.9077	0.8408	0.9919
EfficientNetB2	0.9631	0.8925	0.9918
EfficientNetB0	0.9486	0.8928	0.9896
EfficientNetB1	0.8924	0.8470	0.9892

Table A.12

Classification F1 score achieved by the CNN architectures when applied to the train, validation, and test subsets individually.

Dataset/Subset	Train	Validation	Test
DenseNet169	0.9825	0.9266	0.9812
EfficientNetB2	0.9774	0.9273	0.9756
InceptionResNetV2	0.9428	0.8905	0.9749
InceptionV3	0.9855	0.9221	0.9744
MobileNet	0.9768	0.9233	0.9701
EfficientNetB0	0.9619	0.9139	0.9699
EfficientNetB3	0.9815	0.9304	0.9690
DenseNet201	0.9892	0.9331	0.9683
ResNet152V2	0.9801	0.9128	0.9679
ResNet152	0.9722	0.9226	0.9644
DenseNet121	0.9862	0.9292	0.9616
Xception	0.9670	0.9190	0.9599
VGG19	0.9716	0.9281	0.9558
EfficientNetB1	0.9312	0.8954	0.9551
ResNet50	0.9837	0.9287	0.9520
VGG16	0.9688	0.9173	0.9501
ResNet101V2	0.9679	0.8966	0.9497
MobileNetV2	0.9953	0.9287	0.9457
ResNet101	0.9714	0.9262	0.9370
ResNet50V2	0.9493	0.8773	0.9226
NASNetMobile	0.9209	0.8675	0.8212

19 in total

Review 1. Deep learning in neural networks: an overview.

Authors: Jürgen Schmidhuber
Journal: Neural Netw Date: 2014-10-13

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

3. An automatic approach based on CNN architecture to detect Covid-19 disease from chest X-ray images.

Authors: Swati Hira; Anita Bai; Sanchit Hira
Journal: Appl Intell (Dordr) Date: 2020-11-27 Impact factor: 5.086

4. Automatic detection of coronavirus disease (COVID-19) using X-ray images and deep convolutional neural networks.

Authors: Ali Narin; Ceren Kaya; Ziynet Pamuk
Journal: Pattern Anal Appl Date: 2021-05-09 Impact factor: 2.580

5. The RSNA International COVID-19 Open Radiology Database (RICORD).

Authors: Emily B Tsai; Scott Simpson; Matthew P Lungren; Michelle Hershman; Leonid Roshkovan; Errol Colak; Bradley J Erickson; George Shih; Anouk Stein; Jayashree Kalpathy-Cramer; Jody Shen; Mona Hafez; Susan John; Prabhakar Rajiah; Brian P Pogatchnik; John Mongan; Emre Altinmakas; Erik R Ranschaert; Felipe C Kitamura; Laurens Topff; Linda Moy; Jeffrey P Kanne; Carol C Wu
Journal: Radiology Date: 2021-01-05 Impact factor: 11.105