| Literature DB >> 34268489 |
Dan Nguyen1,2, Fernando Kay3, Jun Tan2, Yulong Yan2, Yee Seng Ng3, Puneeth Iyengar2, Ron Peshock3, Steve Jiang1,2.
Abstract
Since the outbreak of the COVID-19 pandemic, worldwide research efforts have focused on using artificial intelligence (AI) technologies on various medical data of COVID-19-positive patients in order to identify or classify various aspects of the disease, with promising reported results. However, concerns have been raised over their generalizability, given the heterogeneous factors in training datasets. This study aims to examine the severity of this problem by evaluating deep learning (DL) classification models trained to identify COVID-19-positive patients on 3D computed tomography (CT) datasets from different countries. We collected one dataset at UT Southwestern (UTSW) and three external datasets from different countries: CC-CCII Dataset (China), COVID-CTset (Iran), and MosMedData (Russia). We divided the data into two classes: COVID-19-positive and COVID-19-negative patients. We trained nine identical DL-based classification models by using combinations of datasets with a 72% train, 8% validation, and 20% test data split. The models trained on a single dataset achieved accuracy/area under the receiver operating characteristic curve (AUC) values of 0.87/0.826 (UTSW), 0.97/0.988 (CC-CCCI), and 0.86/0.873 (COVID-CTset) when evaluated on their own dataset. The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better. However, the performance dropped close to an AUC of 0.5 (random guess) for all models when evaluated on a different dataset outside of its training datasets. Including MosMedData, which only contained positive labels, into the training datasets did not necessarily help the performance of other datasets. Multiple factors likely contributed to these results, such as patient demographics and differences in image acquisition or reconstruction, causing a data shift among different study cohorts.Entities:
Keywords: COVID-19; SARS-CoV-2; classification; computed tomography; convolutional neural network; deep learning; generalizability
Year: 2021 PMID: 34268489 PMCID: PMC8275994 DOI: 10.3389/frai.2021.694875
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Summary of data used in the study. These datasets include full volumetric CT scans of the patients.
| Dataset | Origin | Description | Available at: | |||
|---|---|---|---|---|---|---|
| Details | # Patients | # 3D scans | Label | |||
| UTSW | UT Southwestern Medical Center | CT vendors: Phillips, Toshiba, GE Medical Systems | 101 | 101 | COVID-19 positive | *See footnote |
| Image resolution: 512 × 512 | 118 | 118 | Infection (negative) | |||
| Pixel size range: 0.45 mm to 0.83 mm | 118 | 118 | Findings Unrelated to Infection (negative) | |||
| Slice thickness range: 0.9–3 mm | ||||||
| Format: DICOM | ||||||
| China Consortium of Chest CT Image Investigation (CC-CCII) Dataset | Sun Yat-sen Memorial Hospital and Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China | CT vendor: unreported | 929 | 1544 | COVID-19 positive |
|
| The First Affiliated Hospital of Anhui Medical University, Anhui, China | Image resolution: mostly 512 × 512 (a few were 128 × 128) | 964 | 1556 | Common Pneumonia (negative) | ||
| West China Hospital, Sichuan, China | Pixel size range: unreported | 849 | 1078 | Normal Lung (negative) | ||
| Nanjing Renmin Hospital, Nanjing, China | Slice thickness range: 1–5 mm | |||||
| Yichang Central People’s Hospital, Hubei, China | ||||||
| Renmin Hospital of Wuhan University, Wuhan, China | ||||||
| COVID-CTset | Negin Medical Center, Sari, Iran | CT vendor: Siemens | 95 | 281 | COVID-19 positive |
|
| Image resolution: 512 × 512 | 282 | 1068 | Normal lung (negative) | |||
| Pixel size range: unreported | ||||||
| Slice thickness range: unreported | ||||||
| MosMedData | Municipal hospitals in Moscow, Russia | CT vendor: Toshiba | 254 | 254 | CT-0—not consistent with pneumonia (can include both COVID-19 positive and negative) |
|
| Image resolution: 512 × 512 | 684 | 684 | CT-1—Mild (COVID-19 positive) | |||
| Pixel size range: unreported Slice thickness: 1 mm | 125 | 125 | CT-2—Moderate (COVID-19 positive) | |||
| 45 | 45 | CT-3—Severe (COVID-19 positive) | ||||
| 2 | 2 | CT-4—Critical (COVID-19 positive) | ||||
FIGURE 1Slice view of example CTs from each dataset. Red arrows show patchy ground-glass opacities with round morphology, which are typical findings in COVID-19 pneumonia.
FIGURE 2Schematic of deep learning architecture used in the study. Black numbers represent the feature shape of each layer prior to the flattening operation. Red numbers represent the number of features at each layer.
FIGURE 3Confusion matrices on the test data for each of the models trained on a single dataset. Each row represents the datasets that the model was trained on, and each column represents the datasets that the model was evaluated on. Note that MosMedData does not have any negative label data. The labeling threshold used for each model is indicated on the lower left of each confusion matrix (t = #). The “Average” confusion matrix is an equally weighted average among each dataset, while the “Combined” confusion matrix is calculated from all samples from the datasets.
FIGURE 4Confusion matrices on the test data for each of the models trained on multiple datasets. Each row represents the datasets that the model was trained on, and each column represents the datasets that the model was evaluated on. The labeling threshold used for each model is indicated on the lower left of each confusion matrix (t = #). The “Average” confusion matrix is an equally weighted average among each dataset, while the “Combined” confusion matrix is calculated from all samples from the datasets.
FIGURE 5ROC curves on the test data for the models trained on single datasets. Each row represents the datasets that the model was trained on, and each column represents the datasets that the model was evaluated on. The error band and the error value in the reported AUC represent 1 standard deviation.
FIGURE 6ROC curves on the test data for the models that trained on multiple datasets. Each row represents the datasets that the model was trained on, and each column represents the datasets that the model was evaluated on. The error band and the error value in the reported AUC represent 1 standard deviation.