| Literature DB >> 35449199 |
João Pedrosa1,2, Guilherme Aresta3,4, Carlos Ferreira3,4, Catarina Carvalho3,4, Joana Silva5,6, Pedro Sousa5,7, Lucas Ribeiro5, Ana Maria Mendonça3,4, Aurélio Campilho3,4.
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has impacted healthcare systems across the world. Chest radiography (CXR) can be used as a complementary method for diagnosing/following COVID-19 patients. However, experience level and workload of technicians and radiologists may affect the decision process. Recent studies suggest that deep learning can be used to assess CXRs, providing an important second opinion for radiologists and technicians in the decision process, and super-human performance in detection of COVID-19 has been reported in multiple studies. In this study, the clinical applicability of deep learning systems for COVID-19 screening was assessed by testing the performance of deep learning systems for the detection of COVID-19. Specifically, four datasets were used: (1) a collection of multiple public datasets (284.793 CXRs); (2) BIMCV dataset (16.631 CXRs); (3) COVIDGR (852 CXRs) and 4) a private dataset (6.361 CXRs). All datasets were collected retrospectively and consist of only frontal CXR views. A ResNet-18 was trained on each of the datasets for the detection of COVID-19. It is shown that a high dataset bias was present, leading to high performance in intradataset train-test scenarios (area under the curve 0.55-0.84 on the collection of public datasets). Significantly lower performances were obtained in interdataset train-test scenarios however (area under the curve > 0.98). A subset of the data was then assessed by radiologists for comparison to the automatic systems. Finetuning with radiologist annotations significantly increased performance across datasets (area under the curve 0.61-0.88) and improved the attention on clinical findings in positive COVID-19 CXRs. Nevertheless, tests on CXRs from different hospital services indicate that the screening performance of CXR and automatic systems is limited (area under the curve < 0.6 on emergency service CXRs). However, COVID-19 manifestations can be accurately detected when present, motivating the use of these tools for evaluating disease progression on mild to severe COVID-19 patients.Entities:
Mesh:
Year: 2022 PMID: 35449199 PMCID: PMC9022741 DOI: 10.1038/s41598-022-10568-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Summary of the types of deep learning architecture (left), ratio of studied COVID-19 CXRs (center) and performance of automatic COVID-19 diagnosis in CXR up to July 2020 (right). Performance metrics: AUC - area under the ROC curve; Acc. - accuracy; F1 - F1 score; Prec. - precision; Sens. - sensitivity; Spec. - specificity (number of works).
Figure 2Summary of the study. Each (letter.number) indicates the data/model used for performing the task encoded by the respective color.
Number of CXRs per dataset and (number of CXRs annotated by radiologists) for each of the three classes after exclusion of non-frontal CXRs.
| Dataset | Normal | Pathological | COVID-19 |
|---|---|---|---|
| Mixed | |||
| CheXpert | 21,214 (0) | 169,833 (7) | 0 (0) |
| ChestXRay-8 | 60,361 (188) | 880 (38) | 0 (0) |
| COVID-19 IDC | 9 (1) | 55 (50) | 618 (205) |
| COVIDx | 1 (0) | 0 (0) | 93 (4) |
| RSNA-PDC | 8851 (251) | 17,833 (291) | 0 (0) |
| SAVE LIVES | 0 (0) | 0 (0) | 4889 (171) |
| SERAM | 0 (0) | 5 (5) | 37 (37) |
| 0 (0) | 0 (0) | 114 (8) | |
| BIMCV | |||
| PADCHEST | 4349 (78) | 9811 (170) | 0 (0) |
| COVID-19+ | 0 (0) | 0 (0) | 2471 (41) |
| COVIDGR | 426 (150) | 426 (150) | |
| CHVNGE | 5626 (529) | 735 (68) | |
Patient and CXR acquisition characteristics for each dataset. Data is shown in absolute number and percentage in parenthesis. Age is shown as median [minimum; maximum]. * indicates that the information could not be obtained for all subjects and NA indicates that this data is not available.
| Mixed | BIMCV | COVIDGR | CHVNGE | |
|---|---|---|---|---|
| Gender | ||||
| Male | 165,123 (58.0) | 8615 (51.8) | 0 (0.0) | 3340 (52.5) |
| Female | 118,863 (41.7) | 8015 (48.2) | 0 (0.0) | 3021 (47.5) |
| Unknown/Other | 807 (0.3) | 1 (0.0) | 852 (100.0) | 0 (0.0) |
| Age | 58 [1;106]* | 64 [1;101]* | NA | 66 [2;100] |
| View | ||||
| AP | 199,606 (70.1) | 4,214 (25.3) | 0 (0.0) | 82 (1.3) |
| PA | 84,070 (29.5) | 12,160 (73.1) | 852 (100.0) | 66 (1.0) |
| Unknown | 1117 (0.4) | 257 (1.5) | 0 (0.0) | 6213 (97.7) |
| CXR equipment | ||||
| Agfa CR | 2280 (0.8) | 346 (2.1) | 0 (0.0) | 0 (0.0) |
| Agfa DX | 247 (1.5) | 0 (0.0) | 0 (0.0) | |
| Canon DX | 0 (0.0) | 45 (0.3) | 0 (0.0) | 0 (0.0) |
| Carestream CR | 0 (0.0) | 9 (0.1) | 0 (0.0) | 0 (0.0) |
| Carestream DX | 0 (0.0) | 97 (0.6) | 0 (0.0) | 59 (0.9) |
| FUJI CR | 0 (0.0) | 155 (0.9) | 0 (0.0) | 2164 (34.0) |
| FUJI DX | 0 (0.0) | 4 (0.0) | 0 (0.0) | 436 (6.9) |
| GE DX | 0 (0.0) | 155 (0.9) | 0 (0.0) | 0 (0.0) |
| GMM DX | 0 (0.0) | 269 (1.6) | 0 (0.0) | 0 (0.0) |
| ImagingDynamics CR | 0 (0.0) | 3668 (22.1) | 0 (0.0) | 0 (0.0) |
| KONICA CR | 0 (0.0) | 423 (2.5) | 0 (0.0) | 0 (0.0) |
| KONICA DX | 0 (0.0) | 89 (0.5) | 0 (0.0) | 0 (0.0) |
| Philips CR | 1019 (0.4) | 8131 (48.9) | 0 (0.0) | 0 (0.0) |
| Philips DX | 2711 (16.3) | 0 (0.0) | 0 (0.0) | |
| Samsung DX | 0 (0.0) | 0 (0.0) | 0 (0.0) | 3702 (58.2) |
| SIEMENS CR | 1575 (0.6) | 261 (1.6) | 0 (0.0) | 0 (0.0) |
| SIEMENS DX | 0 (0.0) | 0 (0.0) | 0 (0.0) | |
| Unknown/Other | 279,919 (98.3) | 21 (0.1) | 852 (100.0) | 0 (0.0) |
Figure 3Confusion matrices between ground truth labels and radiologist annotations. (a) Comparison between ground truth and radiologists’ consensus on each dataset; (b) Inter- and intraobserver variability across all datasets (left and right respectively). N - Normal; P - Not indicative of COVID-19 (pathological); C - Indicative of COVID-19; U - Undetermined. Cases annotated as Compromised are not shown. Color intensity corresponds to the percentage of cases within each column.
Agreement between radiologists after consensus and ground truth considering as positives only CXRs marked as Indicative of COVID-19 (C) or Indicative of COVID-19 and Undetermined (C + U). : p-value obtained with the McNemar test.
| Dataset | C | C + U | ||||
|---|---|---|---|---|---|---|
| Prec. | Recall | Prec. | Recall | |||
| Mixed | 0.79 | 0.28 | < 0.0001 | 0.49 | 0.63 | < 0.0001 |
| BIMCV | 0.67 | 0.19 | < 0.0001 | 0.30 | 0.75 | < 0.0001 |
| COVIDGR | 1.00 | 0.26 | < 0.0001 | 0.93 | 0.54 | < 0.0001 |
| CHVNGE | 0.44 | 0.13 | < 0.0001 | 0.30 | 0.54 | < 0.0001 |
| All | 0.79 | 0.25 | < 0.0001 | 0.49 | 0.60 | < 0.0001 |
Inter- and intraobserver variability of radiologist annotations considering as positives only CXRs marked as Indicative of COVID-19 (C) or Indicative of COVID-19 and Undetermined (C + U). : p-value obtained with the McNemar test (NP: power).
| C | C + U | |||||
|---|---|---|---|---|---|---|
| Acc. | Acc. | |||||
| Interobserver | 0.90 | 0.44 | 0.0021 | 0.82 | 0.58 | NP |
| Intraobserver | ||||||
| Radiologist 1 | 0.96 | 0.80 | NP | 0.81 | 0.56 | NP |
| Radiologist 2 | 0.87 | 0.50 | NP | 0.83 | 0.63 | NP |
| Consensus | 0.95 | 0.77 | NP | 0.88 | 0.71 | NP |
Figure 4ROC of each of the models on each of the datasets. Average model performance across all folds is shown as a line (full line: full dataset used for testing; dashed line: separate test set for each fold) and shaded region corresponds to the 95% confidence interval. Left: model performance on all CXRs; Right: model and radiologists performance on annotated CXRs. Rightmost plots are a zoomed version of the gray shaded region of the left plot. Radiologists performance is shown when considering as positives only CXRs marked as Indicative of COVID-19 () or Indicative of COVID-19 and Undetermined ().
Model AUC for each dataset and cross validation fold. Bold indicates the highest AUC per dataset and fold.
| Mixed | 0.6248 | 0.9294 | 0.9741 | 0.9890 | 0.9192 | |
| 0.7161 | 0.9144 | 0.9590 | 0.9326 | 0.9017 | ||
| 0.7139 | 0.7708 | 0.9604 | 0.9785 | 0.9007 | ||
| 0.6072 | 0.8275 | 0.9783 | 0.9840 | 0.9694 | ||
| 0.6078 | 0.9535 | 0.9718 | 0.9853 | 0.9635 | ||
| BIMCV | 0.6354 | 0.6572 | 0.6929 | 0.7357 | 0.6264 | |
| 0.6564 | 0.6659 | 0.7005 | 0.6887 | 0.6813 | ||
| 0.6405 | 0.6510 | 0.7009 | 0.6952 | 0.6858 | ||
| 0.6491 | 0.6861 | 0.6861 | 0.7160 | 0.7249 | ||
| 0.6399 | 0.6815 | 0.6965 | 0.6755 | 0.6740 | ||
| COVIDGR | 0.8441 | 0.8272 | 0.8409 | 0.7400 | 0.8384 | |
| 0.7802 | 0.6768 | 0.7868 | 0.8060 | 0.7774 | ||
| 0.8065 | 0.8180 | 0.7936 | 0.7697 | 0.8209 | ||
| 0.7871 | 0.8215 | 0.8143 | 0.7701 | 0.8004 | ||
| 0.7432 | 0.7116 | 0.8227 | 0.7991 | 0.7625 | ||
| CHVNGE | 0.5926 | 0.6244 | 0.5730 | 0.5810 | 0.6986 | |
| 0.6085 | 0.6025 | 0.5971 | 0.5969 | 0.6518 | ||
| 0.5549 | 0.6178 | 0.5971 | 0.5887 | 0.6137 | ||
| 0.5465 | 0.5891 | 0.6292 | 0.6049 | 0.6604 | ||
| 0.6256 | 0.5674 | 0.5573 | 0.6297 | 0.6594 |
Statistical significance of differences in AUC between readers (models and radiologists) for each dataset according to the De Long test. Bold indicates statistical significance with Bonferroni correction ( and for the test set and annotated test sets respectively).
| Whole test set | Annotated test set | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Mixed | BIMCV | COVIDGR | CHVNGE | Mixed | BIMCV | COVIDGR | CHVNGE | ||
| < | < | 0.6026 | 0.0481 | < | < | 0.2155 | 0.0376 | ||
| < | < | 0.2237 | < | 0.9545 | 0.6534 | 0.5636 | |||
| < | < | 0.1189 | < | < | 0.3035 | 0.5661 | |||
| < | < | 0.1163 | < | 0.7054 | 0.4333 | ||||
| < | < | 0.1080 | < | 0.5516 | 0.7556 | 0.1077 | |||
| < | < | 0.0102 | 0.0613 | < | 0.2324 | 0.0119 | |||
| < | < | 0.6802 | < | 0.3088 | < | 0.5906 | 0.2749 | ||
| < | < | 0.0319 | 0.1726 | < | 0.0813 | 0.0125 | |||
| < | < | 0.2073 | 0.0425 | < | 0.0433 | 0.0957 | |||
| < | < | < | 0.7269 | 0.1279 | |||||
| < | < | 0.2827 | 0.0139 | < | 0.1835 | 0.6108 | 0.6498 | ||
| < | 0.0277 | 0.0055 | < | < | 0.8138 | 0.1589 | 0.1793 | ||
| < | < | 0.0067 | < | 0.0127 | 0.0893 | 0.2426 | 0.0070 | ||
| < | 0.0059 | < | 0.0234 | 0.7022 | 0.3261 | 0.0528 | |||
| < | < | 0.2475 | < | 0.3714 | 0.4898 | 0.1156 | |||
| Radiologists | - | - | - | - | < | 0.0594 | 0.2566 | 0.0028 | |
| Radiologists | - | - | - | - | < | < | 0.7046 | 0.1386 | |
| Radiologists | - | - | - | - | < | 0.3913 | 0.0137 | 0.0092 | |
| Radiologists | - | - | - | - | < | 0.2694 | 0.7197 | 0.3691 | |
| Radiologists | - | - | - | - | < | 0.4794 | 0.0331 | 0.0212 | |
| Radiologists | - | - | - | - | < | 0.0053 | 0.1322 | 0.2046 | |
Calibration in termos of ECE for each model and dataset for the 5 test folds. Lower ECE values indicate higher calibration. Bold indicates lowest value in row.
| Mixed | 0.0082 | 0.6994 | 0.0883 | 0.0448 | 0.0438 | |
| 0.6977 | 0.1161 | 0.0904 | 0.0224 | 0.0265 | ||
| 0.6021 | 0.2232 | 0.0291 | 0.0242 | 0.0245 | ||
| 0.7106 | 0.0648 | 0.0498 | 0.0294 | 0.0243 | ||
| 0.0241 | 0.7152 | 0.0633 | 0.0217 | 0.0251 | ||
| BIMCV | 0.1469 | 0.1992 | 0.1145 | 0.1234 | 0.1639 | |
| 0.1304 | 0.2277 | 0.1326 | 0.1442 | 0.0929 | ||
| 0.2074 | 0.2724 | 0.1535 | 0.1255 | 0.1350 | ||
| 0.1295 | 0.2237 | 0.1555 | 0.1194 | 0.1052 | ||
| 0.1382 | 0.2209 | 0.2260 | 0.1334 | 0.1489 | ||
| COVIDGR | 0.4447 | 0.2275 | 0.1029 | 0.4238 | 0.1902 | |
| 0.4591 | 0.1790 | 0.1961 | 0.4977 | 0.2853 | ||
| 0.3522 | 0.1970 | 0.2810 | 0.4321 | 0.2300 | ||
| 0.4344 | 0.2442 | 0.3393 | 0.4055 | 0.2572 | ||
| 0.4373 | 0.2008 | 0.3581 | 0.4595 | 0.2026 | ||
| CHVNGE | 0.1202 | 0.6024 | 0.1281 | 0.0650 | 0.1039 | |
| 0.0965 | 0.5253 | 0.1523 | 0.1136 | 0.0552 | ||
| 0.0941 | 0.5427 | 0.2279 | 0.0913 | 0.1131 | ||
| 0.0990 | 0.6492 | 0.1636 | 0.0815 | 0.1447 | ||
| 0.1046 | 0.6215 | 0.2241 | 0.0779 | 0.1075 |
Figure 5Examples of CXRs of COVID-19 positive patients from the CHVNGE dataset (first column) and corresponding GradCAM++ activation on the COVID-19 class for the (second column) and the (third column). The first two rows correspond to CXRs correctly classified by with high probability, the third row corresponds to a CXR correctly classified by with average probability and the fourth row corresponds to a CXR incorrectly classified by with low probability.
Figure 6Equipment-wise evaluation of on the CHVNGE dataset. Top left: model performance on all CXRs. Top right: model and radiologist performance on CXRs annotated by radiologists (Carestream not shown due to limited data). Average model performance for all folds is shown as a line and shaded region corresponds to the 95% confidence interval. Radiologists’ performance is shown considering as positives only CXRs marked as Indicative of COVID-19 () or Indicative of COVID-19 and Undetermined ().
AUC for each CXR equipment on CHVNGE for each cross validation fold. Bold indicates the highest AUC for each fold.
| FUJI CR | Samsung | Carestream | FUJI DX |
|---|---|---|---|
| 0.7808 | 0.5618 | 0.6768 | |
| 0.8145 | 0.5710 | 0.7053 | |
| 0.7770 | 0.5676 | 0.7061 | |
| 0.7908 | 0.5934 | 0.7291 | |
| 0.7975 | 0.5890 | 0.6783 |
Statistical significance of differences in AUC between different CXR equipments on CHVNGE according to the Venkatraman test. Bold indicates statistical significance with Bonferroni correction ().
| FUJI CR | Samsung | < |
| FUJI CR | Carestream | 1.0000 |
| FUJI CR | FUJI DX | 0.2357 |
| Samsung | Carestream | |
| Samsung | FUJI DX | 0.0294 |
| Carestream | FUJI DX | 0.8490 |
Statistical significance of differences in ROC between and radiologists for each CXR equipment on CHVNGE according to the De Long test. Bold indicates .
| FUJI CR | Samsung | FUJI DX | ||
|---|---|---|---|---|
| Radiologists | 0.2699 |