| Literature DB >> 30399157 |
John R Zech1, Marcus A Badgeley2, Manway Liu2, Anthony B Costa3, Joseph J Titano4, Eric Karl Oermann3.
Abstract
BACKGROUND: There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task. METHODS ANDEntities:
Mesh:
Year: 2018 PMID: 30399157 PMCID: PMC6219764 DOI: 10.1371/journal.pmed.1002683
Source DB: PubMed Journal: PLoS Med ISSN: 1549-1277 Impact factor: 11.069
Baseline characteristics of datasets by site.
| Characteristic | IU | MSH | NIH |
|---|---|---|---|
| Patient demographics | |||
| No. patient radiographs | 3,807 | 42,396 | 112,120 |
| No. patients | 3,683 | 12,904 | 30,805 |
| Age, mean (SD), years | 49.6 (17.0) | 63.2 (16.5) | 46.9 (16.6) |
| No. females (%) | 643 (57.3%) | 18,993 (44.8%) | 48,780 (43.5%) |
| Image diagnosis frequencies | |||
| Pneumonia, No. (%) | 39 (1.0%) | 14,515 (34.2%) | 1,353 (1.2%) |
| Emphysema, No. (%) | 62 (1.6%) | 1,308 (3.1%) | 2,516 (2.2%) |
| Effusion, No. (%) | 142 (3.7%) | 19,536 (46.1%) | 13,307 (11.9%) |
| Consolidation, No. (%) | 26 (0.7%) | 25,318 (59.7%) | 4,667 (4.2%) |
| Nodule, No. (%) | 104 (2.7%) | 569 (1.3%) | 6,323 (5.6%) |
| Atelectasis, No. (%) | 307 (8.1%) | 16,713 (39.4%) | 11,535 (10.3%) |
| Edema, No. (%) | 45 (1.2%) | 7,144 (16.9%) | 2,303 (2.1%) |
| Cardiomegaly, No. (%) | 328 (8.6%) | 14,285 (33.7%) | 2,772 (2.5%) |
| Hernia, No. (%) | 46 (1.2%) | 228 (0.5%) | 227 (0.2%) |
*Sex data available for 1,122 / 3,807 IU, 42,383 / 42,396 MSH; age data available for 112,077 / 112,120 NIH.
Abbreviations: IU, Indiana University Network for Patient Care; MSH, Mount Sinai Hospital; NIH, National Institutes of Health Clinical Center; No., number.
Fig 1Pneumonia models evaluated on internal and external test sets.
A model trained using both MSH and NIH data (MSH + NIH) had higher performance on the combined MSH + NIH test set than on either subset individually or on fully external IU data. IU, Indiana University Network for Patient Care; MSH, Mount Sinai Hospital; NIH, National Institutes of Health Clinical Center.
Internal and external pneumonia screening performance for all train, tune, and test hospital system combinations.
Parentheses show 95% CIs.
| Train/ Tune Site | Comparison Type | Test Site (Images) | AUC | Accuracy | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|---|---|---|---|
| NIH | Internal | NIH ( | 0.750 (0.721–0.778) | 0.255 (0.250–0.261) | 0.951 (0.917–0.973) | 0.247 (0.241–0.253) | 0.015 (0.013–0.017) | 0.998 (0.996–0.999) |
| External | MSH ( | 0.695 (0.683–0.706) | 0.476 (0.465–0.486) | 0.950 (0.942–0.958) | 0.212 (0.201–0.223) | 0.401 (0.390–0.413) | 0.884 (0.866–0.901) | |
| External | IU ( | 0.725 (0.644–0.807) | 0.190 (0.178–0.203) | 0.974 (0.865–0.999) | 0.182 (0.170–0.195) | 0.012 (0.009–0.017) | 0.999 (0.992–1.000) | |
| Superset | MSH + NIH ( | 0.773 (0.766–0.780) | 0.462 (0.456–0.467) | 0.950 (0.942–0.957) | 0.403 (0.397–0.409) | 0.160 (0.155–0.166) | 0.985 (0.983–0.987) | |
| Superset | MSH + NIH + IU ( | 0.787 (0.780–0.793) | 0.470 (0.464–0.475) | 0.950 (0.942–0.957) | 0.418 (0.413–0.424) | 0.148 (0.144–0.153) | 0.987 (0.985–0.989) | |
| MSH | Internal | MSH ( | 0.802 (0.793–0.812) | 0.617 (0.607–0.628) | 0.950 (0.942–0.958) | 0.432 (0.419–0.446) | 0.482 (0.469–0.495) | 0.94 (0.930–0.949) |
| External | NIH ( | 0.717 (0.687–0.746) | 0.184 (0.179–0.190) | 0.951 (0.917–0.973) | 0.175 (0.170–0.18) | 0.014 (0.012–0.016) | 0.997 (0.994–0.998) | |
| External | IU ( | 0.756 (0.674–0.838) | 0.099 (0.089–0.109) | 0.974 (0.865–0.999) | 0.090 (0.081–0.099) | 0.011 (0.008–0.015) | 0.997 (0.984–1.000) | |
| Superset | MSH + NIH ( | 0.862 (0.856–0.868) | 0.562 (0.557–0.568) | 0.950 (0.942–0.957) | 0.516 (0.510–0.522) | 0.19 (0.184–0.197) | 0.989 (0.987–0.990) | |
| Superset | MSH + NIH + IU ( | 0.871 (0.865–0.877) | 0.577 (0.572–0.582) | 0.950 (0.942–0.957) | 0.537 (0.532–0.543) | 0.180 (0.174–0.185) | 0.990 (0.989–0.992) | |
| MSH + NIH | Internal | MSH + NIH ( | 0.931 (0.927–0.936) | 0.732 (0.727–0.737) | 0.950 (0.942–0.957) | 0.706 (0.700–0.711) | 0.279 (0.271–0.288) | 0.992 (0.990–0.993) |
| Subset | NIH ( | 0.733 (0.703–0.762) | 0.243 (0.237–0.249) | 0.951 (0.917–0.973) | 0.234 (0.229–0.240) | 0.015 (0.013–0.017) | 0.997 (0.996–0.999) | |
| Subset | MSH ( | 0.805 (0.796–0.814) | 0.630 (0.619–0.640) | 0.950 (0.942–0.958) | 0.451 (0.438–0.465) | 0.491 (0.478–0.504) | 0.942 (0.933–0.951) | |
| External | IU ( | 0.815 (0.745–0.885) | 0.238 (0.224–0.252) | 0.974 (0.865–0.999) | 0.230 (0.217–0.244) | 0.013 (0.009–0.018) | 0.999 (0.994–1.000) | |
| Superset | MSH + NIH + IU ( | 0.934 (0.929–0.938) | 0.732 (0.727–0.737) | 0.95 (0.942–0.957) | 0.709 (0.703–0.714) | 0.258 (0.250–0.266) | 0.993 (0.991–0.994) |
*Superset = a test dataset containing data from the same distribution (hospital system) as the training data as well as external data.
† Subset = a test dataset containing data from fewer distributions (hospital systems) than the training data.
Abbreviations: AUC, area under the receiver operating characteristic curve; IU, Indiana University Network for Patient Care; MSH, Mount Sinai Hospital; NIH, National Institutes of Health Clinical Center; NPV, negative predictive value; PPV, positive predictive value.
Fig 2CNN to predict hospital system detects both general and specific image features.
(A) We obtained activation heatmaps from our trained model and averaged over a sample of images to reveal which subregions tended to contribute to a hospital system classification decision. Many different subregions strongly predicted the correct hospital system, with especially strong contributions from image corners. (B-C) On individual images, which have been normalized to highlight only the most influential regions and not all those that contributed to a positive classification, we note that the CNN has learned to detect a metal token that radiology technicians place on the patient in the corner of the image field of view at the time they capture the image. When these strong features are correlated with disease prevalence, models can leverage them to indirectly predict disease. CNN, convolutional neural network.
Fig 3Assessing how prevalence differences in aggregated datasets encouraged confounder exploitation.
(A) Five cohorts of 20,000 patients were systematically subsampled to differ only in relative pneumonia risk based on the clinical training data sites. Model performance was assessed on test data from the internal hospital systems (MSH, NIH) and from an external hospital system (IU). (B) Although models perform better in internal testing in the presence of extreme prevalence differences, this benefit is not seen when applied to data from new hospital systems. The natural relative risk of disease at MSH, indicated by a vertical line, is quite imbalanced. IU, Indiana University Network for Patient Care; MSH, Mount Sinai Hospital; NIH, National Institutes of Health Clinical Center; ROC, receiver operating characteristic; RR, relative risk.