| Literature DB >> 31304378 |
Marcus A Badgeley1,2,3, John R Zech4, Luke Oakden-Rayner5, Benjamin S Glicksberg6, Manway Liu1, William Gale7, Michael V McConnell1,8, Bethany Percha2, Thomas M Snyder1, Joel T Dudley2,3.
Abstract
Hip fractures are a leading cause of death and disability among older adults. Hip fractures are also the most commonly missed diagnosis on pelvic radiographs, and delayed diagnosis leads to higher cost and worse outcomes. Computer-aided diagnosis (CAD) algorithms have shown promise for helping radiologists detect fractures, but the image features underpinning their predictions are notoriously difficult to understand. In this study, we trained deep-learning models on 17,587 radiographs to classify fracture, 5 patient traits, and 14 hospital process variables. All 20 variables could be individually predicted from a radiograph, with the best performances on scanner model (AUC = 1.00), scanner brand (AUC = 0.98), and whether the order was marked "priority" (AUC = 0.79). Fracture was predicted moderately well from the image (AUC = 0.78) and better when combining image features with patient data (AUC = 0.86, DeLong paired AUC comparison, p = 2e-9) or patient data plus hospital process features (AUC = 0.91, p = 1e-21). Fracture prediction on a test set that balanced fracture risk across patient variables was significantly lower than a random test set (AUC = 0.67, DeLong unpaired AUC comparison, p = 0.003); and on a test set with fracture risk balanced across patient and hospital process variables, the model performed randomly (AUC = 0.52, 95% CI 0.46-0.58), indicating that these variables were the main source of the model's fracture predictions. A single model that directly combines image features, patient, and hospital process data outperforms a Naive Bayes ensemble of an image-only model prediction, patient, and hospital process data. If CAD algorithms are inexplicably leveraging patient and process variables in their predictions, it is unclear how radiologists should interpret their predictions in the context of other known patient data. Further research is needed to illuminate deep-learning decision processes so that computers and clinicians can effectively cooperate.Entities:
Keywords: Computer science; Radiography; Statistics
Year: 2019 PMID: 31304378 PMCID: PMC6550136 DOI: 10.1038/s41746-019-0105-1
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1The main source of variation in whole radiographs is explained by the device used to capture the radiograph. a Schematic of the inception v-3 deep learning model used to featurize radiographs into an embedded 2048-dimensional representation. Inception model architecture schematic derived from https://cloud.google.com/tpu/docs/inception-v3-advanced. b Data were collected from two sources. Variables were categorized as pathology (gold), image (IMG, yellow), patient, (PT, pink), or hospital process (HP, green). Italicized variables are not known at the time of image acquisition and are not used as explanatory variables. c The distribution of radiographs projected into clusters by t-Distributed Stochastic Neighbor Embedding (t-SNE) and designates how the unsupervised distribution of clusters relates to hip fracture and categorical variables
Fig. 2Deep-learning predicts all patient and hospital processes from a radiograph. a Deep-learning image models to predict binarized forms of 14 HP variables, 5 PT variables, and hip fracture. Error bars indicate the 95% confidence intervals of 2000 bootstrapped samples. b Deep-learning regression models to predict eight continuous variables from hip radiographs. Each dot represents one radiograph, and the purple lines are linear models of actual versus predicted values. c ROC, ROC+/− bootstrap confidence intervals, and precision recall curves for deep-learning models that predict fracture based on combinatorial predictor sets of IMG, PT, and HP variables. Crosshairs indicate the best operating point on ROC and PRC curves
Cohort Characteristics after various Sampling Routines
| Cohort | cs-train | cs-test | cc-Random | cc-Dem | cc-Pt | cc-PtHp |
|---|---|---|---|---|---|---|
| Sampling | Cross-sectional | Cross-sectional | Case–control | Case–control | Case–control | Case–control |
| Matching | NA | NA | NA | Age + gender | PT | PT + HP |
| Partition | Train | Test | Test | Test | Test | Test |
| No. of radiographs | 17,587 | 5970 | 416 | 405 | 416 | 411 |
| No. of patients | 6768 | 2256 | 275 | 252 | 217 | 186 |
| No. of scanners | 11 | 11 | 10 | 9 | 8 | 6 |
| No. of scanner manufacturers | 4 | 4 | 4 | 4 | 4 | 4 |
| Age, mean (SD), years | 61 (22) | 61 (22) | 67 (24) | 75 (20) | 75 (21) | 74 (19) |
| Female frequency, no. (%) | 11,647 (66) | 3873 (65) | 260 (62) | 249 (61) | 263 (63) | 253 (62) |
| Fracture frequency, no. (%) | 572 (3) | 207 (3) | 207 (50) | 207 (51) | 207 (50) | 207 (50) |
| BMI, mean (SD) | 28 (7) | 28 (7) | 25 (5) | 25 (5) | 24 (5) | 24 (4) |
| Fall frequency, no. (%) | 3214 (18) | 1139 (19) | 133 (32) | 160 (40) | 174 (42) | 165 (40) |
| Pain frequency, no. (%) | 9010 (51) | 2960 (50) | 164 (39) | 137 (34) | 117 (28) | 104 (25) |
Fig. 3Deep-learning hip fracture from radiographs is successful until controlling for all patient and hospital process variables. a The association between each metadata variable and fracture, colored by how the test cohort is sampled. (*) indicate a Fisher’s Exact test with p < 0.05. (b) ROC and (d) precision recall curves for the image-classifier tested on differentially sampled test sets. The best operating point is indicated with crosshairs. (*) represents a 95% confidence interval that does not include 0.5. c Summary of (b) with 95% bootstrap confidence intervals
Fig. 4Deep learning a compendium of patient data by directly combining image features, PT, and HP variables in multimodal models, or by secondarily ensembling image-only model predictions with PT and HP variables. a experiment schematic demonstrating the CAD simulation scenario wherein a physician secondarily integrates image-only and other clinical data (as modeled in a Naive Bayes ensemble). b ROC and (c) precision recall curves for classifiers tested on differentially sampled test sets. The best operating point is indicated with crosshairs. d Summary of (b) with 95% bootstrap confidence intervals