| Literature DB >> 30457988 |
Pranav Rajpurkar1, Jeremy Irvin1, Robyn L Ball2, Kaylie Zhu1, Brandon Yang1, Hershel Mehta1, Tony Duan1, Daisy Ding1, Aarti Bagul1, Curtis P Langlotz3, Bhavik N Patel3, Kristen W Yeom3, Katie Shpanskaya3, Francis G Blankenberg3, Jayne Seekins3, Timothy J Amrhein4, David A Mong5, Safwan S Halabi3, Evan J Zucker3, Andrew Y Ng1, Matthew P Lungren3.
Abstract
BACKGROUND: Chest radiograph interpretation is critical for the detection of thoracic diseases, including tuberculosis and lung cancer, which affect millions of people worldwide each year. This time-consuming task typically requires expert radiologists to read the images, leading to fatigue-based diagnostic error and lack of diagnostic expertise in areas of the world where radiologists are not available. Recently, deep learning approaches have been able to achieve expert-level performance in medical image interpretation tasks, powered by large network architectures and fueled by the emergence of large labeled datasets. The purpose of this study is to investigate the performance of a deep learning algorithm on the detection of pathologies in chest radiographs compared with practicing radiologists. METHODS ANDEntities:
Mesh:
Year: 2018 PMID: 30457988 PMCID: PMC6245676 DOI: 10.1371/journal.pmed.1002686
Source DB: PubMed Journal: PLoS Med ISSN: 1549-1277 Impact factor: 11.069
Fig 1ROC curves of radiologists and algorithm for each pathology on the validation set.
Each plot illustrates the ROC curve of the deep learning algorithm (purple) and practicing radiologists (green) on the validation set, on which the majority vote of 3 cardiothoracic subspecialty radiologists served as ground truth. Individual radiologist (specificity, sensitivity) points are also plotted, where the unfilled triangles represent radiology resident performances and the filled triangles represent BC radiologist performances. The ROC curve of the algorithm is generated by varying the discrimination threshold (used to convert the output probabilities to binary predictions). The radiologist ROC curve is estimated by fitting an increasing concave curve to the radiologist operating points (see S1 Appendix). BC, board-certified; ROC, receiver operating characteristic.
Radiologists and algorithm AUC with CIs.
| Pathology | Radiologists (95% CI) | Algorithm (95% CI) | Algorithm − Radiologists Difference (99.6% CI) | Advantage |
|---|---|---|---|---|
| Atelectasis | 0.808 (0.777 to 0.838) | 0.862 (0.825 to 0.895) | 0.053 (0.003 to 0.101) | Algorithm |
| Cardiomegaly | 0.888 (0.863 to 0.910) | 0.831 (0.790 to 0.870) | −0.057 (−0.113 to −0.007) | Radiologists |
| Consolidation | 0.841 (0.815 to 0.870) | 0.893 (0.859 to 0.924) | 0.052 (−0.001 to 0.101) | No difference |
| Edema | 0.910 (0.886 to 0.930) | 0.924 (0.886 to 0.955) | 0.015 (−0.038 to 0.060) | No difference |
| Effusion | 0.900 (0.876 to 0.921) | 0.901 (0.868 to 0.930) | 0.000 (−0.042 to 0.040) | No difference |
| Emphysema | 0.911 (0.866 to 0.947) | 0.704 (0.567 to 0.833) | −0.208 (−0.508 to −0.003) | Radiologists |
| Fibrosis | 0.897 (0.840 to 0.936) | 0.806 (0.719 to 0.884) | −0.091 (−0.198 to 0.016) | No difference |
| Hernia | 0.985 (0.974 to 0.991) | 0.851 (0.785 to 0.909) | −0.133 (−0.236 to −0.055) | Radiologists |
| Infiltration | 0.734 (0.688 to 0.779) | 0.721 (0.651 to 0.786) | −0.013 (−0.107 to 0.067) | No difference |
| Mass | 0.886 (0.856 to 0.913) | 0.909 (0.864 to 0.948) | 0.024 (−0.041 to 0.080) | No difference |
| Nodule | 0.899 (0.869 to 0.924) | 0.894 (0.853 to 0.930) | −0.005 (−0.058 to 0.044) | No difference |
| Pleural thickening | 0.779 (0.740 to 0.809) | 0.798 (0.744 to 0.849) | 0.019 (−0.056 to 0.094) | No difference |
| Pneumonia | 0.823 (0.779 to 0.856) | 0.851 (0.781 to 0.911) | 0.028 (−0.087 to 0.125) | No difference |
| Pneumothorax | 0.940 (0.912 to 0.962) | 0.944 (0.915 to 0.969) | 0.004 (−0.040 to 0.051) | No difference |
aThe AUC difference was calculated as the AUC of the algorithm minus the AUC of the radiologists. To account for multiple hypothesis testing, the Bonferroni-corrected CI (1 − 0.05/14; 99.6%) around the difference was computed.
The nonparametric bootstrap was used to estimate the variability around each of the performance measures; 10,000 bootstrap replicates from the validation set were drawn, and each performance measure was calculated for the algorithm and the radiologists on these same 10,000 bootstrap replicates. This produced a distribution for each estimate, and the 95% bootstrap percentile intervals (2.5th and 97.5th percentiles) are reported.
Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval.
Fig 2Performance measures of the algorithm and radiologists on the validation set for mass, nodule, consolidation, and effusion.
Each plot shows the diagnostic measures of the algorithm (purple diamond), micro-average resident radiologist (unfilled orange diamond), micro-average BC radiologist (filled orange diamond), individual resident radiologists (unfilled green diamond), individual BC radiologists (filled green diamond). Each diamond has a vertical bar denoting the 95% CI of each estimate, computed using 10,000 bootstrap replicates. The ground truth values used to compute each metric were the majority vote of 3 cardiothoracic specialty radiologists on each image in the validation set. Kappa refers to Cohen's Kappa, and F1 denotes the F1 score. BC, board-certified; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value.
Fig 3Interpreting network predictions using CAMs.
In the normal chest radiograph images (left), the pink arrows and circles highlight the locations of the abnormalities; these indicators were not present when the image was input to the algorithm. (a) Frontal chest radiograph (left) demonstrates 2 upper-lobe pulmonary masses in a patient with both right- and left-sided central venous catheter. The algorithm correctly classified and localized both masses as indicated by the heat maps. (b) Frontal chest radiograph demonstrates airspace opacity in the right lower lobe consistent with pneumonia. The algorithm correctly classified and localized the abnormality. More examples can be found in S2 Fig. CAM, class activation mapping.