| Literature DB >> 32435698 |
Yu-Xing Tang1, You-Bao Tang1, Yifan Peng2, Ke Yan1, Mohammadhadi Bagheri3, Bernadette A Redd4, Catherine J Brandon5, Zhiyong Lu2, Mei Han6, Jing Xiao7, Ronald M Summers1,4.
Abstract
As one of the most ubiquitous diagnostic imaging tests in medical practice, chest radiography requires timely reporting of potential findings and diagnosis of diseases in the images. Automated, fast, and reliable detection of diseases based on chest radiography is a critical step in radiology workflow. In this work, we developed and evaluated various deep convolutional neural networks (CNN) for differentiating between normal and abnormal frontal chest radiographs, in order to help alert radiologists and clinicians of potential abnormal findings as a means of work list triaging and reporting prioritization. A CNN-based model achieved an AUC of 0.9824 ± 0.0043 (with an accuracy of 94.64 ± 0.45%, a sensitivity of 96.50 ± 0.36% and a specificity of 92.86 ± 0.48%) for normal versus abnormal chest radiograph classification. The CNN model obtained an AUC of 0.9804 ± 0.0032 (with an accuracy of 94.71 ± 0.32%, a sensitivity of 92.20 ± 0.34% and a specificity of 96.34 ± 0.31%) for normal versus lung opacity classification. Classification performance on the external dataset showed that the CNN model is likely to be highly generalizable, with an AUC of 0.9444 ± 0.0029. The CNN model pre-trained on cohorts of adult patients and fine-tuned on pediatric patients achieved an AUC of 0.9851 ± 0.0046 for normal versus pneumonia classification. Pretraining with natural images demonstrates benefit for a moderate-sized training image set of about 8500 images. The remarkable performance in diagnostic accuracy observed in this study shows that deep CNNs can accurately and effectively differentiate normal and abnormal chest radiographs, thereby providing potential benefits to radiology workflow and patient care. © This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2020.Entities:
Keywords: Biomedical engineering; Radiography
Year: 2020 PMID: 32435698 PMCID: PMC7224391 DOI: 10.1038/s41746-020-0273-z
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Classification performance metrics for different CNN architectures on the NIH “ChestX-ray 14” database.
| Models | AUC | Sensitivity (%) | Specificity (%) | PPV (%) | NPV (%) | Accuracy (%) | |
|---|---|---|---|---|---|---|---|
| AlexNet (P) | 0.9741 ± 0.0050 | 94.18 ± 0.47 | 87.70 ± 0.56 | 87.66 ± 0.61 | 94.10 ± 0.41 | 0.9091 ± 0.0057 | 90.85 ± 0.48 |
| AlexNet (S) | 0.9684 ± 0.0043 | 92.65 ± 0.45 | 87.99 ± 0.41 | 87.94 ± 0.57 | 92.68 ± 0.38 | 0.9023 ± 0.0052 | 90.25 ± 0.45 |
| VGG16 (P) | 0.9797 ± 0.0039 | 94.03 ± 0.36 | 90.74 ± 0.41 | 90.56 ± 0.45 | 94.14 ± 0.43 | 0.9226 ± 0.0038 | 92.34 ± 0.40 |
| VGG16 (S) | 0.9742 ± 0.0044 | 93.42 ± 0.40 | 91.46 ± 0.46 | 91.18 ± 0.50 | 93.63 ± 0.46 | 0.9228 ± 0.0040 | 92.41 ± 0.42 |
| VGG19 (P) | 0.9842 ± 0.0036 | 97.09 ± 0.39 | 87.99 ± 0.35 | 88.42 ± 0.41 | 96.97 ± 0.43 | 0.9255 ± 0.0035 | 92.41 ± 0.33 |
| VGG19 (S) | 0.9757 ± 0.0054 | 94.49 ± 0.59 | 88.86 ± 0.49 | 88.90 ± 0.56 | 94.46 ± 0.47 | 0.9161 ± 0.0048 | 91.59 ± 0.50 |
| ResNet18 (P) | 0.9824 ± 0.0043 | 96.50 ± 0.36 | 92.86 ± 0.48 | 92.84 ± 0.55 | 96.52 ± 0.30 | 0.9463 ± 0.0041 | 94.64 ± 0.45 |
| ResNet18 (S) | 0.9766 ± 0.0034 | 96.63 ± 0.41 | 85.09 ± 0.33 | 85.97 ± 0.47 | 96.39 ± 0.36 | 0.9099 ± 0.0034 | 90.70 ± 0.38 |
| ResNet50 (P) | 0.9837 ± 0.0048 | 96.94 ± 0.50 | 88.42 ± 0.61 | 88.78 ± 0.73 | 96.83 ± 0.39 | 0.9268 ± 0.0055 | 92.56 ± 0.54 |
| ResNet50 (S) | 0.9775 ± 0.0057 | 94.32 ± 0.54 | 90.59 ± 0.66 | 90.43 ± 0.75 | 94.42 ± 0.44 | 0.9233 ± 0.0059 | 92.40 ± 0.60 |
| Inception-v3 (P) | 0.9866 ± 0.0041 | 97.38 ± 0.35 | 87.57 ± 0.48 | 88.11 ± 0.55 | 97.26 ± 0.27 | 0.9250 ± 0.0051 | 92.33 ± 0.42 |
| Inception-v3 (S) | 0.9796 ± 0.0034 | 95.08 ± 0.32 | 89.58 ± 0.35 | 89.58 ± 0.42 | 95.08 ± 0.23 | 0.9225 ± 0.0047 | 92.25 ± 0.37 |
| DenseNet121 (P) | 0.9871 ± 0.0057 | 97.40 ± 0.53 | 87.55 ± 0.68 | 88.09 ± 0.74 | 97.27 ± 0.33 | 0.9251 ± 0.0056 | 92.34 ± 0.56 |
| DenseNet121 (S) | 0.9801 ± 0.0044 | 95.10 ± 0.38 | 90.01 ± 0.49 | 90.00 ± 0.61 | 95.11 ± 0.27 | 0.9248 ± 0.0041 | 92.49 ± 0.44 |
CNN model predictions were compared with the consensus labels of three board-certified radiologists.
AUC area under the receiver operating characteristic curve, PPV positive predictive value (or precision), NPV negative predictive value.
P: model weights were initialized from the ImageNet pre-trained model. S: random initialization of model weights, i.e., training from scratch.
Fig. 1Receiver operating characteristic curves (ROCs) for different ImageNet pre-trained CNN architectures versus radiologists on the NIH “ChestX-ray 14” dataset.
a Labels voted by the majority of radiologists as the ground-truth reference standard. b Labels derived from text reports as the ground-truth reference standard. Radiologists’ performance levels are represented as single points (or a cross for attending radiologist who wrote the radiology report). AUC area under the curve.
Fig. 2Comparisons of model performance and different radiologists.
a Performance of different CNN architectures with different input image sizes on the NIH “ChestX-ray 14” dataset. CNN weights were initialized from the ImageNet pre-trained models. Performances are not significantly different among different input image sizes. The error bars represent the standard deviations to the mean values. b True positive rate (sensitivity) and false positive rate (1-specificity) of different radiologists (#1, #2, #3, and #4) against different ground-truth labels. Left depicts performance comparisons when setting the consensus of radiologists as ground-truth. Right depicts comparisons when setting labels from attending radiologist as ground-truth. AR attending radiologist, CR consensus of radiologists (vote by the majority of three board-certified radiologists), AI the artificial intelligence model (ResNet18 CNN model shown here).
Fig. 3Model performance on different datasets and tasks.
a Confusion matrices of VGG-19 CNN model performance on different datasets. Left: RSNA challenge dataset for normal versus abnormal classification. Middle: RSNA challenge dataset for normal versus lung opacity classification. Right: Indiana dataset for normal versus abnormal classification. b ROCs of CNN models’ performance on different datasets and tasks. Pre-train: CNNs pre-trained on the NIH “ChestX-ray 14” dataset for normal versus abnormal classification as weight initialization.
Fig. 4Model visualization.
For each group of images, left is the original radiograph, right is the heatmap overlaid on the radiograph. The areas marked with a peak in heatmap indicate the prediction of abnormalities with high probabilities. a Findings: right lung pneumothorax. All four radiologists labeled it as abnormal. CNN model predicts it as abnormal with a confidence score of 0.9977. b Impression: no evidence of lung infiltrate thoracic infection. Two of four radiologists labeled it as normal, the other two labeled it as abnormal. Model prediction: abnormal, score: 0.9040. c Impression: normal chest. All four radiologists labeled it as normal. Model prediction: normal, abnormality score: 0.0506. d Findings: minimally increased lung markings are noted in the lung bases and perihilar area consistent with fibrosis unchanged. Two of four radiologists labeled it as abnormal, the other two labeled it as normal. Model prediction: normal, score: 0.1538. Findings and impressions were extracted from the associated text reports.