| Literature DB >> 31170223 |
Mike Voets1, Kajsa Møllersen2, Lars Ailo Bongo1.
Abstract
We have attempted to reproduce the results in Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs, published in JAMA 2016; 316(22), using publicly available data sets. We re-implemented the main method in the original study since the source code is not available. The original study used non-public fundus images from EyePACS and three hospitals in India for training. We used a different EyePACS data set from Kaggle. The original study used the benchmark data set Messidor-2 to evaluate the algorithm's performance. We used another distribution of the Messidor-2 data set, since the original data set is no longer available. In the original study, ophthalmologists re-graded all images for diabetic retinopathy, macular edema, and image gradability. We have one diabetic retinopathy grade per image for our data sets, and we assessed image gradability ourselves. We were not able to reproduce the original study's results with publicly available data. Our algorithm's area under the receiver operating characteristic curve (AUC) of 0.951 (95% CI, 0.947-0.956) on the Kaggle EyePACS test set and 0.853 (95% CI, 0.835-0.871) on Messidor-2 did not come close to the reported AUC of 0.99 on both test sets in the original study. This may be caused by the use of a single grade per image, or different data. This study shows the challenges of reproducing deep learning method results, and the need for more replication and reproduction studies to validate deep learning methods, especially for medical image analysis. Our source code and instructions are available at: https://github.com/mikevoets/jama16-retina-replication.Entities:
Mesh:
Year: 2019 PMID: 31170223 PMCID: PMC6553744 DOI: 10.1371/journal.pone.0217541
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Ungradable images.
Examples of ungradable images because they are either out of focus, under-, or overexposed.
Fig 2Grading tool.
Screenshot of grading tool used to assess gradability for all images.
Fig 3Reproduced results (AUC).
Area under receiver operating characteristic curve for the reproduced algorithm.
Reproduced results.
Performance on test sets of reproduction, compared to results from the original study. The results of the original study are depicted in parenthesizes.
| Reproduced results | |||
|---|---|---|---|
| Kaggle EyePACS test | 90.6 (97.5)% sens. | 83.6 (90.3)% sens. | 0.951 (0.991) |
| 84.7 (93.4)% spec. | 92.0 (98.1)% spec. | ||
| Messidor-2 | 81.8 (96.1)% sens. | 68.7 (87.0)% sens. | 0.853 (0.990) |
| 71.2 (93.9)% spec. | 88.5 (98.5)% spec. | ||