| Literature DB >> 31341220 |
Jaakko Sahlsten1, Joel Jaskari1, Jyri Kivinen1, Lauri Turunen2, Esa Jaanio2, Kustaa Hietala3, Kimmo Kaski4.
Abstract
Diabetes is a globally prevalent disease that can cause visible microvascular complications such as diabetic retinopathy and macular edema in the human eye retina, the images of which are today used for manual disease screening and diagnosis. This labor-intensive task could greatly benefit from automatic detection using deep learning technique. Here we present a deep learning system that identifies referable diabetic retinopathy comparably or better than presented in the previous studies, although we use only a small fraction of images (<1/4) in training but are aided with higher image resolutions. We also provide novel results for five different screening and clinical grading systems for diabetic retinopathy and macular edema classification, including state-of-the-art results for accurately classifying images according to clinical five-grade diabetic retinopathy and for the first time for the four-grade diabetic macular edema scales. These results suggest, that a deep learning system could increase the cost-effectiveness of screening and diagnosis, while attaining higher than recommended performance, and that the system could be applied in clinical examinations requiring finer grading.Entities:
Mesh:
Year: 2019 PMID: 31341220 PMCID: PMC6656880 DOI: 10.1038/s41598-019-47181-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Dataset summary.
| Grading system | Patients/Label | Training | Tuning | Primary validation |
|---|---|---|---|---|
| RDR | Patients, No. | 8694 | 1313 | 2477 |
| Images Total, No. (%) | 24806 (100) | 3706 (100) | 7118 (100) | |
| Images Grade 0, No. (%) | 13895 (56.0) | 2079 (56.1) | 4031 (56.6) | |
| Images Grade 1, No. (%) | 10911 (44.0) | 1627 (43.9) | 3087 (43.4) | |
| PIRC | Patients, No. | 8770 | 1259 | 2455 |
| Images Total, No. (%) | 24941 (100) | 3560 (100) | 7129 (100) | |
| Images Grade 0, No. (%) | 11160 (44.7) | 1573 (44.2) | 3229 (45.3) | |
| Images Grade 1, No. (%) | 2793 (11.2) | 408 (11.5) | 842 (11.8) | |
| Images Grade 2, No. (%) | 9221 (37.0) | 1312 (36.9) | 2597 (36.4) | |
| Images Grade 3, No. (%) | 1480 (5.9) | 225 (6.3) | 382 (5.4) | |
| Images Grade 4, No. (%) | 287 (1.2) | 42 (1.2) | 79 (1.1) | |
| RDME | Patients, No. | 8669 | 1281 | 2534 |
| Images Total, No. (%) | 24651 (100) | 3675 (100) | 7304 (100) | |
| Images Grade 0, No. (%) | 20819 (84.5) | 3113 (84.7) | 6162 (84.4) | |
| Images Grade 1, No. (%) | 3832 (15.5) | 562 (15.3) | 1142 (15.6) | |
| PIMEC | Patients, No. | 8708 | 1242 | 2534 |
| Images Total, No. (%) | 24791 (100) | 3535 (100) | 7304 (100) | |
| Images Grade 0, No. (%) | 20958 (84.5) | 2974 (84.1) | 6162 (84.4) | |
| Images Grade 1, No. (%) | 1531 (6.2) | 237 (6.7) | 465 (6.4) | |
| Images Grade 2, No. (%) | 1566 (6.3) | 222 (6.2) | 438 (6.0) | |
| Images Grade 3, No. (%) | 736 (3.0) | 102 (2.9) | 239 (3.3) | |
| QRDR | Patients, No. | 10232 | 1466 | 2926 |
| Images Total, No. (%) | 28787 (100) | 4109 (100) | 8226 (100) | |
| Images Grade 0, No. (%) | 3827 (13.3) | 533 (13.0) | 1132 (13.8) | |
| Images Grade 1, No. (%) | 14005 (48.7) | 1991 (48.5) | 4009 (48.7) | |
| Images Grade 2, No. (%) | 10955 (38.1) | 1585 (38.6) | 3085 (37.5) |
Class distribution shown on as amounts and percentages the dataset divisions between training, tuning, and primary validation used in experiments.
Comparison of classification results for referable diabetic retinopathy.
| Author | Train samples | Validation samples | Input size | AUC | Sensitivity | Specificity |
|---|---|---|---|---|---|---|
| Gulshan | 118419 | 8788 | 299 | 0.991 (0.988–0.993)a | 0.903 (0.875–0.927)a | 0.981 (0.978–0.985)a |
| Ting | 76370 | 71896 | 512 | 0.936 (0.925–0.943)b | 0.905 (0.873–0.930)c | 0.916 (0.910–0.922)c |
| Ours | 28512 | 7118 | 2095 | 0.987 (0.984–0.989)a | 0.896 (0.885–0.907)a | 0.974 (0.969–0.979)a |
The train and validation samples refer to the image amounts in the respective sets, and the input size refers to the image width and height in pixels. Tuning set is included in the train sample size. Our operating point for sensitivity and specificity is calculated at 0.900 sensitivity for comparison of results at similar operating point to Gulshan et al.[4] and Ting et al.[5].
a95% exact CI calculated with Clopper-Pearson method.
b95% asymptotic, bias-corrected CI calculated with cluster-bootstrap on patient level.
c95% asymptotic CI calculated for each logit with cluster sandwich using on patient level.
Figure 1ROC curves for nonreferable vs. referable diabetic retinopathy in classifying nonreferable vs. referable macular edema on primary validation set and Messidor set. (A) NRDR/RDR classification on the primary validation set (N = 7118). (B) NRDR/RDR classification on Messidor set (N = 1200). (C) NRDME/RDME classification on the primary validation set (N = 7304). (D) NRDME/RDME classification on Messidor set (N = 1200). Referable vs. nonreferable diabetic retinopathy shown in (A,B) and referable diabetic macular edema shown in C and D. ROC curve is shown for input image sizes of 256 × 256, 299 × 299, 512 × 512, 1024 × 1024 and 2095 × 2095 pixels. AUC shown in parentheses in the legend.
Classification results for PIRC, QRDR and PIMEC with varying input image sizes on the primary validation set.
| Grading system | Input size | Macro-AUC | Accuracy | Quadratic-Weighted Kappa |
|---|---|---|---|---|
| PIRC | 256 | 0.901 | 0.751 | 0.772 |
| PIRC | 299 | 0.919 | 0.785 | 0.834 |
| PIRC | 512 | 0.951 | 0.838 | 0.894 |
| PIRC | 1024 | 0.961 |
|
|
| PIRC | 2095* |
| 0.869 | 0.910 |
| PIRC | 6 × 512a | 0.958 |
| 0.904 |
| QRDR | 256 | 0.977 | 0.912 | 0.901 |
| QRDR | 299 | 0.981 | 0.922 | 0.914 |
| QRDR | 512 | 0.989 | 0.937 | 0.930 |
| QRDR | 1024 |
|
|
|
| QRDR | 2095* |
| 0.925 | 0.914 |
| QRDR | 6 × 512a |
|
|
|
| PIMEC | 256 | 0.959 | 0.928 | 0.813 |
| PIMEC | 299 | 0.970 | 0.923 | 0.803 |
| PIMEC | 512 | 0.979 | 0.935 | 0.832 |
| PIMEC | 1024 | 0.978 |
| 0.846 |
| PIMEC | 2095* |
| 0.934 |
|
| PIMEC | 6 × 512a |
|
|
|
Macro-AUC refers to area under macro average of ROC for each class one-vs-all manner.
*Trained with model using instance normalization layers and an optimizer with accumulation of 15 mini-batches.
aEnsemble of six classifiers trained on same data with same input size.
Figure 2ROC curves for best performing model for each of the multiclass classification tasks. (A) PIRC classification on the primary validation set (N = 7129) with input size 2095 × 2095. (B) PIMEC classification on the primary validation set (N = 7304) with input size 2095 × 2095. (C) QRDR classification on the primary validation set (N = 8226) with input size 1024 × 1024. Multiclass tasks include PIRC, PIMEC and QRDR for the best performing models based on macro-AUC. ROC curves are shown for each class in one-vs-all strategy with addition of macro-average of ROC. Positive class marked in legend with AUC shown in parentheses.
Classification results for predicted of RDR and RDME with varying input image sizes on the Messidor dataset.
| Grading system | Input size | AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|---|
| RDR | 256 | 0.905 | 0.826 | 0.872 | 0.853 |
| (0.887–0.921) | (0.790–0.858) | (0.845–0.896) | (0.831–0.872) | ||
| RDR | 299 | 0.94 | 0.906 | 0.824 | 0.859 |
| (0.926–0.953) | (0.877–0.930) | (0.794–0.852) | (0.838–0.878) | ||
| RDR | 512 | 0.966 |
| 0.811 | 0.868 |
| (0.954–0.976) |
| (0.780–0.840) | (0.848–0.887) | ||
| RDR | 1024* | 0.957 | 0.853 | 0.955 | 0.912 |
| (0.944–0.968) | (0.820–0.883) | (0.937–0.969) | (0.894–0.927) | ||
| RDR | 2095*,** |
| 0.859 |
|
|
|
| (0.826–0.888) |
|
| ||
| RDR | 6 × 512a | 0.965 | 0.92 | 0.871 | 0.892 |
| (0.953–0.974) | (0.903–0.935) | (0.851–0.889) | (0.873–0.909) | ||
| RDME | 256 | 0.917 | 0.619 | 0.969 | 0.903 |
| (0.900–0.932) | (0.553–0.683) | (0.956–0.979) | (0.885–0.919) | ||
| RDME | 299 | 0.925 | 0.633 | 0.975 | 0.911 |
| (0.908–0.939) | (0.566–0.696) | (0.964–0.984) | (0.893–0.926) | ||
| RDME | 512 | 0.931 |
| 0.989 |
|
| (0.915–0.944) |
| (0.980–0.994) |
| ||
| RDME | 1024* | 0.888 | 0.606 | 0.991 | 0.918 |
| (0.869–0.905) | (0.539–0.670) | (0.983–0.996) | (0.901–0.933) | ||
| RDME | 2095*,** |
| 0.597 |
| 0.917 |
|
| (0.530–0.662) |
| (0.900–0.932) | ||
| RDME | 6 × 512a |
| 0.575 |
| 0.916 |
|
| (0.547–0.603) |
| (0.899–0.931) |
Classification on the Messidor set[15]. Sensitivity, specificity and accuracy measured at 0.900 sensitivity operating point of tuning set. 95% exact Clopper-Pearson confidence interval in brackets.
*Messidor images upscaled from input image size of 900 × 900 pixels using bicubic interpolation.
**Trained with model using instance normalization layers and an optimizer with accumulation of 15 mini-batches.
aEnsemble of six classifiers trained on same data with same input size.
Confusion matrices for PIRC, PIMEC and QRDR classification tasks with varying input size on the primary validation set.
| Input size | PIRC | PIMEC | QRDR | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| | 2982 | 147 | 99 | 0 | 1 |
| 6053 | 49 | 38 | 22 |
| 1084 | 45 | 3 |
| | 500 | 250 | 92 | 0 | 0 |
| 98 | 332 | 32 | 3 |
| 18 | 3736 | 255 |
| | 343 | 187 | 2040 | 27 | 0 |
| 80 | 61 | 259 | 38 |
| 0 | 401 | 2684 |
| | 13 | 8 | 279 | 82 | 0 |
| 30 | 10 | 66 | 133 | ||||
| | 3 | 0 | 51 | 24 | 1 | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| | 3027 | 142 | 60 | 0 | 0 |
| 5977 | 79 | 70 | 36 |
| 1093 | 38 | 1 |
| | 438 | 283 | 121 | 0 | 0 |
| 64 | 353 | 42 | 6 |
| 35 | 3717 | 257 |
| | 197 | 198 | 2128 | 74 | 0 |
| 63 | 45 | 295 | 35 |
| 0 | 311 | 2774 |
| | 6 | 5 | 215 | 154 | 2 |
| 21 | 9 | 95 | 114 | ||||
| | 4 | 0 | 45 | 25 | 5 | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| | 3178 | 36 | 15 | 0 | 0 |
| 6015 | 51 | 78 | 18 |
| 1108 | 22 | 2 |
| | 296 | 440 | 104 | 2 | 0 |
| 60 | 374 | 29 | 2 |
| 50 | 3786 | 173 |
| | 114 | 201 | 2088 | 193 | 1 |
| 57 | 49 | 317 | 15 |
| 1 | 271 | 2813 |
| | 3 | 3 | 119 | 256 | 1 |
| 19 | 8 | 91 | 121 | ||||
| | 4 | 0 | 21 | 40 | 14 | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| | 3106 | 95 | 28 | 0 | 0 |
| 6078 | 45 | 29 | 10 |
| 1115 | 16 | 1 |
| | 149 | 579 | 111 | 0 | 3 |
| 63 | 364 | 36 | 2 |
| 78 | 3732 | 199 |
| | 54 | 195 | 2257 | 89 | 2 |
| 70 | 55 | 268 | 45 |
| 0 | 218 | 2867 |
| | 4 | 2 | 138 | 222 | 16 |
| 30 | 6 | 67 | 136 | ||||
| | 4 | 0 | 16 | 18 | 41 | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| | 3185 | 27 | 16 | 1 | 0 |
| 6024 | 68 | 51 | 19 |
| 1069 | 63 | 0 |
| | 169 | 517 | 156 | 0 | 0 |
| 42 | 388 | 29 | 6 |
| 22 | 3926 | 61 |
| | 81 | 143 | 2304 | 69 | 0 |
| 47 | 69 | 262 | 60 |
| 1 | 472 | 2612 |
| | 3 | 1 | 188 | 189 | 1 |
| 14 | 6 | 71 | 148 | ||||
| | 4 | 0 | 25 | 47 | 3 | |||||||||
Ground truth shown in rows and predicted classes in columns. PIRC classes (0 = no apparent DR, 1 = mild NPDR, 2 = moderate NPDR, 3 = severe NPDR, 4 = PDR), PIMEC classes (0 = no apparent DME, 1 = mild DME, 2 = moderate DME, 3 = severe DME) and QRDR classes (0 = ungradable, 1 = NRDR, 2 = RDR).
*Model trained using instance normalization layers, instead of batch normalization, and optimizer updates accumulated over 15 mini-batches.
Classification results for predicted of NRDR/RDR and NRDME/RDME with varying input image sizes on the primary validation set.
| Grading system | Input size | AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|---|
| RDR | 256 | 0.961 | 0.895 | 0.913 | 0.905 |
| (0.956–0.965) | (0.884–0.906) | (0.904–0.921) | (0.898–0.912) | ||
| RDR | 299 | 0.97 | 0.896 | 0.946 | 0.924 |
| (0.966–0.974) | (0.884–0.906) | (0.938–0.953) | (0.918–0.930) | ||
| RDR | 512 | 0.979 | 0.9 | 0.963 | 0.935 |
| (0.975–0.982) | (0.888–0.910) | (0.956–0.968) | (0.929–0.941) | ||
| RDR | 1024 | 0.984 |
| 0.97 |
|
| (0.981–0.987) |
| (0.964–0.975) |
| ||
| RDR | 2095* |
| 0.896 |
| 0.94 |
|
| (0.885–0.907) |
| (0.935–0.946) | ||
| RDR | 6 × 512a | 0.984 | 0.904 | 0.971 | 0.942 |
| (0.981–0.987) | (0.897–0.911) | (0.967–0.975) | (0.936–0.947) | ||
| RDME | 256 | 0.976 | 0.891 | 0.953 | 0.943 |
| (0.973–0.980) | (0.871–0.908) | (0.948–0.958) | (0.938–0.949) | ||
| RDME | 299 | 0.981 | 0.891 | 0.96 | 0.949 |
| (0.977–0.984) | (0.871–0.908) | (0.955–0.965) | (0.944–0.954) | ||
| RDME | 512 | 0.986 | 0.89 | 0.976 | 0.963 |
| (0.983–0.989) | (0.870–0.907) | (0.972–0.980) | (0.958–0.967) | ||
| RDME | 1024 | 0.986 |
| 0.974 | 0.966 |
| (0.983–0.989) |
| (0.970–0.978) | (0.961–0.970) | ||
| RDME | 2095* |
| 0.905 |
|
|
|
| (0.887–0.922) |
|
| ||
| RDME | 6 × 512a |
| 0.904 |
|
|
|
| (0.897–0.910) |
|
|
Sensitivity, specificity and accuracy measured at 0.900 sensitivity operating point of tuning set. 95% exact Clopper-Pearson confidence interval in brackets.
*Trained with model using instance normalization layers and an optimizer with accumulation of 15 mini-batches.
aEnsemble of six classifiers trained on same data with same input size.