| Literature DB >> 32219181 |
Michelle Y T Yip1,2, Gilbert Lim1,3, Zhan Wei Lim3, Quang D Nguyen1, Crystal C Y Chong1, Marco Yu1, Valentina Bellemo1, Yuchen Xie1, Xin Qi Lee1, Haslina Hamzah1, Jinyi Ho1, Tien-En Tan1, Charumathi Sabanayagam1,2, Andrzej Grzybowski4,5, Gavin S W Tan1,2, Wynne Hsu3, Mong Li Lee3, Tien Yin Wong1,2, Daniel S W Ting1,2,6.
Abstract
Deep learning (DL) has been shown to be effective in developing diabetic retinopathy (DR) algorithms, possibly tackling financial and manpower challenges hindering implementation of DR screening. However, our systematic review of the literature reveals few studies studied the impact of different factors on these DL algorithms, that are important for clinical deployment in real-world settings. Using 455,491 retinal images, we evaluated two technical and three image-related factors in detection of referable DR. For technical factors, the performances of four DL models (VGGNet, ResNet, DenseNet, Ensemble) and two computational frameworks (Caffe, TensorFlow) were evaluated while for image-related factors, we evaluated image compression levels (reducing image size, 350, 300, 250, 200, 150 KB), number of fields (7-field, 2-field, 1-field) and media clarity (pseudophakic vs phakic). In detection of referable DR, four DL models showed comparable diagnostic performance (AUC 0.936-0.944). To develop the VGGNet model, two computational frameworks had similar AUC (0.936). The DL performance dropped when image size decreased below 250 KB (AUC 0.936, 0.900, p < 0.001). The DL performance performed better when there were increased number of fields (dataset 1: 2-field vs 1-field-AUC 0.936 vs 0.908, p < 0.001; dataset 2: 7-field vs 2-field vs 1-field, AUC 0.949 vs 0.911 vs 0.895). DL performed better in the pseudophakic than phakic eyes (AUC 0.918 vs 0.833, p < 0.001). Various image-related factors play more significant roles than technical factors in determining the diagnostic performance, suggesting the importance of having robust training and testing datasets for DL training and deployment in the real-world settings.Entities:
Keywords: Population screening; Translational research
Year: 2020 PMID: 32219181 PMCID: PMC7090044 DOI: 10.1038/s41746-020-0247-1
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Study selection.
Flowchart detailing the systematic literature review conducted to identify suitable studies that have evaluated technical and/or image-related factors that may influence the performance of a DL algorithm in detection of DR.
Technical and image-related challenges to development of deep learning algorithms for ocular disease detection.
| Challenges | Research question | Paper addressing this question | Answer to research question | |
|---|---|---|---|---|
| Technical | Newer convolutional neural networks with increasing number and complexity of layers may allow for greater depth of analysis but may intensify burden on hardware processing power and memory. | Does altering the convolutional neural network architecture affect performance? | Current paper | No. Different neural networks do not affect performance. |
| Differences between computational frameworks based on flexibility, applicability, speed, ease of use, may affect choice. | Does altering the computational framework affect performance? | Current paper | No. Different computational frameworks do not affect performance. | |
| Image-related | Lack of access to high quality retinal images due to poor fundus camera specifications, reduced storage space, or compression for tele-ophthalmology. | Does altering the level of compression of the input data affect performance? | Current paper | Yes. Reducing image size below 250 KB drops performance significantly. |
| Different groups in various countries may possess datasets with varying number of field of fundus views due to disparities in protocols, resources, and manpower. | Does altering the number of fundus field of views of the input data affect performance? | Current paper | Yes. Performance drops in descending order from 7-field to 2-field to 1-field. | |
| The presence of cataract may impinge on proper visualization of the fundus and inaccurate diagnosis due to media opacity, light scatter and aberrancies. | Does previous cataract surgery affect performance? | Current paper | Yes. Presence of media opacity in phakic eyes reduces performance. | |
| The range of retinal cameras available to capture fundus images in terms of camera specifications, requirement for mydriasis, may provide variability in degree of field of view and image quality output. | Does altering the retinal cameras used affect performance? | Ting et al.[ | No. Different retinal cameras do not affect performance. | |
| Ethnic differences in eyes exist that affect optical systems’ ability to capture the posterior pole and the identification of the norm (e.g. pigmentation, optic disc size, vasculature). | Do images from various ethnic groups affect performance? | Ting et al.[ | No. Images from different ethnic groups do not affect performance. | |
| Different populations vary in prevalence rates of ocular disease, thus affecting the dataset used for validation and the utility of a clinical test deployed in that population. | Does deployment in populations with different disease prevalence rates affect performance? | Ting et al.[ | No. Deployment in populations with different prevalence rates does not affect performance. | |
| Ocular diseases do not develop distinctly as many share similar risk factors and occur concurrently in the same patient, thus distinction between manifestations of different diseases is paramount. | Does concurrent related ocular diseases affect performance in detection of an individual disease? | Ting et al.[ | No. Other existing diseases do not affect the algorithm’s ability to detect individual diseases accurately. | |
| The type of study (population-based, clinic-based or screening cohort) used to collect retinal images may influence the patient demographics of the datasets. | Does the type of studies affect the performance? | Ting et al.[ | No. The type of study does not affect the performance. | |
| Different countries may use different reference standards for grading of diabetic retinopathy (e.g., grader or ophthalmologist), a product of resource allocation, expertise and training available. | Does the difference in reference standard used for labeling of images affect performance? | Ting et al.[ | No. Different reference standards used do not affect the performance. | |
| Availability of large datasets in the target population may be scarce and insufficient for the training required for a highly performing algorithm. | Does a smaller dataset used for training affect the performance? | Gulshan et al.[ | Yes. Datasets that drop below 60,000 images produce large drops in performance. | |
| With large amount of images required for training, time constraints and reduced access to high quality retinal cameras may limit the use of large high resolution images for training of deep learning systems. | Does image size of the training dataset affect the performance? | Sahlsten et al.[ | Yes. Increased resolution of training images produce better performance but increases training time. | |
| Mydriasis may provide greater visualization for photographic capture of the posterior pole, potentially influencing quality of fundus photographs. | Does mydriatic photographs improve performance compared to non-mydriatic images? | Gulshan et al.[ | No. Mydriasis does not significantly improve performance. | |
Characteristics of included studies in systematic review.
| First author, reference | Factor addressed of training /testing dataset | Data points | Training dataset | Number of Images (training dataset) | Testing dataset | Number of images (testing dataset) | Outcome measures | Results | Implications | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gulshan[ | Dataset size (% of total training dataset of 103,698) (Training) | 0.2% | EyePACS | 207 | EyePACS | 24,360 | SP (at pre-set 97% SN) | SP 38% | 60,000 Images may be the minimum training dataset size needed for maximum performance | |||||
| 2% | 2073 | 61% | ||||||||||||
| 10% | 10,369 | 77% | ||||||||||||
| 20% | 20,739 | 86% | ||||||||||||
| 30% | 31,109 | 91% | ||||||||||||
| 40% | 41,479 | 98% | ||||||||||||
| 50% | 51,849 | 100% | ||||||||||||
| 60% | 62,218 | 96% | ||||||||||||
| 70% | 72,588 | 97% | ||||||||||||
| 80% | 82,958 | 100% | ||||||||||||
| 90% | 93,328 | 99% | ||||||||||||
| 100% | 103,698 | 100% | ||||||||||||
| Mydriasis (testing) | Mydriatic | EyePACS | 128,175 | EyePACS-1 | 4236 | SN SP | SN 89.6% | SP 97.9% | Mydriasis may not be required for optimal performance | |||||
| Non-Mydriatic | 4534 | 90.9% | 98.5% | |||||||||||
| Both | 8770 | 90.1% | 98.2% | |||||||||||
| Ting[ | Retinal cameras (testing) | Canon | SiDRP | 76,370 | BES | 1052 | AUC SN SP | AUC 0.929 | SN 94.4% | SP 88.5% | Different types of retinal cameras do not affect the performance | |||
| Topcon | CUHK | 1254 | 0.948 | 99.3% | 83.1% | |||||||||
| Carl Zeiss | HKU | 7706 | 0.964 | 100% | 81.3% | |||||||||
| Fundus Vue | Guangdong | 15,798 | 0.949 | 98.7% | 81.6% | |||||||||
| Study type (testing) | Clinic-based | SiDRP | 76,370 | CUHK | 1254 | AUC SN SP | AUC 0.948 | SN 99.3% | SP 83.1% | The study type does not affect the performance in detection of disease | ||||
| Community-based | BES | 1052 | 0.929 | 94.4% | 88.5% | |||||||||
| Population-based | Guangdong | 15,798 | 0.949 | 98.7% | 81.6% | |||||||||
| Reference Standard (testing) | Retinal Specialists | SiDRP | 76,370 | CUHK | 1254 | AUC SN SP | AUC 0.948 | SN 99.3% | SP 83.1% | If minimally professional graders with ≥7 years’ experience grade, performance may not be affected | ||||
| Ophthalmologists | BES | 1052 | 0.929 | 94.4% | 88.5% | |||||||||
| Optometrists | HKU | 7706 | 0.964 | 100% | 81.3% | |||||||||
| Graders | RVEEH | 2302 | 0.983 | 98.9% | 92.2% | |||||||||
| Prevalence rate (testing) | 5.5% (BES) | SiDRP | 76,370 | BES | 1052 | AUC SN SP | AUC 0.929 | SN 94.4% | SP 88.5% | Lower prevalence rate does not greatly affect performance | ||||
| 8.1% (SCES) | SCES | 1936 | 0.919 | 100% | 76.3% | |||||||||
| 12.9% (AFEDS) | AFEDS | 1968 | 0.980 | 98.8% | 86.5% | |||||||||
| Concurrent diseases (testing) | Mixed pathologies | SiDRP | 76,370 | DR | 37,001 | AUC SN SP | AUC 0.936 | SN 90.5% | SP 91.6% | Concurrent ocular pathologies in the same image does not affect the model’s detection of either disease | ||||
| AMD | 773 | 0.942 | 96.4% | 87.2% | ||||||||||
| Glaucoma | 56 | 0.931 | 93.2% | 88.7% | ||||||||||
| Ethnicity (testing) | Malay | SiDRP | 76,370 | SIMES | 3052 | AUC SN SP | AUC 0.889 | SN 97.1% | SP 82.0% | Despite difference in the retina between ethnicities, this does not influence the performance in detection | ||||
| Indian | SINDI | 4512 | 0.917 | 99.3% | 73.3% | |||||||||
| Chinese | SCES | 1936 | 0.919 | 100% | 76.3% | |||||||||
| African American | AFEDS | 1968 | 0.980 | 98.8% | 86.5% | |||||||||
| White | RVEEH | 2302 | 0.983 | 98.9% | 92.2% | |||||||||
| Hispanic | Mexico | 1172 | 0.950 | 91.8% | 84.8% | |||||||||
| Bawankar[ | Mydriasis (testing) | Non-mydriasis (vs ETDRS mydriatic reference standard) | Eye-PACS1, India | 80,000 | India | 1084 | SN SP | SN 91.2% | SP 96.9% | Despite no mydriasis of testing dataset, the DLS was able to perform highly when compared to mydriatic 7-field ETDRS grading reference standard | ||||
| Burlina[ | Dataset size (training) | Real | AREDS | 119,090 | AREDS | 13,302 | AUC AC | AUC 0.971 | AC 91.1% | Creating proxy datasets using GANs may provide a solution to those with limited access to large number of images | ||||
| Synthetic | Image generated with GANs | 119,090 | 0.924 | 82.9% | ||||||||||
| Sahlsten[ | Image pixel size (training) | 256×256 | Digifundus Ltd (Finland) | 24,806 | Digifundus Ltd (Finland) | 7118 | AUC | AUC0.961 | Training with higher resolution images may improve performance | |||||
| 299×299 | 24,806 | 0.970 | ||||||||||||
| 512×512 | 24,806 | 0.979 | ||||||||||||
| 1024×1024 | 24,806 | 0.984 | ||||||||||||
| 2095×2095 | 24,806 | 0.987 | ||||||||||||
| Bellemo[ | Ethnicity (testing) | African | SiDRP | 76,370 | Zambia | 4504 | AUC SN SP | AUC 0.973 | SN 92.3% | SP 89.0% | Differences in ethnicity between training and testing dataset does not affect performance | |||
| Ting[ | Prevalence rate (testing) | 4.1% (VTDR) | SiDRP | 76,370 | Pooled dataset (SiDRP, SIMES, SINDI, SCES, BES, AFEDS, CUHK, DMP) | 93,293 | AUC | AUC 0.950 | Prevalence rate of diseases may be estimated accurately by DLS | |||||
| 6.5% (RDR) | 0.963 | |||||||||||||
| 15.9% (ADR) | 0.863 | |||||||||||||
AUC area under curve of receiver operating curve, AC accuracy, SN sensitivity, SP specificity, EyePACS Eye Picture Archive Communication System, SiDRP Singapore’s National Integrated Diabetic Retinopathy Screening Program, BES Beijing Eye Study, CUHK Chinese University Hong Kong, HKU Hong Kong University, RVEEH Royal Victoria Eye and Ear Hospital, AFEDS African American Eye Disease Study, SCES Singapore Chinese Eye Study, SIMES Singapore Malay Eye Study, SINDI Singapore Indian Eye Study, DMP Diabetes Management Project Melbourne, DLS Deep Learning System, ETDRS Early Treatment Diabetic Retinopathy Study, AREDS Age-Related Eye Disease Study, DR diabetic retinopathy, AMD age-related macular degeneration, VTDR vision threatening diabetic retinopathy, RDR reference diabetic retinopathy, ADR any diabetic retinopathy, GAN Generative Adversarial Network.
Effect of technical factors specifically convolutional neural networks and computational framework.
| Convolutional neural networks | Computational frameworks | |||||||
|---|---|---|---|---|---|---|---|---|
| VGGNet | ResNet | DenseNet | Ensemble | Caffe | TensorFlow | |||
| SiDRP | Value (95% CI) | AUC | 0.938 (0.929–0.945) | 0.936 (0.927–0.944) | 0.941 (0.933–0.947) | 0.944 (0.938–0.950) | 0.936 (0.927–0.944) | 0.938 (0.929–0.945) |
| Reference | 0.581 | 0.410 | 0.02 | Reference | 0.736 | |||
| Sensitivity | 92.1% (89.2–94.5%) | 91.9% (88.9–94.3%) | 92.8% (90.0–95.1%) | 94.0% (91.3–96.0%) | 90.5% (87.3–93.1%) | 92.1% (89.2–94.5%) | ||
| Specificity | 91.0% (90.7–91.3%) | 90.9% (90.6–91.2%) | 90.9% (90.6–91.2%) | 90.7% (90.4–91.0%) | 91.9% (91.6–92.2%) | 91.0% (90.7–91.3%) | ||
P value was calculated by bootstrap method.
Dataset used for evaluation of different computational frameworks and convolutional neural networks is Singapore integrated Diabetic Retinopathy Programme (SiDRP) 2014 to 2015. During the evaluation of the impact of the convolutional neural network (CNN) on the DL algorithm performance, the computational framework was controlled for by using TensorFlow for fair comparison. Similarly, during the evaluation of different computational frameworks, the convolutional neural network controlled was controlled for by using VGGNet for isolation of independent variables.
AUC area under receiver operating curve, CI confidence Interval, SiDRP Singapore integrated Diabetic Retinopathy Programme.
Fig. 2Retinal image examples.
a Our results showed that using different CNNs show complementary classification of referable or non-referable DR, and these two images exhibit this agreement. b Using either computational framework similarly does not affect performance significantly as many images such as those depicted above are correctly classified as non-referable or referable DR by either framework. c Altering the image compression level does affect the DL model’s performance significantly beyond the threshold of 250 KB with a drop in sensitivity and specificity. These two photographs illustrate examples where a referable DR image is correctly identified as referable by the DL model when mild compression is introduced (i.e., a true positive case), but with further compression beyond 250 KB, this is misclassified as non-referable (i.e., a false negative case). This supports the drop in sensitivity beyond the 250 KB threshold. Similarly, this is demonstrated for a case of non-referable DR, where higher compression of the image causes a previously correctly classified image to subsequently be incorrect (i.e., a previously true negative result, now falsely classified as positive with disease), supporting the drop in specificity. d Another amendment to the image characteristics, in this case the field of view, showed reduced sensitivity and specificity when using 1-field instead of 2-field images. This example of referable DR had significant lesions present in the inferior-nasal quadrant, which were likely to be missed if using simply a macula-centered image, supporting the drop in sensitivity with the solitary use of 1-field images. Conversely, this example of healthy retina captured some dust particles in the superior and inferior nasal quadrant that might have inadvertently been misinterpreted by the DL algorithm as a lesion, prompting the misclassification as referable DR, thus supporting the drop in specificity.
Effect of image-related factors specifically compression levels.
| Compression level – image file size | |||||||
|---|---|---|---|---|---|---|---|
| 350 KB | 300 KB | 250 KB | 200 KB | 150 KB | |||
| SiDRP | Value (95% CI) | AUC | 0.936 (0.927–0.944) | 0.921 (0.908–0.932) | 0.900 (0.885–0.913) | 0.896 (0.881–0.910) | 0.891 (0.876–0.905) |
| Reference | 0.261 | <0.001 | <0.001 | <0.001 | |||
| Sensitivity | 90.5% (87.3–93.1%) | 85.9% (82.2–89.0%) | 83.5% (79.7–86.9%) | 85.6% (81.9–88.8%) | 90.5% (87.3–93.1%) | ||
| Specificity | 91.9% (91.6–92.2%) | 92.5% (92.3–92.8%) | 88.8% (88.5–89.2%) | 85.3% (84.9–85.7%) | 72.4% (71.9–72.8%) | ||
P value was calculated by bootstrap method, taking 350 KB as the reference for comparison against.
Dataset used for evaluation of different compression levels is Singapore integrated Diabetic Retinopathy Programme (SiDRP) 2014 to 2015.
AUC area under receiver operating curve, CI confidence interval, KB kilobytes, SiDRP Singapore integrated Diabetic Retinopathy Programme.
Effect of image-related factors specifically previous cataract surgery.
| Lens Status | ||||
|---|---|---|---|---|
| Phakic | Pseudophakic | |||
| SEED | Value (95% CI) | AUC | 0.833 (0.811–0.853) | 0.918 (0.887–0.940) |
| Reference | <0.001 | |||
| Sensitivity | 91.1% (84.6–95.5%) | 93.4% (85.3–97.8%) | ||
| Specificity | 76.1% (73.8–78.3%) | 84.2% (81.4–86.8%) | ||
P value was calculated by bootstrap method, using the phakic eyes as the standard.
Dataset used for evaluation of phakia compared to pseudophakia is Singapore Epidemiology of Eye Diseases study, which comprises of Singapore Malay Eye Study, Singapore Indian Eye Study and Singapore Chinese Eye Study.
AUC area under receiver operating curve, CI confidence interval, SEED Singapore Epidemiology of Eye Diseases study.
Fig. 3Heatmaps generated for compressed images.
Heatmaps showing the ‘hot’ areas that the DL algorithm focuses its attention on when making a diagnostic assessment on the retinal image. This was created using the Integrated Gradient method[66]. The colors on the greyscale retina image show the region of interest, with the red showing peak areas of region of interest while the blue shows the background areas of the region of interest. The white box isolates an area of the image to illustrate the difference between images of 350 and 150 KB in size. a A fundus photo of a healthy retina that was provided to the DL model as a 350 KB image. This was correctly classified by the DL model as a healthy retina with no DR. The heatmaps show focus on the normal optic disc and vasculature. b The same healthy retina is shown but compressed into a 150 KB size. This was misclassified by the DL algorithm as a retina with referable DR. The heatmaps show other regions of interest aside from the normal optic disc. The magnification of one of these anomalous regions of interest depicts pixelations as identified by the white arrows and ovals. These pixelations amalgamate into a mistaken pathological manifestation of DR, resulting in its false positive status.
Fig. 4Convolutional neural networks investigated.
The architecture of convolutional neural networks (CNNs) are based on few general principles. The network is composed of mathematically weighted neurons that form sequential layers where there is linear transfer of signal from the input through to the output layers. For this study, each input image was pre-processed by scaling to a fixed template of 512 × 512 pixels in resolution. These images were subsequently represented as a matrix of Red Green Blue (RGB) values in the input layer. Sequential convolutions were conducted by superimposing a weighted kernel over these input maps, with our study using a 3 × 3 weighted kernel with subsequent max-pooling. The output layer utilizes a softmax classifier to generate probability values for the pre-defined output classes[15,32,52]. a VGGNet is the oldest CNN used in this comparison, released in 2014. Despite its standard uniform architecture composed of 16 layers, it has had great success at feature extraction[53]. b ResNet has been highly favored since its introduction in 2015, with its atypical architecture utilizing skip residual connections (visualized as blue arrows) to bypass signals across layers. This allows for increase in layers without compromising the ease of training, resulting in supra-human performance of 3.6% top-5 error rate[54]. c DenseNet is a newer CNN released in 2017 that has been shown to perform better than ResNet. Its architecture builds on a similar principle to the one capitalized by ResNet, but rather has a dense connectivity pattern where each layer receives information from all preceding layers as shown by the green arrows. This allows concatenation of sequential layers and compacting the network into a ‘denser’ configuration[40]. d Ensemble is a combination of the three networks’ probability output scores generated per eye, through the acquisition of the mean value.
Effect of image-related factors specifically fundus fields of view.
| Fundus fields of view | |||||
|---|---|---|---|---|---|
| 7-field (ETDRS standard) | 2-field (Optic disc and macula-centered) | 1-field (Macula-centered) | |||
| SiDRP | Value (95% CI) | AUC | 0.936 (0.927–0.944) | 0.908 (0.894–0.920) | |
| Reference | <0.001 | ||||
| Sensitivity | 90.5% (87.3–93.1%) | 89.4% (86.0–92.2%) | |||
| Specificity | 91.9% (91.6–92.2%) | 89.4% (89.0%–89.7%) | |||
| AFEDS | Value (95% CI) | AUC | 0.949 (0.923–0.968) | 0.911 (0.877–0.937) | 0.895 (0.852–0.931) |
| P value for AUC comparison | Reference | <0.001 | <0.001 | ||
| Sensitivity | 90.0% (81.9–95.3%) | 82.6% (72.9–89.9%) | 78.4% (67.3–87.1%) | ||
| Specificity | 86.5% (84.6–88.3%) | 84.4% (82.3–86.3%) | 86.1% (84.0–88.0%) | ||
P value was calculated by bootstrap method.
Datasets used for evaluation of different fundus field of views were Singapore integrated Diabetic Retinopathy Programme (SiDRP) 2014 to 2015 to evaluate 2-field and 1-field, and African American Eye Disease Study to evaluate 7-field ETDRS standard retinal images in addition to 2-field and 1-field.
AUC area under receiver operating curve, CI confidence interval, ETDRS Early Treatment Diabetic Retinopathy Study, SiDRP Singapore’s national integrated Diabetic Retinopathy Screening Program, AFEDS African American Eye Disease Study.