Literature DB >> 30361278

Artificial intelligence and deep learning in ophthalmology.

Daniel Shu Wei Ting¹, Louis R Pasquale², Lily Peng³, John Peter Campbell⁴, Aaron Y Lee⁵, Rajiv Raman⁶, Gavin Siew Wei Tan⁷, Leopold Schmetterer^7,8,9,10, Pearse A Keane¹¹, Tien Yin Wong⁷.

Abstract

Artificial intelligence (AI) based on deep learning (DL) has sparked tremendous global interest in recent years. DL has been widely adopted in image recognition, speech recognition and natural language processing, but is only beginning to impact on healthcare. In ophthalmology, DL has been applied to fundus photographs, optical coherence tomography and visual fields, achieving robust classification performance in the detection of diabetic retinopathy and retinopathy of prematurity, the glaucoma-like disc, macular oedema and age-related macular degeneration. DL in ocular imaging may be used in conjunction with telemedicine as a possible solution to screen, diagnose and monitor major eye diseases for patients in primary care and community settings. Nonetheless, there are also potential challenges with DL application in ophthalmology, including clinical and technical challenges, explainability of the algorithm results, medicolegal issues, and physician and patient acceptance of the AI 'black-box' algorithms. DL could potentially revolutionise how ophthalmology is practised in the future. This review provides a summary of the state-of-the-art DL systems described for ophthalmic applications, potential challenges in clinical deployment and the path forward. © Author(s) (or their employer(s)) 2019. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

Entities: Chemical Disease Gene Species

Keywords: glaucoma; imaging; public health; retina; telemedicine

Mesh：

Year: 2018 PMID： 30361278 PMCID： PMC6362807 DOI： 10.1136/bjophthalmol-2018-313173

Source DB: PubMed Journal: Br J Ophthalmol ISSN： 0007-1161 Impact factor: 4.638

Introduction

Artificial intelligence (AI) is the fourth industrial revolution in mankind’s history.1 Deep learning (DL) is a class of state-of-the-art machine learning techniques that has sparked tremendous global interest in the last few years.2 DL uses representation-learning methods with multiple levels of abstraction to process input data without the need for manual feature engineering, automatically recognising the intricate structures in high-dimensional data through projection onto a lower dimensional manifold.2 Compared with conventional techniques, DL has been shown to achieve significantly higher accuracies in many domains, including natural language processing, computer vision3–5 and voice recognition.6 In medicine and healthcare, DL has been primarily applied to medical imaging analysis, in which DL systems have shown robust diagnostic performance in detecting various medical conditions, including tuberculosis from chest X-rays,7 8 malignant melanoma on skin photographs9 and lymph node metastases secondary to breast cancer from tissue sections.10 DL has similarly been applied to ocular imaging, principally fundus photographs and optical coherence tomography (OCT). Major ophthalmic diseases which DL techniques have been used for include diabetic retinopathy (DR),11–15 glaucoma,11 16 age-related macular degeneration (AMD)11 17 18 and retinopathy of prematurity (ROP).19 DL has also been applied to estimate refractive error and cardiovascular risk factors (eg, age, blood pressure, smoking status and body mass index).20 21 A primary benefit of DL in ophthalmology could be in screening, such as for DR and ROP, for which well-established guidelines exist. Other conditions, such as glaucoma and AMD, may also require screening and long-term follow-up. However, screening requires tremendous manpower and financial resources from healthcare systems, in both developed countries and in low-income and middle-income countries. The use of DL, coupled with telemedicine, may be a long-term solution to screen and monitor patients within primary eye care settings. This review summarises new DL systems for ophthalmology applications, potential challenges in clinical deployment and potential paths forward.

DL applications in ophthalmology

Diabetic retinopathy

Globally, 600 million people will have diabetes by 2040, with a third having DR.22 A pooled analysis of 22 896 people with diabetes from 35 population-based studies in the USA, Australia, Europe and Asia (between 1980 and 2008) showed that the overall prevalence of any DR (in type 1 and type 2 diabetes) was 34.6%, with 7% vision-threatening diabetic retinopathy.22 Screening for DR, coupled with timely referral and treatment, is a universally accepted strategy for blindness prevention. DR screening can be performed by different healthcare professionals, including ophthalmologists, optometrists, general practitioners, screening technicians and clinical photographers. The screening methods comprise direct ophthalmoscopy,23 dilated slit lamp biomicroscopy with a hand-held lens (90 D or 78 D),24 mydriatic or non-mydriatic retinal photography,23 teleretinal screening,25 and retinal video recording.26 Nonetheless, DR screening programmes are challenged by issues related to implementation, availability of human assessors and long-term financial sustainability.27 Over the past few years, DL has revolutionised the diagnostic performance in detecting DR.2 Using this technique, many groups have shown excellent diagnostic performance (table 1).14 Abràmoff et al 14 showed that a DL system was able to achieve an area under the receiver operating characteristic curve (AUC) of 0.980, with sensitivity and specificity of 96.8% and 87.0%, respectively, in the detection of referable DR (defined as moderate non-proliferative DR or worse, including diabetic macular oedema (DMO)) on Messidor-2 data set. Similarly, Gargeya and Leng15 reported an AUC of 0.97 using cross-validation on the same data set, and 0.94 and 0.95 in two independent test sets (Messidor-2 and E-Ophtha).

Table 1

DL systems	Year	Test data sets	Test images (n)	CNN	AUC	Sensitivity (%)	Specificity (%)
Referable diabetic retinopathy
Abràmoff et al 14	2016	Messidor-2	1748	AlexNet/VGG	0.98	96.80	87.00
Gulshan et al 12	2016	Messidor-2	1748	Inception-V3	0.99	87	98.50
						96.10	93.90
		EyePACS-1	9963		0.991	90.30	98.10
						97.50	93.40
Gargeya and Leng15	2017	Kaggle images	75 137	Customised CNN	0.97	NA	NA
		E-Ophtha	463		0.96	NA	NA
		Messidor-2	1748		0.94	NA	NA
Ting et al 11	2017	SiDRP 14–15	71 896	VGG-19	0.936	90.50	91.60
		Guangdong	15 798		0.949	98.70	81.60
		SIMES	3052		0.889	97.10	82.00
		SINDI	4512		0.917	99.3	73.3
		SCES	1936		0.919	100	76.30
		BES	1052		0.929	94.40	88.50
		AFEDS	1968		0.98	98.80	86.50
		RVEEH	2302		0.983	98.90	92.20
		Mexican	1172		0.95	91.80	84.80
		CUHK	1254		0.948	99.3	83.10
		HKU	7706		0.964	100	81.30
Abràmoff et al 28	2018	10 primary care practice sites from the USA	892 patients	Alex/VGG	NA	87.2	90.7
Glaucoma suspect*
Ting et al 11	2017	SiDRP 14–15	71 896	VGG-19	0.942	96.40	93.20
Li et al 16	2018	Guangdong	48 116		0.986	95.60	92.00
Age-related macular degeneration
Ting et al 11	2017	SiDRP 14–15	35 948	VGG-19	0.932	93.20	88.70
Burlina et al 17	2017	AREDS	120 656	AlexNet, OverFeat	0.940–0.96	NA	NA
Burlina et al 17	2017	AREDS	120 656	AlexNet, OverFeat	0.940–0.96	NA	NA	Grassmann et al 18	2018	AREDS	120 656	AlexNet, GoogleNet, VGG, Inception-V3, ResNet, Inception-ResNet-V2	NA	84.20	94.30
Retinopathy of prematurity
Brown et al 19	2018	i-ROP	100	Inception-V1 and U-Net	NA	100	94

The diagnostic performance is not comparable between the different DL systems given the different data sets used in the individual study.

*Definition of glaucoma suspect: (1) Ting et al 11—vertical cup to disc ratio of 0.8 or greater, and any glaucomatous disc changes; (2) Li et al 16—vertical cup to disc ratio of 0.7 or greater, and any glaucomatous disc changes.

AFEDS, African American Eye Disease Study; AREDS, Age-Related Eye Disease Study; AUC, area under the receiver operating characteristic curve; BES, Beijing Eye Study; CNN, convolutional neural network; CUHK, Chinese University Hong Kong; DL, deep learning; SiDRP 14–15, Singapore Integrated Diabetic Retinopathy Screening Programme; HKU, Hong Kong University; NA, not available; RVEEH, Royal Victorian Eye and Ear Hospital; SCES, Singapore Chinese Eye Study; SIMES, Singapore Malay Eye Study; SINDI, Singapore Indian Eye Study.

Summary table for the different DL systems in the detection of referable diabetic retinopathy, glaucoma suspect, age-related macular degeneration and retinopathy of prematurity using fundus photographs The diagnostic performance is not comparable between the different DL systems given the different data sets used in the individual study. *Definition of glaucoma suspect: (1) Ting et al 11—vertical cup to disc ratio of 0.8 or greater, and any glaucomatous disc changes; (2) Li et al 16—vertical cup to disc ratio of 0.7 or greater, and any glaucomatous disc changes. AFEDS, African American Eye Disease Study; AREDS, Age-Related Eye Disease Study; AUC, area under the receiver operating characteristic curve; BES, Beijing Eye Study; CNN, convolutional neural network; CUHK, Chinese University Hong Kong; DL, deep learning; SiDRP 14–15, Singapore Integrated Diabetic Retinopathy Screening Programme; HKU, Hong Kong University; NA, not available; RVEEH, Royal Victorian Eye and Ear Hospital; SCES, Singapore Chinese Eye Study; SIMES, Singapore Malay Eye Study; SINDI, Singapore Indian Eye Study. More recently, Gulshan and colleagues12 from Google AI Healthcare reported another DL system with excellent diagnostic performance. The DL system was developed using 128 175 retinal images, graded between 3 and 7 times for DR and DMO by a panel of 54 US licensed ophthalmologists and ophthalmology residents between May and December 2015. The test set consisted of approximately 10 000 images retrieved from two publicly available databases (EyePACS-1 and Messidor-2), graded by at least seven US board-certified ophthalmologists with high intragrader consistency. The AUC was 0.991 and 0.990 for EyePACS-1 and Messidor-2, respectively (table 1). Although a number of groups have demonstrated good results using DL systems on publicly available data sets, the DL systems were not tested in real-world DR screening programmes. In addition, the generalisability of a DL system to populations of different ethnicities, and retinal images captured using different cameras, still remains uncertain. Ting et al 11 reported a clinically acceptable diagnostic performance of a DL system, developed and tested using the Singapore Integrated Diabetic Retinopathy Programme over a 5-year period, and 10 external data sets recruited from 6 different countries, including Singapore, China, Hong Kong, Mexico, USA and Australia. The DL system, developed using the DL architecture VGG-19, was reported to have AUC, sensitivity and specificity of 0.936, 90.5% and 91.6% in detecting referable DR. For vision-threatening DR, the corresponding statistics were 0.958, 100% and 91.1%. The AUC ranged from 0.889 to 0.983 for the 10 external data sets (n=40 752 images). More recently, the DL system, developed by Abramoff et al,28 has obtained a US Food and Drug Administration approval for the diagnosis of DR. It was evaluated in a prospective, although observational setting, achieving 87.2% sensitivity and 90.7% specificity.28

Age-related macular degeneration

AMD is a major cause of vision impairment in the elderly population globally. The Age-Related Eye Disease Study (AREDS) classified AMD stages into none, early, intermediate and late AMD.29 The American Academy of Ophthalmology recommends that people with intermediate AMD should be at least seen once every 2 years. It is projected that 288 million patients may have some forms of AMD by 2040,30 with approximately 10% having intermediate AMD or worse.29 With the ageing population, there is an urgent clinical need to have a robust DL system to screen these patients for further evaluation in tertiary eye care centres. Ting et al 11 reported a clinically acceptable DL system diagnostic performance in detecting referable AMD (table 1). Specifically, the DL system was trained and tested using 108 558 retinal images from 38 189 patients. Fovea-centred images without macula segmentation were used in this study. Given that this was the DR screening population, there were relatively few patients with referable AMD. For the other two studies,17 18 DL systems were developed using the AREDS data set, with a high number of referable AMD (intermediate AMD or worse). Using a fivefold cross-validation, Burlina et al 17 reported a diagnostic accuracy of between 88.4% and 91.6%, with an AUC of between 0.94 and 0.96. Unlike Ting et al,11 the authors presegmented the macula region prior to training and testing, with an 80/20 split between the training and testing in each fold. In terms of the DL architecture, both AlexNet and OverFeat have been used, with AlexNet yielding a better performance. Using the same AREDS data set, Grassmann et al 18 reported a sensitivity of 84.2% in the detection of any AMD. In this study, the authors used six convolutional neural networks—AlexNet, GoogleNet, VGG, Inception-V3, ResNet and Inception-ResNet-V2—to train different models. Data augmentation was also used to increase the diversity of data set and to reduce the risk of overfitting. For the AREDS data set, all the photographs were captured as analogue photographs and then digitised later. Whether this affects the DL system’s performance remains uncertain. In addition, all three abovementioned studies did not have any results for external validation on the individual DL systems.

DM, choroidal neovascularisation and other macular diseases

OCT has had a transformative effect on the management of macular diseases, specifically neovascular AMD and DMO. OCT also provides a near-microscopic view of the retina in vivo with quick acquisition protocols revealing structural detail that cannot be seen using other ophthalmic examination techniques. Thus, the number of macular OCTs has grown from 4.3 million in 2012 to 6.4 million in 2016 in the US Medicare population alone, and will most likely continue to grow worldwide.31 From a DL perspective, macular OCTs possess a number of attractive qualities as a modality for DL. First is the explosive growth in the number of macular OCTs that are routinely collected around the world. This large number of OCTs is required to train DL systems where having many training examples can aid in the convergence of many-layered networks with millions of parameters. Second, macular OCTs have dense three-dimensional structural information that is usually consistently captured. Unlike real-world images or even colour fundus photographs, the field of view of the macula and the foveal fixation is usually consistent from one volume scan to another. This lowers the complexity of the computer vision task significantly and allows networks to reach meaningful performance with smaller data sets. Third, OCTs provide structural detail that is not easily visible using conventional imaging techniques and provide an avenue for uncovering novel biomarkers of the disease. One of the first applications of DL to macular OCTs was in automated classification of AMD. Approximately 100 000 OCT B-scans were used to train a DL classifier based on VGG-16 to achieve an AUC of 0.97 (table 2).32 Few studies used a technique known as transfer learning, where a neural network is pretrained on ImageNet and subsequently then trained on OCT B-scans for retinal disease classification.33–35 Of note, these initial studies involve the use of two-dimensional DL models trained on single OCT B-scans rather than three-dimensional models trained on OCT volumes. This may be a barrier to their potential clinical applicability.

Table 2

Summary table for the different DL systems in the detection of retinal diseases using OCT

DL systems	Year	Disease	OCT machines	Test images	CNN	AUC	Accuracy (%)	Sensitivity (%)	Specificity (%)
Lee et al 13 32	2017	Exudative AMD	Spectralis	20 613	VGG-16	0.928	87.60	84.60	91.50
Trader et al 33	2018	Exudative AMD	Spectralis	100	Inception-V3	0.980	100	NA	NA
Kermany et al 34	2018	CNV	Spectralis	1000	Inception-V3
		DMO
		Drusen
		1. Multiclass comparison				0.999	96.50	97.80	97.40
		2. Limited model				0.988	93.40	96.60	94.00
		3. Binary model
		CNV vs normal				1	100	100	100
		DMO vs normal				0.999	98.20	96.80	99.60
		Drusen vs normal				0.999	99	98	99.20
De Fauw et al 43	2018	Urgent, semiurgent, routine and observation only	Topcon	997 patients	1. Deep segmentation network using U-Net	Urgentreferral0.992	94.5
		Normal, CNV, macular oedema, FTMH, PTMH, CSR, VMT, GA, drusen, ERM	Spectralis	116 patients	2. Deep classification network using a custom 29 CNN layers with 5 pooling layers	Urgent referral0.999	96.6

The diagnostic performance is not comparable between the different DL systems given the different data sets used in the individual study. AUC for specific conditions: CNV 0.993; macular oedema 0.990; normal 0.995; FTMH 1.00; PTMH 0.999; CSR 0.995; VMT 0.980; GA 0.990; drusen 0.967; and ERM 0.966.

AMD, age-related macular degeneration; AUC, area under the receiver operating characteristic curve; CNN, convolutional neural network; CNV, choroidal neovascularisation; CSR, central serous chorioretinopathy; DL, deep learning; DMO, diabetic macular oedema; ERM, epiretinal membrane; FTMH, full-thickness macula hole; GA, geographic atrophy; NA, not available; OCT, optical coherence tomography; PTMH, partial thickness macula hole; VMT, vitreomacular traction.

Summary table for the different DL systems in the detection of retinal diseases using OCT The diagnostic performance is not comparable between the different DL systems given the different data sets used in the individual study. AUC for specific conditions: CNV 0.993; macular oedema 0.990; normal 0.995; FTMH 1.00; PTMH 0.999; CSR 0.995; VMT 0.980; GA 0.990; drusen 0.967; and ERM 0.966. AMD, age-related macular degeneration; AUC, area under the receiver operating characteristic curve; CNN, convolutional neural network; CNV, choroidal neovascularisation; CSR, central serous chorioretinopathy; DL, deep learning; DMO, diabetic macular oedema; ERM, epiretinal membrane; FTMH, full-thickness macula hole; GA, geographic atrophy; NA, not available; OCT, optical coherence tomography; PTMH, partial thickness macula hole; VMT, vitreomacular traction. DL has also had a transformative impact in boundary and feature-level segmentation using neural networks that have been developed for semantic segmentation such as the U-Net.36 Specifically, these networks have been trained to segment intraretinal fluid cysts and subretinal fluid on OCT B-scans.13 37 38 Deep convolutional networks surpassed traditional methods in the quality of segmentation of retinal anatomical boundaries.39–41 Also similar approaches were used to segment en-face OCTA images to segment the foveal avascular zone.42 More recently, DeepMind and the Moorfields Eye Hospital have combined the power of neural networks for both segmentation and classification tasks using a novel AI framework. In this approach, a segmentation network is first used to delineate a range of 15 different retinal morphological features and OCT acquisition artefacts. The output of this network is then passed to a classification network which makes a referral triage decision from four categories (urgent, semiurgent, routine, observation) and classifies the presence of 10 different OCT pathologies (choroidal neovascularisation (CNV), macular oedema without CNV, drusen, geographic atrophy, epiretinal membrane, vitreomacular traction, full-thickness macular hole, partial thickness macular hole, central serous retinopathy and ‘normal’).43 Using this approach, the Moorfields-DeepMind system reports a performance on par with experts for these classification tasks (although in a retrospective setting). Moreover, the generation of an intermediate tissue representation by the first, segmentation network means that the framework can be generalised across OCT systems from multiple different vendors without prohibitive requirements for retraining. In the near term, this DL system will be implemented in an existing real-world clinical pathway—the rapid access ‘virtual’ clinics that are now widely used for triaging of macular disease in the UK.44 In the longer term, the system could be used in triaging patients outside the hospital setting, particularly as OCT systems are increasingly being adopted by optometrists in the community.45

Glaucoma

The global prevalence of glaucoma for people aged 40–80 is 3.4%, and by the year 2040 it is projected there will be approximately 112 million affected individuals worldwide.46 Clinicians and patients alike would welcome improvements in disease detection, assessment of progressive structural and functional damage, treatment optimisation so as to prevent visual disability, and accurate long-term prognosis. Glaucoma is an optic nerve disease categorised by excavation and erosion of the neuroretinal rim that clinically manifests itself by increased optic nerve head (ONH) cupping. Yet, because the ONH area varies by fivefold, there is virtually no cup to disc ratio (CDR) that defines pathological cupping, hampering disease detection.47 Li et al 16 and Ting et al 11 trained computer algorithms to detect the glaucoma-like disc, defined as a vertical CDR of 0.7 and 0.8, respectively. Investigators have also applied machine learning methods to distinguish glaucomatous nerve fibre layer damage from normal scans on wide-angle OCTs (9×12 mm).48 Future opportunities include training a neural network to identify the disc that would be associated with manifest visual field (VF) loss across the spectrum of disc size, as our current treatment strategies are aligned with slowing disease detection. Furthermore, DL could be used to detect progressive structural optic nerve changes in glaucoma. In glaucoma, retinal ganglion cell axons atrophy in a confined space within the ONH and ophthalmologists typically rely on low dimensional psychophysical data to detect the functional consequences of that damage. The outputs from these tests typically provide reliability parameters, age-matched normative comparisons and summary global indices, but more detailed analysis of this functional data is lacking. Elze et al 49 developed an unsupervised computer program to analyse VF that recognises clinically relevant VF loss patterns and assigns a weighting coefficient for each of them (figure 1). This method has proven useful in the detection of early VF loss from glaucoma.50 Furthermore, a myriad of computer programs to detect VF progression exist, ranging from assessment of global indices over time to point-wise analyses, to sectoral VF analysis; however, these approaches are often not aligned with clinical ground truth nor with one another.51 52 Yousefi et al 53 developed a machine-based algorithm that detected VF progression earlier than these conventional strategies. More machine learning algorithms that provide quantitative information about regional VF progression can be expected in the future.

Figure 1

Archetype analysis with 16 visual field (VF) archetypes (ATs) that were derived from an unsupervised computer algorithm described by Elze et al. 49

Archetype analysis with 16 visual field (VF) archetypes (ATs) that were derived from an unsupervised computer algorithm described by Elze et al. 49 Although intraocular pressure (IOP)-lowering has been shown to be therapeutically effective in delaying glaucoma progression, some demonstrated that disease progression is still inevitable,54–56 suggesting that we have not arrived at optimised treatment regimens for the various forms of glaucoma. Kazemian et al 57 developed a clinical forecasting tool that uses tonometric and VF data to project disease trajectories at different target IOPs. Further refinement of this tool that integrates other ophthalmic and non-ophthalmic data would be useful to establish target IOPs and the best strategies to achieve them on a case-by-case basis. Finally, it is documented that patients with newly diagnosed glaucoma harbour fears of going blind58; perhaps, the use of machine learning that incorporates genome-wide data, lifestyle behaviour and medical history into a forecasting algorithm will allow early prognostication regarding the future risk of requiring invasive surgery or losing functional vision from glaucoma. As machine learning algorithms are revised, the practising ophthalmologist will have a host of tools available to diagnose glaucoma, detect disease progression and identify optimised treatment strategies using a precision medicine approaches. In an ideal future scenario, they may also have clinical forecasting tools that inform patients as to their overall prognosis and expected clinical course with or without treatment.

Retinopathy of prematurity

ROP is a leading cause of childhood blindness worldwide, with an annual incidence of ROP-related blindness of 32 000 worldwide.59 The regional epidemiology of the disease varies based on a number of factors, including the number of preterm births, neonatal mortality of preterm children and capacity to monitor exposure to oxygen. ROP screening either directly via ophthalmoscopic examination or telemedical evaluation using digital fundus photography can identify the earliest signs of severe ROP, and with timely treatment can prevent most cases of blindness from ROP.60 61 Due to the high number of preterm births, reductions in neonatal mortality, and limited capacity for oxygen monitoring and ROP screening, the highest burden of blinding ROP today is in low-income and middle-income countries.62 There are two main barriers to effective implementation of ROP screening: (1) the diagnosis of ROP is subjective, with significant interexaminer variability in the diagnosis leading to inconsistent application of evidence-based interventions63; and (2) there are too few trained examiners in many regions of the world.64 Telemedicine has emerged as a viable model to address the latter problem, at least in regions where the cost of a fundus camera is not prohibitive, by allowing a single physician to virtually examine infants over a large geographical area. However, telemedicine itself does not solve the subjectivity problem in ROP diagnosis. Indeed, the acute-phase ROP study found nearly 25% of telemedicine examinations by trained graders required adjudication because the graders disagreed on one of three criteria for clinically significant ROP.65 There have been a number of early attempts to use DL for automated diagnosis of ROP,19 66 which could potentially address both implementation barriers for ROP screening. Most recently, Brown et al 19 reported the results of a fully automated DL system that could diagnose plus disease, the most important feature of severe ROP, with an AUC of 0.98 compared with a consensus reference standard diagnosis combining image-based diagnosis and ophthalmoscopy (table 1). When directly compared with the eight international experts in ROP diagnosis, the i-ROP DL system agreed with the consensus diagnosis more frequently than six out of eight experts. Subsequent work found that the i-ROP DL system could also produce a severity score for ROP that demonstrated promise for objective monitoring of disease progression, regression and response to treatment.67 When compared with the same set of 100 images ranked in order of disease severity by experts, the algorithm had 100% sensitivity an 94% specificity in the detection of pre-plus or worse disease.

Potential challenges

Despite the high level of accuracy of the AI-based models in many of the diseases in ophthalmology, there are still many clinical and technical challenges for clinical implementation and real-time deployment of these models in clinical practice (table 3). These challenges could arise in different stages in both the research and clinical settings. First, many of the studies have used training data sets from relatively homogeneous populations.12 14 15 AI training and testing using retinal images is often subject to numerous variabilities, including width of field, field of view, image magnification, image quality and participant ethnicities. Diversifying the data set, in terms of ethnicities, and image-capture hardware could help to address this challenge.11

Table 3

The clinical and technical challenges in building and deploying deep learning (DL) techniques from ’bench to bedside’

Steps	Potential challenges
1. Identification of training data sets	Patients’ consent and confidentiality issues. Varying standards and regulations between the different institutional review boards. Small training data sets for rare disease (eg, ocular tumours) or common diseases that are not captured in routine (eg, cataracts).
2. Validation and testing data sets	Lack of sample size—not sufficiently powered. Lack of generalisability—not tested widely in different populations or on data collected from different devices.
3. Explainability of the results	Demonstration of the regions ‘deemed’ abnormal by DL. Methods to generate heat maps—occlusion tests, class activation, integrated gradient method, soft attention map and so on.
4. Clinical deployment of DL Systems	Recommendation of the potential clinical deployment sites. Application of regulatory approval from health authorities (eg, US Food and Drug Administration, Europe CE marking and so on). Conducting prospective clinical trials. Medical rebate scheme and medicolegal requirement. Ethical challenges.

The clinical and technical challenges in building and deploying deep learning (DL) techniques from ’bench to bedside’ Patients’ consent and confidentiality issues. Varying standards and regulations between the different institutional review boards. Small training data sets for rare disease (eg, ocular tumours) or common diseases that are not captured in routine (eg, cataracts). Lack of sample size—not sufficiently powered. Lack of generalisability—not tested widely in different populations or on data collected from different devices. Demonstration of the regions ‘deemed’ abnormal by DL. Methods to generate heat maps—occlusion tests, class activation, integrated gradient method, soft attention map and so on. Recommendation of the potential clinical deployment sites. Application of regulatory approval from health authorities (eg, US Food and Drug Administration, Europe CE marking and so on). Conducting prospective clinical trials. Medical rebate scheme and medicolegal requirement. Ethical challenges. Another challenge in the development of AI models in ophthalmology has been the limited availability of large amounts of data for both the rare diseases (eg, ocular tumours) and for common diseases which are not imaged routinely in clinical practice such as cataracts. Furthermore, there are diseases such as glaucoma and ROP where there will be disagreement and interobserver variability in the definition of the disease phenotype. The algorithm learns from what they are presented with. The software is unlikely to produce accurate outcomes if the training set of images given to the AI tool is too small or not representative of real patient populations. More evidence on ways of getting high-quality ground-truth labels is required for different imaging tools. Krause et al68 reported that adjudication grades by retina specialists were a more rigorous reference standard, especially to detect artefacts and missed microaneurysms in DR, than a majority decision and improved the algorithm performance. Second, many AI groups have reported robust diagnostic performance for their DL systems, although some papers did not show how the power calculation was performed for the independent data sets. A power calculation should take the following into consideration: the prevalence of the disease, type 1 and 2 errors, CIs, desired precision and so on. It is important to first preset the desired operating threshold on the training set, followed by analysis of performance metrics such as sensitivity and specificity on the test set to assess calibration of the algorithm. Third, large-scale adoption of AI in healthcare is still not on the horizon as clinicians and patients are still concerned about AI and DL being ‘black-boxes’. In healthcare, it is not only the quantitative algorithmic performance, but the underlying features through which the algorithm classifies disease which is important to improve physician acceptance. Generating heat maps highlighting the regions of influence on the image which contributed to the algorithm conclusion may be a first step (figure 2), although such maps are often challenging to interpret (what does it mean if a map highlights an area of vitreous on an OCT of a patient with drusen?).69 They may also struggle to deal with negations (what would it mean to highlight the most important part of an ophthalmic image that demonstrates that there is no disease present?).70 71 An alternative approach has been used for the DL system developed by the Moorfields Eye Hospital and DeepMind—in this system, the generation of an intermediate tissue representation by a segmentation network is used to highlight for the clinician (and quantify) the relevant areas of retinal pathology (figure 3).43 It is also important to highlight that ‘interpretability’ of DL systems may mean different things to a healthcare professional than to a machine learning expert. Although it seems likely that interpretable algorithms will be more readily accepted by ophthalmologists, future applied clinical research will be necessary to determine whether this is the case and whether it leads to tangible benefits for patients in terms of clinical effectiveness.

Figure 2

Figure 3

A representative screenshot from the output of the Moorfields-DeepMind deep learning system for optical coherence tomography segmentation and classification. In this case, the system correctly diagnoses a case of central serous retinopathy with secondary choroidal neovascularisation and recommends urgent referral to an ophthalmologist. Through the creation of an intermediate tissue representation (seen here as two-dimensional thickness maps for each morphological parameter), the system provides ’explainability’ for the ophthalmologist.

Some examples of heat maps showing the abnormal areas in the retina. (A) Severe non-proliferative diabetic retinopathy (NPDR); (B) geographic atrophy in advanced age-related macular degeneration (AMD) on fundus photographs11; and (C) diabetic macular oedema on optical coherence tomography. A representative screenshot from the output of the Moorfields-DeepMind deep learning system for optical coherence tomography segmentation and classification. In this case, the system correctly diagnoses a case of central serous retinopathy with secondary choroidal neovascularisation and recommends urgent referral to an ophthalmologist. Through the creation of an intermediate tissue representation (seen here as two-dimensional thickness maps for each morphological parameter), the system provides ’explainability’ for the ophthalmologist. Lastly, the current AI screening systems for DR have been developed and validated using two-dimensional images and lack stereoscopic qualities, thus making identification of elevated lesions like retinal tractions challenging. Incorporating the information from multimodal imaging in future AI algorithms may potentially address this challenge. In addition, the medicolegal aspects and the regulatory approvals vary in different countries and settings, and more work will be needed in these areas. An important challenge to the clinical adoption of AI-based technology is how the patients entrust clinical care to machines. Keel et al72 evaluated the patient acceptability of AI-based DR screening within endocrinology outpatient setting and reported that 96% of participants were satisfied or very satisfied with the automated screening model.72 However, in different populations and settings, the patient’s acceptability for AI-based screening may vary and may pose challenge in its implementation.

Conclusions

DL is the state-of-the-art AI machine learning technique that has revolutionised the AI field. For ophthalmology, DL has shown clinically acceptable diagnostic performance in detecting many retinal diseases, in particular DR and ROP. Future research is crucial in evaluating the clinical deployment and cost-effectiveness of different DL systems in the clinical practice. To improve clinical acceptance of DL systems, it is important to unravel the ‘black-box’ nature of DL using existing and future methodologies. Although there are challenges ahead, DL will likely impact on the practice of medicine and ophthalmology in the coming decades.

61 in total

1. Accelerating Very Deep Convolutional Networks for Classification and Detection.

Authors: Xiangyu Zhang; Jianhua Zou; Kaiming He; Jian Sun
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2015-11-20 Impact factor: 6.226

2. Automated detection of exudative age-related macular degeneration in spectral domain optical coherence tomography using deep learning.

Authors: Maximilian Treder; Jost Lennart Lauermann; Nicole Eter
Journal: Graefes Arch Clin Exp Ophthalmol Date: 2017-11-20 Impact factor: 3.117

3. ReLayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks.

Authors: Abhijit Guha Roy; Sailesh Conjeti; Sri Phani Krishna Karri; Debdoot Sheet; Amin Katouzian; Christian Wachinger; Nassir Navab
Journal: Biomed Opt Express Date: 2017-07-13 Impact factor: 3.732

4. Clinical Applicability of Deep Learning System in Detecting Tuberculosis with Chest Radiography.

Authors: Daniel S W Ting; Paul H Yi; Ferdinand Hui
Journal: Radiology Date: 2018-02 Impact factor: 11.105

5. Validated System for Centralized Grading of Retinopathy of Prematurity: Telemedicine Approaches to Evaluating Acute-Phase Retinopathy of Prematurity (e-ROP) Study.

Authors: Ebenezer Daniel; Graham E Quinn; P Lloyd Hildebrand; Anna Ells; G Baker Hubbard; Antonio Capone; E Revell Martin; Candace P Ostroff; Eli Smith; Maxwell Pistilli; Gui-Shuang Ying
Journal: JAMA Ophthalmol Date: 2015-06 Impact factor: 7.389

Review 6. Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review.

Authors: Daniel Shu Wei Ting; Gemmy Chui Ming Cheung; Tien Yin Wong
Journal: Clin Exp Ophthalmol Date: 2016-02-17 Impact factor: 4.207

Review 7. Effectiveness of screening and monitoring tests for diabetic retinopathy--a systematic review.

Authors: A Hutchinson; A McIntosh; J Peters; C O'Keeffe; K Khunti; R Baker; A Booth
Journal: Diabet Med Date: 2000-07 Impact factor: 4.359

8. Sustaining remote-area programs: retinal camera use by Aboriginal health workers and nurses in a Kimberley partnership.

Authors: Richard B Murray; Sue M Metcalf; Philomena M Lewis; Jacqueline K Mein; Ian L McAllister
Journal: Med J Aust Date: 2005-05-16 Impact factor: 7.738

9. Reduction of intraocular pressure and glaucoma progression: results from the Early Manifest Glaucoma Trial.

Authors: Anders Heijl; M Cristina Leske; Bo Bengtsson; Leslie Hyman; Boel Bengtsson; Mohamed Hussein
Journal: Arch Ophthalmol Date: 2002-10

10. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning.

Authors: Hoo-Chang Shin; Holger R Roth; Mingchen Gao; Le Lu; Ziyue Xu; Isabella Nogues; Jianhua Yao; Daniel Mollura; Ronald M Summers
Journal: IEEE Trans Med Imaging Date: 2016-02-11 Impact factor: 10.048

176 in total

1. Artificial Intelligence Screening for Diabetic Retinopathy: the Real-World Emerging Application.

Authors: Valentina Bellemo; Gilbert Lim; Tyler Hyungtaek Rim; Gavin S W Tan; Carol Y Cheung; SriniVas Sadda; Ming-Guang He; Adnan Tufail; Mong Li Lee; Wynne Hsu; Daniel Shu Wei Ting
Journal: Curr Diab Rep Date: 2019-07-31 Impact factor: 4.810

2. Development and validation of a deep learning algorithm for distinguishing the nonperfusion area from signal reduction artifacts on OCT angiography.

Authors: Yukun Guo; Tristan T Hormel; Honglian Xiong; Bingjie Wang; Acner Camino; Jie Wang; David Huang; Thomas S Hwang; Yali Jia
Journal: Biomed Opt Express Date: 2019-06-12 Impact factor: 3.732

Review 3. [Artificial intelligence in cardiology : Relevance, current applications, and future developments].

Authors: Bettina Zippel-Schultz; Carsten Schultz; Dirk Müller-Wieland; Andrew B Remppis; Martin Stockburger; Christian Perings; Thomas M Helms
Journal: Herzschrittmacherther Elektrophysiol Date: 2021-01-15

4. Classification of pachychoroid on optical coherence tomography using deep learning.

Authors: Nam Yeo Kang; Ho Ra; Kook Lee; Jun Hyuk Lee; Won Ki Lee; Jiwon Baek
Journal: Graefes Arch Clin Exp Ophthalmol Date: 2021-02-22 Impact factor: 3.117

5. Bringing Ophthalmic Graduate Medical Education into the 2020s with Information Technology.

Authors: Emily Cole; Nita G Valikodath; April Maa; R V Paul Chan; Michael F Chiang; Aaron Y Lee; Daniel C Tu; Thomas S Hwang
Journal: Ophthalmology Date: 2020-12-24 Impact factor: 12.079

Review 6. Applications of artificial intelligence in nuclear medicine image generation.

Authors: Zhibiao Cheng; Junhai Wen; Gang Huang; Jianhua Yan
Journal: Quant Imaging Med Surg Date: 2021-06

7. Reconstruction of high-resolution 6×6-mm OCT angiograms using deep learning.

Authors: Min Gao; Yukun Guo; Tristan T Hormel; Jiande Sun; Thomas S Hwang; Yali Jia
Journal: Biomed Opt Express Date: 2020-06-08 Impact factor: 3.732

8. A CNN-aided method to predict glaucoma progression using DARC (Detection of Apoptosing Retinal Cells).

Authors: Eduardo M Normando; Tim E Yap; John Maddison; Serge Miodragovic; Paolo Bonetti; Melanie Almonte; Nada G Mohammad; Sally Ameen; Laura Crawley; Faisal Ahmed; Philip A Bloom; Maria Francesca Cordeiro
Journal: Expert Rev Mol Diagn Date: 2020-05-03 Impact factor: 5.225

9. AOCT-NET: a convolutional network automated classification of multiclass retinal diseases using spectral-domain optical coherence tomography images.

Authors: Ali Mohammad Alqudah
Journal: Med Biol Eng Comput Date: 2019-11-14 Impact factor: 2.602

10. A deep-learning system for the assessment of cardiovascular disease risk via the measurement of retinal-vessel calibre.

Authors: Carol Y Cheung; Dejiang Xu; Ching-Yu Cheng; Charumathi Sabanayagam; Yih-Chung Tham; Marco Yu; Tyler Hyungtaek Rim; Chew Yian Chai; Bamini Gopinath; Paul Mitchell; Richie Poulton; Terrie E Moffitt; Avshalom Caspi; Jason C Yam; Clement C Tham; Jost B Jonas; Ya Xing Wang; Su Jeong Song; Louise M Burrell; Omar Farouque; Ling Jun Li; Gavin Tan; Daniel S W Ting; Wynne Hsu; Mong Li Lee; Tien Y Wong
Journal: Nat Biomed Eng Date: 2020-10-12 Impact factor: 25.671