Literature DB >> 30166272

Automated retinopathy of prematurity screening using deep neural networks.

Jianyong Wang¹, Rong Ju², Yuanyuan Chen¹, Lei Zhang¹, Junjie Hu¹, Yu Wu¹, Wentao Dong³, Jie Zhong⁴, Zhang Yi⁵.

Abstract

BACKGROUND: Retinopathy of prematurity (ROP) is the leading cause of childhood blindness worldwide. Automated ROP detection system is urgent and it appears to be a safe, reliable, and cost-effective complement to human experts.
METHODS: An automated ROP detection system called DeepROP was developed by using Deep Neural Networks (DNNs). ROP detection was divided into ROP identification and grading tasks. Two specific DNN models, i.e., Id-Net and Gr-Net, were designed for identification and grading tasks, respectively. To develop the DNNs, large-scale datasets of retinal fundus images were constructed by labeling the images of ROP screenings by clinical ophthalmologists.
FINDINGS: On the test dataset, the Id-Net achieved a sensitivity of 96.62%(95%CI, 92.29%-98.89%) and a specificity of 99.32% (95%CI, 96.29%-9.98%) for ROP identification while the Gr-Net attained sensitivity and specificity values of 88.46% (95%CI, 96.29%-99.98%) and 92.31% (95%CI, 81.46%-97.86%), respectively, on the ROP grading task. On another 552 cases, the developed DNNs outperformed some human experts. In a clinical setting, the sensitivity and specificity values of DeepROP for ROP identification were 84.91% (95%CI, 76.65%-91.12%) and 96.90% (95%CI, 95.49%-97.96%), respectively, whereas the corresponding measures for ROP grading were 93.33%(95%CI, 68.05%-99.83%) and 73.63%(95%CI, 68.05%-99.83%), respectively.
INTERPRETATION: We constructed large-scale ROP datasets with adequate clinical labels and proposed novel DNN models. The DNN models can directly learn ROP features from big data. The developed DeepROP is potential to be an efficient and effective system for automated ROP screening. FUND: National Natural Science Foundation of China under Grant 61432012 and U1435213.

Entities: Disease Gene Species

Mesh：

Year: 2018 PMID： 30166272 PMCID： PMC6156692 DOI： 10.1016/j.ebiom.2018.08.033

Source DB: PubMed Journal: EBioMedicine ISSN： 2352-3964 Impact factor: 8.143

Evidence before this study

Before this study, we visited several experienced ophthalmologist and neonatologist in different hospitals, such as Sichuan Academy of Medical Sciences and Sichuan Provincial People's Hospital and Chengdu Women & Children's Central Hospital. It is known that Retinopathy of prematurity (ROP) is the leading cause of childhood blindness worldwide. Automated ROP detection system is urgent and it appears to be a safe, reliable, and cost-effective complement. We searched the Google Scholar and Baidu Scholar with the items, Retinopathy of prematurity, ROP, retinal fundus images, ROP detection, automated ROP detection, etc. and obtained several related works in JAMA, JAMA Ophthalmology, Nature Biomedical Engineering, Cell, etc. These works indicated that Deep neural networks (DNNs) could be used to develop the Automated ROP detection system. DNNs can directly learn abstract features from big data without the explicit knowledge of experts and yield impressive results on medical image analysis, including ROP detection. Automated ROP detection system using DNNs requires further assessment and validation to improve ROP care.

Added value of this study

The added value of this study lies on three aspects. Firstly, we constructed a large-scale ROP dataset with adequate clinical labels, which is sufficient to develop robust DNNs models that can deliver satisfactory performance. Secondly, we developed and evaluated two specific deep neural networks on the constructed dataset for ROP detection. Thirdly, we integrated the developed DNN models with cloud platform to implement the automated ROP screening system and tested the system with several methods.

Implications of all the available evidence

Firstly, the constructed dataset in this study was labeled with adequate clinical labels. The automated ROP screening system developed on such dataset would be more valuable in the clinical treatment. Future research may pay more efforts to explore the more suitable clinical labels rather than the “none ROP” and “plus-disease”. And it would be a good tutorial for education on ROP. Secondly, we evaluated the good performance of DNNs on ROP detection in retinal fundus images. The method could be applied into other medical images analysis tasks, such as detection of diabetic retinopathy and OCT images classification. Thirdly, we developed and tested the proposed automated ROP screening system which would improve the quality of ROP care since it can be used in the routine ROP screening in hospitals and facilitate cooperation among hospitals to detect ROP immediately and reliably. Fourthly, an interesting work would be done in the future is to show whether the features that DNN models dependent on to assign the prediction to ROP cases are the previously known features. It may reveal some new features unknown. Alt-text: Unlabelled Box

Introduction

Retinopathy of prematurity (ROP) is an ocular disease observed in infants with low birth weight, especially premature infants. The retinal blood vessels of the infant suffering from ROP develop abnormally [1], which affects sight and may even lead to blindness in severe cases, e.g., due to retinal detachment. ROP is treatable by laser photocoagulation if the diagnosis and treatment are timely [2]. Therefore, ROP screening programs play an important role in ROP detection and clinical treatment. ROP screening for high-risk children is mainly performed by pediatric ophthalmologists or retinopathy specialists by bedside fundus photography devices. With the increasing number of high-risk children in the world and the imbalance of geographical distribution of children, this difficult and traditional method is obviously unable to meet the needs [3]. Besides, the number of ophthalmologists who are competent and willing to undertake ROP screening is gradually decreasing [4]. Kemper [5] reported that about 36% of the neonatologists in USA have been unable to transfer a child to a NICU of lower acuity or closer to the child's home because there are no specialists available there for ROP screening. 34% of them have needed to delay discharge because outpatient follow-up for either screening or treatment of ROP is not available near the family's home. ROP is the leading cause of childhood blindness worldwide [6,7]. The situation is much more severe in developing countries, such as China. High neonatal survival rates and a large number of infants with low birth weight increase the number of premature infants at risk of ROP [8], and medical resources are scarce. Automated ROP detection system is urgent and it appears to be a safe, reliable, and cost-effective complement [9,10] to the efforts of ROP specialists, capable of increasing patient access to screening and focusing the resources of the current ophthalmic community on infants with potentially vision-threatening disease. In these ROP screening programs, images of the retinal fundus are captured using fundus photography devices, such as RetCam3, and are used by the ophthalmologist to make a diagnosis. To implement the automated ROP detection system, an algorithm to analyze the retinal fundus images and detect ROP is desired. ROP detection is represented as a classification problem in machine learning. Traditional algorithms [[9], [10], [11], [12]] analyze the retinal fundus images with handcrafted features, such as vessel dilation/tortuosity, venular features, and discern retinal from choroidal images. These methods require the specification of rules, which is unsuitable for ROP detection because of inadequate understanding of ROP symptomatology. Deep neural networks (DNNs) [[13], [14], [15]], a branch of artificial intelligence, have lately yielded impressive results, such as AlphaGO [16]. It avoids the aforementioned “feature engineering” by automatically learning latent features from big data, which is a large set of labeled data consisting of images and corresponding labels. Therefore, DNNs reduce the need for explicit features identified by experts. They have achieved success in a variety of applications of image analysis [3,[16], [17], [18], [19], [20], [21], [22], [23], [24]]. Most relevant research has been limited to ROP with plus-disease [3,21]. For example, DNNs including a preprocessing network with U-net like architecture and a classification network with Inception V1 like architecture achieved a sensitivity of 93% with 94% specificity for diagnosis of plus disease [21]. Generally speaking, ROP with plus-disease is much more distinguishable and many published datasets are limited to such problems. It is clear that plus is an important feature in ROP, however, it is not sufficient to define ROP by itself only. More efforts should be paid to construct datasets by taking more clinical features together, such as stage and zone, into account. Moreover, the number of the images in published datasets is not sufficient to develop robust DNNs models that can deliver satisfactory performance. In this study, we constructed a large-scale ROP dataset with adequate clinical labels firstly. The labels are “normal”, “minor ROP”, and “severe ROP”, which are defined on the clinical features of plus, stage and zone together. In the clinical treatment, patients with “minor ROP” are required to have a reexamination after two weeks whereas patients with “severe ROP” are require to be treated by the clinical ophthalmologist immediately. The automated ROP screening system developed on such dataset would be more valuable in the clinical treatment than the systems developed on the dataset with “plus” and “non-plus” labels. Secondly, we developed two specific DNN models on the constructed dataset. Thirdly, we integrated the developed DNN models with cloud platform to implement the automated ROP screening system and tested the system with several methods.

Methods

In this study, we developed an automated ROP screening system using DNN models. It could be used for telemedicine ROP screening and facilitate cooperation among hospitals. In this section, the methods to construct the ROP detection datasets, develop and evaluate the DNN models as well as the automated ROP screening systems are presented.

Image labeling

Firstly, a large set of retinal fundus images from the ROP screening of premature babies should be collected. During each ROP screening, several retinal fundus images are captured which is split into two sets according to the eyes since the healthy state of each eye could be different. Each set is defined as a case that is to be labeled independently. The cases were labeled by four clinical ophthalmologists following the pipelines in Fig. 1. In the initial blind-reading phase, each case was annotated separately by three ophthalmologists. A case was annotated as “ROP” if ROP was found in any of its retinal fundus images; otherwise, it was annotated as “normal.” The ROP case was then graded into “minor ROP” or “severe ROP” according to its severity. Therefore, two datasets were constructed: an identification dataset and a grading dataset. In the subsequent non-blind phase, the annotations were evaluated by another more experienced clinical ophthalmologist if the assigned labels varied among the initial experts. The fourth ophthalmologist dropped the cases if diagnoses were not faithfully. The labels, i.e., “normal”, “minor ROP”, and “severe ROP”, used in this study were assigned not only based on the International Committee for the Classification of Retinopathy of Prematurity (ICROP) system [25], but also the requirements of clinical treatment described in Cryotherapy for Retinopathy of Prematurity (CRYO-ROP) and Early Treatment ROP (ETROP) [2,25,26]. The identification labels, “normal” and “ROP”, denote whether the clinical features of ROP found in the retinal fundus images. The grading labels, “minor ROP” and “severe ROP”, indicates the severity of the ROP disease. The clinical features of “minor ROP” are Zone II or III and stage 1 or 2, while the clinical features of “severe ROP” are threshold disease, type I, type II, or AP-ROP, and stage 4, or 5. The detail of the definitions of the clinical features of ROP can be found in the literatures [2,25,26].

Fig. 1

The image labeling process.

Deep neural networks

ROP detection consists of two subtasks. One is an identification task, which aims to distinguish ROP cases from normal cases. The other is a grading task, where cases of ROP are classified into minor and severe. The two subtasks are binary classification problems in machine learning, which can be formulated as:where n is the number of images of case X and x is the i-th image. y ∈ {0, 1} is the corresponding label of X. Our goal is to approximate the function F. The methods in the literature [3,21] cannot be applied directly into our work since the detection in this study was performed upon each cases rather than each image independently. There are four challenges making it difficult for human eye as well as DNN models to identify and grading the ROP through the retinal fundus images: [1] The number of the images varies in different cases depending on the patient and the photographer; [2] Images of the retinal fundus are usually poor in quality because of motion blur and illumination artifacts. The colors images are with significantly appearance variations; [3] Only a few images in ROP cases contain the distinguished feature that determines ROP while the others are similar images between the different cases. [4] The feature that determines ROP is in some very small local regions of the image. To deal with the above problems, an ROP detection algorithm is developed by using deep convolutional neural networks (see Fig. 2). An identification network (Id-Net) and a grading network (Gr-Net) were developed to approximate the mapping F for the aforementioned two subtasks, respectively. Id-Net and Gr-Net have a similar network architecture (see Fig. 3). It was constructed by introducing a feature-binding block to the well-known Inception–BN17 network, which delivered state-of-the-art performance in the large-scale image classification challenge “ImageNet.”

Fig. 2

The workflow of ROP detection by using two DNNs. Case X was first identified by Id-Net and subsequently graded by Gr-Net if it was a ROP case.

Fig. 3

The architecture of the DNNs used in the algorithm. The feature-binding block was implemented by an element-wise max operation. The architectures of the Conv Factory, Inception A, and Inception B modules are identical to the corresponding modules in the Inception–BN17 network.

The workflow of ROP detection by using two DNNs. Case X was first identified by Id-Net and subsequently graded by Gr-Net if it was a ROP case. The architecture of the DNNs used in the algorithm. The feature-binding block was implemented by an element-wise max operation. The architectures of the Conv Factory, Inception A, and Inception B modules are identical to the corresponding modules in the Inception–BN17 network. The proposed Id-Net and Gr-Net consist of three blocks. The first is a feature extractor block, which is identical to the subnetwork from the input layer to the second Inception-B block in the Inception–BN17 network. Given an case X, the feature extractor block takes each image x one by one and outputs the corresponding feature map s. The second block is the feature-binding block, which aims to bind the set of feature maps S = {s| i = 1, 2, ⋯, n} into one feature map S(X). The feature fusing method used in our networks is the max operation, defined by S(X) = max {s| i = 1, 2, ⋯, n}, where s is the j-th element of the i-th feature map, and S(X) is the j-th element of feature map S(X). The feature-binding block can reduce negative effects of the varying number of images in each case X, and can enhance the feature maps of distinguishable images while suppressing similar images that are found in many cases belonging to different categories. The third block is the classification block, which consists of several convolution layers and a softmax layer. It learns deep abstract features from feature map S(X), and maps the final feature into the corresponding label of the input case. Id-Net and Gr-Net need to be trained on the large-scale dataset to obtain the network parameters. The training of a neural network refers to changing the parameters in it to improve network performance by deploying a learning algorithm, such as the backpropagation (BP) algorithm. The network performance is described by a cost function that measures the distance between the network's output and the real label. After training, the neural network is expected to produce an output that is close to the corresponding real label. The training processes of Id-Net and Gr-Net were identical except for the training dataset used. Id-Net was trained on the identification dataset whereas the Gr-Net was trained on the grading dataset. The training process consisted of two phases. The first was offline pre-training. An Inception–BN network was trained on Image-Net. The learned parameters were used as the initial parameters of Id-Net and Gr-Net, which were then fine-tuned on the corresponding ROP dataset. Here, the retinal fundus images were resized from 1200*1600*3 to 240*320*3 and the values of the pixels were normalized into range [−1, 1]. Cross-Entropy was used as cost function. The optimization method was AdaDelta [27] with the default parameters lr = 1.0, rho = 0.9, eps = 1e − 06. The proposed DNN models were implemented using Tensorflow (https://www.tensorflow.org) and the source code is available in Github as https://github.com/qmiwang/deeprop.

Experts comparisons

To further evaluate the performance of the developed DNN models, the retinal fundus images captured during routine clinical ROP Screenings in Jan. 2018 in Chengdu Women & Children's Central Hospital were used to compare the DNN models' predictions with the diagnoses of three human experts. Having developing a labeling system, we instructed three experts with adequate clinical experience in ROP detection to make a referral decision on each case using only the retinal fundus image independently. Given a case, the true label was given by the voting of the labels made by three experts. It means that the label of a given case is obtained by choosing the label given by at least two experts. The confusion table and error rate of the DNN model and the three experts were calculated by referring the true labels. Furthermore, we computed the KAPPA values between the DNN model and the three experts respectively.

Testing of the DeepROP system under clinical settings

The ROP detection algorithm was integrated into a cloud computing platform for automated ROP detection in images of the retinal fundus. The platform is called DeepROP. It can be used for telemedicine ROP screening and cooperation among hospitals. In general, the workflow consists of the following stages (see Fig. 4):

Fig. 4

The workflow of developed DeepROP system in the clinical test. The cloud-based platform consists of the proposed DNN models and the DeepROP website as the user interface.

The workflow of developed DeepROP system in the clinical test. The cloud-based platform consists of the proposed DNN models and the DeepROP website as the user interface. Step1: Uploading images of the retinal fundus. In a hospital that has ROP screening devices, prematurely born infants undergo ROP screening programs. The photographer, doctor, or anyone who can obtain images of the retinal fundus can access the automated ROP detection platform and upload the images through a website. Step2: Automated ROP detection. The automated ROP detection platform saves the retinal fundus images and automatically divides them into two cases: images of the left eye and those of the right eye. Given a case, Id-Net takes the images as input and generates the ROP identification results. The detection process is complete if the given case is determined to be normal. Otherwise, Gr-Net is executed to determine the severity of ROP. Finally, an initial diagnostic report is generated automatically. This process does not require any human assistance. It eliminates the drawbacks stemming from a lack of adequately trained clinicians for ROP identification. Step3: ROP expert evaluation. The diagnostic report contains a confidence score for the automated ROP detection platform. If the confidence score is below a threshold, the results should be evaluated by human experts. They can check the report online, and one report can be evaluated by several experts. Our cloud-based platform thus supports cooperation among hospitals for ROP diagnosis. The website was assessed in six hospitals for routine ROP screening programs. The clinical ophthalmologist uploaded images captured during routine clinical ROP screening programs without any data selection or preprocessing. The DeepROP system would generate the diagnose automatically. By referring the diagnosis by the clinical ophthalmologist as the standard label, the specificity and sensitivity of the model were presented.

Results

In this study, we constructed a large-scale ROP datasets of retinal fundus images and developed two DNN models for the detection of ROP. One is Id-Net which aims to distinguish the normal case from ROP case and the other is Gr-Net which classify the ROP cases into either Minor ROP case or Severe ROP case. The two DNN models were trained and evaluated on the constructed dataset and then integrated into a cloud based platform for automated ROP detection, which was assessed in the routine ROP screening in multiple hospitals in China.

Large-scale ROP datasets

We collected lots of images of ROP screenings from the Chengdu Women & Children's Central Hospital. Following the proposed image labeling method, there were 349 cases drop in the identification dataset and 222 cases drop in the grading task. The proposed datasets consist of three parts: developing data to build the DNN models, data for expert comparison, and data collected from web. The number of cases and images are listed in the Table 1. The cases in developing data and data for expert comparison were collected from screenings of 869 infants. The summaries of the birth weights and gestation ages of the infants are presented in Fig. 5. As expected, most of them had low birth weights or were premature. There were 93 boys and 78 girls while the genders of other infants were unknown since they were not recorded during the ROP screening.

Table 1

Summaries of datasets used in this study.

	#patients	Identification task				Grading task
		normal		ROP		Minor ROP		Severe ROP
		#cases	#images	#cases	#images	#cases	#images	#cases	#images
Developing data	605	1484	7559	742	5967	260	1834	260	2305
Data for expert comparison	264	501	2068	51	293	31	173	20	120
Data from web	404	838	4251	106	657	91	565	15	92

Fig. 5

The distributions of birth weight and gestation ages of the infants. “UK” denotes the set of infants whose birth weights or gestation ages were not provided. A, each bar represents the number of sets of infants whose birth weights were in the given range. B, each bar represents the number of sets of infants whose gestation ages were within the given weeks.

Summaries of datasets used in this study. The distributions of birth weight and gestation ages of the infants. “UK” denotes the set of infants whose birth weights or gestation ages were not provided. A, each bar represents the number of sets of infants whose birth weights were in the given range. B, each bar represents the number of sets of infants whose gestation ages were within the given weeks. To the best of our knowledge, this is the largest dataset of images of the retinal fundus for ROP detection. A comparison of our dataset with published datasets is listed in Table 2. Our dataset was more heterogeneous as it contained images of more patients and more cases of ROP screening. It would be a good tutorial for ROP education because it ensures coverage of the various instances of different ROP severities. Furthermore, the scale of the dataset is key to the success of automated ROP grading methods, such as DNNs. The labels used in our dataset were designed according to the clinical treatment of ROP patients in Sichuan, China. It is thus more suitable for patients in China.

Table 2

Summaries of different datasets for ROP detection.

	Patients	Cases	Images	Labels
Canada[3]	35	347	1459	normal/plus
London[3]	–	–	106	normal/plus
Dataset[8]	–	–	77	Normal/plus/preplus
Ours	1273	3722	20,795	normal/minor/severe

Summaries of different datasets for ROP detection.

Performance of the model

To train and evaluate DNN models, the developing dataset was split randomly. 298 cases (149 normal and 149 ROP) and 104 cases (52 minor ROP and 52 severe ROP) were selected randomly from the developing dataset for evaluating the performance of Id-Net and Gr-Net, respectively. Other cases were used to train Id-Net and Gr-Net. To deal with the imbalance problems, during the training, we oversampled the classes with less cases to ensure the number of the cases are same in each task. By referring to annotations provided by four ophthalmologists, the proportions of false positives and missed cases of the optimized DNNs were recorded. The results were presented in Fig. 6, Table 3, and Table 4. In the identification task, the developed Id-Net distinguished normal cases from ROP cases with a sensitivity of 96.64%(95%CI, 92.34%–98.90%) and a specificity of 99.33%(95%CI, 96.32%–99.98%). The area under the ROC curve was 99.49%. Using Gr-Net, the algorithm provided ROP grading attained sensitivity and specificity values of 88.46%(95%CI, 76.56%–95.65%) and 92.31%(95%CI, 81.46%–97.86%), respectively. The area under the ROC curve was 95.08%.

Fig. 6

Receiver operating characteristic (ROC) curves of DNN models for identification task (Left) and grading task (Right).

Table 3

Confusion table of Id-Net for identification task.

		Predicted label
		Normal	ROP
True label	Normal	148	1
True label	ROP	5	144

Table 4

Confusion table of Gr-Net for grading task.

		Predicted label
		Minor ROP	Severe ROP
True label	Minor ROP	48	4
True label	Severe ROP	6	46

Receiver operating characteristic (ROC) curves of DNN models for identification task (Left) and grading task (Right). Confusion table of Id-Net for identification task. Confusion table of Gr-Net for grading task.

Comparison of the model with human experts

The dataset for expert comparison were the retinal fundus images captured during routine clinical ROP in Jan. 2018 in Chengdu Women & Children's Central Hospital to compare the developed DNN's referral decisions with the decisions made by human experts. There are 2361 images of 552 cases from 264 patients. The distributions of the normal cases, minor ROP cases, and severe ROP cases were presented in Table 1. The inter-rater KAPPA values among the DNN models and three experts were presented in the Table 5. Furthermore, the confusion tables and error rates of the DNN models and the three experts are presented in Table 6, Table 7, Table 8, Table 9, and Fig. 7. The developed DNN models outperformed the expert 1.

Table 5

Inter-rater KAPPA values.

	DNN Models	Expert1	Expert2	Expert3	Label
DNN Models	1	0.34	0.70	0.70	0.68
Expert 1	0.34	1	0.32	0.42	0.53
Expert 2	0.70	0.32	1	0.69	0.80
Expert 3	0.70	0.42	0.69	1	0.89
Label	0.68	0.53	0.80	0.89	1

Table 6

The confusion table of the DNNs on the dataset used for the comparison of the model with experts.

		Predicted Label
		Normal	Minor	Severe
True Label	Normal	487	11	3
	Minor	6	16	9
	Severe	2	1	17

Table 7

The confusion table of the expert 1 on the dataset used for the comparison of the model with experts.

		Predicted label
		Normal	Minor	Severe
True Label	Normal	499	0	2
	Minor	15	16	0
	Severe	5	15	0

Table 8

The confusion table of the expert 2 on the dataset used for the comparison of the model with experts.

		Predicted label
		Normal	Minor	Severe
True Label	Normal	492	9	0
	Minor	1	20	10
	Severe	0	0	20

Table 9

The confusion table of the expert 3 on the dataset used for the comparison of the model with experts.

		Predicted label
		Normal	Minor	Severe
True label	Normal	489	12	0
	Minor	0	31	0
	Severe	0	0	20

Fig. 7

The error rates of the experts and the developed DNN model.

Inter-rater KAPPA values. The confusion table of the DNNs on the dataset used for the comparison of the model with experts. The confusion table of the expert 1 on the dataset used for the comparison of the model with experts. The confusion table of the expert 2 on the dataset used for the comparison of the model with experts. The confusion table of the expert 3 on the dataset used for the comparison of the model with experts. The error rates of the experts and the developed DNN model. Furthermore, the averagetime per case used by the DNNs is around 2 s while it is around 30–60 s for human experts. The detection prediction by the DNN models is very efficient.

DeepROP website and clinical test

The proposed ROP detection algorithm was integrated with a cloud computing platform for automated ROP detection and report generation by the proposed networks, sample search and download, report evaluation and interaction between patients and ophthalmologists. The website was assessed in six hospitals for routine ROP screening programs, including Chengdu Women & Children's Central Hospital, Xin Hua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Ya An People's Hospital, People's Hospital of De Yang City, Affiliated Hospital of North Sichuan Medical College, Affiliated Hospital of Southwest Medical University. The clinical ophthalmologist uploaded images captured during routine clinical ROP screening programs without any data selection or preprocessing. The developed DNN models achieved good performance. From March 3 to December 28, 2017, 472 ROP screenings were conducted on 404 infants in Chengdu Women & Children's Central Hospital. The clinical ophthalmologist uploaded images from these 472 screenings to the website. DeepROP automatically divided the images into 944 cases according to eye. For each case, it made two predictions for identification and grading, respectively, and generated an initial diagnostic report that was evaluated by a clinical ophthalmologist. The ophthalmologist edited the report if his/her diagnosis was inconsistent with the initial report by DeepROP. For comparison, the diagnosis by the clinical ophthalmologist was referred to as the standard label. The ophthalmologist diagnosis and DeepROP predictions were then recorded. The performance of DeepROP is shown in Table 10.

Table 10

Summaries of the performance of DNNs used in DeepROP website in clinical test.

	Id-Net	Gr-Net
Sensitivity	84.91%(95%CI, 76.65%–91.12%)	93.33%(95%CI, 68.05%–99.83%)
Specificity	96.90%(95%CI, 95.49%–97.96%)	73.63%(95%CI, 68.05%–99.83%)
Accuracy	95.55%(95%CI, 94.03%–96.77%)	76.42% (95%CI, 67.18%–84.12%)

Summaries of the performance of DNNs used in DeepROP website in clinical test.

Discussions

Automated ROP grading algorithms are desirable for ROP screening and treatment. Dataset plays a crucial role in the training of automated ROP grading algorithms. In general, a dataset used for DNNs must contain a large number of images of ROP screenings that cover various ROP features and different severities, and have been captured from an adequately large number of patients to reduce the individual effect and avoid the over-fitting of the learning algorithm to some specific feature. Otherwise, the performance of the learning algorithm worsens when applied to new images of ROP screening. The datasets constructed in this study are advantageous in terms of the above properties over other datasets. Moreover, both the identification dataset and grading dataset were provided, which enabled the process to grade cases of ROP for accurate treatment, rather than simply identify ROP. The results showed that DNNs can be trained using large-scale datasets, without the need to specify features by experts, to automatically detect ROP in images of the retinal fundus with high sensitivity and accuracy. Compared with human experts with adequate clinical experience on ROP diagnose, the developed DNNs obtained comparable performance and made the detection prediction very efficiently. Further, it provides the same diagnosis on a given image every time, which is difficult for a human ophthalmologist. Multiple DNN models can be trained and integrated to simulate the consultation of several experts as well. This will be tested in our future work. Furthermore, the proposed method in this study is more suitable to be deployed in the automated ROP screening system than the method in the literature [21]. Because our method detects the ROP per case no matter how many images there are. The clinical setting test highlighted the impressive performance of the DeepROP platform. Results suggest that whilst ROP identification worked very well prospectively, the ROP grading does not do so well. Maybe there are two reasons. One is that the features between “minor ROP” cases and “severe ROP” cases are less distinguishable than the features between “Normal” cases and “ROP” cases. The other is that the number of cases used to develop the Gr-Net is smaller than the number of cases used to develop the Id-Net so that the generalization performance of the Gr-Net is poorer than the Id-Net. The clinical setting test is an important step in telemedicine ROP screening programs and multihospital collaboration for several reasons. First, it enables ROP pre-screening and evaluation of patients in non-specialized hospitals. Second, it breaks the bottleneck of data usage due to the isolation of data in individual hospitals and accelerates the process of collecting a large-scale dataset for fine-grained ROP grading. Third, the cloud-based platform can collect new data, which will increase the coverage of ROP features, and improve the ROP detection algorithm. For example, in the Web-based test in this study, 4908 images of 944 cases of 404 infants were collected in 10 months. The size of those data items was larger than published datasets. The limitations of our works lie on three aspects. Firstly, the number of “severe ROP” cases is not enough to ensure the generalization performance in the clinical test. Secondly, the developed system can only evaluate the severity of ROP. In future work, we plan to collect more data and grade ROP in more fine-grained classes, such as the “plus-disease,” “stage,” and “zone.” Thirdly, an open question is whether the predictions of the cloud platform will influence the diagnoses of ophthalmologists in comparison with clinical testing without the platform. More attention should be paid to test the platform more widely in future.

16 in total

Review 1. The International Classification of Retinopathy of Prematurity revisited.

Authors:
Journal: Arch Ophthalmol Date: 2005-07

2. Computerized analysis of retinal vessel width and tortuosity in premature infants.

Authors: Clare M Wilson; Kenneth D Cocker; Merrick J Moseley; Carl Paterson; Simon T Clay; William E Schulenburg; Monte D Mills; Anna L Ells; Kim H Parker; Graham E Quinn; Alistair R Fielder; Jeffrey Ng
Journal: Invest Ophthalmol Vis Sci Date: 2008-04-11 Impact factor: 4.799

Review 3. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

4. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.

Authors: Varun Gulshan; Lily Peng; Marc Coram; Martin C Stumpe; Derek Wu; Arunachalam Narayanaswamy; Subhashini Venugopalan; Kasumi Widner; Tom Madams; Jorge Cuadros; Ramasamy Kim; Rajiv Raman; Philip C Nelson; Jessica L Mega; Dale R Webster
Journal: JAMA Date: 2016-12-13 Impact factor: 56.272

5. Mastering the game of Go without human knowledge.

Authors: David Silver; Julian Schrittwieser; Karen Simonyan; Ioannis Antonoglou; Aja Huang; Arthur Guez; Thomas Hubert; Lucas Baker; Matthew Lai; Adrian Bolton; Yutian Chen; Timothy Lillicrap; Fan Hui; Laurent Sifre; George van den Driessche; Thore Graepel; Demis Hassabis
Journal: Nature Date: 2017-10-18 Impact factor: 49.962

6. Characteristics of infants with severe retinopathy of prematurity in countries with low, moderate, and high levels of development: implications for screening programs.

Authors: Clare Gilbert; Alistair Fielder; Luz Gordillo; Graham Quinn; Renato Semiglia; Patricia Visintin; Andrea Zin
Journal: Pediatrics Date: 2005-04-01 Impact factor: 7.124

7. Automated Diagnosis of Plus Disease in Retinopathy of Prematurity Using Deep Convolutional Neural Networks.

Authors: James M Brown; J Peter Campbell; Andrew Beers; Ken Chang; Susan Ostmo; R V Paul Chan; Jennifer Dy; Deniz Erdogmus; Stratis Ioannidis; Jayashree Kalpathy-Cramer; Michael F Chiang
Journal: JAMA Ophthalmol Date: 2018-07-01 Impact factor: 7.389

8. Childhood blindness.

Authors: P G Steinkuller; L Du; C Gilbert; A Foster; M L Collins; D K Coats
Journal: J AAPOS Date: 1999-02 Impact factor: 1.220

Review 9. Retinal, visual, and refractive development in retinopathy of prematurity.

Authors: Anne Moskowitz; Ronald M Hansen; Anne B Fulton
Journal: Eye Brain Date: 2016-05-20

10. Computer-Based Image Analysis for Plus Disease Diagnosis in Retinopathy of Prematurity: Performance of the "i-ROP" System and Image Features Associated With Expert Diagnosis.

Authors: Esra Ataer-Cansizoglu; Veronica Bolon-Canedo; J Peter Campbell; Alican Bozkurt; Deniz Erdogmus; Jayashree Kalpathy-Cramer; Samir Patel; Karyn Jonas; R V Paul Chan; Susan Ostmo; Michael F Chiang
Journal: Transl Vis Sci Technol Date: 2015-11-30 Impact factor: 3.283

15 in total

1. Diving Deep into Deep Learning: An Update on Artificial Intelligence in Retina.

Authors: Brian E Goldhagen; Hasenin Al-Khersan
Journal: Curr Ophthalmol Rep Date: 2020-06-07

2. Automatic zoning for retinopathy of prematurity with semi-supervised feature calibration adversarial learning.

Authors: Yuanyuan Peng; Zhongyue Chen; Weifang Zhu; Fei Shi; Meng Wang; Yi Zhou; Daoman Xiang; Xinjian Chen; Feng Chen
Journal: Biomed Opt Express Date: 2022-03-09 Impact factor: 3.562

3. Automated identification of retinopathy of prematurity by image-based deep learning.

Authors: Yan Tong; Wei Lu; Qin-Qin Deng; Changzheng Chen; Yin Shen
Journal: Eye Vis (Lond) Date: 2020-08-01

Review 4. Artificial intelligence for retinopathy of prematurity.

Authors: Rebekah H Gensure; Michael F Chiang; John P Campbell
Journal: Curr Opin Ophthalmol Date: 2020-09 Impact factor: 3.761

5. ROPtool analysis of plus and pre-plus disease in narrow-field images: a multi-image quadrant-level approach.

Authors: Marguerite C Weinert; David K Wallace; Sharon F Freedman; J Wayne Riggins; Keith J Gallaher; S Grace Prakalapakorn
Journal: J AAPOS Date: 2020-03-27 Impact factor: 1.220

6. Practice Guidelines for Ocular Telehealth-Diabetic Retinopathy, Third Edition.

Authors: Mark B Horton; Christopher J Brady; Jerry Cavallerano; Michael Abramoff; Gail Barker; Michael F Chiang; Charlene H Crockett; Seema Garg; Peter Karth; Yao Liu; Clark D Newman; Siddarth Rathi; Veeral Sheth; Paolo Silva; Kristen Stebbins; Ingrid Zimmer-Galler
Journal: Telemed J E Health Date: 2020-03-25 Impact factor: 3.536

7. Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening.

Authors: Ji Wang; Jie Ji; Mingzhi Zhang; Jian-Wei Lin; Guihua Zhang; Weifen Gong; Ling-Ping Cen; Yamei Lu; Xuelin Huang; Dingguo Huang; Taiping Li; Tsz Kin Ng; Chi Pui Pang
Journal: JAMA Netw Open Date: 2021-05-03