Literature DB >> 35755875

DeepDRiD: Diabetic Retinopathy-Grading and Image Quality Estimation Challenge.

Ruhan Liu^1,2, Xiangning Wang³, Qiang Wu³, Ling Dai^1,2, Xi Fang⁴, Tao Yan⁵, Jaemin Son⁶, Shiqi Tang⁷, Jiang Li⁸, Zijian Gao⁹, Adrian Galdran¹⁰, J M Poorneshwaran¹¹, Hao Liu⁹, Jie Wang¹², Yerui Chen¹³, Prasanna Porwal¹⁴, Gavin Siew Wei Tan¹⁵, Xiaokang Yang², Chao Dai¹⁶, Haitao Song², Mingang Chen¹⁷, Huating Li^18,19, Weiping Jia^18,19, Dinggang Shen^20,21, Bin Sheng^1,2, Ping Zhang^22,23,24.

Abstract

We described a challenge named "Diabetic Retinopathy (DR)-Grading and Image Quality Estimation Challenge" in conjunction with ISBI 2020 to hold three sub-challenges and develop deep learning models for DR image assessment and grading. The scientific community responded positively to the challenge, with 34 submissions from 574 registrations. In the challenge, we provided the DeepDRiD dataset containing 2,000 regular DR images (500 patients) and 256 ultra-widefield images (128 patients), both having DR quality and grading annotations. We discussed details of the top 3 algorithms in each sub-challenges. The weighted kappa for DR grading ranged from 0.93 to 0.82, and the accuracy for image quality evaluation ranged from 0.70 to 0.65. The results showed that image quality assessment can be used as a further target for exploration. We also have released the DeepDRiD dataset on GitHub to help develop automatic systems and improve human judgment in DR screening and diagnosis.

Entities: Chemical

Keywords: artificial intelligence; challenge; deep learning; diabetic retinopathy; fundus image; image quality analysis; retinal image; screening; ultra-widefield

Year: 2022 PMID： 35755875 PMCID： PMC9214346 DOI： 10.1016/j.patter.2022.100512

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

Diabetic retinopathy (DR) is the most common disease caused by diabetes, and it leads to vision loss in adults and mainly affects the working-age population.1, 2, 3, 4 Approximately 600 million people are estimated to have diabetes by 2040, and one-third of them are expected to have DR. DR is diagnosed by visually inspecting a retinal fundus image for the presence of one or more retinal lesions, such as microaneurysms, hemorrhages, soft exudates, and hard exudates. An internationally accepted method of grading the DR levels classifies DR into non-proliferative DR (NPDR) and proliferative DR (PDR). NPDR is the early stage of DR and is characterized by the presence of microaneurysms, whereas PDR is an advanced stage of DR and can lead to severe vision loss. The number and degree of retinal lesions vary in different DR grading, and the specific grading standards of NPDR and PDR are listed in Table 1. Furthermore, Figure S1 shows different levels of DR disease presentation.

Table 1

International Clinical DR Severity Scale

Disease severity level	Descriptions	Findings observable on dilated ophthalmoscopy
Grade 0	no apparent retinopathy	no abnormalities
Grade 1	mild NPDR	microaneurysms only
Grade 2	moderate NPDR	between just microaneurysms and severe NPDR
Grade 3	severe NPDR	any of the following:
		more than 20 intraretinal hemorrhages in each of 4 quadrants;
		definite venous beading in more than 2 quadrants; prominent
		intraretinal microvascular abnormalities in more than 1
		quadrant; no signs of PDR retinopathy
Grade 4	PDR	one or more of the following:
Grade 4	PDR	neovascularization; vitreous/preretinal hemorrhage

PDR, proliferative diabetic retinopathy; NPDR, non-proliferative diabetic retinopathy.

International Clinical DR Severity Scale PDR, proliferative diabetic retinopathy; NPDR, non-proliferative diabetic retinopathy. DR has some severe implications, such as blindness, and its whole population screening is still hampered by several factors.6, 7, 8, 9, 10 First, DR screening places a cumbersome burden on ophthalmologists. Second, healthcare workers are faced with inadequate training, resulting in low-accuracy problems in DR grading. Therefore, computer-aided diagnostic tools are needed to assist manual screening, reducing the burden on ophthalmologists, and helping trained providers to grade fundus images more accurately.12, 13, 14, 15, 16 Recent studies have been conducted to collect raw fundus images and achieve accurate pixel- or image-level expert annotations;,17, 18, 19 these efforts play an important role in facilitating the research community in developing, validating, and comparing DR gradings. Large numbers of raw fundus images and their corresponding physician annotations have important clinical implications for developing robust automated DR grading models. In medical image analysis, grand competitions present substantial opportunities to quickly advance the state-of-the-art methods. Organizers define a clinically relevant task and build a sufficiently large and diverse dataset to allow participants to develop algorithms for solving one or several clinically related problem(s). Moreover, algorithms proposed by participants are consistently evaluated in a fair performance comparison. Many successful challenges have been organized in recent years, specifically in DR fields, i.e., IDRiD, Kaggle 2015, Messidor, Kaggle 2009, ROC, E-Ophtha, and DiaretDB. In 2018, we participated in the “Diabetic Retinopathy—Segmentation and Grading Challenge” (IDRiD) to grade DR levels, segment fundus lesions, and locate retinal landmarks (macula and optic disc) in regular fundus images. The development of our automated screening system for DR was further refined during and after the competition. The details of the model development are described in Dai et al.’s work. When developing the system, we found certain problems hindering the practicality of the automatic DR screening system. First, low-quality fundus images due to significant artifacts and poorly lit areas increase training difficulty. Moreover, fundus images from different devices pose a challenge for the stability of automated screening systems. Finally, dual views of regular fundus images are rarely seen in the previous DR challenges. To address all these limitations, we organized “Diabetic Retinopathy—Grading and Image Quality Estimation Challenge” (DeepDRiD) in ISBI 2020, and we designed three sub-challenges: (1) regular fundus DR grading for images in different quality, (2) image quality assessment for availability, and (3) ultra-widefield (UWF) DR grading for different device transferring. The following is the setup of our challenge: In the DeepDRiD, we presented regular fundus photographs for left and right eyes from each patient in dual views (macula-centered and optic-disc-centered) for the system development by participants. We provided a detailed quality assessment score for each image from the dataset. We also provided a sub-challenge to assess image quality in four aspects: artifact, clarity, field definition, and overall score. We prepared a dataset from UWF fundus photography, containing one double-view shot per patient. This dataset offered the possibility to develop, validate, and test DR screening systems with multiple devices. In this paper, we discuss details of the three best algorithms in each sub-challenge. All participants used convolutional neural networks. Their algorithms differed mainly in terms of the detailed neural network architecture, training strategy, and pre- and post-processing methods. By examining their models, we validate the performance of image quality and DR grading, and we also summarize the relationship between these two tasks.

Methods

Materials

In the DeepDRiD challenge, we included patients from different projects in Shanghai, including participants in the Shanghai Diabetic Complication Screening Project, Nicheng Diabetes Screening Project, and Nation-wide Screening for Complications of Diabetes, for regular fundus images. The other part of fundus images included regular fundus images and UWF retinal images by retinal specialists at the outpatient ophthalmology clinic in the Sixth People’s Hospital of Shanghai Jiao Tong University in China. From thousands of examinations available, we randomly selected 2,000 regular fundus images from 500 patients to form the regular fundus dataset. Each patient in the dataset has four fundus images, and each eye has two records, centered on the macula and optic disc. An example patient is shown in Figure S2A. Furthermore, 256 UWF images from another 128 patients formed our UWF dataset. The study was approved by the Ethics Committee of Shanghai Sixth People’s Hospital and conducted in accordance with the Declaration of Helsinki. Informed consent was obtained from participants. The study was registered on the Chinese Clinical Trials Registry (ChiCTR.org.cn) under the identifier ChiCTR2000031184. In addition to constructing the DeepDRiD dataset, we performed the following procedures to ensure image quality and accuracy of lesion diagnostic labels. Original retinal images were uploaded to the online platform, and the images of each eye were assigned separately to two authorized ophthalmologists. They labeled the images using an online reading platform and gave the image quality assessment scores and graded diagnosis of DR. The third ophthalmologist who served as the senior supervisor confirmed or corrected when the diagnostic results were contradictory. The final grading result was dependent on the consistency within these three ophthalmologists. Clinically, five levels of DR are distinguished, based on the International Clinical DR (ICDR) classification scale: (1) no apparent retinopathy (grade 0), (2) mild NPDR (grade 1), (3) moderate NPDR (grade 2), (4) severe NPDR (grade 3), and (5) PDR (grade 4). Furthermore, the major factors affecting fundus image quality assessment are image artifact, low clarity, and low field definition, as shown in Figure S2B. The specific criteria of image quality assessment and DR grading can be seen in Tables 1 and 2. The DeepDRiD dataset is available to the public (Mendeley Data: https://doi.org/10.5281/zenodo.6452623).

Table 2

Image quality scoring criteria

Type	Image quality specification	Score
Artifact	no artifacts	0
	artifacts are outside the aortic arch with scope less than ¼ of the image	1
	artifacts do not affect the macular area with range less than ¼	4
	artifacts cover more than ¼ but less than ½ of the image	6
	artifacts cover more than ½ without fully covering the posterior pole	8
	cover the entire posterior pole	10
Clarity	clarity only level I vascular arch is visible	1
	level II vascular arch and a small number of lesions are visible	4
	level III vascular arch and some lesions are visible	6
	level III vascular arch and most lesions are visible	8
	level III vascular arch and all lesions are visible	10
Field definition	field definition do not include the optic disc and macula	1
	only contain either optic disc or macula	4
	contain optic disc and macula	6
	the optic disc or macula is outside the 1 papillary diameter and within the 2 papillary	8
	diameter range of the center
	the optic disc and macula are within 1 papillary diameter of the center	10
Overall quality	quality is not good enough for the diagnosis of retinal diseases	0
Overall quality	quality is good enough for the diagnosis of retinal diseases	1

Image quality scoring criteria For regular fundus images, the data were split into 60% for training (Regular Set-A: 300 patients, 1,200 images), 20% for testing (Regular Set-B: 100 patients, 400 images), and 20% for testing (Regular Set-C: 100 patients, 400 images). The UWF data were divided into UWF Set-A (77 patients, 154 images), UWF Set-B (25 patients, 50 images), and UWF Set-C (26 patients, 52 images). Moreover, Set-A and Set-B of regular fundus images and UWF images were provided to participants in model development, and the Set-C was used as an online validation set to evaluate the final performance. In addition, we provided patient-level DR grading results, including a comprehensive assessment of the DR grading results of both eyes. The distribution of DR severity in regular fundus images dataset (Regular Set-A, Regular Set-B, and Regular Set-C) is shown in Table 3. In the 256-image UWF fundus dataset, we collected labeling of DR grading levels according to the ICDR classification scale. The DR grading procedure for ophthalmologists is the same as that of the regular fundus. In this dataset, we only obtained two UWF fundus images from the right and left eye of each patient. We provided DR grading levels for the fundus image of each eye in the UWF image centered on the optic disc. Example figures of UWF fundus in different DR levels are shown in Figure S3. A detailed information of DeepDRiD dataset can be seen in Note S1.

Table 3

Basic characteristics of the patients in DeepDRiD dataset (mean SD)

DR levels	Regular fundus			UWF fundus
DR levels	Set-A	Set-B	Set-C	Set-A	Set-B	Set-C
No. of images	1,200	400	400	77	25	26
No. of participants	300	100	100	154	50	52
Male (%)	51.00	44.00	46.00	54.55	57.69	48.00
Age (years)	70.63 ± 7.70	65.13 ± 1.89	61.36 ± 7.23	74.64 ± 4.86	64.96 ± 1.71	58.28 ± 4.88
BMI (kg m⁻²)	25.17 ± 3.13	24.88 ± 3.21	25.01 ± 2.58	24.90 ± 2.89	25.19 ± 2.61	24.06 ± 3.30
Waist (cm)	90.15 ± 9.24	88.36 ± 9.75	88.03 ± 8.87	88.43 ± 9.07	92.00 ± 8.86	84.73 ± 7.55

Basic characteristics of the patients in DeepDRiD dataset (mean SD)

Challenge setup

The DeepDRiD was composed of various stages, giving a well-organized work process to facilitate the success of contests. Figure 1 depicts the workflow of the overall organization of the challenge. The challenge was officially announced on the ISBI 2020 website on October 25, 2019. Following the DR challenge held with ISBI in 2018, we decided to promote the progress further through the second challenge using a new dataset (DeepDRiD). The challenge was subdivided into three tasks as follows:

Figure 1

Workflow of the ISBI 2020: Diabetic Retinopathy—Grading and Image Quality Estimation Challenge

Sub-challenge 1: DR disease grading: classification of fundus images according to the severity level of diabetic retinopathy using dual-view retinal fundus images. Sub-challenge 2: image quality estimation: fundus quality assessment for overall image quality, artifacts, clarity, and field definition. Sub-challenge 3: UWF fundus DR grading: explore the generalizability of a DR grading system. The robust and generalizable models were expected to be developed to solve practical clinical issues. Workflow of the ISBI 2020: Diabetic Retinopathy—Grading and Image Quality Estimation Challenge We set up a website to share information about the challenge and provide an interface for all challenge-related issues. The challenge website is accessible directly at https://isbi.deepdr.org. On the website, the participants could register and find a general overview of the challenge, including the deadlines, a brief description of the biomedical background of the problem, a description of the dataset, the rules of the challenge, the evaluation metrics, and Python code snippets for accessing the images and the annotations. Finally, the participants could submit their results and access a forum to ask questions and provide comments through the website. It consisted of an open-testing round (Regular Set-B and UWF Set-B) for teams to refine and calibrate their models, and a final evaluation round (Regular Set-C and UWF Set-C). Participants were granted access to the dataset, forum, and submission system after they registered and accepted the rules of the challenge. Anonymous participation was not allowed. The complete DeepDRiD datasets were shared on GitHub (Mendeley Data: https://doi.org/10.5281/zenodo.6452623). The challenge aimed for a fair comparison of algorithms. Due to the large size of the public dataset in fundus images, participants were allowed to use other data sources but were required to mention which data they used. The participants had to submit their results as CSV files through the challenge website. The deadline for submissions was March 4, 2020. A maximum of three submissions was allowed per participant, with a four-page ISBI style paper accompanying each submission describing their methods. The three submissions had to be methodologically different. Resubmissions with simple hyper-parameter tuning were not allowed. During the workshop at ISBI 2020, we presented the challenge results and invited the top three teams to present their methods. The results, presentations, and algorithms of participants were shared on the challenge website after the workshop. Subsequently, the challenge was reopened for registration and submissions. In submission result analysis, we used quadratic weighted kappa () as the assessment metric for sub-challenges. Moreover, in sub-challenge 2, the overall quality is evaluated by accuracy. The details of the evaluation method can be seen in supplemental information: evaluation metrics.

Results

We had 574 registered participants before March 1, 2019, when the test dataset was released. The teams explored a wide range of machine learning and deep learning models, ranging from CatBoost, LightGBM, XGboost, VGG, ResNet, SE-ResNeXt, to EfficientNet, and combinations of several types of models. In total, 34 teams submitted their models in our challenge. To help the participating teams avoid overfitting problems, we also provided a separate validation set (Regular Set-B and UWF Set-B) during the competition to help them validate the model results. In Figure 2, we gave the results of three sub-challenges in the rank scores.

Figure 2

Bar chart for leaderboard in three sub-challenges

The colored bars indicate the top three teams in each challenge.

Bar chart for leaderboard in three sub-challenges The colored bars indicate the top three teams in each challenge.

Summary of competing solutions

We only present the methodology and results of the top three best-performing algorithms in each sub-challenge to keep the paper concise. We discuss the algorithms of the nine teams in terms of the following steps: data preprocessing, data augmentation, model pre-training, and training strategies of classifying deep learning models. We provide a summary and then discuss each of these four steps. The detail models and specific training strategies are shown in Note S2. Moreover, we summarize the commonalities in the good results achieved by these team approaches. These teams all considered the background of medical expertise and considered the diagnostic processes of professional physicians in the preprocessing of the data, the training of the models, and the integration of the final results. The improvement of model generalization performance is achieved by pre-training the model with extensive use of routine fundus images published in publicly available datasets and DR grading results from professional physicians. Considering the task-to-task correlation, knowledge migration from source to target data is utilized, enabling the model to learn important information quickly. Simultaneous training and integration of multiple models are used to improve the performance and performance of the models using training strategies in the field of deep learning. The winning teams in three sub-challenges have different characteristics. In sub-challenge 1 (DR grading using regular fundus), the winning team did not use complex data preprocessing and augmentation operations, but used advanced deep learning training tools from the training means. In sub-challenge 2 (image quality assessment), the winning team used rich data preprocessing and augmentation operations to design the model. The winning team in sub-challenge 3 (UWF DR grading) won the competition by pre-training and knowledge transfer of large-scale data.

Data preprocessing

We analyzed different preprocessing steps used in each of three sub-challenges: DR grading based on regular fundus; image quality assessment based on regular fundus; and DR grading based on UWF fundus. In DR severity grading of regular fundus images, public dataset providing large size of regular fundus images and their DR grading results, such as IDRiD, Kaggle 2015, Messidor, Kaggle 2009, ROC, E-Ophtha, DiaretDB, and REFUGE 2, were used. In a previous study, general preprocessing methods were introduced to improve model performance of DR grading. Some teams adopted these preprocessing algorithms, including Ben’s preprocessing method, image transformation based on bilinear interpolation, reducing the black edges of fundus images, and so on. In the image quality assessment task based on the regular fundus, the preprocessing algorithms used by participants were fundamentally the same as the DR severity grading task also based on the regular fundus. Moreover, due to difference between the regular fundus and UWF fundus, the preprocessing steps were different in the DR grading task based on regular fundus and UWF fundus. In the UWF fundus, all teams used the center-cut method to cut the edge of the UWF fundus images. For more details, we refer to Table 4.

Table 4

Differences in preprocessing

RK	Cut	Color	Resize	Filling
Sub-challenge 1: DR grading

1	N	N	N	N
2	black edge	Ben’s³⁵	Bi (512)	N
3	black edge	N	Bi (1,024)	N

Sub-challenge 2: image quality assessment

1	N	N	Bi (512)	N
2	N	N	N	N
3	black edge	N	N	flip

Sub-challenge 3: DR grading based on UWF fundus

1	center	N	N	N
2	center	N	N	N
3	N	N	N	N

Black edge, cut the black edges in the fundus; center, preserve the center of the image as input; Ben’s, Ben’s preprocessing algorithm; Bi(i), use bilinear interpolation to resize the fundus image to i pixels size; flip, use a symmetrical flip pattern to fill the black edges; N, never use this strategy; RK, rank.

Differences in preprocessing Black edge, cut the black edges in the fundus; center, preserve the center of the image as input; Ben’s, Ben’s preprocessing algorithm; Bi(i), use bilinear interpolation to resize the fundus image to i pixels size; flip, use a symmetrical flip pattern to fill the black edges; N, never use this strategy; RK, rank.

Data augmentation

The color distribution of fundus images can influence the robustness of the convolution neural networks (CNNs). Most teams used data augmentation methods, such as color adjustment, mirroring, rotation, and so on, to maintain the generality of CNNs. In sub-challenge 1, two teams adopted the same mirroring method: horizontal flip, vertical flip, and horizontal and vertical flip; they also used rotation augmentation with different rotation angles. As most teams participating in sub-challenges 2 and 3 used the pre-trained network migration based on sub-challenge 1, they did not use data augmentation methods. In sub-challenge 2, only the top 1 team used additional color adjustment methods, i.e., CutMix, RICAP, and Mixup. Furthermore, the top 1 team in sub-challenge 3 adopted plentiful augmentation strategies. The results from sub-challenges 2 and 3 show that, although pre-trained network transfer helps the network learn new tasks quickly, the use of data augmentation methods is still helpful in improving network results. Table 5 shows the detailed data augmentation method adopted by nine teams.

Table 5

Differences in data augmentation

RK	Mirroring	Rotation	Color	Other
Sub-challenge 1: DR grading

1	N	N	N	N
2	H/V/HV	R: −30, +30	N	N
3	H/V	R: −20, +20	ID/N	R/ET/
3	/HV			GT/AT

Sub-challenge 2: image quality assessment

1	N	N	CM/RC/MU	N
2	N	N	N	N
3	N	N	N	N

Sub-challenge 3: DR grading based on UWF fundus

1	H/V	R: −20, +20	ID/N	R/ET/
1	/HV			GT/AT
2	N	N	N	N
3	N	N	N	RCC

H, horizontal flip; V, vertical flip; HV, horizontal and vertical flip; R, min degree, max degree:rotation angle; ID, image disturbance; N, noise; R, resize; ET, elastic transformation; GT, grid transformation; AT, affine transformation; RCC, random center cut; CM, RC, and MU, preprocessing method in reference; RK, rank.

Differences in data augmentation H, horizontal flip; V, vertical flip; HV, horizontal and vertical flip; R, min degree, max degree:rotation angle; ID, image disturbance; N, noise; R, resize; ET, elastic transformation; GT, grid transformation; AT, affine transformation; RCC, random center cut; CM, RC, and MU, preprocessing method in reference; RK, rank.

Model pre-training

Many eye diseases are diagnosed based on fundus images. Thus, it is common for public datasets in fundus images to build models for different eye diseases. In our challenge, most teams selected to use model pre-training to improve their model ability. Table 6 shows model pre-training details.

Table 6

Differences in model pre-training

RK	Pre-training dataset
Sub-challenge 1: DR grading

1	Kaggle2015 + APTOS
2	Kaggle2015 + APTOS
3	labeled and unlabeled dataset

Sub-challenge 2: image quality assessment

1	ImageNet
2	Kaggle2015
3	private fundus lesion segmentation data

Sub-challenge 3: DR grading based on UWF fundus

1	labeled and unlabeled
2	Kaggle2015 + AOTOS
3	N

The public datasets used are Kaggle2015, APTOS. Labeled: Kaggle2015, APTOS, and IDRiD; unlabeled: REFUGE, MESSIOR, and E-ophtha. RK, rank.

Differences in model pre-training The public datasets used are Kaggle2015, APTOS. Labeled: Kaggle2015, APTOS, and IDRiD; unlabeled: REFUGE, MESSIOR, and E-ophtha. RK, rank.

Classifying deep learning models

In three sub-challenges, all teams used current deep learning models to construct classification frameworks. Most teams adopted EfficientNet as their deep learning backbones, and obtained great model performance, whereas some teams selected classical ResNet and its variant SE-ResNeXt. A team in sub-challenge 2 used a private dataset with regular fundus image and pixel-level structure labels, and selected UNet and VGG as their deep learning classification model. Most teams chose regular classification loss functions, such as cross-entropy loss (CE), L1 loss, and smooth L1 loss. One team in sub-challenge 2 proposed and adopted cost-sensitive loss. The teams that selected different training strategies to develop deep learning models are detailed in Table 7.

Table 7

Differences in deep learning models

RK	Model frameworks	Loss function	Training strategies
Sub-challenge 1: DR grading

1	EfficientNet³³	SL1	MMoE + GMP + ES + OHEM + CV + O + T
2	EfficientNet³³	SL1 + CE + DV + PL	CV + TTA
3	EfficientNet³³	L1 + CE(5 class)	PLT

Sub-challenge 2: image quality assessment

1	SE-ResNeXt³²	CE	TL
2	ResNet³¹	CS + L1	TL
3	VGG,³⁰ UNet³⁹	CE	TL

Sub-challenge 3: DR grading based on UWF fundus

1	EfficientNet³³	L1 + CE(5 class)	PLT
2	EfficientNet³³	SL1	MMoE + GMP + ES + OHEM + CV + O + T
3	EfficientNet³³	CE	TL

SL1, smooth L1 loss; CE, cross-entropy loss; DV, dual view loss; PL, patient-level loss; CS, cost-sensitive loss; L1, L1 loss; CE(5 class), mean loss of 5 class (one versus others); MMoE, multi-gate mixture of expert; GMP, generalized mean pooling; OHEM, online hard example mining;, CV, cross-validation; O, oversampling; ES, early stopping; TL, transfer learning; TTA, test time augmentation;, PLT, pseudo-labeled and labeled training.

Differences in deep learning models SL1, smooth L1 loss; CE, cross-entropy loss; DV, dual view loss; PL, patient-level loss; CS, cost-sensitive loss; L1, L1 loss; CE(5 class), mean loss of 5 class (one versus others); MMoE, multi-gate mixture of expert; GMP, generalized mean pooling; OHEM, online hard example mining;, CV, cross-validation; O, oversampling; ES, early stopping; TL, transfer learning; TTA, test time augmentation;, PLT, pseudo-labeled and labeled training.

Solution results

To fairly evaluate the performance of the individual competition team models, the quadratic weighted kappa score was used to rank the algorithms. The ranged from 0.9303 to 0.9033 for all nine participating teams in sub-challenge 1, from 0.6981 to 0.6938 for the sub-challenge 2, and from 0.9062 to 0.6437 for sub-challenge 3. In sub-challenge 1, almost all teams achieved good performance (>0.90); in sub-challenge 2, almost all teams achieved unsatisfactory performance (<0.70). This may be partly due to the fact that the teams in the competition did not take into account well the unevenness of the categories, and the relatively small differences between classes that are difficult to extract. In sub-challenge 3, all the teams performed evenly in distribution (0.60–0.90). Correlation between different fundus images was considered, and better accuracy was achieved using a team of transfer learning and sliding window learning. For the scores of the top 3 teams in each sub-challenge, we refer to Table 8. We also give a summary of the participation of these nine teams for all sub-challenges in Table S4. The performances of the proposed methods on the final validation (Set-C) are shown in six subtasks (divided into three sub-challenges). The leaderboard ranks in the three sub-challenges are also illustrated.

Table 8

DeepDRiD online leaderboard

Rank	Team	Affiliation	Score
Sub-challenge 1

1	Xi Fang et al.	Shanghai Jiao Tong University	0.9303
2	Jiang Li et al.	Shanghai Jiao Tong University	0.9262
3	Jaemin Son et al.	VUNO Inc.	0.9232

Sub-challenge 2

1	Poorneshwaran J M et al.	Healthcare Technology Innovation Center	0.6981
2	Adrian Galdran et al.	Bournemouth University	0.6950
3	Yerui Chen et al.	Nanjing University of Science and Technology	0.6938

Sub-challenge 3

1	Jaemin Son et al.	VUNO Inc.	0.9062
2	Xi Fang et al.	Shanghai Jiao Tong University	0.8620
3	Jie Wang et al.	Beihang University	0.8230

DeepDRiD online leaderboard

Sub-challenge 1: DR disease grading

This section presents the performance of all competing solutions in the DR grading task using regular fundus pictures. The results received from the participating teams were analyzed using as a validation measure. was calculated on the validation set (Regular Set-C) for each of the different techniques. Of the 34 participating teams in the challenge, 11 teams participated in sub-challenge 1. Of these 11 teams, 9 (see Table S5) performed well in the DR grading task and then were invited to participate in the challenge workshop. The top three groups were those of Xi Fang, Jiang Li, and Jaemin Son. The classification results of the three teams reflect that all of their models achieved good classification performance, with sensitivity and specificity comparable with physicians on the grading from normal to PDR. In addition, the classification results showed a slightly higher degree of confusion for mild lesions than for moderate and severe lesions. In Note S3, we detail the model performance and result analysis.

Sub-challenge 2: Image quality estimation

This task was performed using the validation algorithm described in Note S2 on Set-C to evaluate four aspects of image quality: artifacts, clarity, field definition, and overall quality. The algorithm produced scores for the above four aspects. The best-performing solution in the on-site sub-challenge two was proposed by Poorneshwaran J M, followed by Adrian Galdran and Yerui Chen. For the teams performing poorly in several tasks of image quality detection, their overall accuracy of image quality detection was also not high. The main reason seems to be the inaccurate differentiation of the degree of DR image quality due to the uneven distribution of classes in the dataset and the relatively high degree of similarity between classes. The detailed result can be seen in Note S3.

Sub-challenge 3: UWF fundus DR grading

The results for DR grading of UWF fundus images were obtained by the same evaluation method as used for sub-challenge 1 using . Table S6 shows the results of the field evaluation, summarizing the performance of all participating algorithms in the UWF fundus DR grading task. Jaemin Son developed the winning method for the UWF fundus DR grading, and Jaemin Son, Xi Fang, and Jie Wang won the best top 3 performers in this task.

Discussions

Summary of holding and analyzing the challenge

In this paper, we present the details of the DeepDRiD challenge, including relevant information regarding the dataset, evaluation metrics for multiple sub-challenges of the competition, the organization of the challenge, solutions, and results by the participating teams on all sub-challenges. The sub-challenges included grading DR severity, quality detection and assessment of fundus photo images, and UWF fundus images DR grading. With 34 teams participating the challenge and reporting the results, we consider our challenge successful. We did our best to create a relevant, stimulating, and fair competition for advancing the collective knowledge of the research community. The best methods for DR lesion severity grading used a considerable number of common tips: (1) efficient extraction of features through data augmentation, (2) transfer learning of large amounts of fundus data with and without physician labels, and (3) loss function modification. In addition, many grading networks used the EfficientNet-based framework to learn grading features quickly and efficiently, which improved the performance of the models. The rich parameter adjustment methods and model fusion methods also provided new ideas to further solve the DR grading problem. In the quality assessment task, the accuracy of image quality detection ranged between 0.68 and 0.70. The results did not reach the performance required for clinically feasible automatic screening of good quality fundus images; therefore, there is still much work to do in image quality assessment. Attention must be paid to features of both artifacts and clarity to improve the overall assessment results considering the misclassification cases. In sub-challenge 3, the results of five teams were used for evaluation. We observe that using those readily available regular fundus images for knowledge transfer has a very significant effect on the DR grading task for the same UWF images of the fundus of the eye.

Limitations of the study

This challenge provided data collected in routine clinical practice using an acquisition protocol consistent with all images. The data were acquired with the same camera simultaneously after pupil dilation and followed to provide annotations corresponding to the quality assessment protocol. Several experts jointly evaluated the images in this dataset, and images disagreed by experts were excluded from the dataset. Even after these efforts (for providing the best possible data), the annotation process (especially for image quality) remained inherently subjective. Thus, manual judgment is a limiting factor in the method development, especially for the methods trained and evaluated in a supervised manner. Our challenge provides the potential to develop DR lesion grading solutions, fundus image quality assessment, and DR grading using UWF fundus images. Despite the complexity of the tasks and also just 1.5 months for method development, it still received a very positive response from the community. Nevertheless, there is still room for improvement, especially in evaluation of image quality. Therefore, although the competition is over, the dataset is still publicly available for research purposes, to attract more researchers to study the problem and develop new solutions to meet current and future clinical standards.

Insights on the future directions

Based on our analysis of the organization of this challenge and the results from the challenge, we propose the following ideas for future directions. First, almost all teams in this challenge used deep learning models as the main network framework to solve this problem. The results also show that the deep learning models do achieve good results, which demonstrates the great potential of deep models in this problem. Second, in the pre-training of the model, almost all teams used a wide range of general fundus images for model pre-training and parameter migration. This is based on the relatively extensive research interest and a large number of datasets publicly available for this problem on the one hand, and the significant importance of pre-training for model performance improvement on the other hand. Finally, in different subtasks, we can find that the models that achieved victory have different characteristics, some are more focused on preprocessing and augmentation methods and some are more focused on the model architecture and training means. This means that developing models for specific medical problems requires more problem-specific analysis.

Suggestions for organizing medical grand challenges

To help the research community better organize medical grand challenges, we also give a few of our tests. First, the motivation for the challenge needs to come from the clinician’s real-world problems. For example, in our challenge, all three subtasks come from the difficulties and challenges encountered in automated deep learning screening during DR screening. In addition, reasonable and compliant access to data prior to organizing the challenge requires that we communicate and collaborate with clinicians as early as possible. Second, the organization, promotion, and conduct of the challenge needed to be as rich in diversity as possible: diversity of competition organizers, diversity of participants, etc. (from different countries and regions, different professional backgrounds, etc.). Finally, a long research base will also help the organizers to better organize the competition and sustainably lead the direction.

Conclusion

By leveraging hospital research data and physician resources, we provide a finely labeled dataset of realistic DR screening scenarios that demonstrate the diagnostic potential of the DeepDRiD challenge models on conventional DR grading, DR image quality assessment, and ultra-wide angle fundus DR grading. These models obtained comparable diagnostic performance with general ophthalmologists on DR grading and preliminary attempts on image quality assessment. Furthermore, these new deep learning prediction models and their training strategies can be used to enhance the diagnostic capabilities of healthcare workers to improve the accuracy of DR screening in true screening scenarios. Nevertheless, there is still a clear opportunity to further improve the models in this competition. We believe that, with access to higher quality and more comprehensive image quality assessment data, as well as a wider range of challenge participants, more accurate models could be developed.

Experimental procedures

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Bin Sheng (shengbin@sjtu.edu.cn).

Materials availability

This study did not generate any new materials.

Data and code availability

The DeepDRiD dataset is available at https://github.com/deepdrdoc/DeepDRiD (Mendeley Data: https://doi.org/10.5281/zenodo.6452623).

Evaluation metrics

For DR disease-grading tasks, in sub-challenges 1 and 3, the quadratic weighted kappa () was used as the evaluation metric to determine the performance of the participating algorithms. Submissions were scored based on the quadratic weighted kappa, , which measures the agreement between two ratings (ground-truths results and submitted results). This metric varied from 0 (random agreement between raters) to 1 (complete agreement between raters). If there was less agreement between the raters than expected by chance, the metric could go below 0. The quadratic weighted kappa, , was calculated between the scores, which were expected/known, and the predicted scores. The results had five possible ratings: 0, 1, 2, 3, and 4. The quadratic weighted kappa was calculated as follows. First, an histogram matrix, O, was constructed, such that it corresponded to the number of adoption records that had a rating of i (actual) and received a predicted rating, j. An matrix of weights, w, was calculated based on the difference between the actual and predicted rating scores. An histogram matrix of expected ratings, E, was calculated, assuming no correlation between rating scores. This was calculated as the outer product between the actual rating’s histogram vector of ratings and the predicted rating’s histogram vector of ratings, normalized such that E and O had the same sum. From these three matrices, the quadratic weighted kappa was calculated. The metric is expressed as The weight penalization, , is defined by , where C is the number of classes. The values of and lead to linear and quadratic penalizations, respectively. The values of is in the interval of , where −1 means perfect symmetric disagreement and 1 means perfect agreement. In sub-challenge 2, the scoring metric was classification accuracy, as described aswhere is true positive samples, is false positive samples, and is the total numbers.

23 in total

1. Systematic screening for diabetic retinopathy (DR) in Hong Kong: prevalence of DR and visual impairment among diabetic population.

Authors: Jin Xiao Lian; Rita A Gangwani; Sarah M McGhee; Christina K W Chan; Cindy Lo Kuen Lam; David Sai Hung Wong
Journal: Br J Ophthalmol Date: 2015-08-13 Impact factor: 4.638

2. EyePACS: an adaptable telemedicine system for diabetic retinopathy screening.

Authors: Jorge Cuadros; George Bresnick
Journal: J Diabetes Sci Technol Date: 2009-05-01

3. Determination of diabetic retinopathy prevalence and associated risk factors in Chinese diabetic and pre-diabetic subjects: Shanghai diabetic complications study.

Authors: Can Pang; Lili Jia; Sunfang Jiang; Wei Liu; Xuhong Hou; Yuhua Zuo; Huilin Gu; Yuqian Bao; Qiang Wu; Kunsan Xiang; Xin Gao; Weiping Jia
Journal: Diabetes Metab Res Rev Date: 2012-03 Impact factor: 4.876

4. An Automated Grading System for Detection of Vision-Threatening Referable Diabetic Retinopathy on the Basis of Color Fundus Photographs.

Authors: Zhixi Li; Stuart Keel; Chi Liu; Yifan He; Wei Meng; Jane Scheetz; Pei Ying Lee; Jonathan Shaw; Daniel Ting; Tien Yin Wong; Hugh Taylor; Robert Chang; Mingguang He
Journal: Diabetes Care Date: 2018-10-01 Impact factor: 19.112

Review 5. Prevalence of diabetic retinopathy in Type 2 diabetes in developing and developed countries.

Authors: L M Ruta; D J Magliano; R Lemesurier; H R Taylor; P Z Zimmet; J E Shaw
Journal: Diabet Med Date: 2013-04 Impact factor: 4.359

6. Improving convolutional neural networks performance for image classification using test time augmentation: a case study using MURA dataset.

Authors: Ibrahem Kandel; Mauro Castelli
Journal: Health Inf Sci Syst Date: 2021-07-31

7. A deep learning system for detecting diabetic retinopathy across the disease spectrum.

Authors: Ling Dai; Liang Wu; Huating Li; Chun Cai; Qiang Wu; Hongyu Kong; Ruhan Liu; Xiangning Wang; Xuhong Hou; Yuexing Liu; Xiaoxue Long; Yang Wen; Lina Lu; Yaxin Shen; Yan Chen; Dinggang Shen; Xiaokang Yang; Haidong Zou; Bin Sheng; Weiping Jia
Journal: Nat Commun Date: 2021-05-28 Impact factor: 14.919

8. Glycemic exposure and blood pressure influencing progression and remission of diabetic retinopathy: a longitudinal cohort study in GoDARTS.

Authors: Yiyuan Liu; Minghui Wang; Andrew D Morris; Alex S F Doney; Graham P Leese; Ewan R Pearson; Colin N A Palmer
Journal: Diabetes Care Date: 2013-10-29 Impact factor: 19.112

9. Validation of automated screening for referable diabetic retinopathy with the IDx-DR device in the Hoorn Diabetes Care System.

Authors: Amber A van der Heijden; Michael D Abramoff; Frank Verbraak; Manon V van Hecke; Albert Liem; Giel Nijpels
Journal: Acta Ophthalmol Date: 2017-11-27 Impact factor: 3.761

10. Non-uniform Label Smoothing for Diabetic Retinopathy Grading from Retinal Fundus Images with Deep Neural Networks.

Authors: Adrian Galdran; Jihed Chelbi; Riadh Kobi; José Dolz; Hervé Lombaert; Ismail Ben Ayed; Hadi Chakor
Journal: Transl Vis Sci Technol Date: 2020-06-30 Impact factor: 3.283

1 in total

1. Fractal dimension of retinal vasculature as an image quality metric for automated fundus image analysis systems.

Authors: Xingzheng Lyu; Purvish Jajal; Muhammad Zeeshan Tahir; Sanyuan Zhang
Journal: Sci Rep Date: 2022-07-13 Impact factor: 4.996

1 in total