Literature DB >> 31143882

Diagnostic Efficacy and Therapeutic Decision-making Capacity of an Artificial Intelligence Platform for Childhood Cataracts in Eye Clinics: A Multicentre Randomized Controlled Trial.

Haotian Lin¹, Ruiyang Li¹, Zhenzhen Liu¹, Jingjing Chen¹, Yahan Yang¹, Hui Chen¹, Zhuoling Lin¹, Weiyi Lai¹, Erping Long¹, Xiaohang Wu¹, Duoru Lin¹, Yi Zhu^1,2, Chuan Chen^1,2, Dongxuan Wu³, Tongyong Yu³, Qianzhong Cao¹, Xiaoyan Li¹, Jing Li¹, Wangting Li¹, Jinghui Wang¹, Mingmin Yang⁴, Huiling Hu⁴, Li Zhang⁵, Yang Yu⁶, Xuelan Chen⁶, Jianmin Hu⁶, Ke Zhu⁷, Shuhong Jiang⁸, Yalin Huang⁹, Gang Tan¹⁰, Jialing Huang¹¹, Xiaoming Lin¹, Xinyu Zhang¹, Lixia Luo¹, Yuhua Liu¹, Xialin Liu¹, Bing Cheng¹, Danying Zheng¹, Mingxing Wu¹, Weirong Chen¹, Yizhi Liu¹.

Abstract

BACKGROUND: CC-Cruiser is an artificial intelligence (AI) platform developed for diagnosing childhood cataracts and providing risk stratification and treatment recommendations. The high accuracy of CC-Cruiser was previously validated using specific datasets. The objective of this study was to compare the diagnostic efficacy and treatment decision-making capacity between CC-Cruiser and ophthalmologists in real-world clinical settings.
METHODS: This multicentre randomized controlled trial was performed in five ophthalmic clinics in different areas across China. Pediatric patients (aged ≤ 14 years) without a definitive diagnosis of cataracts or history of previous eye surgery were randomized (1:1) to receive a diagnosis and treatment recommendation from either CC-Cruiser or senior consultants (with over 5 years of clinical experience in pediatric ophthalmology). The experts who provided a gold standard diagnosis, and the investigators who performed slit-lamp photography and data analysis were blinded to the group assignments. The primary outcome was the diagnostic performance for childhood cataracts with reference to cataract experts' standards. The secondary outcomes included the evaluation of disease severity and treatment determination, the time required for the diagnosis, and patient satisfaction, which was determined by the mean rating. This trial is registered with ClinicalTrials.gov (NCT03240848).
FINDINGS: Between August 9, 2017 and May 25, 2018, 350 participants (700 eyes) were randomly assigned for diagnosis by CC-Cruiser (350 eyes) or senior consultants (350 eyes). The accuracies of cataract diagnosis and treatment determination were 87.4% and 70.8%, respectively, for CC-Cruiser, which were significantly lower than 99.1% and 96.7%, respectively, for senior consultants (p < 0.001, OR = 0.06 [95% CI 0.02 to 0.19]; and p < 0.001, OR = 0.08 [95% CI 0.03 to 0.25], respectively). The mean time for receiving a diagnosis from CC-Cruiser was 2.79 min, which was significantly less than 8.53 min for senior consultants (p < 0.001, mean difference 5.74 [95% CI 5.43 to 6.05]). The patients were satisfied with the overall medical service quality provided by CC-Cruiser, typically with its time-saving feature in cataract diagnosis.
INTERPRETATION: CC-Cruiser exhibited less accurate performance comparing to senior consultants in diagnosing childhood cataracts and making treatment decisions. However, the medical service provided by CC-Cruiser was less time-consuming and achieved a high level of patient satisfaction. CC-Cruiser has the capacity to assist human doctors in clinical practice in its current state. FUNDING: National Key R&D Program of China (2018YFC0116500) and the Key Research Plan for the National Natural Science Foundation of China in Cultivation Project (91846109).

Entities: Chemical

Keywords: Artificial intelligence; Childhood cataracts; Multicentre randomized controlled trial; Ophthalmology

Year: 2019 PMID： 31143882 PMCID： PMC6510889 DOI： 10.1016/j.eclinm.2019.03.001

Source DB: PubMed Journal: EClinicalMedicine ISSN： 2589-5370

Research in context Evidence before this study Advances in medical AI are expected to provide high-quality medical services and alleviate the shortage of medical resources. We searched PubMed for clinical trials published in all language between Jan 1, 2000, and Dec 20, 2018, using the search terms “artificial intelligence”, “diagnose/diagnosis”, and “treatment”. We also searched the reference lists of the retrieved articles. Previous studies of medical AI, such as diagnosis of skin diseases, breast cancer, retinopathy, cataract, glaucoma and cardiovascular diseases, have mainly focused on machine learning by specific screened datasets or observational study to assess the AI performance. However, our scientific literature review found that all the available evidence had not evaluated the efficacy of diagnostic medical AI using large-scale unscreened clinical data in a real-world-comparative trial. Added value of this study To the best of our knowledge, this study is the first multicentre randomized controlled trial to compare the diagnostic accuracy and efficiency of medical AI to that of senior consultants in real-world clinical settings. The results of this study suggest that the accuracy of diagnosis and treatment-decision making of medical AI is lower than that of senior consultants. However, our medical AI requires less time for diagnosis and still achieves a high level of patient satisfaction in eye clinics. Implications of all the available evidence The study demonstrates that medical AI has the capacity to assist human doctors in clinical practice. However, the real-world diagnostic performance of all medical AI must be evaluated in clinical controlled trials before regular clinical application. Alt-text: Unlabelled Box

Introduction

The implementation of artificial intelligence (AI), including robotic surgery, medical imaging, and automated diagnosis, has become increasingly popular in modern medical industry [1], [2], [3], [4]. For example, IBM-Watson, a question-answering computer system, can provide diagnosis and treatment suggestions for breast cancer [5]. An AI generated through deep convolutional neural network (CNN) algorithms can effectively screen skin disease and classify skin cancer as dermatologists [6]. Medical AI has significant advantages in continuous lifelong learning from human experts, convenient open-source sharing, and efficient decision-making [4], [7]. The traditional medical care service modality has limited capacity for providing high-quality healthcare to large populations, as experienced clinicians require extensive training [8], [9], [10]. In contrast, medical AI can imitate and replace the primary work of human doctors through deep learning, and provide medical guidance to multiple hospitals simultaneously, especially in those less-developed and remote areas [11], [12]. Therefore, advances in medical AI are expected to provide high-quality medical services and alleviate the uneven distribution of medical resources [13], [14]. Previous studies on the application of medical AI, such as detecting diabetic retinopathy, macular degeneration, glaucoma, and cardiovascular diseases, mainly focused on machine learning by screening images collected from specific datasets [15], [16], [17], [18], [19]. However, the efficacy of medical AI in disease diagnosis and therapeutic decision-making has not been evaluated using large-scale unfiltered clinical data in a real-world-comparative trial. At present, inaccurate diagnoses and inappropriate treatment decisions are common, especially among patients with rare diseases, mainly due to insufficient medical resources in non-specialized hospitals [20], [21], [22]. Childhood cataract is a rare disease that can cause irreversible vision loss without urgent early diagnosis and treatment [23], [24]. Moreover, the diagnosis and treatment of cataract is mainly based on morphology and AI has showed great advantages in image recognition [1]. Thus, childhood cataract is a suitable test case for the exploration of a medical AI. CC-Cruiser is an ophthalmic AI platform developed by Zhongshan Ophthalmic Center (ZOC) for diagnosing childhood cataracts and providing risk stratification and treatment guidance [25]. This collaborative cloud platform enables patient data sharing between individual hospitals for data integration and patient screening. CC-Cruiser was trained from a dataset including 410 ocular images of childhood cataracts and 476 images of normal eyes from the Childhood Cataract Program of the Chinese Ministry of Health (CCPMOH), a specialized care centre for rare diseases in China. In addition, the high accuracy of CC-Cruiser was previously validated in an in silico test, a website-based study using eye images from websites, a ‘finding a needle in a haystack’ test (a test using a dataset with a normal lens: cataract ratio of 100:1), and a small clinical trial [25]. Here, we performed a multicentre diagnostic randomized controlled trial [26], which is the final frontier to evaluate the clinical difference between the AI diagnostic procedures using CC-Cruiser and traditional eye clinics. We also investigated patients' feedback regarding the medical services provided by CC-Cruiser and senior consultants.

Methods

Study Design and Participants

This is a large, multicentre, parallel-group, randomized controlled trial performed in five Chinese ophthalmic clinics. The Consolidated Standards for Reporting Trials (CONSORT) guidelines have been followed in our study [27]. The leading study centre of this trial is the ZOC, located in Guangzhou in southern China. The other four eye clinics are affiliated with Shenzhen Eye Hospital, the Central Hospital of Wuhan, the Second Affiliated Hospital of Fujian Medical University, and Kaifeng Eye Hospital. We selected these collaborating hospitals from different areas to represent the diversity of healthcare settings across China. Participants were recruited by the investigators according to standard inclusion criteria of the ophthalmic clinics in these hospitals. Participants were eligible for the study if they were less than 14 years old, with or without eye symptoms, and had no history of previous eye surgery. All participants were required to undergo slit-lamp photography, and sedatives such as chloral hydrate were used when necessary. Patients who already had a definitive diagnosis of cataract, other ocular abnormalities or ocular trauma were excluded. Written informed consent was obtained at enrollment from at least one guardian of each participating child, and the principles outlined in the Declaration of Helsinki were followed throughout the study. The study protocol was approved by the ethics committee of ZOC and the institutional review boards at all collaborating centres, including Shenzhen Eye Hospital, the Central Hospital of Wuhan, the Second Affiliated Hospital of Fujian Medical University, and Kaifeng Eye Hospital. This trial is registered with ClinicalTrials.gov (NCT03240848).

Randomization and Masking

The participants were randomized (1:1) to receive a diagnosis from either CC-Cruiser or senior consultants, where one participant (two eyes) was randomized to the same group. Centralized randomization was done via a random number generating program with no stratification factors to avoid selection bias. Investigators in each study centre assessed the eligibility of each patient. If the patient met the inclusion criteria, the investigator sent the patient's information to a study coordinator, and the coordinator notified the investigators about the allocated group. Slit-lamp photography and patient recruitment were performed in each participating clinic by trained clinical staffs. The clinical staffs, investigators involved in data management and analysis, and experts providing the golden standard diagnosis by consensus in each clinic were blinded to the group assignments to help prevent ascertainment bias. The study participants, senior consultants, the study coordinator and study personnel responsible for randomization were not masked.

Procedures

The CC-Cruiser platform at the Children's Cataract Center of the ZOC was connected with all collaborating clinics through internet. A CC-Cruiser website (https://www.cc-cruiser.com/version1) has been established with a demonstration video of guidelines and instructions. Registered users can upload new cases to CC-Cruiser, and the output will include: diagnosis (normal lens versus cataract), comprehensive evaluation (opacity area, density and location), and treatment recommendation (surgery versus follow-up). Senior consultants with at least 5 years of clinical experience in pediatric ophthalmology provided initial diagnoses in each centre. The investigators created a profile for every eligible and consenting participant and documented their demographic information and baseline clinical characteristics, including sex, date of birth, family history of cataract, and eye symptoms. The participating investigators and clinical staffs at each centre received standardized training for the study procedures before the trial. All eligible participants underwent slit-lamp photography with pupil dilation and unified standard of diffused light, appropriate illumination intensity of slit-lamp, and normal eye position before group assignment. The clinical staffs attempted no more than three times for each eye. The investigators used sedatives (chloral hydrate) for 43 very young patients who would otherwise not cooperate with this examination. The participants in the AI group were assigned to the AI clinic after slit-lamp photography. The investigators sent images of the ocular anterior segment to CC-Cruiser and received the initial diagnoses (normal lens versus cataract) with comprehensive evaluations of disease severity (lens opacity and the opacity area, density, and location) and treatment suggestions (surgery versus follow-up). The investigators calculated the time required for visiting CC-Cruiser and receiving initial diagnoses. The participants in the senior consultants group were assigned to the regular ophthalmic clinic. The senior consultants provided patients with initial diagnostic reports including the disease severity and treatment decision. The investigators also calculated the time required for the diagnostic process. After receiving an initial diagnosis, all the participants with identification numbers masked received a gold standard diagnosis from an expert panel including three cataract experts with more than 10 years of clinical experience in ophthalmology. The expert panel performed the slit-lamp examination and reached a consensus to make a final definitive diagnosis and treatment-decision for every patient. After the initial diagnostic report and standard diagnosis, the participants and their guardians were asked to complete a questionnaire regarding their satisfaction with diagnostic accuracy and efficiency.

Outcomes

The primary outcome was the accuracy of the diagnosis normal lens versus cataract. Because there is no available international classification system for pediatric cataracts, the reference standard for the evaluation of pediatric cataracts is the diagnosis from the cataract experts. The investigators compared the diagnostic accuracy of CC-Cruiser to that of the senior consultants using the gold standard diagnoses from the cataract experts. The secondary outcomes included the evaluation of the disease severity, the time required for making the diagnosis, and patient satisfaction. The disease severity was comprehensively evaluated with the opacity area (extensive versus limited), density (dense versus non-dense), location (central versus peripheral), and treatment recommendations (surgery versus follow-up). The opacity area was defined as extensive when the opacity covered more than 50% of the pupil; otherwise, it was defined as limited. The opacity density was defined as dense when the opacity fully disrupted fundus imaging; otherwise, it was defined as non-dense. The opacity location was defined as central when the opacity fully covered the visual axis area; otherwise, it was defined as peripheral. Because the diagnosis was based on the slit-lamp image of the ocular anterior segment, the time required for diagnosis was calculated from the beginning of image acquisition to the completion of initial diagnostic reports and treatment recommendations by CC-Cruiser or senior consultants. The level of patient satisfaction was evaluated and analyzed via a seven-item questionnaire. A score of 1 indicated disagree; 2 indicated neutral; 3 indicated agree; and 4 indicated strongly agree. Both the number and percentage of participants who responded to each item were documented, and the mean rating for each item was calculated.

Statistical Analysis

Using the data from a comparative test with CC-Cruiser [25], we calculated that a sample size of at least 700 eyes (assuming a 1:1 allocation ratio, 350 eyes in each group) was required to compare diagnostic accuracy between CC-Cruiser and senior consultants based on the expected accuracy of 90% in the AI arm and 95% in the senior consultants arm, an 80% statistical power, and a 5% statistical significance level [27], [28], [29], [30]. The study analyses followed a comprehensive, prespecified statistical analysis plan. Demographic and clinical data were recorded at baseline. Baseline demographics and diseases characteristics were statistically analyzed to confirm that all 350 participants (700 eyes) were well randomized into two study groups. The intention to treat population is same with the population of per protocol in this trial since no patients discontinued or withdrew after recruitment. Then, our primary analysis included all patients as originally allocated after randomization. The analysis of diagnostic accuracy was at eye level, and bilateral eyes in the same person were separately analyzed in the same group. We calculated the sensitivity, specificity, accuracy, positive predictive value (PPV), and negative predictive value (NPV) of CC-Cruiser and the senior consultants with reference to the gold standard (the cataract experts). The correct diagnosis of cataract was further analyzed with comprehensive evaluations of disease severity and treatment recommendations. The generalized estimating equations (GEE) method, an extension of the quasi-likelihood approach, is being increasingly used to analyze longitudinal and other correlated data, especially when they are binary or in the form of counts [31]. We applied two eyes data from one person, which belonged to correlated data, therefore we performed the GEE to identify significant differences in the accuracy, true positive fraction (TPF), and false positive fraction (FPF) between CC-Cruiser and the senior consultants. The TPF is equivalent to sensitivity, and the FPF is equivalent to 1-specificity. The time required by CC-Cruiser and the senior consultants was assessed by the Mann–Whitney U test. Patient satisfaction with the medical service was also calculated as the mean rating with standard deviation. The Mann–Whitney U test was performed to identify significant differences in the responses to each question between the two groups. The criterion for significance was set at α = 0.05. For all models, the results are expressed as an estimate of the effect size with odd ratio (OR), 95% CIs and p-values. All statistical analyses were performed with SPSS (version 20; SPSS, Inc., Chicago, IL, USA).

Results

Between August 9, 2017, and May 25, 2018, 353 patients were screened for eligibility (Fig. 1). After screening, three very young children were excluded because they could not take chloral hydrate and undergo slit-lamp photography. The remaining 350 participants (700 eyes) were randomly assigned to either the AI group (350 eyes) or a senior consultant's group (350 eyes). No participant withdrew from the study after randomization. Three hundred and fifty participants (700 eyes) were included in the analysis. The baseline demographics and disease characteristics, including sex, age, family history, eye symptoms, patients with cataracts, eyes with cataracts and severity of cataract were comparable between the two groups (Table 1).

Fig. 1

Trial profile.

AI = artificial intelligence.

Table 1

Baseline demographics and disease characteristics.

	AI group (N = 175 P/350 E)	SC group (N = 175 P/350 E)	p-Value
Sex
Male	77 (44.0%)	82 (46.9%)	p = 0.591a
Female	98 (56.0%)	93 (53.1%)	p = 0.591a

Age (years)	6.58 (0.45)	5.89 (0.45)	p = 0.124b
Family history of cataracts
No	165 (94.3%)	163 (93.1%)	p = 0.660a
Yes	10 (5.7%)	12 (6.9%)	p = 0.660a
Eye symptoms
No	96 (54.9%)	93 (53.1%)	p = 0.748a
Yes	79 (45.1%)	82 (46.9%)	p = 0.748a
Patients with cataracts
Normal	106 (60.6%)	100 (57.1%)
Monocular cataracts	31 (17.7%)	28 (16.0%)	p = 0.527a
Bilateral cataract	38 (21.7%)	47 (26.9%)
Eyes with cataracts
No	243 (69.4%)	228 (65.1%)	p = 0.342c
Yes	107 (30.6%)	122 (34.9%)	p = 0.342c
Opacity area
Extensive	71 (66.4%)	85 (69.7%)	p = 0.764c
Limited	36 (33.6%)	37 (30.3%)	p = 0.764c
Density
Dense	69 (64.5%)	77 (63.1%)	p = 0.696c
Non-dense	38 (35.5%)	45 (36.9%)	p = 0.696c
Location
Central	77 (72.0%)	83 (68.0%)	p = 0.776c
Peripheral	30 (28.0%)	39 (32.0%)	p = 0.776c
Treatment recommendations
Surgery	63 (58.9%)	76 (62.3%)	p = 0.575c
Follow-up	44 (41.1%)	46 (37.7%)	p = 0.575c

Data are presented as the number n (%) or mean (standard deviations). Percentages do not add up to 100% in some cases because of rounding. The χ2 test was performed to compare the characteristics of sex, family history of cataracts, eye symptoms and patients with cataracts between the AI group and the senior consultant group. An independent samples t-test was performed to compare age between the two groups. The generalized estimating equation was performed to compare the eyes with cataracts, disease severity, and treatment recommendations. None of the baseline characteristics differed significantly at the 0.05 level between groups. AI = artificial intelligence. SC = senior consultant. P = participants. E = eyes.

χ2 test.

t-Test.

Generalized estimating equation.

Trial profile. AI = artificial intelligence. Baseline demographics and disease characteristics. Data are presented as the number n (%) or mean (standard deviations). Percentages do not add up to 100% in some cases because of rounding. The χ2 test was performed to compare the characteristics of sex, family history of cataracts, eye symptoms and patients with cataracts between the AI group and the senior consultant group. An independent samples t-test was performed to compare age between the two groups. The generalized estimating equation was performed to compare the eyes with cataracts, disease severity, and treatment recommendations. None of the baseline characteristics differed significantly at the 0.05 level between groups. AI = artificial intelligence. SC = senior consultant. P = participants. E = eyes. χ2 test. t-Test. Generalized estimating equation. With reference to the cataract experts' standards, the sensitivity, specificity, accuracy, PPV, and NPV of the diagnoses (normal lens versus cataract) were 89.7%, 86.4%, 87.4%, 74.4%, and 95.0%, respectively, for CC-Cruiser, compared to 98.4%, 99.6%, 99.1%, 99.2%, and 99.1%, respectively, for the senior consultants (Table 2). The diagnostic accuracy and TPF for childhood cataracts for CC-Cruiser were significantly lower (p < 0.001, OR = 0.06 [95% CI 0.02 to 0.19]; and p = 0.012, OR = 0.14 [95% CI 0.03 to 0.65], respectively) and the FPF for CC-Cruiser was significantly higher than those for the senior consultants (p < 0.001, OR = 43.05 [95% CI 5.42 to 341.70]) (Table 2). CC-Cruiser was significantly less accurate in diagnosing cataracts than senior consultants. The percentages of correct comprehensive evaluations of lens opacity including the opacity area, density, and location were 90.6%, 80.2%, and 77.1%, respectively, in the AI group, compared to 93.3%, 85.0%, and 87.5%, respectively, in the senior consultants group (Table 3). Compared to senior consultants, CC-Cruiser exhibited no statistical difference when evaluating the opacity area, density, and opacity location (p = 0.463, 0.286, and 0.130, respectively) (Table 3). The treatment recommendations (surgery versus follow-up) provided by CC-Cruiser were significantly less accurate than those provided by the senior consultants (70.8% vs. 96.7%, p < 0.001, OR = 0.08 [95% CI 0.03 to 0.25], Table 3).

Table 2

Diagnostic performance regarding childhood cataract.

	Sensitivity	Specificity	Accuracy	Positive predictive value	Negative predictive value	Accuracy difference (p-value, OR [95% CI])	TPF difference (p-value, OR [95% CI])	FPF difference (p-value, OR [95% CI])
CC-Cruiser	89.7%	86.4%	87.4%	74.4%	95.0%	− 11.7 (p < 0.001, OR = 0.06 [95% CI 0.02 to 0.19])	− 8.7 (p = 0.012, OR = 0.14 [95% CI 0.03 to 0.65])	13.2 (p < 0.001, OR = 43.05 [95% CI 5.42 to 341.70])
Senior consultants	98.4%	99.6%	99.1%	99.2%	99.1%	− 11.7 (p < 0.001, OR = 0.06 [95% CI 0.02 to 0.19])

Eyes were the units of analysis (N = 700). There were 350 eyes in the CC-Cruiser group and 350 eyes in the senior consultant group. OR = odd ratio. CI = confidence interval. TPF = true positive fraction. TPF is equivalent to sensitivity. FPF = false positive fraction. FPF is equivalent to 1-specificity. We performed a diagnostic accuracy analysis with reference to the cataract specialists' standards. The TPF and FPF of diagnosis (normal lens versus cataract) were 89.7%, and 13.6%, respectively, for CC-Cruiser and 98.4%, and 0.4%, respectively, for the senior consultants. The generalized estimating equation (GEE) was performed to identify significant differences in accuracy, TPF, and FPF between CC-Cruiser and the senior consultants. GEE results (adjusted results) and logistic regression results (unadjusted results) for cataract diagnosis in the supplementary table were presented to show the impact of the cluster at the level of participants (Supplementary table 1 for adjusted and unadjusted results).

Table 3

Comprehensive evaluations of childhood cataract and treatment recommendations.

	Sensitivity	Specificity	Accuracy	Accuracy difference (p-value, OR [95% CI])	TPF difference (p-value, OR [95% CI])	FPF difference (p-value, OR [95% CI])
Opacity area
CC-Cruiser	91.3%	88.9%	90.6%	− 2.7 (p = 0.460, OR = 0.66 [95% CI 0.22 to 1.98])	− 2.8 (p = 0.564, OR = 0.68 [95% CI 0.18 to 2.56])	2.5 (p = 0.439, OR = 2.11 [95% CI 0.32 to 14.05])
Senior consultants	94.1%	91.4%	93.3%
Density
CC-Cruiser	85.3%	67.9%	80.2%	− 4.8 (p = 0.286, OR = 0.64 [95% CI 0.28 to 1.45])	3.5 (p = 0.867, OR = 1.09 [95% CI 0.40 to 2.97])	22.8 (p = 0.042, OR = 4.24 [95% CI 1.05 to 17.13])
Senior consultants	81.8%	90.7%	85.0%
Location
CC-Cruiser	84.2%	50%	77.1%	− 10.4 (p = 0.130, OR = 0.52 [95% CI 0.22 to 1.21])	− 7.4 (p = 0.351, OR = 0.59 [95% CI 0.20 to 1.78])	28.4 (p = 0.134, OR = 2.91 [95% CI 0.72 to 11.71])
Senior consultants	91.6%	78.4%	87.5%
Treatment
CC-Cruiser	86.7%	44.4%	70.8%	− 25.9 (p < 0.001, OR = 0.08 [95% CI 0.03 to 0.25])	− 8.0 (p = 0.247, OR = 0.44 [95% CI 0.11 to 1.77])	55.6
Senior consultants	94.7%	100.0%	96.7%

Eyes were the units of analysis. A total of 216 eyes (correctly diagnosed as cataracts in both groups, 96 eyes in the CC-Cruiser group and 120 eyes in the senior consultant group) were further analyzed by comprehensive evaluation of lens opacity, including the opacity area (extensive versus limited), density (dense versus non-dense), and location (central versus peripheral), and the recommended treatment (surgery versus follow-up) with reference to the cataract specialists' standards. OR = odd ratio. CI = confidence interval. TPF = true positive fraction. TPF is equivalent to sensitivity. FPF = false positive fraction. FPF is equivalent to 1-specificity. The generalized estimating equation was performed to identify significant differences in the accuracy, TPF, and FPF of the opacity area, density, and location and the treatment recommendations between CC-Cruiser and the senior consultants. The p-value and OR of the difference in FPF of treatment between two groups couldn't be calculated because of the 100% specificity for senior consultants. GEE results (adjusted results) and logistic regression results (unadjusted results) for evaluation of cataract and treatment in the supplementary table were presented to show the impact of the cluster at the level of participants (Supplementary table 1 for adjusted and unadjusted results).

Diagnostic performance regarding childhood cataract. Eyes were the units of analysis (N = 700). There were 350 eyes in the CC-Cruiser group and 350 eyes in the senior consultant group. OR = odd ratio. CI = confidence interval. TPF = true positive fraction. TPF is equivalent to sensitivity. FPF = false positive fraction. FPF is equivalent to 1-specificity. We performed a diagnostic accuracy analysis with reference to the cataract specialists' standards. The TPF and FPF of diagnosis (normal lens versus cataract) were 89.7%, and 13.6%, respectively, for CC-Cruiser and 98.4%, and 0.4%, respectively, for the senior consultants. The generalized estimating equation (GEE) was performed to identify significant differences in accuracy, TPF, and FPF between CC-Cruiser and the senior consultants. GEE results (adjusted results) and logistic regression results (unadjusted results) for cataract diagnosis in the supplementary table were presented to show the impact of the cluster at the level of participants (Supplementary table 1 for adjusted and unadjusted results). Eyes were the units of analysis (N = 700). There were 350 eyes in the CC-Cruiser group and 350 eyes in the senior consultant group. OR = odd ratio. CI = confidence interval. TPF = true positive fraction. TPF is equivalent to sensitivity. FPF = false positive fraction. FPF is equivalent to 1-specificity. We performed a diagnostic accuracy analysis with reference to the cataract specialists' standards. The TPF and FPF of diagnosis (normal lens versus cataract) were 89.7%, and 13.6%, respectively, for CC-Cruiser and 98.4%, and 0.4%, respectively, for the senior consultants. The generalized estimating equation (GEE) was performed to identify significant differences in accuracy, TPF, and FPF between CC-Cruiser and the senior consultants. GEE results (adjusted results) and logistic regression results (unadjusted results) for cataract diagnosis in the supplementary table were presented to show the impact of the cluster at the level of participants (Supplementary table 1 for adjusted and unadjusted results). Comprehensive evaluations of childhood cataract and treatment recommendations. Eyes were the units of analysis. A total of 216 eyes (correctly diagnosed as cataracts in both groups, 96 eyes in the CC-Cruiser group and 120 eyes in the senior consultant group) were further analyzed by comprehensive evaluation of lens opacity, including the opacity area (extensive versus limited), density (dense versus non-dense), and location (central versus peripheral), and the recommended treatment (surgery versus follow-up) with reference to the cataract specialists' standards. OR = odd ratio. CI = confidence interval. TPF = true positive fraction. TPF is equivalent to sensitivity. FPF = false positive fraction. FPF is equivalent to 1-specificity. The generalized estimating equation was performed to identify significant differences in the accuracy, TPF, and FPF of the opacity area, density, and location and the treatment recommendations between CC-Cruiser and the senior consultants. The p-value and OR of the difference in FPF of treatment between two groups couldn't be calculated because of the 100% specificity for senior consultants. GEE results (adjusted results) and logistic regression results (unadjusted results) for evaluation of cataract and treatment in the supplementary table were presented to show the impact of the cluster at the level of participants (Supplementary table 1 for adjusted and unadjusted results). Eyes were the units of analysis. A total of 216 eyes (correctly diagnosed as cataracts in both groups, 96 eyes in the CC-Cruiser group and 120 eyes in the senior consultant group) were further analyzed by comprehensive evaluation of lens opacity, including the opacity area (extensive versus limited), density (dense versus non-dense), and location (central versus peripheral), and the recommended treatment (surgery versus follow-up) with reference to the cataract specialists' standards. OR = odd ratio. CI = confidence interval. TPF = true positive fraction. TPF is equivalent to sensitivity. FPF = false positive fraction. FPF is equivalent to 1-specificity. The generalized estimating equation was performed to identify significant differences in the accuracy, TPF, and FPF of the opacity area, density, and location and the treatment recommendations between CC-Cruiser and the senior consultants. The p-value and OR of the difference in FPF of treatment between two groups couldn't be calculated because of the 100% specificity for senior consultants. GEE results (adjusted results) and logistic regression results (unadjusted results) for evaluation of cataract and treatment in the supplementary table were presented to show the impact of the cluster at the level of participants (Supplementary table 1 for adjusted and unadjusted results). The time required for CC-Cruiser to make a diagnosis and treatment recommendation was less than that required for the senior consultants (2.79 min vs. 8.53 min, p < 0.001, mean difference 5.74 [95% CI 5.43 to 6.05], Table 4).

Table 4

Time required for the diagnostic process of CC-Cruiser and senior consultants.

	Mean time (minutes)	Standard deviation	95% CI		Mean difference (p-value, 95% CI)
	Mean time (minutes)	Standard deviation	Lower	Upper
CC-Cruiser	2.79	1.11	2.64	2.96	5.74 (p < 0.001, 95% CI 5.43 to 6.05)
Senior consultants	8.53	1.75	8.27	8.78	5.74 (p < 0.001, 95% CI 5.43 to 6.05)

Three hundred patients were included in the analysis (175 participants in the CC-Cruiser group and 175 participants in the senior consultant group). The Mann–Whitney U test was performed to compare the time required. Significant differences in time required were observed between the CC-Cruiser and senior consultant groups (p < 0.001). CI = confidence interval.

Time required for the diagnostic process of CC-Cruiser and senior consultants. Three hundred patients were included in the analysis (175 participants in the CC-Cruiser group and 175 participants in the senior consultant group). The Mann–Whitney U test was performed to compare the time required. Significant differences in time required were observed between the CC-Cruiser and senior consultant groups (p < 0.001). CI = confidence interval. At the end of the study, 345 participants completed the evaluation questionnaire (172 in the CC-Cruiser group and 173 in the senior consultant group). Five participants' guardians did not complete the questionnaires for personal reasons. The responses to each statement are summarized in Table 5. The response rates for the completion of questionnaire were 98.3% for the AI group and 98.9% for the senior consultant group. The patients had high satisfaction levels regarding the medical services provided by CC-Cruiser, especially for the time required for diagnosis. The mean rating for overall satisfaction with CC-Cruiser was 3.47 ± 0.501, which was higher than that of the senior consultants (3.38 ± 0.554, p = 0.007, Table 5), indicating that patients preferred medical AI than real doctors when receiving medical services.

Table 5

Questionnaire provided to the participants with their responses to the clinical service.

Question	Response in the AI group (N = 172)				Mean rating (SD)	Response in the SC group (N = 173)				Mean rating (SD)	p-Value
Question	1	2	3	4	Mean rating (SD)	1	2	3	4	Mean rating (SD)	p-Value
The initial diagnosis of the eye clinic was credible.	5.2% (9)	15.9% (27)	32.0% (55)	47.1% (81)	3.21 (0.893)	0	3.5% (6)	58.4% (101)	38.2% (66)	3.35 (0.546)	p = 0.679
The initial therapeutic decision of the eye clinic was credible.	4.7% (8)	18.0% (31)	28.5% (49)	48.8% (84)	3.22 (0.902)	0	5.8% (10)	57.0% (98)	37.6% (65)	3.32 (0.578)	p = 0.972
The initial diagnosis of the eye clinic was consistent with that of the experts.	3.5% (6)	18.6% (32)	21.5% (37)	56.4% (97)	3.31 (0.896)	0	2.9% (5)	37.6% (65)	59.5% (103)	3.57 (0.552)	p = 0.053
The initial therapeutic decision of the eye clinic was consistent with that of the experts.	3.5% (6)	23.3% (40)	17.4% (30)	55.8% (96)	3.26 (0.918)	0	4.0% (7)	37.6% (65)	58.4% (101)	3.54 (0.575)	p = 0.042
I was satisfied with the time required to wait for CC-Cruiser/senior consultants in this eye clinic.	0	0.6% (1)	41.9% (72)	57.6% (99)	3.57 (0.508)	0	5.2% (9)	50.9% (88)	43.9% (76)	3.39 (0.586)	p = 0.005
I was satisfied with the time required to make the diagnosis and provide treatment recommendations by CC-Cruiser/senior consultants.	0	0	43.6% (75)	56.4% (97)	3.56 (0.497)	0	0.5% (1)	59.5% (103)	39.9% (69)	3.38 (0.554)	p = 0.002
Overall, I was satisfied with this medical service provided in this eye clinic.	0	0	52.9% (91)	47.1% (81)	3.47 (0.501)	0	4.0% (7)	61.3% (106)	34.7% (60)	3.31 (0.543)	p = 0.007

Data are presented as the number (%) or mean (standard deviation). Percentages do not add up to 100% in some cases because of rounding. Three hundred and forty-five patients were included in the analysis (172 in the CC-Cruiser group and 173 in the senior consultant group). Pediatric participants and at least one of their guardians were asked to complete the questionnaire together. Five participants' guardians were unwilling to complete the questionnaires because of personal reasons. The survey questions used a 4-point scale (1, disagree; 2, neutral; 3, agree; and 4, strongly agree). The Mann–Whitney U test was performed to identify significant differences in responses to each question between the two groups. AI = artificial intelligence. SC = senior consultant. SD = standard deviation.

Questionnaire provided to the participants with their responses to the clinical service. Data are presented as the number (%) or mean (standard deviation). Percentages do not add up to 100% in some cases because of rounding. Three hundred and forty-five patients were included in the analysis (172 in the CC-Cruiser group and 173 in the senior consultant group). Pediatric participants and at least one of their guardians were asked to complete the questionnaire together. Five participants' guardians were unwilling to complete the questionnaires because of personal reasons. The survey questions used a 4-point scale (1, disagree; 2, neutral; 3, agree; and 4, strongly agree). The Mann–Whitney U test was performed to identify significant differences in responses to each question between the two groups. AI = artificial intelligence. SC = senior consultant. SD = standard deviation.

Discussion

In this study, we showed that CC-Cruiser was less accurate in diagnosing childhood cataracts and making treatment decisions in clinical practice than the senior consultants. However, compared to senior consultants, CC-Cruiser required less time for diagnosis, and achieved a high level of patients' satisfaction. These results highlighted the clinical importance of diagnostic randomized controlled trials for evaluating real-world performance of CC-Cruiser before regular use in outpatient settings. The real-world diagnostic accuracy of CC-Cruiser is lower than that reported in our previous study conducted with screening datasets [25]. Although CC-Cruiser was highly accurate in evaluating 306 standard images of the ocular anterior segment, cataracts were misdiagnosed and evaluated inaccurately more often by CC-Cruiser than senior consultants for 43 images of poor quality in this clinical trial, which could be attributed to several reasons. First, some pediatric patients could not cooperate sufficiently and fix their eyes on the cameras due to photophobia or lack of attention. Therefore, the slit-lamp could not be focused properly on the lens. Second, the eyelids and eyelashes could obscure the lens, compromising the quality of the captured images. Third, if the reflective point was focused near the visual axis, the features on the reflective point on the lens could not be accurately extracted, leading to a misdiagnosis of cataract and a higher false positive fraction for CC-Cruiser. Fourth, the strong illumination intensity of slit-lamp may result in artefactual lens opacities, which was another reason for a higher false positive fraction for CC-Cruiser. However, these problems could usually be identified by senior consultants, as they could adjust the focus point manually and evaluate the opacity from different sites or angles of the lenses. The higher false positives may increase the burden and cost of medical resources and may result in physical or mental injury to the patients. In addition, although diagnosis by CC-Cruiser at the current stage may still need inputs from clinicians (including using sedative drugs) to ensure the quality of image capture, we believe that further improvement in autofocus technology of medical AI will achieve more diagnostic accuracy with less requirement of human input. For example, an improvement in the recognition of reflective point on the lens can significantly reduce false positive rate. Previous studies indicated that AI-facilitated diagnosis can alleviate doctors' workload and contribute to high-quality medical care provision to patients in need [3], [12]. Here, we showed that in clinical application, the medical AI platform exhibited superiority to real human doctors in terms of shortened diagnostic time. Consistently, the participants in the CC-Cruiser group felt that they received a faster diagnosis, and the waiting time required for the outpatient visit was significantly reduced. It is reported that AI have the potential to reduce costs in health care economies [32]. With the widespread application of AI technology in health care, economic cost will also be lower than human doctors since only the cost of development and operation of machine will be assumed. Therefore, AI technology is a promising modality for providing high-quality health services to large populations in time- and cost-effectiveness [13]. Patients' satisfaction with medical AI has not been fully studied. Laure et al. assessed patient satisfaction with rheumatoid arthritis (RA) care using Sanoia, an e-health website [33]. The authors showed discordance between patient satisfaction and access to the AI platform, primarily because RA is a chronic disease, and patients may lose interest in using Sanoia and become less dedicated to regular disease self-management when the disease is in remission [33]. However, childhood cataract can be vision-threatening if early diagnosis and appropriate management are not provided [24]. Therefore, parents of pediatric patients are eager to access a medical service for diagnosis and treatment-decisions with high efficiency. Our study showed that the overall patient satisfaction with CC-Cruiser was slightly higher than that with the senior consultants, indicating that patients had good experience in the AI medical service. The satisfaction of patients may due to their curiosity or interest to medical AI, or the fact that patients need a balance between the diagnostic accuracy and diagnostic time and are more willing to receive medical service that is less time-consuming and with acceptable diagnostic accuracy. Our results support that CC-Cruiser at least achieves a comparable satisfaction metrics as human doctors do. Therefore, CC-Cruiser, at its current stage, has shown potential to assist human doctors in clinical applications. In future studies, we will dedicate in the improvement of accuracy of CC-Cruiser to increase patient satisfaction. The strengths of the study include its randomized, controlled design, a large sample size, and data collection from five eye clinics across China. However, our trial has several limitations. First, as patients without symptoms such as blurred vision were less willing to participate in the study, we may have missed some patients with slightly opaque lens. Therefore, assessment of early-stage cataract by CC-Cruiser needs further improvement. Second, CC-Cruiser provided treatment suggestions without considering the patients' general conditions. Therefore, a small proportion (six cases) of treatment recommendations provided by CC-Cruiser were not consistent with those made by the experts, despite that lens opacity had been accurately evaluated. Further improvement of the capacity for treatment determination will require consideration of non-ophthalmic factors, such as age and health status [34]. Third, our AI system was reliance on the computing power and internet accessibility, thus difficulties of widespread application of CC-Cruiser may exist in those developing areas without stable internet. However, those remote locations with internet access can still benefit from medical service provided by CC-Cruiser. Fourth, a cluster randomized controlled trial (cluster at level of the pediatric patients) has been undertaken in this trial because the randomization was on the level of patients and the observation and its analysis was on the level of eyes. However, when the sample size was calculated, the intra-cluster correlation between two eyes from one child was not accounted for, and a randomized controlled trial design was adopted. This would result in a statistical power lower than 0.8 as anticipated since cluster randomized controlled trials require larger sample size than randomized controlled trials to achieve the same statistical power. In conclusion, this is the first clinical randomized controlled trial to validate the diagnostic accuracy and efficiency of an AI system in eye clinics. This represents the first clinical trial of this kind to robustly evaluate the clinical application of medical AI. CC-Cruiser exhibited less accuracy compared with senior human consultants in diagnosing childhood cataracts and making treatment decisions, but has the capacity to assist human doctors in clinical practice in its current state. Further efforts will be required to perform the clinical controlled trials to appropriately evaluate the real-world diagnostic performance of medical AI. The following are the supplementary data related to this article.

Contributors

HTL and YZL contributed to the concept of the study and critically reviewed the manuscript. YZL, HTL, RYL, ZZL, HC, YHY, and JJC designed the study and did the literature search. ZLL, WYL, EPL, XHW, DRL, DXW, TYY, QZC, XYL, JL, WTL, JHW, MMY, HLH, LZ, YY, XLC, JMH, KZ, XML, XYZ, LXL, YHL, XLL, BC, DYZ, MXW, and WRC contributed to the data collection. JLH contributed to the design of statistical analysis plan. RYL, YHY, HTL, and JLH did the data analysis and data interpretation. HTL and RYL drafted the manuscript. HTL, RYL, ZZL, HC, YHY, YZ, and CC critically revised the manuscript. HTL provided research funding, coordinated the research and oversaw the project. All the authors reviewed the manuscript for important intellectual content and approved the final manuscript.

Role of the Funding Source

The source of funding played no role in the design of the study protocol, data collection, data analysis, data interpretation, writing of the report, or the decision to submit the manuscript for publication. The corresponding author had full access to all data in the study and assumes final responsibility for the decision to submit the manuscript for publication. All authors approved the decision to submit.

Declaration of Interests

The authors declare no competing financial interests.

31 in total

1. Sample size calculations in clinical research.

Authors: Peter Bacchetti; Jacqueline M Leung
Journal: Anesthesiology Date: 2002-10 Impact factor: 7.892

Review 2. National plans and strategies on rare diseases in Europe.

Authors: Domenica Taruscio; Luciano Vittozzi; Rumen Stefanov
Journal: Adv Exp Med Biol Date: 2010 Impact factor: 2.622

3. CONSORT 2010.

Authors: Yao-Long Chen; Ke-Hu Yang
Journal: Lancet Date: 2010-07-24 Impact factor: 79.321

Review 4. Why rare diseases are an important medical and social issue.

Authors: Arrigo Schieppati; Jan-Inge Henter; Erica Daina; Anita Aperia
Journal: Lancet Date: 2008-06-14 Impact factor: 79.321

5. Extraocular muscle hypertrophy in myotonia congenita: Mutation identified in the SCN4A gene (V445M).

Authors: Bradley Wakeman; Ian M MacDonald; Ieke Ginjaar; Jack Tarleton; Deepti Babu
Journal: J AAPOS Date: 2009-10 Impact factor: 1.220

Review 6. Sample size calculation in physical medicine and rehabilitation: a systematic review of reporting, characteristics, and results in randomized controlled trials.

Authors: Lydia Abdul Latif; João Eduardo Daud Amadera; Daniel Pimentel; Thais Pimentel; Felipe Fregni
Journal: Arch Phys Med Rehabil Date: 2011-02 Impact factor: 3.966

7. The coming of age of artificial intelligence in medicine.

Authors: Vimla L Patel; Edward H Shortliffe; Mario Stefanelli; Peter Szolovits; Michael R Berthold; Riccardo Bellazzi; Ameen Abu-Hanna
Journal: Artif Intell Med Date: 2008-09-13 Impact factor: 5.326

8. Diagnostic randomized controlled trials: the final frontier.

Authors: Marc Rodger; Tim Ramsay; Dean Fergusson
Journal: Trials Date: 2012-08-16 Impact factor: 2.279

Review 9. Standards for reporting randomized controlled trials in medical informatics: a systematic review of CONSORT adherence in RCTs on clinical decision support.

Authors: K M Augestad; G Berntsen; K Lassen; J G Bellika; R Wootton; R O Lindsetmo
Journal: J Am Med Inform Assoc Date: 2011-07-29 Impact factor: 4.497

10. Telemedicine in a rural memory disorder clinic-remote management of patients with dementia.

Authors: Nahid Azad; Stephanie Amos; Kelly Milne; Barbara Power
Journal: Can Geriatr J Date: 2012-12-04

39 in total

1. Statistical methods for validation of predictive models.

Authors: Marcio Augusto Diniz
Journal: J Nucl Cardiol Date: 2022-05-24 Impact factor: 5.952

2. Protocol for a systematic review and meta-analysis of the diagnostic accuracy of artificial intelligence for grading of ophthalmology imaging modalities.

Authors: Jessica Cao; Brittany Chang-Kit; Glen Katsnelson; Parsa Merhraban Far; Elizabeth Uleryk; Adeteju Ogunbameru; Rafael N Miranda; Tina Felfeli
Journal: Diagn Progn Res Date: 2022-07-14

3. Classification of Multiple Diseases on Body CT Scans Using Weakly Supervised Deep Learning.

Authors: Fakrul Islam Tushar; Vincent M D'Anniballe; Rui Hou; Maciej A Mazurowski; Wanyi Fu; Ehsan Samei; Geoffrey D Rubin; Joseph Y Lo
Journal: Radiol Artif Intell Date: 2021-12-01

4. Using digital technologies in clinical trials: Current and future applications.

Authors: Carmen Rosa; Lisa A Marsch; Erin L Winstanley; Meg Brunner; Aimee N C Campbell
Journal: Contemp Clin Trials Date: 2020-11-17 Impact factor: 2.226

5. Automatic classification of heterogeneous slit-illumination images using an ensemble of cost-sensitive convolutional neural networks.

Authors: Jiewei Jiang; Liming Wang; Haoran Fu; Erping Long; Yibin Sun; Ruiyang Li; Zhongwen Li; Mingmin Zhu; Zhenzhen Liu; Jingjing Chen; Zhuoling Lin; Xiaohang Wu; Dongni Wang; Xiyang Liu; Haotian Lin
Journal: Ann Transl Med Date: 2021-04

Review 6. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension.

Authors: Xiaoxuan Liu; Samantha Cruz Rivera; David Moher; Melanie J Calvert; Alastair K Denniston
Journal: Lancet Digit Health Date: 2020-09-09

7. Evaluation of pediatric ophthalmologists' perspectives of artificial intelligence in ophthalmology.

Authors: Nita G Valikodath; Tala Al-Khaled; Emily Cole; Daniel S W Ting; Elmer Y Tu; J Peter Campbell; Michael F Chiang; Joelle A Hallak; R V Paul Chan
Journal: J AAPOS Date: 2021-06-01 Impact factor: 1.325

8. Does Artificial Intelligence Outperform Natural Intelligence in Interpreting Musculoskeletal Radiological Studies? A Systematic Review.

Authors: Olivier Q Groot; Michiel E R Bongers; Paul T Ogink; Joeky T Senders; Aditya V Karhade; Jos A M Bramer; Jorrit-Jan Verlaan; Joseph H Schwab
Journal: Clin Orthop Relat Res Date: 2020-12 Impact factor: 4.755

Review 9. AI in health and medicine.

Authors: Pranav Rajpurkar; Emma Chen; Oishi Banerjee; Eric J Topol
Journal: Nat Med Date: 2022-01-20 Impact factor: 87.241

Review 10. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension.

Authors: Samantha Cruz Rivera; Xiaoxuan Liu; An-Wen Chan; Alastair K Denniston; Melanie J Calvert
Journal: Lancet Digit Health Date: 2020-09-09