| Literature DB >> 33402366 |
Aaron Y Lee1,2,3, Ryan T Yanagihara4, Cecilia S Lee4,2, Marian Blazes4, Hoon C Jung4,2, Yewlin E Chee4, Michael D Gencarella4, Harry Gee5, April Y Maa6,7, Glenn C Cockerham8,9, Mary Lynch6,10, Edward J Boyko11,12.
Abstract
OBJECTIVE: With rising global prevalence of diabetic retinopathy (DR), automated DR screening is needed for primary care settings. Two automated artificial intelligence (AI)-based DR screening algorithms have U.S. Food and Drug Administration (FDA) approval. Several others are under consideration while in clinical use in other countries, but their real-world performance has not been evaluated systematically. We compared the performance of seven automated AI-based DR screening algorithms (including one FDA-approved algorithm) against human graders when analyzing real-world retinal imaging data. RESEARCH DESIGN AND METHODS: This was a multicenter, noninterventional device validation study evaluating a total of 311,604 retinal images from 23,724 veterans who presented for teleretinal DR screening at the Veterans Affairs (VA) Puget Sound Health Care System (HCS) or Atlanta VA HCS from 2006 to 2018. Five companies provided seven algorithms, including one with FDA approval, that independently analyzed all scans, regardless of image quality. The sensitivity/specificity of each algorithm when classifying images as referable DR or not were compared with original VA teleretinal grades and a regraded arbitrated data set. Value per encounter was estimated.Entities:
Year: 2021 PMID: 33402366 PMCID: PMC8132324 DOI: 10.2337/dc20-1877
Source DB: PubMed Journal: Diabetes Care ISSN: 0149-5992 Impact factor: 19.112
Demographic factors and baseline clinical characteristics of the study population
| Seattle | Atlanta | Total | |
|---|---|---|---|
| Patients, | 13,439 | 10,285 | 23,724 |
| Age (years) | |||
| Mean (SD) | 62.20 (10.91) | 63.46 (10.14) | 62.75 (10.60) |
| Range | 21–97 | 24–98 | 21–98 |
| Male sex | 12,724 (94.68) | 9,795 (95.24) | 22,519 (94.92) |
| Race | |||
| White | 9,482 (70.56) | 4,678 (45.48) | 14,160 (59.69) |
| African American | 1,642 (12.22) | 5,085 (49.44) | 6,727 (28.35) |
| Asian | 383 (2.85) | 34 (0.33) | 417 (1.76) |
| Other | 605 (4.50) | 90 (0.88) | 695 (2.93) |
| Unknown | 1,327 (9.87) | 398 (3.87) | 1,725 (7.27) |
| Encounters, | 21,797 | 13,104 | 34,901 |
| Retinopathy grade | |||
| No DR | 15,270 (70.05) | 11,166 (85.21) | 26,436 (75.75) |
| Mild NPDR | 2,364 (10.85) | 957 (7.31) | 3,321 (9.51) |
| Moderate NPDR | 494 (2.27) | 311 (2.37) | 805 (2.31) |
| Severe NPDR | 110 (0.50) | 153 (1.17) | 263 (0.75) |
| PDR | 22 (0.10) | 193 (1.47) | 215 (0.62) |
| Ungradable | 3,537 (16.23) | 324 (2.47) | 3,861 (11.06) |
| Images, | 199,142 | 112,462 | 311,604 |
Data are n (%) unless otherwise indicated.
Figure 1The relative screening performance of AI algorithms. Using the full-image data set (A), the sensitivity, specificity, NPV, and PPV of each algorithm are shown using the original teleretinal grader as the reference standard. These analyses were repeated separately using color fundus photographs obtained from Atlanta (B) and Seattle (C).
Figure 2Relative performance of human grader compared with AI algorithms. The relative performance of the VA teleretinal grader (Human) and algorithms A–G in screening for referable DR using the arbitrated data set at different thresholds of DR. A: Sensitivity and specificity of each algorithm compared with a human grader with 95% CI bars against a subset of double-masked arbitrated grades in screening for referable DR in images with mild NPDR or worse and ungradable image quality. B–D: Only gradable images were used. The VA teleretinal grader is compared with the AI sensitivities, with 95% CIs, at different thresholds of disease, including moderate NPDR or worse (B), severe NDPR or worse (C), and PDR (D). *P ≤ 0.05, **P ≤ 0.001, ***P ≤ 0.0001.
Figure 3Value per encounter of AI algorithms meeting the sensitivity threshold. The value per encounter with 95% CI bars of algorithms E, F, and G. Only algorithms that achieved equivalent sensitivity to the VA teleretinal graders in screening for referable DR in images regraded as moderate NPDR or worse in the arbitrated data set were carried forward. The value per encounter of each algorithm if optometrists (Optom) or ophthalmologists (Ophth) were to implement this system into their clinical practice to make a normal profit on the basis of geographical location or the combined data set is shown. ATL, Atlanta; SEA, Seattle; TOT, total (Atlanta and Seattle).