Literature DB >> 34003938

Reliability of Retinal Pathology Quantification in Age-Related Macular Degeneration: Implications for Clinical Trials and Machine Learning Applications.

Philipp L Müller^1,2,3, Bart Liefers^1,4,5, Tim Treis⁶, Filipa Gomes Rodrigues^1,2, Abraham Olvera-Barrios^1,2, Bobby Paul⁷, Narendra Dhingra⁸, Andrew Lotery⁹, Clare Bailey¹⁰, Paul Taylor¹¹, Clarisa I Sánchez^4,5,12, Adnan Tufail^1,2.

Abstract

Purpose: To investigate the interreader agreement for grading of retinal alterations in age-related macular degeneration (AMD) using a reading center setting.
Methods: In this cross-sectional case series, spectral-domain optical coherence tomography (OCT; Topcon 3D OCT, Tokyo, Japan) scans of 112 eyes of 112 patients with neovascular AMD (56 treatment naive, 56 after three anti-vascular endothelial growth factor injections) were analyzed by four independent readers. Imaging features specific for AMD were annotated using a novel custom-built annotation platform. Dice score, Bland-Altman plots, coefficients of repeatability, coefficients of variation, and intraclass correlation coefficients were assessed.
Results: Loss of ellipsoid zone, pigment epithelium detachment, subretinal fluid, and drusen were the most abundant features in our cohort. Subretinal fluid, intraretinal fluid, hypertransmission, descent of the outer plexiform layer, and pigment epithelium detachment showed highest interreader agreement, while detection and measures of loss of ellipsoid zone and retinal pigment epithelium were more variable. The agreement on the size and location of the respective annotation was more consistent throughout all features. Conclusions: The interreader agreement depended on the respective OCT-based feature. A selection of reliable features might provide suitable surrogate markers for disease progression and possible treatment effects focusing on different disease stages. Translational Relevance: This might give opportunities for a more time- and cost-effective patient assessment and improved decision making as well as have implications for clinical trials and training machine learning algorithms.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2021 PMID： 34003938 PMCID： PMC7938003 DOI： 10.1167/tvst.10.3.4

Source DB: PubMed Journal: Transl Vis Sci Technol ISSN： 2164-2591 Impact factor: 3.283

Introduction

Age-related macular degeneration (AMD) is a leading cause of legal blindness in the industrialized world. Concerning advanced disease manifestations, a dry stage defined by the presence of retinal pigment epithelial (RPE) atrophy (called geographic atrophy [GA]) can be distinguished from or complicated by a neovascular (nAMD) form typically characterized by the presence of choroidal neovascularization (CNV).– While both forms of late-stage AMD are associated with the risk of visual loss, an effective treatment for GA development and progression is still pending. However, various therapeutic approaches are tested in different stages of preclinical and clinical trials., To accelerate clinical testing, meaningful, validated clinical endpoints are needed. Most interventional trials currently rely on the progression of GA, which is an accepted endpoint by regulators., However, the most effective upcoming therapeutic approach might be directed to earlier disease stages. Therefore, ideal surrogate markers should identify early disease-associated alterations before the hitherto unknown point of no return. In contrast to color fundus photography and fundus autofluorescence-based definition of GA,, the Classification of Atrophy Meetings (CAM) group (as an international consensus) recently used optical coherence tomography (OCT) imaging to redefine the phenotypic end stage of AMD as complete RPE and outer retinal atrophy (RORA). They not only included alterations of the outer retina into the definition but also reported preceding OCT features for AMD., Furthermore, a current study dealing with RORA in mitochondriopathies described a consistent sequence of these OCT features in the development of RORA representing different disease stages. Accordingly, they could bear great potential as future clinical surrogate markers. However, the reliability of the detection and quantification of some of these features has not yet been systematically and comprehensively investigated. Nevertheless, they have already been implemented by reading centers for current and upcoming observational and interventional trials.,, Concerning nAMD, the therapy with intraocular injection of anti–vascular endothelial growth factor (VEGF) has been shown to be effective and reduces the risk of visual loss., However, the numbers and costs of required visits mean a significant burden on health care systems, medical personal, and patients, particularly in light of growing numbers due to demographic changes and rising life expectation. Therefore, personalized interval and treatment strategies (i.e., “treat and extend”) are used more commonly in current clinical settings., In this context, objective and reliable features to determine disease activity are crucial. OCT is typically used for monitoring as it provides cross-sectional images of the retina that allow identifying the presence as well as extent of these features., Usually, the feature identification is manually performed by human investigators. Machine learning (ML) applications are progressively entering this field, especially in the context of potential deployment of in-home or remote OCT monitoring. However, the “gold standard” by which these algorithms are trained and validated is conventionally human grading. This might raise the question concerning reliability, subjectivity, and bias of the treatment decisions. In this study, we therefore investigate the reliability of the grading of defined OCT features commonly found in the development of RORA and/or in the presence of CNV secondary to AMD in order to provide estimates for human interreader agreement for each of these features. Thereby, we focus on the detection as well as the size and the overlap of the particular annotations.

Methods

This retrospective cross-sectional case series was performed at the Moorfields Eye Hospital NHS Foundation Trust (London, UK). To identify patients with AMD, the OCT images were linked to the diagnosis of the electronic medical records (EMR) database (Medisoft, Leeds, UK) of five centers in the United Kingdom using pseudonymized identifiers. The data pseudonymization was undertaken by the EMR vendor independently before export to the study team. The pseudonymization key that was generated to allow linkage of EMR to OCT data remained with the EMR vendor at the clinical site and not accessible to the study team, and all patient identifiers were removed. This means that the data received by the study team were effectively fully anonymized on receipt to prevent any possible identification of individual patients or treatment sites by the investigators. The imaging data comprised 6-mm × 6-mm foveal-centered OCT volume scans (128 or 256 scans per volume), resulting in a resolution of either 512 × 128 A-scans or 256 × 256 A-scans. They were obtained by spectral-domain OCT (Topcon, Tokyo, Japan) using standardized scan protocols. Any other additional ocular pathology (including prior clinically significant macular edema), prior unlicensed bevacizumab injections, intraocular surgery within 90 days, or prior macular or panretinal photocoagulation led to exclusion. Thereby, this study included imaging data of 112 eyes of 112 patients with AMD at different disease stages. Half of these eyes were treatment naive, and the others were imaged after three anti-VEGF injections. Active neovascularization was present in 70 eyes. Of the remaining 42 eyes, 12 and 30 were graded as intermediate and late AMD, respectively. There were 60 right and 52 left eyes included. The mean ± SD age was 81.4 ± 8.18 years (range, 51–98 years). The study was in adherence with the Declaration of Helsinki. The institutional review board ruled that approval was not required for this study, because all data were effectively completely anonymized before being released to our study team to perform this research.

Image Analysis

To assess the reliability of grading retinal alterations in AMD, a single OCT B-scan per eye was randomly selected for annotation (including both foveal and eccentric scans). The other B-scans were available to give additional context if needed. Annotations were performed by four independently trained retinal specialists masked to the results of each other using a custom-build platform (Supplementary Fig. S1). All retinal abnormalities were to be delineated using (1) the definition of features as well as the images (as standard examples) of CAM reports, and (2) unpublished (additional) description of features based on the Classification of Atrophy Meeting from January 2019 in Milan, Italy (the corresponding CAM Report 5 is currently under review). The platform provided default labels for the most common abnormalities (including those described by the CAM group), and allowed the readers to add additional labels not covered (as free text) by the default setup. The latter was used only once by one reader (annotating a single microaneurysm). Depending on the feature, it was annotated as area, lateral extent, or number (i.e., single dots in features with pointwise presentation) and likewise for all readers. Preset default labels included drusen, loss of ellipsoid zone (EZ), intraretinal hyperreflective foci (HRF), hypertransmission of OCT signal (HT), hyporeflective wedges, intraretinal fluid (IRF), descent of the outer plexiform layer (OPL), outer retinal tubulations, pigment epithelial detachment (PED), loss of retinal pigment epithelium (RPE), reticular pseudodrusen (RPD), subretinal fluid (SRF), subretinal hyperreflective material (SRHM), and sub-RPE plaques (Supplementary Fig. S1). The annotated images were then evaluated using Python (version 3.8.2). To obtain the area measures in square millimeters and lateral extent measures in millimeters, the extracted values of annotated features (i.e., in pixels² and pixels) were multiplied by the individual scaling factor depending on the scanning protocol. Further statistical analysis was exclusively made for features present in at least 20 annotated B-scans (respectively, eyes) to ensure reliable results.

Statistical Analysis

The software environment R (version 4.0.2; The R Foundation for Statistical Computing, Vienna, Austria) was used for interreader correlations. To compare the reliability of feature detection, Fleiss coefficients were used. To measure the agreement in the annotated feature size, lateral extent, or number, intraclass correlation coefficients (ICCs, one-way random), 95% coefficients of repeatability, and coefficients of variation (CVs) were determined.– To account for the unbalanced number of readings per sample, a linear mixed-effects model was used. Bland–Altman plots were generated from slices with annotations of at least two readers for visualization of limits of agreement. Spearman's rank correlation coefficients (ρ) were calculated between the absolute differences and the mean values to evaluate whether measurement variability increases with lesion size or number. To measure overlap in annotated areas, we calculated the Dice similarity metric using Python (version 3.8.2; Python Software Foundation, Wilmington, Delaware, USA) whenever more than one reader annotated the same feature within a respective B-scan. It is defined as the size of the intersection of two areas divided by their average individual size, ranging from 0 (indicating no spatial overlap) to 1 (indicating complete overlap). For area measures, overlap was calculated on the pixel level. For lateral extent measures, only the lateral location of the feature was taken into account. The mean Dice coefficients per feature are reported. Due to their focal nature, the Dice coefficient was not regarded an appropriate metric for annotations of HRF.

Results

In 111 of the included 112 OCT B-scans, at least one pathologic feature was annotated. Hyporeflective wedges (n = 1), microaneurysm (n = 1), outer retinal tubulations (n = 5), RPD (n = 16), and sub-RPE plaques (n = 3) were present but excluded from analysis due to their rarity in the respective scans. In total, 10 features were used for further analysis (Table 1). Out of the latter group, EZ loss, drusen, and PED were the most abundant features.

Table 1.

Interreader Agreement of Feature Detection

Grading Parameter	n	κ Coefficient	95% CI
Drusen	85	0.367	0.292–0.443
Drusen_def.	64	0.613	0.537–0.689
EZ loss	108	0.260	0.185–0.336
HRF	71	0.422	0.246–0.497
HT	29	0.746	0.671–0.822
IRF	50	0.621	0.545–0.696
OPL descent	20	0.611	0.536–0.687
PED	77	0.598	0.522–0.674
RPE loss	76	0.160	0.085–0.236
SRF	45	0.823	0.747–0.898
SRHM	51	0.357	0.282–0.433

n = overall number of B-scans annotated with the respective feature by at least one reader. CI, confidence interval; Drusen_def., drusen with a minimum size of 1558.6 µm² in the respective B-scan.

Interreader Agreement of Feature Detection n = overall number of B-scans annotated with the respective feature by at least one reader. CI, confidence interval; Drusen_def., drusen with a minimum size of 1558.6 µm² in the respective B-scan. The feature detection at the B-scan level (i.e., the individual lesion level is important when investigating progression) revealed variable interreader agreement (Table 1). The most reliable results could be found in SRF and IRF, which account for neovascular complications, as well as the features HT, OPL descent, and PED. Only slight to moderate interreader agreement could be found in the detection of EZ loss and RPE loss, quite similar to drusen grading. However, setting a threshold of 1558.6 µm² as minimum drusen area (derived from the Age-Related Eye Disease Study (AREDS) definition of minimal drusen diameter of 63 µm), to exclude so-called drupelets led to a reduced number of annotated B-scans (n = 64) and to a significantly increased κ coefficient of drusen grading, indicating substantial interreader agreement. The evaluation of interreader agreement concerning the size, lateral extension, or number of annotated features at the B-scan level revealed more consistent results. All ICC values ranged from moderate to excellent correlation (Table 2). The focality (i.e., number of individual annotated spots) measures of HRF revealed the lowest ICC with values over 0.50. The features with the highest scores for interreader agreement of annotated size, lateral extension, or number were PED, SRF, HT, and OPL descent in our cohort (ICC > 0.85, Fig. 1). Similar to the feature detection, exclusion of drupelets led to a higher interreader agreement of grading of drusen size (Table 2).

Table 2.

Interreader Agreement of Size, Lateral Extension, or Number of Annotated Features

Grading Parameter	CoR	CV, %	ICC (95% CI)
Drusen	0.098^a	55.0	0.687 (0.534–0.792)
Drusen_def.	0.094^a	48.5	0.788 (0.670–0.868)
EZ loss	3.446^b	42.4	0.573 (0.415–0.695)
HRF_focality	9.388	64.5	0.527 (0.267–0.699)
HT	0.625^b	24.1	0.936 (0.880–0.968)
IRF	0.121^a	81.8	0.713 (0.525–0.831)
OPL descent	0.763^b	16.2	0.884 (0.739–0.952)
PED	0.134^a	17.6	0.972 (0.959–0.981)
RPE loss	2.157^b	44.8	0.614 (0.345–0.766)
SRF	0.103^a	46.5	0.938 (0.900–0.964)
SRHM	0.234^a	53.9	0.793 (0.644–0.880)

CoR, 95% coefficients of repeatability; CV, Coefficients of variation; ICC; Intraclass correlation coefficients.

Values indicate mm².

Values indicate mm.

Figure 1.

OCT-based feature annotation. An OCT B-scan (left) and the respective feature annotation of each reader (right) are demonstrated as example. IRF (blue), SRF (orange), and PED (green) revealed high interreader agreement, while annotations of EZ loss (red) and intraretinal HRF (yellow) significantly differed in size and number between the readers. However, the location of annotated features within the B-scan was quite similar throughout all features.

Interreader Agreement of Size, Lateral Extension, or Number of Annotated Features CoR, 95% coefficients of repeatability; CV, Coefficients of variation; ICC; Intraclass correlation coefficients. Values indicate mm². Values indicate mm. OCT-based feature annotation. An OCT B-scan (left) and the respective feature annotation of each reader (right) are demonstrated as example. IRF (blue), SRF (orange), and PED (green) revealed high interreader agreement, while annotations of EZ loss (red) and intraretinal HRF (yellow) significantly differed in size and number between the readers. However, the location of annotated features within the B-scan was quite similar throughout all features. The Bland–Altman plots did not reveal systematic interreader discrepancies. Therefore, the mean difference between measurements by different readers consistently was around 0, and no pair of readers permanently showed higher or lower interreader agreement than the others (Fig. 2 and Supplementary Figs. S2–S11). However, the interreader variability increased with annotated area or number according to Spearman's rank correlation coefficient (ρ) for absolute differences and mean values for measures of drusen (ρ = 0.317 to ρ = 0.828, P < 0.001 to P = 0.049), PED (ρ = 0.316 to ρ = 0.605, P < 0.001 to P = 0.042), and HRF (ρ = 0.509 to ρ = 0.761, P < 0.001 to P = 0.018). The area measures of IRF (ρ = 0.311 to ρ = 0.755, P < 0.001 to P = 0.139), SRF (ρ = 0.326 to ρ = 0.517, P = 0.003 to P = 0.062), and SRHM (ρ = 0.150 to ρ = 0.436, P = 0.170 to P = 0.708), as well as lateral distance measures of EZ loss (ρ = 0.010 to ρ = 0.297, P = 0.021 to P = 0.936), HT (ρ = 0.021 to ρ = 0.550, P = 0.027 to P = 0.921), OPL descent (ρ = 0.036 to ρ = 0.455, P = 0.066 to P = 0.964), and RPE loss (ρ = 0.108 to ρ = 0.748, P < 0.001 to P = 0.818), did not show this correlation.

Figure 2.

Interreader agreement. The Bland–Altman plots demonstrate the interreader agreement between two exemplary readers (readers 1 and 4) for measures of drusen, EZ loss, intraretinal HRF, HT, IRF, OPL descent, PED, RPE loss, SRF, and SRHM. The measurement differences (diff.) are plotted against their mean. The solid line indicates the mean difference and the dashed lines indicate the 95% limits of agreement. There were no systematic differences between the readers. Bland–Altman plots for the interreader agreement between each pair of all readers can be found in Supplementary Figures S2 to S11. More reliable than size, extent, or number of annotated features, the Dice coefficients revealed consistent values over 0.5 (up to >0.75, Table 3) for all features. This indicated a distinct overlap of annotated regions and therefore uniform localization of the features (Fig. 1).

Table 3.

Interreader Agreement of Location of Annotated Features

Grading Parameter	Dice	95% CI
Drusen	0.539	0.507–0.570
EZ loss	0.632	0.606–0.658
HT	0.696	0.646–0.745
IRF	0.549	0.508–0.591
OPL descent	0.720	0.658–0.782
PED	0.764	0.740–0.787
RPE loss	0.650	0.598–0.701
SRF	0.664	0.632–0.697
SRHM	0.612	0.552–0.671

Interreader Agreement of Location of Annotated Features

Discussion

In this study, we systematically investigated the reliability of grading an extensive number of structural OCT features associated with different stages of AMD in a reading center setting. The presented findings provided evidence for the dependence of interreader agreement on the respective annotated feature. Hence, the appropriate selection of features has the potential to provide suitable surrogate markers for disease progression and possible therapeutic effects on different disease stages in upcoming interventional trials. Clinical surrogate markers are needed to accelerate future interventional trials. Best-corrected visual acuity loss does not always constitute a useful endpoint in clinical trials for AMD due to its high interindividual variability, its psychophysical nature, and phenomena such as foveal noninvolvement. Nevertheless, most interventional trials for neovascular AMD currently rely on this feature. In contrast, studies for dry AMD usually use morphologic endpoints like GA (e.g., by semiautomated delineation in fundus autofluorescence imaging) or RORA (defined by OCT imaging), as an accepted endpoint by regulators., However, atrophic lesions represent the end stage of AMD, and the most effective upcoming therapeutic approach might be directed to earlier disease stages, which is difficult to extrapolate from preclinical data. Ideal surrogate markers, therefore, should be readily captured, reflect the current disease stage, be reliable, and ideally be predictive for long-term progression based on short-term changes. As the OCT is the most abundant digital imaging device in modern ophthalmology, it has already been implemented in routine patient assessment and most clinical trial designs for retinopathies. For neovascular AMD, the analysis of IRF and SRF is used to evaluate disease activity and treatment indication besides drop of vision, presence of bleedings, or leakage in angiography., It has been shown to be an objective and susceptible measure that might even precede functional impairment and be faster executed and/or more comfortable than invasive imaging technology like angiography or fundus photography.,, For dry AMD, multimodal assessment (including OCT) of drusen, pigment epithelial alterations, or signs of RORA is inevitable in the differential diagnosis and analysis of disease progression. The evaluation of additional or individual OCT features could therefore be effectively carried out. A current publication showed a consistent sequence of OCT features in the development of RORA secondary to maternally inherited diabetes and deafness (MIDD), indicating that these features represent different disease stages. Given that MIDD is a mitochondriopathy and mitochondrial dysfunction is considered part of the pathophysiology in AMD,, results obtained in that model disease might be partly transferred to AMD. Indeed, an international consensus published by the CAM group indicated that most of these features are associated with RORA development secondary to AMD., It also described features like EZ loss, RPE loss, HT, OPL descent, HRF, and SRHM. However, the reliability of these features has not yet been comprehensively investigated by this group. Reliability might be the most important prerequisite to define a surrogate marker for patient assessment and future interventional clinical trials. Rather, low interreader agreement was found in the detection of EZ loss and RPE loss. Reliability of size and location of both feature annotations, however, were distinctly higher, while ICC did not reach levels of previously published data (0.75 for RPE loss). However, the latter used another OCT device (Spectralis HRA-OCT; Heidelberg Engineering, Heidelberg, Germany) that might have led to better image quality. Some of the differences between readers might be due to inaccurate delineation of lesion borders since loss and attenuation of RPE and/or EZ might merge (Fig. 1). Interestingly, the average relative difference between two readers for RPE loss was indicated with 72.4, which was significantly higher than the CV (44.8) in our study, while both measures are thought to be independent of lesion size. Concerning HRF, the variable number might derive from the size of the feature. Readers might have simply overlooked small features, leading to not more than moderate reliability (Fig. 1). As these features with low interrater agreement might be inherently problematic for humans to detect and quantify on OCT images, their utility as surrogate markers in clinical studies is limited. In this context, an automated artificial intelligence–based feature detection is likely to be more consistent and precise in performance than human graders.,, The application of deep learning and its broader family, ML, might be a way forward in utilizing the utility of these potential surrogate markers. However, the ML algorithms are trained and the performance is judged by the human “gold standard,” which, if unreliable, may be problematic. Different approaches try to assess this problem: (1) Prerequisites for reliable gradings are precise definitions and grading protocols as well as proper annotation platforms (respectively, software environments). (2) Training a ML algorithm on gradings from multiple graders could converge these gradings to an average grader, which would mitigate part of the subjectivity. (3) A consensus grading (e.g., from a consensus meeting or by averaging gradings or by adjudicating inconsistencies) might be considered “superhuman” (i.e., better than a single grader). This superhuman grading could be used to develop a model that produces results at the same quality. (4) The use of additional data (e.g., other modalities or follow-up images) may allow for improved grading. (5) By using super-quality imaging (e.g., higher-resolution OCT), more reliable gradings might be obtained, which could then be transferred to standard-quality imaging for model development. Moreover, ML is likely to be the only way to quantitate large volumes of dense OCT raster scans that are being generated in clinical trial reading centers, busy clinical practices, and emerging home/remote OCT devices. More consistent results could be found for SRF and IRF. Here, our results revealed high interreader agreement in all three investigated parameters (detection, size, and location; Fig. 1). This was in line with previously published data. Despite different data sets, the here described ICCs between readers were higher than the ICCs derived from intermodality reliability between spectral-domain and time-domain OCT., Given that both features reflect neovascular activity and guide the indication for anti-VEGF treatment (besides other clinical features, including hemorrhage and loss of vision), this might be of particular importance. A recent study has investigated the interreader agreement of PED size measures and reported an ICC of over 0.99. The slightly higher ICC value (our study, 0.972) might be traced back to the fact that the latter has included only 20 eyes with a definite presence of PED and did not parallelly focus on other retinal alterations. The possible impact of reader fatigue (number of images and/or features) might be worth investigating in a future study. We noted a high reliability of the HT feature, supporting previously published data. In contrast, no previous report has systematically investigated interreader agreement of OPL descent. Given the high reliability (Tables 1–3) and appearance in the development of RORA, OPL descent would be worth further investigations and to explore its potential as a possible surrogate marker in future clinical trials as well as for training ML algorithms. Interestingly, the reliability of OCT-based feature annotation for SRF, HT, and PED, for example, reached the reliability of grading atrophic lesions in fundus autofluorescence imaging in different diseases, including AMD.,,, However, OCT imaging uses less energetic infrared light that minimizes potential light toxicity and is more comfortable for the patient., Furthermore, OCT imaging does not rely on pupil dilation, and devices are more common than fundus autofluorescence imaging devices. In this context, OCT scans were selected in a randomized manner in our study. A previous study revealed that more eccentric scan locations might lead to less reliable results. Therefore, the pure evaluation of central scans might have led to even higher interreader agreement. Nevertheless, additional features of summation images like shape-descriptive parameters or dynamic flow signal could give further information,,, suggesting a multimodal assessment as a gold standard in AMD diagnosis and study design at the current stage of imaging technology. It has been shown by the AREDS study that the number and size of drusen might predict progression of AMD. Furthermore, we could show that the AREDS definition of minimum drusen size makes sense not only in the context of color fundus photography but also for OCT grading as the so-called drupelets (diameter <63 µm) have an unclear pathologic importance, and their exclusion led to a significant increase in interreader agreement (Tables 1 and 2)., If, nevertheless, a delineation of drupelets is aimed for, an automated artificial intelligence–based feature detection is likely to show improved performance over human graders, similar to the abovementioned small feature of HRF. More recently, a focus was set on the predictive value as well as the complicated delineation of drusen in the presence of RPD (also termed subretinal drusenoid deposits). In this context, the low number of patients with RPD (which led to exclusion from further analysis) is a limitation of our study, and future studies focusing on interrater reliability of drusen, including this particular feature, are warranted. Besides drusen and RPD, the presentation of HRF and the baseline atrophic lesion size were also reported to affect future progression rate. Concerning exudative complications, the predictive value of SRF has been controversially discussed,, while the extent of central retinal thickening and IRF is thought to represent the neovascular activity and therefore visual outcome.– Therefore, it might be hypothesized that some of the additionally presented imaging features could also be predictive for neovascular or dry AMD progression. However, the image feature description in this study was based on retrospective cross-sectional data, as it was beyond the scope of this study to evaluate the accuracy of predictive factors. However, if noted to be present, the consistency of size and location of most imaging features have the potential to provide the framework for further prospective studies. These prospective studies would allow to further evaluate the predictive value, which might give more insights into the pathophysiology of AMD and allow for effective study design as presented before for different parameters in AMD or other retinopathies.,,,,, A further limitation of this study is the application of OCT imaging devices by a single manufacturer. Different OCT imaging devices might provide different scanning artifacts or image quality., Thereby, the annotation and, hence, the reliability of single features might be different on large-scale real-world data. As there is no gold standard, it cannot be excluded that features have been missed and other data sets could provide addition conclusions. To minimize this possibility, we relied on trained retinal specialists who have identified and interpreted the features, and the opportunity to add additional features was given at all time points during annotation (Supplementary Fig. S1). Finally, readers might have utilized the contextual B-scans differently, which was not recorded. However, the variability of their approach and annotations reflects the human variability, which was part of the purpose of this study. An evaluation of how human readers use additional images for grading might be an interesting question for a future study, especially in the context of multimodal approaches to retinal diseases. In conclusion, this study evaluated the reliability of annotations of multiple OCT features representing different disease stages in a reading center setup. The inclusion of objective and reliable features like SRF, IRF, HT, OPL descent, or PED into future studies might enable multiple surrogate markers representing different disease stages within a single image. This might open up numerous new opportunities for evaluating disease progression and possible treatment effect in AMD, possibly leading to a more time- and cost-effective interpretation, further insights into the pathomechanisms, enhanced individualized patient assessment, and improved training of ML application. Emerging advances in artificial intelligence training and validation may allow for a higher consistency in performance than human graders, suggesting a wider variety of reliable surrogate markers and potential benefits in the future.

69 in total

1. Incomplete Retinal Pigment Epithelial and Outer Retinal Atrophy in Age-Related Macular Degeneration: Classification of Atrophy Meeting Report 4.

Authors: Robyn H Guymer; Philip J Rosenfeld; Christine A Curcio; Frank G Holz; Giovanni Staurenghi; K Bailey Freund; Steffen Schmitz-Valckenberg; Janet Sparrow; Richard F Spaide; Adnan Tufail; Usha Chakravarthy; Glenn J Jaffe; Karl Csaky; David Sarraf; Jordi M Monés; Ramin Tadayoni; Juan Grunwald; Ferdinando Bottoni; Sandra Liakopoulos; Daniel Pauleikhoff; Sergio Pagliarini; Emily Y Chew; Francesco Viola; Monika Fleckenstein; Barbara A Blodi; Tock Han Lim; Victor Chong; Jerry Lutty; Alan C Bird; Srinivas R Sadda
Journal: Ophthalmology Date: 2019-09-30 Impact factor: 12.079

Review 2. A view of the current and future role of optical coherence tomography in the management of age-related macular degeneration.

Authors: U Schmidt-Erfurth; S Klimscha; S M Waldstein; H Bogunović
Journal: Eye (Lond) Date: 2016-11-25 Impact factor: 3.775

3. IDENTIFICATION OF FLUID ON OPTICAL COHERENCE TOMOGRAPHY BY TREATING OPHTHALMOLOGISTS VERSUS A READING CENTER IN THE COMPARISON OF AGE-RELATED MACULAR DEGENERATION TREATMENTS TRIALS.

Authors: Cynthia A Toth; Francis Char Decroos; Gui-Shuang Ying; Sandra S Stinnett; Cynthia S Heydary; Russell Burns; Maureen Maguire; Daniel Martin; Glenn J Jaffe
Journal: Retina Date: 2015-07 Impact factor: 4.256

Review 4. MACUSTAR: Development and Clinical Validation of Functional, Structural, and Patient-Reported Endpoints in Intermediate Age-Related Macular Degeneration.

Authors: Robert P Finger; Steffen Schmitz-Valckenberg; Matthias Schmid; Gary S Rubin; Hannah Dunbar; Adnan Tufail; David P Crabb; Alison Binns; Clara I Sánchez; Philippe Margaron; Guillaume Normand; Mary K Durbin; Ulrich F O Luhmann; Parisa Zamiri; José Cunha-Vaz; Friedrich Asmus; Frank G Holz
Journal: Ophthalmologica Date: 2018-08-28 Impact factor: 3.250

Review 5. Long-term Outcomes of Treat and Extend Regimen of Anti-vascular Endothelial Growth Factor in Neovascular Age-related Macular Degeneration.

Authors: Andy Lee; Pooja G Garg; Alice T Lyon; Rukhsana Mirza; Manjot K Gill
Journal: J Ophthalmic Vis Res Date: 2020-08-06

Review 6. Geographic atrophy: clinical features and potential therapeutic approaches.

Authors: Frank G Holz; Erich C Strauss; Steffen Schmitz-Valckenberg; Menno van Lookeren Campagne
Journal: Ophthalmology Date: 2014-01-14 Impact factor: 12.079

7. Optical coherence tomography-based decision making in exudative age-related macular degeneration: comparison of time- vs spectral-domain devices.

Authors: C Cukras; Y D Wang; C B Meyerle; F Forooghian; E Y Chew; W T Wong
Journal: Eye (Lond) Date: 2009-08-21 Impact factor: 3.775

Review 8. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis.

Authors: Wan Ling Wong; Xinyi Su; Xiang Li; Chui Ming G Cheung; Ronald Klein; Ching-Yu Cheng; Tien Yin Wong
Journal: Lancet Glob Health Date: 2014-01-03 Impact factor: 26.763

9. Comparison of Green Versus Blue Fundus Autofluorescence in ABCA4-Related Retinopathy.

Authors: Philipp L Müller; Maximilian Pfau; Matthias M Mauschitz; Philipp T Möller; Johannes Birtel; Petrus Chang; Martin Gliem; Steffen Schmitz-Valckenberg; Monika Fleckenstein; Frank G Holz; Philipp Herrmann
Journal: Transl Vis Sci Technol Date: 2018-10-01 Impact factor: 3.283

10. Prediction of Function in ABCA4-Related Retinopathy Using Ensemble Machine Learning.

Authors: Philipp L Müller; Tim Treis; Alexandru Odainic; Maximilian Pfau; Philipp Herrmann; Adnan Tufail; Frank G Holz
Journal: J Clin Med Date: 2020-07-29 Impact factor: 4.241

2 in total

Review 1. Quantitative assessment of retinal fluid in neovascular age-related macular degeneration under anti-VEGF therapy.

Authors: Gregor S Reiter; Ursula Schmidt-Erfurth
Journal: Ther Adv Ophthalmol Date: 2022-03-23

2. Intersession Repeatability of Structural Biomarkers in Early and Intermediate Age-Related Macular Degeneration: A MACUSTAR Study Report.

Authors: Marlene Saßmannshausen; Sarah Thiele; Charlotte Behning; Maximilian Pfau; Matthias Schmid; Sérgio Leal; Ulrich F O Luhmann; Robert P Finger; Frank G Holz; Steffen Schmitz-Valckenberg
Journal: Transl Vis Sci Technol Date: 2022-03-02 Impact factor: 3.283

2 in total