Literature DB >> 33344750

Evaluating the Alzheimer's disease data landscape.

Colin Birkenbihl^1,2, Yasamin Salimi^1,2, Daniel Domingo-Fernándéz^1,2, Simon Lovestone³, Holger Fröhlich^1,2, Martin Hofmann-Apitius^1,2.

Abstract

INTRODUCTION: Numerous studies have collected Alzheimer's disease (AD) cohort data sets. To achieve reproducible, robust results in data-driven approaches, an evaluation of the present data landscape is vital.
METHODS: Previous efforts relied exclusively on metadata and literature. Here, we evaluate the data landscape by directly investigating nine patient-level data sets generated in major clinical cohort studies.
RESULTS: The investigated cohorts differ in key characteristics, such as demographics and distributions of AD biomarkers. Analyzing the ethnoracial diversity revealed a strong bias toward White/Caucasian individuals. We described and compared the measured data modalities. Finally, the available longitudinal data for important AD biomarkers was evaluated. All results are explorable through our web application ADataViewer (https://adata.scai.fraunhofer.de). DISCUSSION: Our evaluation exposed critical limitations in the AD data landscape that impede comparative approaches across multiple data sets. Comparison of our results to those gained by metadata-based approaches highlights that thorough investigation of real patient-level data is imperative to assess a data landscape.

Entities: Chemical

Keywords: Alzheimer's disease; FAIR data; biomarker; clinical study; cohort; cohort study; data; data access; data set; data sharing; data viewer; data‐driven; dementia; disease modeling; magnetic resonance imaging; open‐science; patient level data

Year: 2020 PMID： 33344750 PMCID： PMC7744022 DOI： 10.1002/trc2.12102

Source DB: PubMed Journal: Alzheimers Dement (N Y) ISSN： 2352-8737

BACKGROUND

In the field of Alzheimer's disease (AD) research, numerous cohort studies have been conducted, and their collected data build the basis for a plethora of research projects. However, each of these studies only reflects patients of a particular subpopulation defined by inclusion and exclusion criteria. This is becoming especially relevant with respect to the increasing popularity of data‐driven approaches and machine learning. , After analyzing a single cohort, it is mandatory to demonstrate that results are reproducible in independent, external data originating from distinct cohort studies. Furthermore, it is essential to conduct comparative analyses across data sets to assess whether the observed patterns are robust. Such systematic data‐driven approaches are, however, hampered because patient‐level data are often difficult to access or entirely inaccessible. Moreover, we have limited knowledge about how the distinct cohort data sets available in our field compare to each other on a qualitative (eg, overlap of measured variables) as well as quantitative level (eg, values encountered in the data). , Thus, to leverage the full potential of collected patient‐level data, it is important to characterize the clinical AD data landscape in detail.

Metadata‐driven evaluations of the Alzheimer's disease data landscape

Evaluating a data landscape involves organizing and comparing data sets to: (1) qualitatively assess their collected data modalities and variables, and (2) quantitatively describe the demographics of the study population and distributions of measured variables. Such characterization provides a detailed overview of the data accessibility and supports the design of research projects and future cohort studies. Finally, evaluating a data landscape inherently exposes potential flaws with regard to interoperability between existing data sets and underrepresentation of important disease or population characteristics.

RESEARCH IN CONTEXT

Systematic review: The authors reviewed relevant literature through bibliographic search engines. Relevant cohort data sets have been discovered through data portals, data publications, and citations in the literature. Applications were filed for 18 cohort data collections of which 9 were successful. Interpretation: The presented results illustrate the current state of the Alzheimer's disease (AD) data landscape from a patient‐level data‐centric perspective, whereas previous investigations relied solely on provided cohort metadata. This investigation exposes limitations in data availability and interoperability, and establishes a detailed overview on what resources current data sets provide for data‐driven analyses. Future directions: This work emphasizes the need for a common semantic framework for patient‐level AD data to enable the community to work across cohort data sets and ultimately to generate robust scientific insights to advance AD research. In the AD field, previous studies have attempted to establish a comprehensive view of the AD data landscape as well as to demonstrate how cohort data sets relate to each other. For example, the European Medical Information Framework (EMIF) collected metadata of AD cohort studies by providing data owners with a questionnaire in which they could specify the variables contained in their data sets. The resulting metadata is presented through the EMIF‐Catalog. Similarly, the Real world Outcomes across the Alzheimer's Disease spectrum for better care: Multi‐modal data Access Platform (ROADMAP) project generated an overview of clinical outcomes and data modalities that were collected in several European AD cohort studies. By analyzing metadata (partially originating from the EMIF‐Catalog), ROADMAP created the ROADMAP Data Cube, a web application that shows the availability of AD‐related outcomes in a selected set of European dementia cohorts (https://datacube.roadmap-alzheimer.org). Lawrence et al., on the other hand, opted for a literature‐based approach to assess the AD data landscape. The authors reviewed publications corresponding to AD cohort data sets and gathered the contained information.

Moving beyond metadata through data‐level investigations

All of the above‐mentioned undertakings attempted to evaluate the AD data landscape solely on the basis of metadata and literature, without investigating the underlying patient‐level data. However, reviewing study protocols can only explain the original design of a given study and thereby neglects unforeseen changes in procedures or participant recruitment throughout study runtime. The alternative approach is a patient‐level and data‐driven evaluation of the AD data landscape, which is a tedious and time‐consuming endeavor. The first hurdle of such an approach is gaining access to a sufficient number of cohort data sets. Data access typically requires completing an application procedure with numerous legal requirements and considerations. If access is granted, intensive manual curation and investigation of data follow. Although difficult to establish, a comprehensive data‐driven view on the AD data landscape is crucial, since reliance exclusively on metadata assumes that these metadata correctly describe the underlying data sets and that the data sets are complete. In contrast, a patient‐level and data‐driven evaluation (1) is not subject to these assumptions, (2) allows for a quantitative investigation of important cohort statistics, and (3) illustrates the amount and quality of the data accessible to the field.

Novelty and impact of this work

In this work, we aimed at assessing the current AD data landscape through meticulous investigation and curation of accessible cohort data sets on the data level rather than solely relying on metadata and/or literature. To accomplish this task, we traced down, accessed, investigated, and compared nine of the major clinical cohort study data sets available in the AD field. Here, we comprehensively describe the acquired data and show which data modalities we found in the data sets as well as their overlaps with other studies. In addition, we assessed the longitudinal follow‐up on the biomarker level and demonstrated to what extent current AD data are covering the progression of the disease. Furthermore, we compared the content we observed in these data sets with the reported findings of metadata‐based approaches. , Finally, we made all results available through ADataViewer (https://adata.scai.fraunhofer.de), an interactive web‐portal that allows researchers to explore the AD data landscape generated based on the investigated data sets.

METHODS

Investigated cohorts

We aimed to acquire as many major AD cohort studies as possible to allow for a thorough investigation of the data landscape. We only considered data sets that were downloadable, hereby excluding data portals with restricted data access from our investigations. Most of the data sets we accessed were shared after completing an official data request process. We applied for access to 18 distinct AD cohort data sets. Until submitting this work for publication, we were granted access to nine (Table 1). We discuss the reasons behind failed data access applications in the Supplementary Text. Notably, not all of the accessed data sets are observational cohort studies in the strict sense; for more information, please see the Supplementary Text.

TABLE 1

The investigated AD cohorts and their references

Cohort	Consortium	Reference
A4	Anti‐Amyloid Treatment in Asymptomatic Alzheimer's Disease	⁹
ADNI	The Alzheimer's Disease Neuroimaging Initiative	¹⁰
ANMerge	AddNeuroMed	¹¹
AIBL	The Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing	¹²
EMIF‐1000	European Medical Information Framework	¹³
EPAD v1500	European Prevention of Alzheimer's Dementia	¹⁴
JADNI	Japanese Alzheimer's Disease Neuroimaging Initiative	¹⁵
NACC	The National Alzheimer's Coordinating Center	¹⁶
ROSMAP	The Religious Orders Study and Memory and Aging Project	¹⁷

The investigated AD cohorts and their references It is important to be aware that not all of these studies followed the same design or goals. Each study enforced its own recruitment criteria and enrolled participants following distinct selection processes. Although some aimed for a case‐control setting and included a substantial amount of AD patients in their cohort, others deliberately excluded them to focus on early disease progression. Therefore, the cohort data sets are all subject to inherent biases.

Generating the summary statistics

To illustrate the content of the data sets, we characterized the demographics of each cohort and described the encountered statistical distributions of important AD biomarkers. The demographic variables we considered are: participant age, sex, and completed years of education. The AD biomarkers we compared between cohorts are motivated in the Supplementary Text. In addition, we assessed the diversity of ethnoracial groups in our acquired AD cohorts, since it is known that ethnoracial factors may impact AD and related findings. More detailed definitions of the ethnoracial groups can be found in the Supplementary Text. For numerical variables, we describe the encountered distributions using the 25%, 50%, and 75% quantiles of the raw measurements. For categorical ones, we describe the proportion of study participants falling into its respective categories. In some data sets, single variables were reported only numerically given that they were placed within a defined value range (eg, 400 to 1700). If the measurement appeared to be outside of this range, the exact number was not reported but replaced with a cutoff (eg, “>1700″). To allow for calculations, we considered these values to be equal to the mentioned cutoff (here, 1700).

Generating the data availability map

While establishing a data landscape, it is of high interest to identify the data modalities that were measured in the underlying studies as well as to compare their overlaps. However, assessing the availability of data modalities in clinical cohort data sets is not straightforward. This process involves intensive and meticulous manual curation of the acquired data sets and thereby the definition of applicable curation criteria specifying under which circumstances each data modality is considered as “available.” Furthermore, it is often necessary to define a gradual categorization to represent the degree of availability. For example, exclusively measuring two specific single nucleotide polymorphisms (SNPs) is not equal to conducting a genome‐wide genotyping of individuals. Similarly, distributing normalized brain volumes summed over both hemispheres is less informative than providing the underlying raw magnetic resonance (MR) images. The latter would enable researchers to process the images according to their needs, whereas the former impedes interoperability to other data sets due to differences in employed image‐processing pipelines. This could hamper certain analyses such as systematic comparisons across cohorts or validation approaches. To enable a meaningful, comparable assessment of the availability of data modalities, we established criteria for categorizing the availability of each modality into three discrete stages (Supplementary Table S1): stage 0, no data were available for the respective modality; stage 1, data were partially available; and stage 2, more complete data or unprocessed raw data were available.

Investigating longitudinal follow‐up across studies

To assess how far existing cohort data sets cover the time dimension of AD, we conducted a thorough investigation of their respective longitudinal follow‐up. For each cohort, we evaluated how many participants were assessed at each follow‐up visit and implicitly analyzed the drop‐out over study runtime. Since not all measurements were performed at each visit and not every individual participated in all sample collections, we further focused on the follow‐up and coverage of important AD biomarkers. Determining the amount of available longitudinal data per biomarker provides insight on how much information we can exploit to model and ultimately understand patterns of AD progression. As of publication of this article, EPAD and NACC are still subject of ongoing data collection, while ADNI received funding to extend their study and continue participant recruitment.

RESULTS

Investigation of the AD data landscape

Altogether, we investigated data from nine studies comprising a total of 60,004 assessed study participants. Table 2 shows how these participants were distributed among the analyzed cohorts. With NACC being the exception (n = 40,858), all studies recruited individuals in the low thousands (n = ≈1200 to 3600). According to their diagnosis, participants could be separated into three groups: cognitively healthy controls, patients with mild cognitive impairment (MCI), and patients with AD. Seven of the investigated studies based their diagnoses on the National Institute of Neurological and Communicative Disorders and Stroke‐Alzheimer's Disease and Related Disorders Association (NINCDS‐ADRDA) criteria which significantly increases the interoperability between those data sets, since AD follows the same semantic description. Depending on each study's goals, the recruitment process focused on enrolling more or fewer individuals falling into specific diagnosis groups.

TABLE 2

Description of the investigated cohorts

Cohorts	N	Healthy	MCI	AD	N with 2+ visits	Follow‐up Interval (months)	Location	Diagnostic criteria AD
A4	6943	6943	0	0	0:	≈8	US, Canada, Australia	AD patients excluded
ADNI	2249	813	1016	389	1978 (88%)	6	USA, Canada	NINCDS‐ADRDA
AIBL	1378	803	134	181	1019 (74%)	18	Australia	NINCDS‐ADRDA
ANMerge	1702	793	397	512	1254 (74%)	12	Europe	NINCDS‐ADRDA
EMIF	1221	386	526	201	0	no follow‐up	Europe	NINCDS‐ADRDA
EPAD v1500	1500	1410	80	3	0:	6	Europe	NINCDS‐ADRDA
JADNI	537	151	233	149	518	6	Japan	NINCDS‐ADRDA
NACC	40858	15894	3649	11761	27657 (68%)	12	US	UDS Form D1
ROSMAP	3627	2514	898	203	3335 (92%)	12	US	NINCDS‐ADRDA

NOTE: The numbers of diagnosed subjects do not always add up to N, since patients with different dementia diagnoses (eg, Lewy body or frontotemporal dementia) were excluded. N, Total number of participants; CTL/MCI/AD, Number of participants with the respective diagnosis at study baseline; 2+ visits, Number of study participants for whom data for at least two time points are available; Follow‐up Interval, Approximated regular time interval between participant visits; Longitudinal data have been collected but are not yet released.

Description of the investigated cohorts NOTE: The numbers of diagnosed subjects do not always add up to N, since patients with different dementia diagnoses (eg, Lewy body or frontotemporal dementia) were excluded. N, Total number of participants; CTL/MCI/AD, Number of participants with the respective diagnosis at study baseline; 2+ visits, Number of study participants for whom data for at least two time points are available; Follow‐up Interval, Approximated regular time interval between participant visits; Longitudinal data have been collected but are not yet released. Although no data are shared through our web‐portal, information on how to access the data sets can be found at https://adata.scai.fraunhofer.de/cohorts.

Characterization of the cohorts

Investigation of the cohort demographics revealed considerable differences between key demographic characteristics of the acquired cohorts. EPAD, for example, recruited a comparably young and primarily non‐symptomatic cohort, whereas participants of ANMerge and ROSMAP were significantly older (Table 3). Across all cohorts, the age range spans roughly from 60 (lowest 25% quantile) to 85 years (highest 75% quantile). Theoretically, this opens the opportunity to construct a pseudo‐continuum of 25 years of disease history. Furthermore, in most studies, we observed the general tendency that more female than male participants enrolled into the studies. Overall, most individuals included in the AD cohort studies were highly educated (≈14 years on average). As previously mentioned by Whitwell et al., a high level of education can act as cognitive reserve, possibly concealing a prodromal manifestation of AD. Numerous demographic differences found between studies may result from distinct recruitment criteria which, again, mirror the individual study goals. Although distinct recruitment criteria lead to a broader sampling of the AD population, they reduce the direct comparability between data sets because they inevitably introduce bias into the data. One key example is recruitment specifically for participants with AD risk factors (eg, APOE genotype). This could significantly bias the patterns exhibited in the data in comparison to another data set with a lower amount of APOE ε4–positive participants.

TABLE 3

Distribution of demographic variables and key AD biomarkers encountered in each cohort

	Female %	Age	Education	APOE ε4 %	MMSE	CDR	CDR‐SB	Hippocampus	A‐beta	t‐Tau	p‐Tau
A4	57.7	68, 71, 75	14, 16, 18	34.3	28, 29, 30	0.0, 0.0, 0.0	0.0, 0.0, 0.0	6, 7, 7
ADNI	47	68, 73, 78	14, 16, 18	45.6	26, 28, 29	0.0, 0.5, 0.5	0.0, 1.0, 2.0	5948, 6864, 7651	596, 854, 1396	193, 258, 350	17, 24, 34
AIBL	57.9	67, 73, 79	10, 12, 15	36	26, 28, 30	0.0, 0.0, 0.5	0.0, 0.0, 1.0	3, 3, 3	445, 567, 802	238, 366, 516	43, 64, 81
ANMerge	59.3	71, 77, 81	8, 11, 14	38.8	24, 28, 29	0.0, 0.5, 0.5	0.0, 0.5, 4.0	5311, 6270, 7142
EMIF	46.2	62, 68, 74	9, 12, 15	46.8	25, 28, 29	0.5, 0.5, 0.5		6357, 7223, 8004	385, 525, 739	160, 278, 504	37, 52, 74
EPAD	56.9	60, 66, 71	12, 15, 17	37.7	28, 29, 30	0.0, 0.0, 0.0	0.0, 0.0, 0.0	4413, 4808, 5182	899, 1319, 1700	162, 201, 252	13, 17, 22
JADNI	52.7	66, 72, 77	12, 12, 16	46.1	24, 26, 29	0.0, 0.5, 0.5	0.0, 1.5, 3.0	5260, 6133, 7132	254, 315, 454	67, 101, 138	36, 48, 73
NACC	57.2	65, 72, 79	12, 16, 18	40.6	23, 27, 29	0.0, 0.5, 0.5	0.0, 1.0, 4.0	43.5%	46.5%	43.9%	43.9%
ROSMAP	72.8	73, 79, 84	14, 16, 18	25.1	27, 29, 30

NOTE: We show the 25%, 50%, and 75% quantiles of numerical variables at baseline. Categorical variables are given as the proportion of participants falling into one respective category. APOE ε4%, Proportion of participants with at least one APOE ε4 allele; Hippocampus, A‐beta, t‐Tau, p‐Tau, NACC values are given as the proportion of “abnormal observations”.

Distribution of demographic variables and key AD biomarkers encountered in each cohort NOTE: We show the 25%, 50%, and 75% quantiles of numerical variables at baseline. Categorical variables are given as the proportion of participants falling into one respective category. APOE ε4%, Proportion of participants with at least one APOE ε4 allele; Hippocampus, A‐beta, t‐Tau, p‐Tau, NACC values are given as the proportion of “abnormal observations”. To further highlight one potential bias in AD data, we analyzed the ethnoracial diversity encountered in the investigated AD cohorts (Figure 1). An aggregated analysis of all acquired data sets demonstrates that most of these recruited individuals come from a White/Caucasian background (79.3%). The second largest group was Black/African descendants with 11.5%, followed by participants of Latin/Hispanic heritage with 5.6%. Here, we would like to point out that these findings are heavily influenced by the study location and the number of enrolled participants per study. Because the majority of the studies have been conducted in the United States, their locally exhibited ethnoracial diversity overshadows signals from European cohorts. However, the analogous plots for each European cohort show not only a similar, but even more extreme tendency toward White/Caucasian individuals (EPAD: 99% white; ANMerge: 98,5% white; see https://adata.scai.fraunhofer.de/ethnicity).

FIGURE 1

Combined ethnoracial diversity found across the investigated AD cohorts. Table S2 shows the individual compositions of each cohort

Combined ethnoracial diversity found across the investigated AD cohorts. Table S2 shows the individual compositions of each cohort The ethnoracial composition in the investigated cohorts relies on the diversity of populations from which the participants have been recruited. Nonetheless, our results elucidate that there is a substantial bias toward White/Caucasian in AD data sets and a severe underrepresentation of other ethnoracial groups, which, in turn, could be problematic for developing personalized treatments.

Availability of data modalities

To analyze which modalities are available in our investigated cohorts and to explore the overlaps between them, we assigned a score of availability per data modality according to our previously described criteria (Table S1). In Figure 2A, we show an overview of the data modalities and their availability score in all acquired cohort data sets. Commonly assessed modalities throughout all studies were demographic variables (eg, age, sex, and education) as well as clinical assessments (eg, Mini Mental State Examination [MMSE]). Regarding these two modalities, eight studies were assigned the availability score 2, with EMIF and AIBL being the only exceptions due to missing ethnoracial information. Cerebrospinal fluid (CSF) biomarker measurements were found to be present in all data sets but ANMerge. With regard to autopsy data, only ROSMAP contained a detailed collection, ranging from simple measurements such as brain weight to comprehensive brain proteomics and transcriptomics. Although seven studies released some structural MRI data, three of those limited the shared data to processed MRI features (eg, brain volumes). In our case, only ADNI, NACC, JADNI, EPAD, and ANMerge granted access to the raw images.

FIGURE 2

Interoperability of AD data sets. A, Availability of data modalities scored based on the defined criteria. The criteria are explained in Supplementary Table 1. B, Equivalence of clinical assessment variables across cohorts. PET = positron emission tomography Although the purpose of this section is to provide a comprehensive overview about the availability of data modalities, we would like to emphasize that the presented results are strongly dependent on our defined curation criteria, and different criteria could lead to deviating results. In addition, all investigated data sets could hold more information than we presented here. Due to our premise of looking exclusively into those patient‐level data that have indeed been shared with us, it is possible that we missed modalities or resources that are existent but were not shared (eg, MRI images). Our results can be explored at https://adata.scai.fraunhofer.de/modality.

Metadata investigation versus data investigations

To establish how our observations of data availability differed from results gained by solely investigating metadata, we qualitatively compared our findings to the metadata presented in the EMIF catalog. *Only four of our investigated studies were listed: ADNI, ANMerge, EMIF, and EPAD. Although the majority of our findings are in concordance with the EMIF‐catalog, deviations between metadata and the real data exist. We encountered variables in the data sets that are reported as absent in the catalog (eg, Global Deterioration Scale in ANMerge), or were not listed at all. Other variables and even modalities are reported to be present, yet could not be found in the respective data set. For instance, the catalog states that post‐mortem brain autopsy was performed in ANMerge, for which we could not find any evidence. Similar observations were made when comparing our findings to the review by Lawrence et al. Here, for example, the reported longitudinal follow‐up of ANMerge is significantly shorter than what we observed in the data (reported: 12 months, data: 84 months). In addition, the reported number of participants with at least two visits does not equal our findings (reported: 378, data: 1254 participants). The finding of common modalities across cohorts does not imply that the measured variables are interoperable or even comparable on a semantic level. By mapping a variety of variables across the data sets, we established an overview of their interoperability (Figure 2B). We would like to emphasize that the current version of these mappings is not complete but a proof of concept that a semantic integration of these data sets is, in theory, possible. However, this integration is cumbersome and time‐consuming, as many data sets exhibit low interoperability and distinct variable naming conventions. An in‐depth view of the preliminary mappings is given at https://adata.scai.fraunhofer.de/feature_comparison.

Disease manifestation across cohorts

To evaluate how severely patients from each cohort have been affected by AD, we compared the distributions of both cognitive outcomes and key biomarkers for the cognitively affected patient subgroups (ie, participants with an MCI or AD diagnosis). Table 3 shows the distributions for each complete cohort including healthy controls, MCI, and AD patients. Analogous tables per diagnosis subgroup can be found at https://adata.scai.fraunhofer.de/cohorts. According to the MMSE scores, AD patients from AIBL (quantiles: 15, 20, 25), ANMerge (quantiles: 16, 21, 25), and NACC (quantiles: 16, 21, 25) showed the worst cognitive performance. ADNI (quantiles: 21, 23, 25) contained patients with fewer cognitive symptoms. The CDR Dementia Staging Instrument (CDR) Sum of Boxes (CDR‐SOB) scores slightly shift the perspective. Here, ANMerge is the most affected cohort, with its 25%, 50%, and 75% quantiles of the CDR‐SOB scores being 4, 6, and 9, respectively. AIBL patients scored 3.5, 5, and 7, which slightly contradicts the image painted by the MMSE scores. Again, ADNI shows the least cognitive symptoms with its CDR‐SOB quantiles being 3, 4.5, and 5. A comparison of raw biomarker measurements between cohorts proved to be impossible, since encountered values are on different scales and may be subject to batch effects. Thus we analyzed how much measurements diverged from their respective control population in each cohort (Supplementary Text). The prerequisite for comparative approaches involving biomarker measurements across data sets is an alignment of their underlying data models (ie, making data interoperable). In our analysis, we found that each study had defined its own data model, and variable names differed between them. This forced us to individually map variables to their corresponding counterparts in other studies to enable comparisons in the first place (eg, combine “lh_hippo_volume” and “rh_hippo_volume” and map to “Hippocampus”). Another difficulty is that numerous data sets reported values of equivalent variables in different ways. For example, CSF biomarker measurements are reported to be either normal (0) or abnormal (1) in NACC, whereas other studies provide numerical values that were capped at different thresholds between studies (eg, “ >1700″). All these factors led to a severe lack of interoperability between data sets, which significantly limits comparative approaches and restricts them to more standardized variables like clinical assessment scores.

Longitudinal follow‐up

The majority of the investigated studies have collected longitudinal data in the form of repeated measurements. The intervals of data collection differed across studies (Table 2). Figure 3A displays the drop‐out of study participants over time relative to the size of the cohort. In this analysis, participants were considered if at least one measurement was taken at the respective month. However, an individual's participation in some assessments does not imply that all biomarker values were acquired for the same individual on all visits. Thus we additionally investigated the amount of study participants for which select AD biomarkers were measured over time (Figure 3). Plots for all of the investigated biomarkers can be found at https://adata.scai.fraunhofer.de/follow-up.

FIGURE 3

Longitudinal follow‐up as the proportion of participants at study baseline (ie, participants were aligned based on their first visit). A, At least one variable measured. B, CSF amyloid beta. C, MMSE scores. CSF = cerebrospinal fluid. MMSE = Mini Mental State Examination One example biomarker that we selectively investigated is CSF amyloid beta for which Figure 3B displays the longitudinal coverage. Comparing Figure 3B with Figure 3A demonstrates that CSF samples were, if at all, taken only from a small fraction of participants consistently over time. Summed over all the investigated cohorts, only 273 participants (0.5%) have undergone CSF sampling at baseline and again 3 years after. In contrast to CSF, cognitive assessments follow the drop‐out curves quite closely (Figure 3C). Although these findings are not surprising given the invasiveness of CSF sample collection, they raise severe concerns regarding the robustness of statistical analysis results obtained from CSF data. In turn, this again elucidates that comparative longitudinal approaches in the AD field are limited mainly to cognitive assessments or suffer from small sample size.

DISCUSSION

In this work, we established an overview of the AD data landscape by investigating patient‐level data from nine major clinical AD cohort studies. Our results demonstrate that the individual data sets vary with respect to key characteristics, such as number of enrolled participants per diagnosis, demographic composition, and distribution of important AD biomarkers. Assessing the ethnoracial diversity in the cohorts exposed a severe overrepresentation of White/Caucasian individuals compared to other ethnoracial backgrounds. To appraise the availability of modalities in each study, we categorized each modality based on the relative presence of data in each cohort. Another important remark of our findings is the limited number of longitudinal follow‐up measurements for important AD biomarkers like CSF amyloid beta. Finally, we made all results explorable through ADataViewer, an interactive web application that can help researchers to identify cohort data sets that are suitable for their research.

Achieving data set interoperability through one common data model

Our analysis exposed major challenges that severely impede comparative approaches on AD cohort data. Although there has been work on standardizing data collection , as well as on guidelines defining an AD‐specific data model, we still experience a deficit in interoperability across AD data sets. The investigated cohort data sets neither followed a common naming system for variables nor represented values of the same measurement in an equal manner. On top of that, some studies shared only processed values instead of the underlying raw data. This further impedes interoperability, since differences in applied processing pipelines inevitably introduce systematic biases into the data. One promising approach to increase data set interoperability could be a comprehensive, AD‐specific common data model. Such a data model could support the alignment and mapping of variables by providing easy‐to‐follow guidelines and a dedicated interface for retrospective data harmonization.

Data limitations hamper disease modeling

In the context of personalized medicine, training models on predominantly White/Caucasian participants can lead to biased models. It is known that exhibited patterns of biomarker measurements differ across AD patients from distinct ethnoracial groups. , Given that there are only limited data from non‐White participants available, trained models could fail to learn such ethnoracial‐specific signals, which, in turn, would result in poor performance for individuals of non‐White background. As mentioned previously, the abundance of longitudinal CSF data was limited throughout all acquired data sets. One possible reason explaining participants' reluctance to provide CSF samples, especially repeatedly, is the invasiveness of its sampling procedure. Although cross‐sectional CSF biomarkers can support AD diagnosis, longitudinal measurements are fundamental to understand disease progression on a biomarker‐level. Given the low CSF sample sizes currently available, it remains questionable whether longitudinal analyses of these data can generate robust insights on conversions between normal and abnormal values of CSF biomarkers.

Actionable knowledge through data‐driven landscapes

The evident contradictions found between our data‐driven investigation and the metadata‐based approaches (Section 3.4) can be divided into two types. Type 1 describes that we found variables in the data sets that were reported as missing according to metadata resources. From this type of contradiction, we can conclude that approaches relying solely on metadata and literature potentially suffer in accuracy when estimating the real content available in cohort data sets. Contradiction type 2, on the other hand, resembles cases in which metadata sources reported a variable to be present, while we were not able to find it in the underlying data. Type 2 contradictions do not lead to the same conclusion as type 1, since it may be possible that the respective variables have simply not been shared with us. However, it is arguable how practical correct metadata is if the data it describes are not themselves available. We believe that our presented comparison highlights that, despite their significantly higher demand for time and effort, data‐driven investigations should be preferred when assessing a data landscape.

Future perspectives

The observed differences in demographic characteristics and disease risk factors across studies could severely hamper the comparison and validation of findings across disparate cohorts, since they can significantly influence the patterns and trends exhibited in the data. Until now, only limited insight is available on how much the heterogeneous data landscape limits comparative approaches and cross‐cohort disease modeling on AD data. Further systematic investigations are required to ensure that results generated on AD data sets are robust and reproducible across multiple cohorts. To support such endeavors, we aim to improve the ADataViewer to include more data sets, variable mappings, and the results of systematic data set comparisons in the future.

COMPETING INTERESTS

The authors have nothing to declare. Supplementary information Click here for additional data file. Supplementary information Click here for additional data file. Supplementary information Click here for additional data file. Supplementary information Click here for additional data file. Supplementary information Click here for additional data file.

26 in total

1. The importance of real-world data to precision medicine.

Authors: Dipak Kalra
Journal: Per Med Date: 2019-02-06 Impact factor: 2.512

2. EMIF Catalogue: A collaborative platform for sharing and reusing biomedical data.

Authors: José Luís Oliveira; Alina Trifan; Luís A Bastião Silva
Journal: Int J Med Inform Date: 2019-03-13 Impact factor: 4.046

3. Ways toward an early diagnosis in Alzheimer's disease: the Alzheimer's Disease Neuroimaging Initiative (ADNI).

Authors: Susanne G Mueller; Michael W Weiner; Leon J Thal; Ronald C Petersen; Clifford R Jack; William Jagust; John Q Trojanowski; Arthur W Toga; Laurel Beckett
Journal: Alzheimers Dement Date: 2005-07 Impact factor: 21.566

Review 4. Religious Orders Study and Rush Memory and Aging Project.

Authors: David A Bennett; Aron S Buchman; Patricia A Boyle; Lisa L Barnes; Robert S Wilson; Julie A Schneider
Journal: J Alzheimers Dis Date: 2018 Impact factor: 4.472

Review 5. A Systematic Review of Longitudinal Studies Which Measure Alzheimer's Disease Biomarkers.

Authors: Emma Lawrence; Carolin Vegvari; Alison Ower; Christoforos Hadjichrysanthou; Frank De Wolf; Roy M Anderson
Journal: J Alzheimers Dis Date: 2017 Impact factor: 4.472

6. Race modifies the relationship between cognition and Alzheimer's disease cerebrospinal fluid biomarkers.

Authors: Jennifer C Howell; Kelly D Watts; Monica W Parker; Junjie Wu; Alexander Kollhoff; Thomas S Wingo; Cornelya D Dorbin; Deqiang Qiu; William T Hu
Journal: Alzheimers Res Ther Date: 2017-11-02 Impact factor: 6.982

7. ANMerge: A Comprehensive and Accessible Alzheimer's Disease Patient-Level Dataset.

Authors: Colin Birkenbihl; Sarah Westwood; Liu Shi; Alejo Nevado-Holgado; Eric Westman; Simon Lovestone; Martin Hofmann-Apitius
Journal: J Alzheimers Dis Date: 2021 Impact factor: 4.472

8. Proteome-based plasma biomarkers for Alzheimer's disease.

Authors: A Hye; S Lynham; M Thambisetty; M Causevic; J Campbell; H L Byers; C Hooper; F Rijsdijk; S J Tabrizi; S Banner; C E Shaw; C Foy; M Poppe; N Archer; G Hamilton; J Powell; R G Brown; P Sham; M Ward; S Lovestone
Journal: Brain Date: 2006-11 Impact factor: 13.501

Review 9. Guidelines for the standardization of preanalytic variables for blood-based biomarker studies in Alzheimer's disease research.

Authors: Sid E O'Bryant; Veer Gupta; Kim Henriksen; Melissa Edwards; Andreas Jeromin; Simone Lista; Chantal Bazenet; Holly Soares; Simon Lovestone; Harald Hampel; Thomas Montine; Kaj Blennow; Tatiana Foroud; Maria Carrillo; Neill Graff-Radford; Christoph Laske; Monique Breteler; Leslie Shaw; John Q Trojanowski; Nicole Schupf; Robert A Rissman; Anne M Fagan; Pankaj Oberoi; Robert Umek; Michael W Weiner; Paula Grammas; Holly Posner; Ralph Martins
Journal: Alzheimers Dement Date: 2014-10-01 Impact factor: 21.566

10. Race modifies default mode connectivity in Alzheimer's disease.

Authors: Maria B Misiura; J Christina Howell; Junjie Wu; Deqiang Qiu; Monica W Parker; Jessica A Turner; William T Hu
Journal: Transl Neurodegener Date: 2020-02-19 Impact factor: 8.014

6 in total

1. ADataViewer: exploring semantically harmonized Alzheimer's disease cohort datasets.

Authors: Yasamin Salimi; Daniel Domingo-Fernández; Carlos Bobis-Álvarez; Martin Hofmann-Apitius; Colin Birkenbihl
Journal: Alzheimers Res Ther Date: 2022-05-21 Impact factor: 8.823

2. Screening and enrollment of underrepresented ethnocultural and educational populations in the Alzheimer's Disease Neuroimaging Initiative (ADNI).

Authors: Miriam T Ashford; Rema Raman; Garrett Miller; Michael C Donohue; Ozioma C Okonkwo; Monica Rivera Mindt; Rachel L Nosheny; Godfrey A Coker; Ronald C Petersen; Paul S Aisen; Michael W Weiner
Journal: Alzheimers Dement Date: 2022-02-25 Impact factor: 16.655

3. Biopsychosocial pathways in dementia inequalities: Introduction to the Michigan Cognitive Aging Project.

Authors: Laura B Zahodne
Journal: Am Psychol Date: 2021-12

Review 4. Using the Alzheimer's Disease Neuroimaging Initiative to improve early detection, diagnosis, and treatment of Alzheimer's disease.

Authors: Dallas P Veitch; Michael W Weiner; Paul S Aisen; Laurel A Beckett; Charles DeCarli; Robert C Green; Danielle Harvey; Clifford R Jack; William Jagust; Susan M Landau; John C Morris; Ozioma Okonkwo; Richard J Perrin; Ronald C Petersen; Monica Rivera-Mindt; Andrew J Saykin; Leslie M Shaw; Arthur W Toga; Duygu Tosun; John Q Trojanowski
Journal: Alzheimers Dement Date: 2021-09-28 Impact factor: 16.655

5. Comparison and aggregation of event sequences across ten cohorts to describe the consensus biomarker evolution in Alzheimer's disease.

Authors: Neil P Oxtoby; Colin Birkenbihl; Sepehr Golriz Khatami; Yasamin Salimi; Martin Hofmann-Apitius
Journal: Alzheimers Res Ther Date: 2022-04-20 Impact factor: 8.823

6. Residual reserve index modifies the effect of amyloid pathology on fluorodeoxyglucose metabolism: Implications for efficiency and capacity in cognitive reserve.

Authors: Cathryn McKenzie; Romola S Bucks; Michael Weinborn; Pierrick Bourgeat; Olivier Salvado; Brandon E Gavett
Journal: Front Aging Neurosci Date: 2022-08-12 Impact factor: 5.702

6 in total