OBJECTIVE: Managing registries with continual data collection poses challenges, such as following reproducible research protocols and guaranteeing data accessibility. The University of Kansas (KU) Alzheimer's Disease Center (ADC) maintains one such registry: Curated Clinical Cohort Phenotypes and Observations (C3PO). We created an automated and reproducible process by which investigators have access to C3PO data. MATERIALS AND METHODS: Data was input into Research Electronic Data Capture. Monthly, data part of the Uniform Data Set (UDS), that is data also collected at other ADCs, was uploaded to the National Alzheimer's Coordinating Center (NACC). Quarterly, NACC cleaned, curated, and returned the UDS to the KU Data Management and Statistics (DMS) Core, where it was stored in C3PO with other quarterly curated site-specific data. Investigators seeking to utilize C3PO submitted a research proposal and requested variables via the publicly accessible and searchable data dictionary. The DMS Core used this variable list and an automated SAS program to create a subset of C3PO. RESULTS: C3PO contained 1913 variables stored in 15 datasets. From 2017 to 2018, 38 data requests were completed for several KU departments and other research institutions. Completing data requests became more efficient; C3PO subsets were produced in under 10 seconds. DISCUSSION: The data management strategy outlined above facilitated reproducible research practices, which is fundamental to the future of research as it allows replication and verification to occur. CONCLUSION: We created a transparent, automated, and efficient process of extracting subsets of data from a registry where data was changing daily.
OBJECTIVE: Managing registries with continual data collection poses challenges, such as following reproducible research protocols and guaranteeing data accessibility. The University of Kansas (KU) Alzheimer's Disease Center (ADC) maintains one such registry: Curated Clinical Cohort Phenotypes and Observations (C3PO). We created an automated and reproducible process by which investigators have access to C3PO data. MATERIALS AND METHODS: Data was input into Research Electronic Data Capture. Monthly, data part of the Uniform Data Set (UDS), that is data also collected at other ADCs, was uploaded to the National Alzheimer's Coordinating Center (NACC). Quarterly, NACC cleaned, curated, and returned the UDS to the KU Data Management and Statistics (DMS) Core, where it was stored in C3PO with other quarterly curated site-specific data. Investigators seeking to utilize C3PO submitted a research proposal and requested variables via the publicly accessible and searchable data dictionary. The DMS Core used this variable list and an automated SAS program to create a subset of C3PO. RESULTS: C3PO contained 1913 variables stored in 15 datasets. From 2017 to 2018, 38 data requests were completed for several KU departments and other research institutions. Completing data requests became more efficient; C3PO subsets were produced in under 10 seconds. DISCUSSION: The data management strategy outlined above facilitated reproducible research practices, which is fundamental to the future of research as it allows replication and verification to occur. CONCLUSION: We created a transparent, automated, and efficient process of extracting subsets of data from a registry where data was changing daily.
The push for data sharing and increased transparency has been gaining more attention from federal organizations, global organizations, and peer-reviewed journals. In 2003, the National Institute of Health released a statement on sharing research data in which they endorsed the sharing of research data as vital to furthering the goals of medicine: translating scientific research into “knowledge, products and procedures to improve human health.” In 2012, this concept was echoed by the World Health Organization in a report that stressed the need for international collaboration and public/private sector partnerships to continue the progress of finding solutions for diseases. Additionally, in 2016, the New England Journal of Medicine reaffirmed their support for transparency in data sharing and encouraged members to allow other researchers to have access to their data. While the importance of data sharing and transparency has gained recognition, there are certain situations where the traditional concepts of data sharing and transparency do not apply, thus causing challenges with reproducibility and replicability of research. One such situation is the data collection, storage and distribution of longitudinal, prospective cohort registries.Longitudinal, prospective cohort studies are becoming increasingly popular for many types of research, including epigenetics, epidemiology, and neurology. Generally, these studies have continual data collection. Additionally, the data collected over time might change with the addition and deletion of certain variables of interest due to evolving research requirements. Thus, these studies are generally considered to have dynamic data as opposed to a clinical trial or experiment with fixed completion times. These changes can occur not only with the addition of new observations, but also with respect to the collection of new variables or similar variables with evolving formats. Such dynamic data issues are more common for registries than clinical trials.Dynamic data, while invaluable for gathering information and completing numerous research projects, creates many challenges with upholding standard reproducible research principles. These challenges include a lack of a finalized dataset, maintaining the integrity of the data as it is collected and ensuring the accessibility of the data. Moreover, these dynamic datasets can become very large, thus requiring good big data practices. When managing a dynamic database, it is important for these challenges to be overcome in an efficient way so that the dataset can be accessed and utilized for a variety of research projects. While this need has been identified,,, there have been limited methods published in the literature that have outlined an efficient way to overcome these challenges.The complexities of dynamic data exist in the context of the University of Kansas (KU) Alzheimer’s Disease Center (ADC). The Curated Clinical Cohort Phenotypes and Observations (C3PO) database contains information about clinical, biological, and neural imaging data captured across 935 participants of the KU ADC Clinical Cohort collected since August of 2011. The KU ADC recognized the importance of investigators completing and publishing innovative research, and therefore set a goal to create a seamless process for which investigators had timely access to a current and accurate version of C3PO. Prompted by the increase in data requests and variety of data being utilized, this process was required to be flexible enough to curate different types of datasets yet robust enough to decrease data distribution errors.The primary objective of this article was to describe the process by which the KU ADC enabled investigators to complete reproducible research when using the C3PO dataset. Specifically, this article will identify how the data was curated and stored, how investigators requested data for unique research projects and how subsets of C3PO were generated for the investigators.
MATERIALS AND METHODS
At the time of this writing, the data in C3PO consisted of 1913 variables stored in 15 datasets on 935 patients (Table 1). There were 18 documents describing how the data was collected or generated. The data within C3PO includes various forms of cognitive testing, clinical assessments, blood analysis, imaging studies, and histological reports in addition to basic demographic information. Notably, C3PO contains data on mitochondrial function and distinct aspects of metabolism, such as physical fitness and body circumference measurements.
Table 1.
Description of all datasets within C3PO that are publicly visible via R2D2
Dataset
Number of variables
Description
UDS
1351
UDS 2.0 and 3.0
Genotype
3
APOE status.
Haplotype
1
Mitochondrial haplogroup.
Blood draw
6
Provides information about the type of blood product to be stored.
Cybrid
2
Indicates the existence of a cybrid line.
Imaging
13
Indicates the type of images available.
Freesurfer imaging
187
Summary measures from MRI imaging.
DXA
33
Body composition.
Cognitive visits
35
Neuropsychological measures unique to KUMC.
CDR visits
23
Data unique to KUMC exclusive of neuropsychological.
Outline of the fulfillment of KU ADC resource requests from C3PO.
Outline of the fulfillment of KU ADC resource requests from C3PO.
RESULTS
By using the framework outlined above, 15 data requests were completed in 2018 and 23 data requests were completed in 2017. Over a dozen different departments within the University of Kansas Medical Center (KUMC) used this data for research as well as several other research institutions. The requested datasets supported topics that ranged from how exercise prevented amyloid plaque development to the impact of mitochondrial function on Alzheimer’s Disease progression. The average time of data request completion decreased through automation of several steps. Notably, the SAS program took under 10 seconds to compile the requested variables. An example of one such data requests using this framework is provided below. Additionally, a visual demonstration of this process is available in the Supplementary materials.First, the investigator submitted their project proposal to the KU ADC. Once approved, the investigator worked with the clinical expert to clearly define the data in C3PO that would answer the research question by selecting variables in R2D2. On this same day, the data request was initiated and the DMS Core received an email with the list of requested variables. This list of requested variables was manually input into the SAS program. Before the data subset was sent to the investigator, the DMS Core examined the dataset for any PHI and to ensure the accuracy of the dataset.
DISCUSSION
The data management strategy outlined above resulted in an efficient, effective, and semiautomated process of providing unique datasets for each project approved by the KU ADC. By having a streamlined procedure for curating C3PO every 3 months, accurate and current, near real-time data was made available to investigators. Importantly, coupled with speed, this framework allowed reproducible research to occur and ensured the protection of sensitive health information.The novelty of our work is the scale, type and setting. C3PO is a collection of unique datasets that contains 1913 variables collected on over 935 subjects. C3PO contains not only the prospective and longitudinal clinical data from the UDS, but also data on metabolic function, APOE genetics, autopsy findings, amyloid plaques, and magnetic resonance imaging (MRI) scans. R2D2 is both interactive and available to the public, thus facilitating transparency of the data upholding best reproducible research practices. The combination of a dataset and data dictionary such as C3PO and R2D2 that is available to all researchers is less common. As there are limited peer reviewed publications on dynamic data management strategies, the process outlined in this article serves as a solid foundation of how to approach dynamic datasets so that the integrity of clinical research, such as reproducibility, is upheld.Before the creation of C3PO and R2D2, completing data requests was considerably more time consuming. Often, investigators would use a PDF of the entire data dictionary, which was nearly 150 pages long, to select variables for their project. They would then email the DMS Core this list of variables and the DMS Core would have to verify the exact variables the investigator wanted. Frequently, multiple emails and phone calls between the DMS Core and the investigator would ensue, thus prolonging the time before investigators received the requested data. For example, there are three different variables in C3PO with distinct diabetic classification criteria. Before R2D2, investigators may not have realized there were different diabetic classifications within C3PO. Therefore, if they did not specify which diabetic classification they needed, the DMS Core would not know which variable to select, thus prompting further clarification and delaying the data request process. By using R2D2 and a clinical expert, investigators were better able to identify the appropriate variables for their project because the data standards and formats were publicly available for all variables. Overall, the data request process was expedited with the addition of C3PO and R2D2.Beyond allowing investigators to more easily select variables, R2D2’s structure contains essential information necessary for creating the requested dataset. This information includes the dataset within C3PO that contains the requested variable and the secondary key needed for joining variables from distinct datasets within C3PO. As requests may only require data collected at baseline, data from multiple annual visits, data from a single time point, or a combination of these, the secondary key for each dataset within C3PO is not the same. Therefore, while the sequence of steps to create a requested dataset is dependent on the combination of variables, we were able to automate the merging of variables from C3PO by utilizing SAS software.The algorithm used in the SAS software utilizes a user-defined macrofunction that systematically and iteratively searches through each dataset within C3PO to select the requested variables. This user-defined macro function allows for situational joining of subsets of datasets that is completed within seconds. While automation of this method required an investment of time to create, it resulted in two major advantages. First, the DMS core could use the same, efficient SAS program to create the datasets (Supplementary File 1). Previously, the SAS program would have to be manually changed for each data request. This process was complicated and inefficient because the statistician had to either memorize or look up which datasets contained each of the requested variables and the corresponding secondary key. By using the same, automated SAS program, the DMS Core was able to increase their efficiency and minimize errors. For example, it used to take between 3 and 4 hours to complete a data request but now it takes between 1 and 2 hours. Second, the automated SAS program could identify data management discrepancies, therefore helping with data quality checks. For example, if the data dictionary recorded a variable as being in the incorrect C3PO curated dataset, a note was written to the SAS Log, alerting the DMS Core of this mistake. This is extremely important for ensuring the quality of the data dictionary.One of the most important aspects of this method is that it ensures the data request process follows reproducible research practices. This was done by both curating the data quarterly and using an automated program. Curating the data quarterly established an effective final dataset and automating the SAS program allowed for replicability. Moreover, this project advanced reproducible research practices by increasing the access and transparency of the data by making R2D2 publicly available and searchable. This helped investigators take more ownership of the data. For example, investigators were able to select the exact variables they needed to answer their research hypothesis and knew the format of this data. Establishing standard reproducible research practices in studies with dynamic data is vital to the future of research in healthcare because if the results cannot be verified, then the credibility of the results decreases.This process has two weaknesses. First, examining the accuracy of the created dataset and ensuring the privacy of protected health information cannot be automated. While it would be ideal for this step to be automated, it would be nearly impossible to do so because of the importance of protecting patient’s rights. Second, timely maintenance of technology and the data dictionary is necessary for maintaining the credibility and security of the data. External factors, such as SAS Software updates, require a higher level of technical effort to maintain this program.The main strengths of this method include replicability, transparency, efficiency, and accuracy in the context of dynamic data. All of these qualities uphold reproducible research standards. By employing this automated process, the KU ADC is better able to support investigators and resourcefully utilize data the KU ADC cohort has provided. This process is a major step forward not only for reproducibility practices, but also for fostering positive collaboration across many disciplines at a large research institution.
CONCLUSION
We have described a process that allows for reproducible research in a longitudinal clinical cohort with dynamic data. This strategy utilizes quarterly data freezes to ensure that current snapshots of our dynamic data are available to investigators. Additionally, much of this process is automated, thus allowing for the data to be disseminated quickly and efficiently to investigators while ensuring the quality of the data.
FUNDING
This research was supported by NIH grant P30 AG035982 through the National Institute on Aging.
AUTHOR CONTRIBUTIONS
SLH, DPM, KM, EDV, JMB, and JDM contributed to the conception and overall design of C3PO and R2D2 in addition to overseeing data collection. KAM, SLH, DPM, and JDM contributed to the creation and execution of the methods used to facilitate and fulfill data requests. KAM, SLH, GH, and JDM drafted the initial manuscript. All authors were involved in critically revising the manuscript and approved the submitted manuscript.
DATA SHARING
Data contained within C3PO is available upon request pending appropriate scientific and safety review (https://redcap.kumc.edu/surveys/? s=wQMXHa). A complete list of variables contained within C3PO is publicly available (http://r2d2.kumc.edu/ADC/R2D2.jsp).
CONFLICT OF INTEREST
None to report.Click here for additional data file.
Authors: Nora E Fritz; Scott D Newsome; Ani Eloyan; Rhul Evans R Marasigan; Peter A Calabresi; Kathleen M Zackowski Journal: Neurology Date: 2015-04-15 Impact factor: 9.910
Authors: Nicholas R Anderson; E Sally Lee; J Scott Brockenbrough; Mark E Minie; Sherrilynne Fuller; James Brinkley; Peter Tarczy-Hornoch Journal: J Am Med Inform Assoc Date: 2007-04-25 Impact factor: 4.497
Authors: Jeffrey M Drazen; Stephen Morrissey; Debra Malina; Mary Beth Hamel; Edward W Campion Journal: N Engl J Med Date: 2016-09-22 Impact factor: 91.245
Authors: Jane W Y Ng; Laura M Barrett; Andrew Wong; Diana Kuh; George Davey Smith; Caroline L Relton Journal: Genome Biol Date: 2012-06-29 Impact factor: 13.583
Authors: Ashwin Belle; Raghuram Thiagarajan; S M Reza Soroushmehr; Fatemeh Navidi; Daniel A Beard; Kayvan Najarian Journal: Biomed Res Int Date: 2015-07-02 Impact factor: 3.411
Authors: Palash Sharma; Robert N Montgomery; Rasinio S Graves; Kayla Meyer; Suzanne L Hunt; Eric D Vidoni; Jonathan D Mahnken; Russell H Swerdlow; Jeffrey M Burns; Dinesh Pal Mudaranthakam Journal: JAMIA Open Date: 2021-08-02