| Literature DB >> 32906086 |
Jelmer J Zondergeld1, Ron H H Scholten2, Barbara M I Vreede2, Roy S Hessels3, A G Pijl4, Jacobine E Buizer-Voskamp5, Menno Rasch6, Otto A Lange2, Coosje L S Veldkamp5.
Abstract
The YOUth cohort study aims to be a trailblazer for open science. Being a large-scale, longitudinal cohort following children in their development from gestation until early adulthood, YOUth collects a vast amount of data through a variety of research techniques. Data are collected through multiple platforms, including facilities managed by Utrecht University and the University Medical Center Utrecht. In order to facilitate appropriate use of its data by research organizations and researchers, YOUth aims to produce high-quality, FAIR data while safeguarding the privacy of participants. This requires an extensive data infrastructure, set up by collaborative efforts of researchers, data managers, IT departments, and the Utrecht University Library. In the spirit of open science, YOUth will share its experience and expertise in setting up a high-quality research data infrastructure for sensitive cohort data. This paper describes the technical aspects of our data and data infrastructure, and the steps taken throughout the study to produce and safely store FAIR and high-quality data. Finally, we will reflect on the organizational aspects that are conducive to the success of setting up such an enterprise, and we consider the financial challenges posed by individual studies investing in sustainable science.Entities:
Keywords: Cohort study; Data infrastructure; FAIR data; Information technology; Open science; Research data management
Mesh:
Year: 2020 PMID: 32906086 PMCID: PMC7481825 DOI: 10.1016/j.dcn.2020.100834
Source DB: PubMed Journal: Dev Cogn Neurosci ISSN: 1878-9293 Impact factor: 6.464
Fig. 1Overview describing the systems composing the infrastructure and the flow of data through these systems. Note that metadata of all the data are stored in the RDP and that the RDP sources these metadata from the repositories in which the data are held. The RIA→RDP metadata pipeline represents the use of a pre-existing metadata connection between these systems, capable of extracting additional useful metadata computed by the RIA.
Abbreviations: CRC = Child Research Center; UMCU = University Medical Center Utrecht; SLIM = Study Logistics and Information Manager; RO = Research Online; RDP = Research Data Platform; WUR = Wageningen University; RIA = Research Imaging Architecture.
| Principle | Implementation |
|---|---|
| F1. (Meta)data are assigned a globally unique and persistent identifier | Published data sets are assigned a DOI. |
| F2. Data are described with rich metadata (defined by R1 below) | Currently the assigned metadata is DataCite 4.x compliant. However, as part of the grant referred to in Section |
| F3. Metadata clearly and explicitly include the identifier of the data they describe | Metadata entries in the UMCU RDP include the WEPV code. The metadata of published data sets list the DOI of the data. |
| F4. (Meta)data are registered or indexed in a searchable resource | WEPV metadata are stored in the UMCU RDP, which is a searchable resource. Additional metadata are stored in JSON format in Yoda (see Section |
| A1. (Meta)data are retrievable by their identifier using a standardized communications protocol | Data sets are retrievable from Yoda through a WebDAV and a HTTP interface, though only in the case of published data is it retrievable using a DOI. See also F4. |
| A1.1 The protocol is open, free, and universally implementable | WebDAV and HTTP are open, free, and universally implementable. JSON is currently used internally. We are investigating future use of JSON-LD or RDF. The OAI-PMH endpoints will expose XML-data. The data access protocol is publicly available. |
| A1.2 The protocol allows for an authentication and authorization procedure, where necessary | This is necessary. The authorization procedure is described in the publicly available data access protocol. Authentication is handled by Yoda through use of the open WebDAV protocol (and therefore also available to machines). |
| A2. Metadata are accessible, even when the data are no longer available | For published data sets, a landing page with metadata about the data set remains available when the data set is no longer available. Metadata within the RDP remains available after data removal (e.g. after consent withdrawal). |
| I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. | As a result of the transition to DDI LifeCycle 3 (see F2), the semantic representation of our data will come to be based on a broadly applicable common language (e.g. through the use of harmonized variables and terminologies). |
| I2. (Meta)data use vocabularies that follow FAIR principles | Controlled vocabularies and reference improvements are part of the CID metadata harmonization plans. See Section |
| I3. (Meta)data include qualified references to other (meta)data | |
| R1. Meta(data) are richly described with a plurality of accurate and relevant attributes | Yes. See Section |
| R1.1. (Meta)data are released with a clear and accessible data usage license | Published data sets on Yoda have a mandatory License field. |
| R1.2. (Meta)data are associated with detailed provenance | Detailed provenance information (e.g. laboratory setups or test administration protocols) is available for all data. |
| R1.3. (Meta)data meet domain-relevant community standards | The initiated transition from the domain-agnostic DataCite descriptions to DDI LifeCycle 3 (see F2) incorporates subdiscipline-specific methodology descriptions and introduces a wider use of community-driven vocabularies. Moreover, the more detailed information about e.g. constraints and variables used has a large positive impact on the reusability as a whole. |