Literature DB >> 34727971

The importance of adherence to international standards for depositing open data in public repositories.

Diego A Forero^1,2, Walter H Curioso³, George P Patrinos^4,5,6.

Abstract

There has been an important global interest in Open Science, which include open data and methods, in addition to open access publications. It has been proposed that public availability of raw data increases the value and the possibility of confirmation of scientific findings, in addition to the potential of reducing research waste. Availability of raw data in open repositories facilitates the adequate development of meta-analysis and the cumulative evaluation of evidence for specific topics. In this commentary, we discuss key elements about data sharing in open repositories and we invite researchers around the world to deposit their data in them.

Entities: Chemical

Keywords: Data repositories; Data reuse; Open data; Open science

Mesh：

Year: 2021 PMID： 34727971 PMCID： PMC8561348 DOI： 10.1186/s13104-021-05817-z

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Introduction

There is an important global interest in Open Science, which include open data and methods, in addition to open access (OA) publications [1, 2]. Several funding agencies in the United States and in Europe have mandates for open data generated in the research projects they support. In addition, an increasing number of scientific journals have policies encouraging or asking authors to provide data in open repositories [3]. In this commentary, we discuss key elements about data sharing in open repositories, from an international and interdisciplinary perspective [4].

Main text

Open research data

It has been proposed that public availability of raw data increases their value and the possibility of confirming scientific findings, improving reproducibility and replicability of results [5-8], in addition to enhancing the options of reducing research waste [9]. In this context, the Transparency and Openness Promotion (TOP) guidelines promotes data transparency (https://www.cos.io/initiatives/top-guidelines) [7, 8]. It has been highlighted that there are several main types of research data repositories: Institutional, disciplinary, multidisciplinary and project specific [10]. Availability of raw data in open repositories facilitates the adequate development of meta-analysis, particularly individual patient data -IPD- meta analyses [11], and the cumulative evaluation of evidence for specific topics [12], especially for high-dimensional data [13] (such as results from genomics, transcriptomics or epigenomics). In this context, certain research fields, such as genomics, have developed standards that facilitate and promote deposition of raw data [14]. A recent study showed, in a sample of 531.889 OA journal articles, that a minor fraction of papers included a link to data repositories and that those articles have a higher citation impact [3]. Another recent work analyzed 487 papers describing clinical trials and found that, although many declared data availabilities, very few included data in repositories [15]. An analysis of 500 articles from 50 high-impact journals found that only a small fraction deposited their full raw data online [16]. In addition, in a sample of 49 published articles it was found that the reluctance to share data was associated with a weaker evidence and a higher number of errors in the reporting of statistical results [17]. Ioannidis and coworkers found that raw data unavailability led to a low rate of repeatability of microarray results from published articles [18]. The FAIR Guiding Principles have been proposed for scientific data management [19] and they involve these main four categories: Findable (unique and persistent identifiers, in addition to rich metadata), Accessible (retrievable by their identifier), Interoperable (a broadly applicable language for data representation) and Reusable (a clear and accessible usage license) [19]. Metadata, the information containing the details of data organization, collection and preprocessing, is key for the appropriate processes of finding, using and citing files in open repositories [20]. Recently, Corpas et al. have provided several recommendations to comply with the FAIR principles, such as establishing an adequate consent framework, maximizing machine-readable data and selecting the most findable and accessible data repositories [21]. Broman et al. have proposed several valuable recommendations for the organization of data files, such as being consistent, choosing adequate names for variables, avoiding empty cells, creating data dictionaries and using standard file formats (such as comma-delimited files) [22]. In this context, it has been shown that the use of some commercial file formats, such as.xls files, has led to issues in data storage, such as changing gene symbols to dates [23].

Open access licenses and ethical aspects

There are several available OA licenses and the ones from Creative Commons (CC; https://creativecommons.org/about/cclicenses/) are frequently used [24]. CC BY is one of the less restrictive and involves attribution, CC BY-SA needs licensing under identical conditions, CC BY-ND does not allow derivative works, CC-BY-NC does not allow commercial uses and CC BY-ND-NC does not allow neither derivative works nor commercial uses [24]. It has been recommended [25] that a CC0 license (a universal public domain dedication; https://creativecommons.org/share-your-work/public-domain/cc0) should be used for data sharing. There are several ethical aspects related to the sharing of data from human subjects, such as de-identification and having appropriate informed consents and approval by the institutional review boards [26-29]. In addition, in certain contexts, it is advisable the use of controlled-access repositories, in which the researchers need to apply to get access to the data. In specific cases of highly sensitive information, there is the option for the submission of processed data, such as summary statistics [25, 28]. The International Committee of Medical Journal Editors (ICMJE) requires, since 2017, that articles reporting the results of clinical trials should include a data sharing statement [30]. There are two major interesting examples of international sharing of data from patients and the development of important scientific findings and collaborations [28]: the Alzheimer’s Disease Neuroimaging Initiative (ADNI; adni.loni.usc.edu) has led to more than 2.100 international publications [31] and The Cancer Imaging Archive (TCIA; cancerimagingarchive.net) has facilitated the generation of more than 1.100 international publications [32]. In some regions of the world, there is the need for further training for members of research ethics committees about the multiple advantages of sharing data for the advancement of health sciences research [27, 28].

Recommendations for researchers around the globe

In Table 1 we present a selection of major data repositories (some of them are for general use and others are oriented to specific applications or data types), in order to provide options to the readers to submit their raw results [25]. Among them, the databases at the National Center for Biotechnology Information (NCBI) contain several billion records; some of the largest databases from NCBI are the ones for DNA and RNA sequences (more than 429 million records), gene expression profiles (more than 128 million records), single nucleotide polymorphisms (SNPs; more than 720 million records) and protein sequences (more than 874 million records) [33]. Regarding the databases from the European Bioinformatics Institute, the largest resources are the European Nucleotide and Genome-Phenome Archives, the PRoteomics IDEntifications and the ArrayExpress [34]. The Protein Data Bank has more than 140.000 entries [35] and the Image Data Resource stores different types of imaging data [36]. DataMed (datamed.org) is a search engine for data deposited in repositories [37], there is the Registry of Research Data Repositories (re3data.org) [10] and the European Data Portal (https://data.europa.eu/en) facilitates consolidation and search of open datasets from that region of the world [38]. The Research Data Alliance (RDA) is an international initiative promoting multiple aspects related to open data sharing (https://www.rd-alliance.org) [39].

Table 1

Information about selected major open data repositories

Repository	URL	Type of data	Features
OSF	https://www.osf.io	All types	Individual files must be 5 GB or less
Zenodo	http://www.zenodo.org	All types	50 GB per dataset
Figshare	https://www.figshare.com	All types	File uploads of up to 5 TB in size
Dryad	https://www.datadryad.org	All types	A limit of 300 GB per data publication
NCBI GEO	https://www.ncbi.nlm.nih.gov/geo	Array- and sequence-based data	It encourages to supply MIAME- and MINSEQE-compliant data
ArrayExpress	https://www.ebi.ac.uk/arrayexpress	Array- and sequence-based data	It encourages to supply MIAME- and MINSEQE-compliant data
Image Data Resource	https://www.idr.openmicroscopy.org	Life sciences image data	For file sizes larger than 1000 GB special planning is needed
Protein Data Bank	https://www.rcsb.org	Atomic-level, 3D structure data	It uses the PDBx/mmCIF file format

Researchers can identify the repository with the highest affinity to their data type and needs of sharing, such as general repositories or platforms for specific types of data. The Registry of Research Data Repositories (re3data.org) provides a comprehensive list

Information about selected major open data repositories Researchers can identify the repository with the highest affinity to their data type and needs of sharing, such as general repositories or platforms for specific types of data. The Registry of Research Data Repositories (re3data.org) provides a comprehensive list There is a need for more training about open science and data science [25], particularly in emerging economies, and a larger number of open data repositories are very needed in these regions of the world [40, 41]. In this context, the adequate implementation of standards for reporting of raw data for specific fields, such as the MIAME (Minimum Information About a Microarray Experiment) [14], is key in order to provide an adequate organization of files and inclusion of key metadata, with information such as description of the individuals/samples, experimental conditions and analyses [20]. Funding agencies and academic institutions from multiple countries are invited to consider the importance of open data in their policies and incentives [41, 42]. Although it is a common practice in several journals, editors and peer reviewers of even more international publications should enforce the guidelines asking authors of manuscripts to deposit raw data [12] and scientists from around the world are invited to deposit their data in open repositories [20, 25, 43]. These efforts could be particularly catalyzed by initiatives such as microattribution [44, 45], which provides researchers incentives to openly share their data to the public domain, allowing not only open data sharing but also the possibility of reaching new scientific conclusions that would otherwise not be possible if these data are not being made publicly available [44]. Such initiatives have already been implemented for data repositories, such as locus-specific databases [44], national/ethnic mutation databases [46], clinical databases and consortia [47] and scientific journals (https://www.nature.com/sdata).

Outlook

In times of COVID-19, it is critical to have good quality data (including aspects of accessibility, timeliness and support for users, among others [48]) for proper decision-making. We need data of high quality, that are reliable and trustworthy [49]. At the global level, initiatives like the Research Data Alliance COVID-19 Working Group involved 440 volunteer data experts to address several issues with data and software sharing to improve the response to the pandemic [49]. They provided recommendations and guidelines on data sharing [49]. However, several challenges have to be solved, particularly in emerging economies, such as: legal and policy issues, scarcity of coordination between research groups, lack of a culture for data sharing and ethical/privacy considerations, insufficiency of proper infrastructure (including high-speed Internet connectivity), deficiency in interoperability of platforms, shortage of data managers and data scientists and a scarcity of open data repositories to facilitate data sharing [50]. Recently, an examination of open government data portals for 60 countries found that USA, Czech Republic and Canada have the largest numbers of available datasets (more than 291,000, 136,000 and 85,000, respectively) [48]. In some cases, governments do not see the value for implementing open data repositories; besides it is an excellent way for transparency [48], accountability and even a strategy to deal with corruption. We all play a role in this pandemic, and we need more collaboration between private and public agencies, interdisciplinary approaches, universities, non-governmental organizations, and the civil society to promote an efficient use of open data repositories (as it has been demonstrated recently in the pandemic [51]). In addition, investing in health information systems, interoperability and incentives are key components. Governments should also monitor and evaluate the impact of sharing data on repositories. Finally, there is an important need to strength capacities in the biomedical personnel (particularly in emerging economies), in topics such as: data science, open data repositories, data intelligence, data protection regulations with multidisciplinary teams and collaboration between key stakeholders. As a very high number of publications about Open Science is written by authors from the Global North [8], it is needed to have more international articles about Open Data from the Global South [1, 4, 52].

47 in total

1. Systematic documentation and analysis of human genetic variation in hemoglobinopathies using the microattribution approach.

Authors: Belinda Giardine; Joseph Borg; Douglas R Higgs; Kenneth R Peterson; Sjaak Philipsen; Donna Maglott; Belinda K Singleton; David J Anstee; A Nazli Basak; Barnaby Clark; Flavia C Costa; Paula Faustino; Halyna Fedosyuk; Alex E Felice; Alain Francina; Renzo Galanello; Monica V E Gallivan; Marianthi Georgitsi; Richard J Gibbons; Piero C Giordano; Cornelis L Harteveld; James D Hoyer; Martin Jarvis; Philippe Joly; Emmanuel Kanavakis; Panagoula Kollia; Stephan Menzel; Webb Miller; Kamran Moradkhani; John Old; Adamantia Papachatzopoulou; Manoussos N Papadakis; Petros Papadopoulos; Sonja Pavlovic; Lucia Perseu; Milena Radmilovic; Cathy Riemer; Stefania Satta; Iris Schrijver; Maja Stojiljkovic; Swee Lay Thein; Jan Traeger-Synodinos; Ray Tully; Takahito Wada; John S Waye; Claudia Wiemann; Branka Zukic; David H K Chui; Henri Wajcman; Ross C Hardison; George P Patrinos
Journal: Nat Genet Date: 2011-03-20 Impact factor: 38.330

2. Sharing biological data: why, when, and how.

Authors: Samantha L Wilson; Gregory P Way; Wout Bittremieux; Jean-Paul Armache; Melissa A Haendel; Michael M Hoffman
Journal: FEBS Lett Date: 2021-04 Impact factor: 4.124

3. Database resources of the National Center for Biotechnology Information.

Authors: Eric W Sayers; Jeffrey Beck; Evan E Bolton; Devon Bourexis; James R Brister; Kathi Canese; Donald C Comeau; Kathryn Funk; Sunghwan Kim; William Klimke; Aron Marchler-Bauer; Melissa Landrum; Stacy Lathrop; Zhiyong Lu; Thomas L Madden; Nuala O'Leary; Lon Phan; Sanjida H Rangwala; Valerie A Schneider; Yuri Skripchenko; Jiyao Wang; Jian Ye; Barton W Trawick; Kim D Pruitt; Stephen T Sherry
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

4. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository.

Authors: Kenneth Clark; Bruce Vendt; Kirk Smith; John Freymann; Justin Kirby; Paul Koppel; Stephen Moore; Stanley Phillips; David Maffitt; Michael Pringle; Lawrence Tarbox; Fred Prior
Journal: J Digit Imaging Date: 2013-12 Impact factor: 4.056

5. The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles.

Authors: Heather Piwowar; Jason Priem; Vincent Larivière; Juan Pablo Alperin; Lisa Matthias; Bree Norlander; Ashley Farley; Jevin West; Stefanie Haustein
Journal: PeerJ Date: 2018-02-13 Impact factor: 2.984

6. Collaboration in times of COVID-19: the urgent need for open-data sharing in Latin America.

Authors: Walter H Curioso; Gabriel Carrasco-Escobar
Journal: BMJ Health Care Inform Date: 2020-07

7. Open science challenges, benefits and tips in early career and beyond.

Authors: Christopher Allen; David M A Mehler
Journal: PLoS Biol Date: 2019-05-01 Impact factor: 8.029

8. The European Bioinformatics Institute in 2020: building a global infrastructure of interconnected data resources for the life sciences.

Authors: Charles E Cook; Oana Stroe; Guy Cochrane; Ewan Birney; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

9. The Image Data Resource: A Bioimage Data Integration and Publication Platform.

Authors: Eleanor Williams; Josh Moore; Simon W Li; Gabriella Rustici; Aleksandra Tarkowska; Anatole Chessel; Simone Leo; Bálint Antal; Richard K Ferguson; Ugis Sarkans; Alvis Brazma; Rafael E Carazo Salas; Jason R Swedlow
Journal: Nat Methods Date: 2017-06-19 Impact factor: 28.547

10. Smarter Open Government Data for Society 5.0: Are Your Open Data Smart Enough?

Authors: Anastasija Nikiforova
Journal: Sensors (Basel) Date: 2021-07-31 Impact factor: 3.576

3 in total

1. Centralized project-specific metadata platforms: toolkit provides new perspectives on open data management within multi-institution and multidisciplinary research projects.

Authors: Andrew Wright Child; Jennifer Hinds; Lucas Sheneman; Sven Buerki
Journal: BMC Res Notes Date: 2022-03-18

Review 2. Artificial intelligence and its impact on the domains of universal health coverage, health emergencies and health promotion: An overview of systematic reviews.

Authors: Antonio Martinez-Millana; Aida Saez-Saez; Roberto Tornero-Costa; Natasha Azzopardi-Muscat; Vicente Traver; David Novillo-Ortiz
Journal: Int J Med Inform Date: 2022-08-17 Impact factor: 4.730

Review 3. A Bioinformatics-Assisted Review on Iron Metabolism and Immune System to Identify Potential Biomarkers of Exercise Stress-Induced Immunosuppression.

Authors: Diego A Bonilla; Yurany Moreno; Jorge L Petro; Diego A Forero; Salvador Vargas-Molina; Adrián Odriozola-Martínez; Carlos A Orozco; Jeffrey R Stout; Eric S Rawson; Richard B Kreider
Journal: Biomedicines Date: 2022-03-21

3 in total