| Literature DB >> 28594819 |
Maxim Grechkin1, Hoifung Poon2, Bill Howe1,3.
Abstract
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.Entities:
Mesh:
Year: 2017 PMID: 28594819 PMCID: PMC5464523 DOI: 10.1371/journal.pbio.2002477
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Fig 1Number of samples in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO).
Data underlying the figure are available as S1 Data.
Fig 2Number of Gene Expression Omnibus (GEO) datasets overdue for release over time, as detected by Wide-Open.
Prior to this submission, we notified GEO of the standing list, which led to the dramatic drop of overdue datasets (magenta portion), with 400 datasets released within the first week. Data underlying the figure are available as S2 Data.
Fig 3Average delay from submission to release in the Gene Expression Omnibus (GEO).
Data underlying the figure are available as S3 Data.