| Literature DB >> 32859925 |
Stephanie D Jurburg1,2, Maximilian Konzack3,4, Nico Eisenhauer3,5, Anna Heintz-Buschart3,6.
Abstract
As DNA sequencing has become more popular, the public genetic repositories where sequences are archived have experienced explosive growth. These repositories now hold invaluable collections of sequences, e.g., for microbial ecology, but whether these data are reusable has not been evaluated. We assessed the availability and state of 16S rRNA gene amplicon sequences archived in public genetic repositories (SRA, EBI, and DDJ). We screened 26,927 publications in 17 microbiology journals, identifying 2015 16S rRNA gene sequencing studies. Of these, 7.2% had not made their data public at the time of analysis. Among a subset of 635 studies sequencing the same gene region, 40.3% contained data which was not available or not reusable, and an additional 25.5% contained faults in data formatting or data labeling, creating obstacles for data reuse. Our study reveals gaps in data availability, identifies major contributors to data loss, and offers suggestions for improving data archiving practices.Entities:
Year: 2020 PMID: 32859925 PMCID: PMC7455719 DOI: 10.1038/s42003-020-01204-9
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Fig. 1Popular locations for data storage.
Data for all studies which contained 16S rRNA amplicon sequencing (a), and the V3–V4 subset (b); n = 2656 and n = 635 studies, respectively. For the entirety of the study, studies which contained amplicon sequences but did not deposit them were inferred by manually checking 150 randomly-selected articles which did not contain INSDC accession numbers or refer to alternative databases, indicated in lighter yellow. For the V3–V4 subset, studies which contained the keywords “16S rRNA”, “515”, and “806” were selected. Studies for which INSDC-compliant accession numbers were reported but which did not exist on any INSDC database are shown in lighter blue.
Fig. 2Trends in community sequencing practices over time.
The number of amplicon sequencing studies in the V3–V4 subset (a). The proportion of these studies which were deposited in a single sequence file a data deposition error associated with legacy sequence formats (b) significantly decreased over the period studied (evaluated with a Chi-squared test for trend in proportions). The proportion of each sequencing platform used across the studies changed over time (c). For a, the total number of articles for 2019 was estimated from the first two months of data (light gray). In b, the mean proportion for all years is indicated with a gray dashed line.
Fig. 3The fate of microbiome community data.
An assessment of the data location and state of the 635 studies in the V3–V4 subset. Data loss was divided into four categories: loss due to data location, errors in data deposition, errors in data formatting, and errors in data labeling. Data was categorized as ‘reusable’ if no faults in the above four categories were found. Data was categorized as ‘partially usable’ if faults in data formatting or data labeling were likely to create obstacles in data reuse (i.e., if data not findable in the database due to mislabeling). Finally, data was categorized as ‘not available’ if it was not publicly available on INSDC databases, or if the datasets were missing data which precluded their reusability.
Recommendations for the future improvement of data archiving practices.
| Studies affected | Issue | Recommendations |
| 31.4 % | Data is not readily accessible • Data is not deposited • Data is not deposited to INSDC-affiliated databases • Accession numbers are incorrect • Data is private • Metadata is private | Researchers: • Make deposited data available upon a manuscript’s publication. • Ensure accession numbers are correct in the published article. • Develop community standards on removal of identifying human reads and storage of clean microbiome data. Publishers: • Require that the sequencing data is available upon article submission, and remind authors to make the data publicly available by the time of publication[ • Demand that datasets are deposited to the appropriate INSDC databases prior to submission in order to guarantee their long-term availability. Data archives: • Require that users select a date to make data public during the deposition process. |
| 23.6% | Changes in data formatting practices • Data is uploaded in legacy file formats • Single sequence files are uploaded for paired-end data | Researchers: • Ensure that a minimum set of data is provided in order to allow for reproducibility. This includes formally collecting and depositing metadata to include experiment, sample, and sequence information; and recording protocols using modern tools[ Data archives: • Allow for the deposition of more diverse sequence file types, (i.e., allow for the deposition of sequence metadata files). • Develop new standards which require the reporting of metadata on sequencing and sequence processing. Essential information such as DNA extraction, sequencing, and computational processing and data provenance should be providable via a DOI. • Have a common and precise language regarding ‘best practices’ for data deposition (e.g., the inclusion of primers)[ • Keep publicly available changelogs of database guidelines, so that users may understand how and why data was deposited in a particular format in the past. |
| 14.6% | Mislabeling • Amplicon sequences not listed as ‘amplicon’ • Single sequence files are uploaded for paired-end data | Researchers: • Become familiarized with the terms associated with sequencing and sequence formats for proper data upload[ • Proactive interaction with database holders (i.e., helpdesk) to ensure that data deposition is done correctly. Publishers: • Demand that the metadata tables be included during article submission for peer review. Data archives: • Recognize that amplicon sequencing is an increasingly interdisciplinary technique, and continue the current trend towards improved documentation and explanations. In particular, users may benefit from more precise guidelines into what constitutes informative metadata for the purposes of archiving (e.g., listing the environment as ‘human’ vs. ‘human gut’, Supplementary Fig. |