| Literature DB >> 35491842 |
David A Yarmosh1, Juan G Lopera1, Nikhita P Puthuveetil1, Patrick Ford Combs1, Amy L Reese1, Corina Tabron1, Amanda E Pierola1, James Duncan1, Samuel R Greenfield1, Robert Marlow1, Stephen King1, Marco A Riojas1,2, John Bagnoli1, Briana Benton1, Jonathan L Jacobs1.
Abstract
The availability of public genomics data has become essential for modern life sciences research, yet the quality, traceability, and curation of these data have significant impacts on a broad range of microbial genomics research. While microbial genome databases such as NCBI's RefSeq database leverage the scalability of crowd sourcing for growth, genomics data provenance and authenticity of the source materials used to produce data are not strict requirements. Here, we describe the de novo assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection (ATCC), each with full genomics data provenance relating to bioinformatics methods, quality control, and passage history. Comparative genomics analysis of ATCC standard reference genomes (ASRGs) revealed significant issues with regard to NCBI's RefSeq bacterial genome assemblies related to completeness, mutations, structure, strain metadata, and gaps in traceability to the original biological source materials. Nearly half of RefSeq assemblies lack details on sample source information, sequencing technology, or bioinformatics methods. Deep curation of these records is not within the scope of NCBI's core mission in supporting open science, which aims to collect sequence records that are submitted by the public. Nonetheless, we propose that gaps in metadata accuracy and data provenance represent an "elephant in the room" for microbial genomics research. Effectively addressing these issues will require raising the level of accountability for data depositors and acknowledging the need for higher expectations of quality among the researchers whose research depends on accurate and attributable reference genome data. IMPORTANCE The traceability of microbial genomics data to authenticated physical biological materials is not a requirement for depositing these data into public genome databases. This creates significant risks for the reliability and data provenance of these important genomics research resources, the impact of which is not well understood. We sought to investigate this by carrying out a comparative genomics study of 1,113 ATCC standard reference genomes (ASRGs) produced by ATCC from authenticated and traceable materials using the latest sequencing technologies. We found widespread discrepancies in genome assembly quality, genetic variability, and the quality and completeness of the associated metadata among hundreds of reference genomes for ATCC strains found in NCBI's RefSeq database. We present a comparative analysis of de novo-assembled ASRGs, their respective metadata, and variant analysis using RefSeq genomes as a reference. Although assembly quality in RefSeq has generally improved over time, we found that significant quality issues remain, especially as related to genomic data and metadata provenance. Our work highlights the importance of data authentication and provenance for the microbial genomics community, and underscores the risks of ignoring this issue in the future.Entities:
Keywords: DNA sequencing; bioinformatics; comparative studies; culture collection; data provenance; genome analysis; genome authentication; genomes; genomics; microbial genomics
Mesh:
Year: 2022 PMID: 35491842 PMCID: PMC9241530 DOI: 10.1128/msphere.00077-22
Source DB: PubMed Journal: mSphere ISSN: 2379-5042 Impact factor: 5.029
FIG 1Pipeline for end-to-end genomic data provenance. Source materials were obtained directly from the ATCC biorepository and tracked through to the final assembly and genome annotation. Upfront culture conditions varied depending on the species cultured, but downstream process steps were performed using standardized protocols for DNA extraction, library prep, sequencing, and bioinformatics. Each pipeline is hosted on One Codex’s cloud infrastructure.
FIG 2Sequencing and quality metrics for 1,113 bacterial genome assemblies. (A) Illumina versus ONT reads for ASRGs before down-sampling; (B) N50 metrics versus genome size; (C) N50 normalized by genome size versus CheckM genome completion estimates; (D) diversity of GC content for all 1,113 ASRG assemblies.
FIG 3Comparative metrics for 1,113 ASRGs versus RefSeq Assemblies. (A) Intersection of ASRGs versus RefSeq for strains labeled as being from ATCC. In parentheses are the total numbers of RefSeq assemblies, allowing for strain redundancy. (B) N50 variability of RefSeq versus ASRGs by sequencing technology. Note that the scale is 1E6. (C) Differences in contig counts for ASRG versus RefSeq assemblies. Positive values indicate that the RefSeq assembly had more contigs. (D) Ratios of ASRG N50 values (y axis) to RefSeq N50 values (“public,” x axis). Density along the diagonal indicates that many assemblies are similar, while density along the y axis indicates ASRGs with higher N50 values. (E) GC content for ASRGs (y axis) versus RefSeq (x axis). Nearly all assemblies have less than 0.1% difference in GC content. (F) Pairwise GC content differences between ASRGs and comparable RefSeq assemblies for the same strain.