Literature DB >> 34404731

Poor data stewardship will hinder global genetic diversity surveillance.

Rachel H Toczydlowski1, Libby Liggins2, Michelle R Gaither3, Tanner J Anderson4, Randi L Barton5, Justin T Berg6, Sofia G Beskid7, Beth Davis5, Alonso Delgado8, Emily Farrell3, Maryam Ghoojaei3, Nan Himmelsbach9, Ann E Holmes10, Samantha R Queeno4, Thienthanh Trinh3, Courtney A Weyand11, Gideon S Bradburd12, Cynthia Riginos13, Robert J Toonen14, Eric D Crandall15.   

Abstract

Genomic data are being produced and archived at a prodigious rate, and current studies could become historical baselines for future global genetic diversity analyses and monitoring programs. However, when we evaluated the potential utility of genomic data from wild and domesticated eukaryote species in the world's largest genomic data repository, we found that most archived genomic datasets (86%) lacked the spatiotemporal metadata necessary for genetic biodiversity surveillance. Labor-intensive scouring of a subset of published papers yielded geospatial coordinates and collection years for only 33% (39% if place names were considered) of these genomic datasets. Streamlined data input processes, updated metadata deposition policies, and enhanced scientific community awareness are urgently needed to preserve these irreplaceable records of today's genetic biodiversity and to plug the growing metadata gap.
Copyright © 2021 the Author(s). Published by PNAS.

Entities:  

Keywords:  biodiversity; conservation; genomic; management; metadata

Mesh:

Year:  2021        PMID: 34404731      PMCID: PMC8403888          DOI: 10.1073/pnas.2107934118

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


Genomic data have never been more available. Researchers can now genotype thousands of loci or sequence whole genomes from virtually any species, and these data are deposited in open-access repositories. Although generated for diverse research purposes, much of these archived genomic data have immense reuse value for measuring genetic diversity—the raw material on which species’ health depends (1, 2). In principle, these data can provide time-stamped records for genetic diversity monitoring (3, 4) (supporting the goals of the United Nations Convention on Biological Diversity [CBD]) (5) and can be used to elucidate the evolutionary and ecological processes that shape biodiversity across the globe (6). Thus, raw genomic data in public repositories are inimitable historical resources—analogous to natural history museums—for the most fundamental level of biodiversity. However, reuse of genomic sequences also minimally requires information about the spatial and temporal context of the sampled organisms (7). Without appropriate archival practices that maintain links between genotypes, place, and time, these growing genomic resources will have limited real-world impact on genetic diversity surveillance. To evaluate whether genomic data and spatiotemporal metadata are adequately archived, we conducted a structured search of publicly available data () in the International Nucleotide Sequence Database Collaboration (INSDC) (8). Most scientific journals require authors to archive their genetic data in a permanent database, and the INSDC is the leading repository of raw genomic data. Data are submitted through one of three INSDC data centers—Japan’s DNA Data Bank of Japan, the European Molecular Biology Laboratory’s European Bioinformatics Institute, or the United States’ National Center for Biotechnology Information (NCBI) (which includes the original sequence repository GenBank)—and are propagated into the other two daily. We accessed the INSDC records through the NCBI portal. We focused on wild and domesticated species, because these are the most common targets for biodiversity studies. Whereas most studies describing spatial and temporal patterns in genetic diversity include wild populations (6, 9), the CBD prioritizes conserving domesticated species (and their wild relatives) and aspires to detect temporal trends in the genetic diversity of stocks and cultivars (5). As of October 2020, the Sequence Read Archive (SRA) of the INSDC contained 600 terabytes (1.63 quadrillion base pairs) of genomic data representing over 16,700 unique wild and domesticated eukaryotic species and 327,577 individual organisms (BioSamples, Fig. 1) in 5,043 datasets (BioProjects). Alarmingly, we found that genomic records for only 14% of these individuals included the spatiotemporal metadata required for genetic diversity monitoring. After removing 562 domesticated species, we were left with 233,639 sequenced individuals from putatively wild populations in 3,903 datasets. Individuals in 17% of these datasets had geospatial coordinates, 41% had collection years, and only 14% had both. With manual effort, approximate geospatial context could be inferred for individuals in about half of these 3,093 datasets—51% had place names (e.g., Lake Mendota) and 66% had country names (Fig. 2 and ). Still only 38% had some location data and a collection year. Records from domesticated species had similar or more extreme levels of missing metadata compared to those from wild species (Fig. 2).
Fig. 1.

Genomic-level sequence data are being added to the INSDC at an exponential rate across eukaryotic taxa. Colors represent the status of spatiotemporal metadata (latitude/longitude and collection year) for each individual (BioSample, n = 327,577, see ). (Inset) Taxonomic breakdown of BioSamples. Percentages in outer rings sum to corresponding inner-ring totals. Unlabeled inner-ring slices correspond to “other” for the outer-ring taxa.

Fig. 2.

Most genomic-level sequence data in the INSDC lack critical metadata. (A) Status of metadata in the INSDC for wild and domesticated individuals (BioSamples, n = 327,577). Gray hashed box indicates datasets (BioProjects) with more than four wild individuals that lacked latitude/longitude and are addressed in B (n = 493). (B) Status of metadata for records inside hashed box in A after augmenting with metadata from associated publications. Left of black diamonds = present in INSDC.

Genomic-level sequence data are being added to the INSDC at an exponential rate across eukaryotic taxa. Colors represent the status of spatiotemporal metadata (latitude/longitude and collection year) for each individual (BioSample, n = 327,577, see ). (Inset) Taxonomic breakdown of BioSamples. Percentages in outer rings sum to corresponding inner-ring totals. Unlabeled inner-ring slices correspond to “other” for the outer-ring taxa. Most genomic-level sequence data in the INSDC lack critical metadata. (A) Status of metadata in the INSDC for wild and domesticated individuals (BioSamples, n = 327,577). Gray hashed box indicates datasets (BioProjects) with more than four wild individuals that lacked latitude/longitude and are addressed in B (n = 493). (B) Status of metadata for records inside hashed box in A after augmenting with metadata from associated publications. Left of black diamonds = present in INSDC. To explore whether the levels of missing metadata that we report for putatively wild populations were inflated by including nonwild individuals, we tested how accurately our filters identified wild individuals. We randomly subsampled 200 datasets from the 3,093 datasets programmatically identified as “wild” and read their associated scientific publications. Based on this subsample, 70% of the datasets identified as wild by our filters were in fact from wild populations. Spatiotemporal metadata were present for only 13% (bootstrapped 95% CI: 6 to 20%) of these datasets, suggesting the 14% we report for the 3,093 putatively wild datasets is representative. Adding a searchable INSDC field that identifies wild-collected individuals would greatly benefit future genetic diversity syntheses and monitoring efforts. We further investigated whether missing spatiotemporal metadata could be manually recovered from sources external to the INSDC. We prioritized 848 genomic datasets (representing 94,416 individuals) deemed relevant for conservation monitoring, because they each described more than four putatively wild individuals. We located published scientific papers describing 739 (of 852) datasets. By reading these papers, we determined that 493 datasets (representing 57,396 individuals) reported genetic diversity for wild populations, and we increased metadata coverage for each category (Fig. 2). After these manual efforts, individuals in 63% of these datasets had geospatial coordinates, 40% had collection years, and 33% had both (39% if any type of location data were considered). In summary, most depositions in the SRA lack sufficient spatiotemporal metadata to enable future reuse and genetic diversity monitoring. Even time-consuming manual efforts to recover these data (∼2,000 human hours here) are only partially successful. Working directly with individual authors is the only remaining strategy to potentially recover these missing metadata (e.g., from personal files or memory) and these missing metadata become increasingly difficult to recover with time since deposition (10). In cases where metadata were never collected or lost, the genomic data may simply be unusable for future analyses. Assuming a sequencing cost of $50/individual, the lost investment from missing spatiotemporal metadata identified in this effort totals tens of millions of US dollars, and this amount will likely grow exponentially each year (Fig. 1). Moreover, this estimate ignores the cost of fieldwork and sampling and the fact that most past timepoints cannot be resampled. The genetics community has long championed open-data publication. The INSDC databases originated in the early 1980s (8), and a combination of top-down mandates and recognition of open-data benefits helped ingrain open-data values in the research community. Only since 2008, however, were the Minimum Information about any Sequence (MIxS) metadata standards formulated (11), which encouraged the community to provide metadata about what (taxonomy), where (georeferences and habitat type), when (collection date), how (sampling and sequencing protocols), and by whom a sample was collected. Initiatives from journals and funders such as the Joint Data Archival Policy have improved genetic metadata quality (7). But, our assessment of the INSDC highlights a gap between which metadata should be collected and archived and which metadata are collected and archived. Solutions to the metadata gap require understanding of why metadata are missing. In some cases metadata are not collected, as this contextual information is nonessential for the original study. In most cases, however, the intent of the original study suggests that metadata should exist, but researchers depositing the genomic data have either not followed the FAIR Guiding Principles for data stewardship (data should be findable, accessible, interoperable, and reusable; ref. 12) or have misfiled their metadata within the INSDC fields (Fig. 2). Although the INSDC was not designed to be a metadatabase for genetic diversity studies, and issues of data integrity will always persist in data repositories of this size (13), repositories have a responsibility to help researchers be compliant with community standards (sensu ref. 14). Simpler deposition protocols would encourage researchers to link spatiotemporal metadata with sequence data of individuals. The metadata that we recovered, for example, will be accessioned to the Genomics Observatories Metadatabase (GEOME; ref. 15), which provides a user-friendly portal for researchers to upload MIxS-compliant, FAIR metadata (to GEOME), and genomic data (to the INSDC SRA). From GEOME, these metadata can easily be cross-walked into INSDC. Incentivizing changes in researcher behavior may additionally require journals and funders to mandate the deposition of spatiotemporal metadata when it is relevant to reuse the genomic data, and for data publications to be rewarded appropriately in hiring, promotion, and tenure decisions. We urge journals to join Molecular Ecology in encouraging authors to link spatiotemporal metadata to genetic sequence data generated for wild species and domesticated species where available (16). While the initial success of GenBank relied on maturing community consensus around the value of open data, today’s increasing rate of biodiversity loss (9) makes ongoing spatiotemporal metadata loss an urgent community issue. We join others in calling for ambitious goals to safeguard genetic diversity (3, 7, 17) and the knowledge structures that will support this goal. Common to proposed genetic diversity monitoring agendas is a shared vision whereby agile pipelines would intake raw genomic data and produce outputs that directly inform conservation policies and decisions. Yet, without appropriate archival genomic data that include the spatiotemporal metadata, crucial information will be unavailable to such pipelines, and researchers will be unable to monitor genetic biodiversity or to reconstruct past baselines. Our critical evaluation of whether publicly available genomic data could be used for meaningful biodiversity analyses and assessments shows that most records fall short. The identified metadata gap represents an irreplaceable loss of historical details. In 2019 alone, the SRA grew by 50%, with the addition of trillions of base pairs of DNA sequence added per day. Meanwhile the world’s sixth mass extinction event is underway with 35,000 species now listed as endangered (i.e., The International Union for Conservation of Nature’s Red List of Threatened Species, https://www.iucnredlist.org/en). Now is the time to plug this metadata gap for the most foundational layer of biodiversity. Our future ability to study, monitor, and conserve all levels of biodiversity depends on it.
  15 in total

1.  Genetics and the conservation of natural populations: allozymes to genomes.

Authors:  Fred W Allendorf
Journal:  Mol Ecol       Date:  2017-01-13       Impact factor: 6.185

2.  On the unreliability of published DNA sequences.

Authors:  Paul D Bridge; Peter J Roberts; Brian M Spooner; Gita Panchal
Journal:  New Phytol       Date:  2003-10       Impact factor: 10.151

3.  Building a global genomics observatory: Using GEOME (the Genomic Observatories Metadatabase) to expedite and improve deposition and retrieval of genetic data and metadata for biodiversity research.

Authors:  Cynthia Riginos; Eric D Crandall; Libby Liggins; Michelle R Gaither; Rodney B Ewing; Christopher Meyer; Kimberly R Andrews; Peter T Euclide; Benjamin M Titus; Nina Overgaard Therkildsen; Antonia Salces-Castellano; Lucy C Stewart; Robert J Toonen; John Deck
Journal:  Mol Ecol Resour       Date:  2020-10-27       Impact factor: 7.090

4.  Set ambitious goals for biodiversity and sustainability.

Authors:  Sandra Díaz; Noelia Zafra-Calvo; Andy Purvis; Peter H Verburg; David Obura; Paul Leadley; Rebecca Chaplin-Kramer; Luc De Meester; Ehsan Dulloo; Berta Martín-López; M Rebecca Shaw; Piero Visconti; Wendy Broadgate; Michael W Bruford; Neil D Burgess; Jeannine Cavender-Bares; Fabrice DeClerck; José María Fernández-Palacios; Lucas A Garibaldi; Samantha L L Hill; Forest Isbell; Colin K Khoury; Cornelia B Krug; Jianguo Liu; Martine Maron; Philip J K McGowan; Henrique M Pereira; Victoria Reyes-García; Juan Rocha; Carlo Rondinini; Lynne Shannon; Yunne-Jai Shin; Paul V R Snelgrove; Eva M Spehn; Bernardo Strassburg; Suneetha M Subramanian; Joshua J Tewksbury; James E M Watson; Amy E Zanne
Journal:  Science       Date:  2020-10-23       Impact factor: 47.728

5.  The Genomic Observatories Metadatabase.

Authors:  Benjamin Sibbett; Loren H Rieseberg; Shawn Narum
Journal:  Mol Ecol Resour       Date:  2020-11       Impact factor: 7.090

6.  The minimum information about a genome sequence (MIGS) specification.

Authors:  Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal:  Nat Biotechnol       Date:  2008-05       Impact factor: 54.908

7.  The International Nucleotide Sequence Database Collaboration.

Authors:  Guy Cochrane; Ilene Karsch-Mizrachi; Toshihisa Takagi
Journal:  Nucleic Acids Res       Date:  2015-12-10       Impact factor: 16.971

8.  The TRUST Principles for digital repositories.

Authors:  Dawei Lin; Jonathan Crabtree; Ingrid Dillo; Robert R Downs; Rorie Edmunds; David Giaretta; Marisa De Giusti; Hervé L'Hours; Wim Hugo; Reyna Jenkyns; Varsha Khodiyar; Maryann E Martone; Mustapha Mokrane; Vivek Navale; Jonathan Petters; Barbara Sierman; Dina V Sokolova; Martina Stockhause; John Westbrook
Journal:  Sci Data       Date:  2020-05-14       Impact factor: 6.444

9.  The FAIR Guiding Principles for scientific data management and stewardship.

Authors:  Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal:  Sci Data       Date:  2016-03-15       Impact factor: 6.444

10.  Global determinants of freshwater and marine fish genetic diversity.

Authors:  Stéphanie Manel; Pierre-Edouard Guerin; David Mouillot; Simon Blanchet; Laure Velez; Camille Albouy; Loïc Pellissier
Journal:  Nat Commun       Date:  2020-02-10       Impact factor: 14.919

View more
  5 in total

1.  Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR.

Authors:  Sebastian Beier; Anne Fiebig; Cyril Pommier; Isuru Liyanage; Matthias Lange; Paul J Kersey; Stephan Weise; Richard Finkers; Baron Koylass; Timothee Cezard; Mélanie Courtot; Bruno Contreras-Moreira; Guy Naamati; Sarah Dyer; Uwe Scholz
Journal:  F1000Res       Date:  2022-02-24

Review 2.  The state of Medusozoa genomics: current evidence and future challenges.

Authors:  Mylena D Santander; Maximiliano M Maronna; Joseph F Ryan; Sónia C S Andrade
Journal:  Gigascience       Date:  2022-05-17       Impact factor: 7.658

3.  Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies.

Authors:  David A Yarmosh; Juan G Lopera; Nikhita P Puthuveetil; Patrick Ford Combs; Amy L Reese; Corina Tabron; Amanda E Pierola; James Duncan; Samuel R Greenfield; Robert Marlow; Stephen King; Marco A Riojas; John Bagnoli; Briana Benton; Jonathan L Jacobs
Journal:  mSphere       Date:  2022-05-02       Impact factor: 5.029

4.  Skimming for barcodes: rapid production of mitochondrial genome and nuclear ribosomal repeat reference markers through shallow shotgun sequencing.

Authors:  Mykle L Hoban; Jonathan Whitney; Allen G Collins; Christopher Meyer; Katherine R Murphy; Abigail J Reft; Katherine E Bemis
Journal:  PeerJ       Date:  2022-08-05       Impact factor: 3.061

Review 5.  Global genetic diversity status and trends: towards a suite of Essential Biodiversity Variables (EBVs) for genetic composition.

Authors:  Sean Hoban; Frederick I Archer; Laura D Bertola; Jason G Bragg; Martin F Breed; Michael W Bruford; Melinda A Coleman; Robert Ekblom; W Chris Funk; Catherine E Grueber; Brian K Hand; Rodolfo Jaffé; Evelyn Jensen; Jeremy S Johnson; Francine Kershaw; Libby Liggins; Anna J MacDonald; Joachim Mergeay; Joshua M Miller; Frank Muller-Karger; David O'Brien; Ivan Paz-Vinas; Kevin M Potter; Orly Razgour; Cristiano Vernesi; Margaret E Hunter
Journal:  Biol Rev Camb Philos Soc       Date:  2022-04-12
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.