Literature DB >> 28771471

The Genomic Observatories Metadatabase (GeOMe): A new repository for field and sampling event metadata associated with genetic samples.

John Deck¹, Michelle R Gaither², Rodney Ewing³, Christopher E Bird^2,4, Neil Davies^5,6, Christopher Meyer⁷, Cynthia Riginos⁸, Robert J Toonen², Eric D Crandall⁹.

Abstract

The Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/) is an open access repository for geographic and ecological metadata associated with biosamples and genetic data. Whereas public databases have served as vital repositories for nucleotide sequences, they do not accession all the metadata required for ecological or evolutionary analyses. GeOMe fills this need, providing a user-friendly, web-based interface for both data contributors and data recipients. The interface allows data contributors to create a customized yet standard-compliant spreadsheet that captures the temporal and geospatial context of each biosample. These metadata are then validated and permanently linked to archived genetic data stored in the National Center for Biotechnology Information's (NCBI's) Sequence Read Archive (SRA) via unique persistent identifiers. By linking ecologically and evolutionarily relevant metadata with publicly archived sequence data in a structured manner, GeOMe sets a gold standard for data management in biodiversity science.

Entities: Chemical Gene Species

Mesh：

Year: 2017 PMID： 28771471 PMCID： PMC5542426 DOI： 10.1371/journal.pbio.2002925

Source DB: PubMed Journal: PLoS Biol ISSN： 1544-9173 Impact factor: 8.029

The missing metadata

Documenting patterns of global biodiversity and understanding how that diversity is generated and maintained are important steps towards mitigating the effects of anthropogenic stressors [1-3], whether local or global. Genetic data are key to this effort as these data can be used to: (a) identify cryptic diversity, (b) define population structure and associated management units, (c) identify hot spots of genetic diversity for the conservation of adaptive potential, (d) study the mechanisms driving patterns of biodiversity to identify regions of high evolutionary potential [4,5], and (e) monitor the flux of both intra- and interspecific genetic diversity at a particular site or within a particular region [6]. Whereas there have been several coordinated efforts to document patterns of species diversity (e.g., Global Biodiversity Information Facility [GBIF, http://www.gbif.org/; see Table 1 for acronym definitions], Ocean Biogeographic Information System [http://www.iobis.org/]), there have been fewer attempts to document and archive global patterns of genetic diversity. Notable efforts in this direction, however, include the Earth Microbiome Project [7,8] and Ocean Sampling Day [9], focusing on microbes, the Genomic Observatories Network (GO Network) of research sites focusing on entire ecosystems [10,11], and analyses of data archived in public repositories [12].

Table 1

Acronym definitions.

Category	Acronym	Name
Databases	Dryad	Dryad Digital Repository
	GBIF	Global Biodiversity Information Facility
	GeOMe	Genomic Observatories Metadatabase
	SRA	NCBI's Sequence Read Archive
Organizations	EMBL-EBI	European Bioinformatics Institute
	EMP	Earth Microbiome Project
	GO Network	Genomic Observatories Network
	GSC	Genomic Standards Consortium
	NCBI	US National Center for Biotechnology Information
	NSF	US National Science Foundation
	RCN	Research Coordination Network
	TDWG	Biodiversity Information Standards Organization, aka Taxonomic Databases Working Group
Standards	MIxS	GSC’s Minimum Information about any (x) marker Sequence
	RDF	Resource Description Framework
	DwC	Darwin Core, TDWG’s body of standards for sharing information about biological diversity
File formats	FASTA	Fast Alignment Search Tool-All
File formats	FASTQ	Fast Alignment Search Tool-Quality
Tools	EZID	Tool for creating and managing globally-unique, long-term identifiers for data
Tools	FIMS	Biocode Field Information Management System

While granting agencies and publishers enforce data accessibility and open access requirements for genetic data, they do not always require standardized metadata [13-15]. The public genetic repositories, such as NCBI and the European Bioinformatics Institute (EMBL-EBI), were established to store large volumes of sequence data. With vast capacity for storage and curation of genetic data, their role as repositories for the growing volume of genetic data is crucial; however, NCBI, for example, encourages but does not require the standardized metadata needed for ecological- or evolutionary-level analyses. Yet standards do exist for such metadata, notably thanks to the efforts of the Genomic Standards Consortium (GSC) [16] and the Biodiversity Information Standards Organization (known as “TDWG,” http://www.tdwg.org/). The GSC’s Minimum Information about any (x) marker Sequence (MIxS) standard [17] specifies a set of metadata standards for genetic data. Likewise, TDWG’s Darwin Core is a body of standards for describing and sharing biodiversity information [18]. However, neither NCBI nor EMBL-EBI currently enforces these standards or offers a portal for searching MIxS-compliant data. The problem is not only with the genetic repositories. The Dryad Digital Repository is an important resource that links data to their associated scientific publications and makes those data citable, yet Dryad does not enforce set standards or metadata requirements. New databases and repositories that accommodate specific disciplines and subfields are coming online, e.g., http://reefgenomics.org/ [19], but there remains no central cross-disciplinary repository that enforces MIxS standards for sequence data and requires submission of the associated metadata describing the ecological and geographic context of source tissues. This “metadata gap” means that vital information about sampling events, such as sampling location, date, habitat, and organism life history, are rarely reported. Instead, most of this information is left unpublished, greatly diminishing the potential value (reuse) of the data [13,14,20].

Filling the metadata gap: GeOMe

To fill the metadata gap for genetic sequence data, we have developed a web-based database and infrastructure to aid collaboration and the cross dissemination of published genetic data (http://geome-db.org/). GeOMe can be easily expanded as necessary to accommodate an increasing diversity of data from various research communities. Early development began as part of the Moorea Biocode Project (http://biocode.berkeley.edu/, Moore Foundation) and subsequently the National Science Foundation (NSF) Biological Science Collections Tracker project (http://biscicol.blogspot.com/). Development continued under a NSF Research Coordination Network (RCN) grant [16], which led to the establishment of the GO Network [10,11] as a joint initiative of GSC and the Group on Earth Observations Biodiversity Observation Network [21]. The resulting informatics stack (Biocode Commons) reached its current level of development under the auspices of another NSF RCN (the Diversity of the Indo-Pacific Network, http://diversityindopacific.net/) and is now being expanded for the broader scientific community as GeOMe. The suite of tools provided through GeOMe provides a platform for investigators to publish standardized metadata that captures the temporal, environmental, geospatial, and even scholarly context for each sample and its derivative genetic data. GeOMe’s user-friendly, web-based interface allows users, from student and single investigator–driven projects to large scientific consortia, to customize metadata templates using the Biocode Field Information Management System (FIMS) [22]. Users select from a set of fields constructed from standard Darwin Core terms (http://rs.tdwg.org/dwc/) to create a metadata template that best reflects their needs and can be reused across multiple projects within or between labs (Fig 1). Data field options include hypotheses about the taxon (if an individual organism) or taxa in the sample (e.g., bacteria) and information on sampling habitat, life history (if an individual organism), details of sampling location and time, and publications deriving from the data. GeOMe provides a set of customizable project-level metadata validation rules, which ensures that metadata are compliant with both Darwin Core and MIxS standards (i.e., each sample has a unique identifier and required fields are provided). Thus, research communities can easily design their own templates and validation rules to describe, for example, an environmental sample used in metagenomics, tissues associated with transcriptomics, or an individual organism’s genomic sequence. Once the metadata template has been created, no internet connection is required for template editing until the data are uploaded, and therefore the system can be used in remote locations and with any personal computer that employs spreadsheet software (e.g., Microsoft Excel or comma-separated value [CSV] formats are supported).

Fig 1

The Genomic Observatories Metadatabase (GeOMe) workflow.

Steps in blue are those conducted within the Field Information Management System (FIMS) of GeOMe while those in white are independent of GeOMe.

The Genomic Observatories Metadatabase (GeOMe) workflow.

Steps in blue are those conducted within the Field Information Management System (FIMS) of GeOMe while those in white are independent of GeOMe. The FIMS architecture (https://github.com/biocodellc) draws on community vocabularies (Darwin Core and MIxS) with terms stored internally as Uniform Resource Identifiers (URIs) and as specified by the Resource Description Framework (RDF) model. Most user-supplied data are stored as attributes of a core “sample” and are joined to either Sanger-based sequence data (including the marker name and actual sequence) or high throughput sequence data (storing metadata associated with sequence data stored on NCBI’s SRA). RDF-based attributes and class names for samples and sequences are then indexed in a document-store database (ElasticSearch, http://www.elastic.co/) for fast retrieval. To submit data to GeOMe (Fig 1), contributors upload a tab-delimited text file together with a Fast Alignment Search Tool-All (FASTA) file (for Sanger sequence data) or a list of Fast Alignment Search Tool-Quality (FASTQ) file names (for high throughput sequence data, in which FASTQ files contain data from an individual sample). GeOMe then validates the dataset, checking to ensure that a set of minimum required fields are complete (following project-specific rules) and that sequence identifiers match metadata identifiers. When rules are violated, an informative and easy-to-interpret error message appears, prompting the user to fix the issue before proceeding. The contributor is also presented with a map of sampling localities to allow them to verify the geospatial information. Once validated, GeOMe assigns persistent, universally unique identifiers to each sample (EZID: California Digital Library; http://ezid.cdlib.org/), which are used for linking samples between GeOMe, NCBI, and other repositories. Sanger sequence data are stored as a text field within the database. For high throughput sequence data, GeOMe provides the data contributor with a completed batch metadata file for NCBI’s SRA and a SRA BioSample file to facilitate submission of the data to the NCBI SRA. Once the data are uploaded to the SRA, GeOMe harvests the NCBI accession numbers, thereby creating a direct link between the genetic data, the sample EZID, and associated metadata. To maximize open access, metadata are available under a Creative Commons Zero license (CC0) and are automatically pushed to GBIF using a dedicated Integrated Publishing Toolkit (IPT, http://www.gbif.org/ipt) installation [23]. Finally, users can choose to embargo their uploaded datasets from public view for a period of up to 2 years from the date of submission. While we encourage all users to make their data immediately public and CC0 on upload, we recognize that GeOMe is useful in preparing and processing research outputs and, consequently, data may not be ready for public release. GeOMe is designed for flexibility and persistence using representational state transfer (REST) web services for communication between the database and the interface, while enabling potential third party applications to interact with services, as well. GeOMe’s web interface enables flexible searches based on any field and/or a geospatial bounding box (Fig 2). The GeOMe database may also be queried with a dedicated R package (geomedb; https://github.com/DIPnet/fimsR-access). GeOMe has also been designed so that it can be used in conjunction with the Biocode Laboratory Information Management System (LIMS; http://software.mooreabiocode.org) for the Geneious software platform (Biomatters, Incorporated). Sanger sequence data are available for download in FASTA format, while high throughput sequence data are provided as a list of SRA accession numbers. Associated metadata can then be downloaded in CSV and keyhole markup language (KML) formats. Already, the database contains metadata for >35,000 Sanger sequences across 233 species supplied from >50 participating laboratories. It has recently begun accepting metadata for high throughput FASTQ datasets. By using the FIMS architecture for metadata but continuing to store genetic sequence data at NCBI, we are helping to ensure long-term persistence of links between sequence data and its associated metadata while keeping the data searchable with NCBI’s Basic Local Alignment Search Tool (BLAST). We believe that this flexibility enables maximum integration with similar regional or discipline-specific data archival initiatives.

Fig 2

Screen shot of the Genomic Observatories Metadatabase (GeOMe) query system for Acanthaster planci, the crown of thorns sea star.

Each number indicates the number of specimens in the database from that location. When a group of specimens is selected, distinct samples are visible as a spiral radiating from the chosen location, and individual records report summary information about each sample.

Screen shot of the Genomic Observatories Metadatabase (GeOMe) query system for Acanthaster planci, the crown of thorns sea star.

Conclusion

A major challenge for biodiversity genomics research is the need to carry out physical sampling in the field (nucleotide sequences cannot be obtained remotely) and then to link biologically and ecologically important metadata with downstream data products, notably, published genetic sequences. No existing federated database provides this functionality. Yet, maintaining linkages among these data types is vital for data integration and analysis. Publicly archiving these metadata is essential to ensure scientific reproducibility and synthesis as well as to maximize potential reuse of sequence data as new techniques develop. Here, we provide a solution to the metadata gap: GeOMe. A bottom-up effort with buy-in from over 50 laboratories, our database is growing and adding new capacity while also setting the industry standard for metadata publication.

20 in total

1. Field information management systems for DNA barcoding.

Authors: John Deck; Joyce Gross; Steven Stones-Havas; Neil Davies; Rebecca Shapley; Christopher Meyer
Journal: Methods Mol Biol Date: 2012

Review 2. Global biodiversity conservation priorities.

Authors: T M Brooks; R A Mittermeier; G A B da Fonseca; J Gerlach; M Hoffmann; J F Lamoreux; C G Mittermeier; J D Pilgrim; A S L Rodrigues
Journal: Science Date: 2006-07-07 Impact factor: 47.728

3. A global map of human impact on marine ecosystems.

Authors: Benjamin S Halpern; Shaun Walbridge; Kimberly A Selkoe; Carrie V Kappel; Fiorenza Micheli; Caterina D'Agrosa; John F Bruno; Kenneth S Casey; Colin Ebert; Helen E Fox; Rod Fujita; Dennis Heinemann; Hunter S Lenihan; Elizabeth M P Madin; Matthew T Perry; Elizabeth R Selig; Mark Spalding; Robert Steneck; Reg Watson
Journal: Science Date: 2008-02-15 Impact factor: 47.728

4. Ecology. Toward a global biodiversity observing system.

Authors: R J Scholes; G M Mace; W Turner; G N Geller; N Jurgens; A Larigauderie; D Muchoney; B A Walther; H A Mooney
Journal: Science Date: 2008-08-22 Impact factor: 47.728

5. Linking big: the continuing promise of evolutionary synthesis.

Authors: Brian Sidlauskas; Ganeshkumar Ganapathy; Einat Hazkani-Covo; Kristin P Jenkins; Hilmar Lapp; Lauren W McCall; Samantha Price; Ryan Scherle; Paula A Spaeth; David M Kidd
Journal: Evolution Date: 2009-11-06 Impact factor: 3.694

6. Not the time or the place: the missing spatio-temporal link in publicly available genetic data.

Authors: Lisa C Pope; Libby Liggins; Jude Keyse; Silvia B Carvalho; Cynthia Riginos
Journal: Mol Ecol Date: 2015-06-23 Impact factor: 6.185

7. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

8. Darwin Core: an evolving community-developed biodiversity data standard.

Authors: John Wieczorek; David Bloom; Robert Guralnick; Stan Blum; Markus Döring; Renato Giovanni; Tim Robertson; David Vieglais
Journal: PLoS One Date: 2012-01-06 Impact factor: 3.240

9. The founding charter of the Genomic Observatories Network.

Authors: Neil Davies; Dawn Field; Linda Amaral-Zettler; Melody S Clark; John Deck; Alexei Drummond; Daniel P Faith; Jonathan Geller; Jack Gilbert; Frank Oliver Glöckner; Penny R Hirsch; Jo-Ann Leong; Chris Meyer; Matthias Obst; Serge Planes; Chris Scholin; Alfried P Vogler; Ruth D Gates; Rob Toonen; Véronique Berteaux-Lecellier; Michèle Barbier; Katherine Barker; Stefan Bertilsson; Mesude Bicak; Matthew J Bietz; Jason Bobe; Levente Bodrossy; Angel Borja; Jonathan Coddington; Jed Fuhrman; Gunnar Gerdts; Rosemary Gillespie; Kelly Goodwin; Paul C Hanson; Jean-Marc Hero; David Hoekman; Janet Jansson; Christian Jeanthon; Rebecca Kao; Anna Klindworth; Rob Knight; Renzo Kottmann; Michelle S Koo; Georgios Kotoulas; Andrew J Lowe; Viggó Thór Marteinsson; Folker Meyer; Norman Morrison; David D Myrold; Evangelos Pafilis; Stephanie Parker; John Jacob Parnell; Paraskevi N Polymenakou; Sujeevan Ratnasingham; George K Roderick; Naiara Rodriguez-Ezpeleta; Karsten Schonrogge; Nathalie Simon; Nathalie J Valette-Silver; Yuri P Springer; Graham N Stone; Steve Stones-Havas; Susanna-Assunta Sansone; Kate M Thibault; Patricia Wecker; Antje Wichels; John C Wooley; Tetsukazu Yahara; Adriana Zingone
Journal: Gigascience Date: 2014-03-07 Impact factor: 6.524

10. The GBIF integrated publishing toolkit: facilitating the efficient publishing of biodiversity data on the internet.

Authors: Tim Robertson; Markus Döring; Robert Guralnick; David Bloom; John Wieczorek; Kyle Braak; Javier Otegui; Laura Russell; Peter Desmet
Journal: PLoS One Date: 2014-08-06 Impact factor: 3.240

21 in total

Review 1. Opportunities and challenges of macrogenetic studies.

Authors: Deborah M Leigh; Charles B van Rees; Katie L Millette; Martin F Breed; Chloé Schmidt; Laura D Bertola; Brian K Hand; Margaret E Hunter; Evelyn L Jensen; Francine Kershaw; Libby Liggins; Gordon Luikart; Stéphanie Manel; Joachim Mergeay; Joshua M Miller; Gernot Segelbacher; Sean Hoban; Ivan Paz-Vinas
Journal: Nat Rev Genet Date: 2021-08-18 Impact factor: 53.242

2. Toward global integration of biodiversity big data: a harmonized metabarcode data generation module for terrestrial arthropods.

Authors: Paula Arribas; Carmelo Andújar; Kristine Bohmann; Jeremy R deWaard; Evan P Economo; Vasco Elbrecht; Stefan Geisen; Marta Goberna; Henrik Krehenwinkel; Vojtech Novotny; Lucie Zinger; Thomas J Creedy; Emmanouil Meramveliotakis; Víctor Noguerales; Isaac Overcast; Hélène Morlon; Anna Papadopoulou; Alfried P Vogler; Brent C Emerson
Journal: Gigascience Date: 2022-07-19 Impact factor: 7.658

3. A streamlined workflow for conversion, peer review, and publication of genomics metadata as omics data papers.

Authors: Mariya Dimitrova; Raïssa Meyer; Pier Luigi Buttigieg; Teodor Georgiev; Georgi Zhelezov; Seyhan Demirov; Vincent Smith; Lyubomir Penev
Journal: Gigascience Date: 2021-05-13 Impact factor: 6.524

4. Internet of Samples (iSamples): Toward an interdisciplinary cyberinfrastructure for material samples.

Authors: Neil Davies; John Deck; Eric C Kansa; Sarah Whitcher Kansa; John Kunze; Christopher Meyer; Thomas Orrell; Sarah Ramdeen; Rebecca Snyder; Dave Vieglais; Ramona L Walls; Kerstin Lehnert
Journal: Gigascience Date: 2021-05-07 Impact factor: 6.524

5. Restricted dispersal in a sea of gene flow.

Authors: L Benestan; K Fietz; N Loiseau; P E Guerin; E Trofimenko; S Rühs; C Schmidt; W Rath; A Biastoch; A Pérez-Ruzafa; P Baixauli; A Forcada; E Arcas; P Lenfant; S Mallol; R Goñi; L Velez; M Höppner; S Kininmonth; D Mouillot; O Puebla; S Manel
Journal: Proc Biol Sci Date: 2021-05-19 Impact factor: 5.530

6. The little shrimp that could: phylogeography of the circumtropical Stenopus hispidus (Crustacea: Decapoda), reveals divergent Atlantic and Pacific lineages.

Authors: 'Ale'alani Dudoit; Matthew Iacchei; Richard R Coleman; Michelle R Gaither; William E Browne; Brian W Bowen; Robert J Toonen
Journal: PeerJ Date: 2018-03-06 Impact factor: 2.984

7. Clipperton Atoll as a model to study small marine populations: Endemism and the genomic consequences of small population size.

Authors: Nicole L Crane; Juliette Tariel; Jennifer E Caselle; Alan M Friedlander; D Ross Robertson; Giacomo Bernardi
Journal: PLoS One Date: 2018-06-27 Impact factor: 3.240