Literature DB >> 27634291

Pathogen metadata platform: software for accessing and analyzing pathogen strain information.

Wenling E Chang¹, Matthew W Peterson², Christopher D Garay², Tonia Korves³.

Abstract

BACKGROUND: Pathogen metadata includes information about where and when a pathogen was collected and the type of environment it came from. Along with genomic nucleotide sequence data, this metadata is growing rapidly and becoming a valuable resource not only for research but for biosurveillance and public health. However, current freely available tools for analyzing this data are geared towards bioinformaticians and/or do not provide summaries and visualizations needed to readily interpret results.
RESULTS: We designed a platform to easily access and summarize data about pathogen samples. The software includes a PostgreSQL database that captures metadata useful for disease outbreak investigations, and scripts for downloading and parsing data from NCBI BioSample and BioProject into the database. The software provides a user interface to query metadata and obtain standardized results in an exportable, tab-delimited format. To visually summarize results, the user interface provides a 2D histogram for user-selected metadata types and mapping of geolocated entries. The software is built on the LabKey data platform, an open-source data management platform, which enables developers to add functionalities. We demonstrate the use of the software in querying for a pathogen serovar and for genome sequence identifiers.
CONCLUSIONS: This software enables users to create a local database for pathogen metadata, populate it with data from NCBI, easily query the data, and obtain visual summaries. Some of the components, such as the database, are modular and can be incorporated into other data platforms. The source code is freely available for download at https://github.com/wchangmitre/bioattribution .

Entities: Chemical Disease Gene Species

Keywords: BioSample; Biosurveillance; Geocoding; Java; LabKey; Metadata; Pathogen; PostgreSQL

Mesh：

Year: 2016 PMID： 27634291 PMCID： PMC5025631 DOI： 10.1186/s12859-016-1231-2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

With advances in DNA sequencing technology, whole genome sequencing of pathogen strains from disease outbreaks is becoming routine. These advances are resulting in enormous growth in the amount of publicly available pathogen nucleotide sequence data. One critical component of this data is high-quality metadata about biological samples. This metadata includes information about where the sample originated and the sample’s phenotypic properties. These types of features include, but are not limited to, geolocation data, isolation source, collection date, the organization performing collection, sample and strain names, and drug or vaccine resistance information. Pathogen sample metadata presents new opportunities for diagnostic and treatment discovery, biosurveillance, and public health investigations. In order for many of these opportunities to be realized, pathogen metadata needs to be made easily accessible to those beyond the bioinformatics community. There has been significant growth in the capture and sharing of pathogen metadata. The Genomic Standards Consortium (GSC) has developed a set of “Minimal Information about any Sequence” (MIxS) checklists for genomes (MIGS), including checklists specifically for pathogen samples [1, 2]. Recently, a consortium of pathogen-sequencing institutions created a new metadata standard for pathogens, called the GSCID/BRC (Genome Sequencing Centers for Infectious Diseases and Bioinformatics Resource Centers) Project and Sample Application Standard [3]. Repositories for pathogen metadata have also been created. The National Center for Biotechnology Information (NCBI) maintains the BioSample and BioProject databases [4], which contain metadata about biological samples and projects, respectively. This data is typically submitted by investigators in concert with submission of nucleotide sequence data. BioSample and BioProject databases exchange data with their European and Japanese counterparts [5]. The Pathosystems Resource Integration Center (PATRIC) and the Virus Pathogen Database and Analysis Resource (ViPR) also provide standardized metadata for some pathogenic bacterial and viral genomes, respectively [6, 7]. The Genomes Online Database (GOLD) [8], developed at the Department of Energy Joint Genomes Institute, is a manually curated warehouse of metadata about sequencing experiments following the MIxS standards. There have also been a number of tools developed to query and retrieve this metadata. The Entrez system at the NCBI [9] provides an interface for searching and filtering query results, and tools such as BioPython [10], BioPerl [11], and BioJava [11] provide functionality for interfacing with these web services. SRAdb enables access to the Sequence Read Archive metadata using R [12]. For biosurveillance and public health endeavors, there are advantages to hosting an independent data platform incorporating publicly available pathogen metadata. In particular, this allows institutions to integrate other data critical for the mission and analyze it in concert with NCBI sample data. For biosurveillance and public health, the joint analysis of pathogen metadata and epidemiological data will be particularly important. Institutions may also have additional pathogen sample data not associated with genomes, or sample data an institution does not want to make public to be analyzed in concert with publicly available data. Furthermore, a separate database allows institutions to customize the database by further standardizing data or adding data fields and tables. This manuscript describes a web server application designed to make pathogen metadata readily accessible to biologists, biosurveillance analysts, and public health investigators without requiring computer programming. The software includes a database for the capture of pathogen metadata, scripts to populate the database with metadata from NCBI BioSample and BioProject and a user interface to query, obtain standardized metadata, and visually summarize results.

Implementation

The sample metadata database schema

The sample metadata database is a PostgreSQL database designed to store information about pathogen samples. The schema captures information types that occur in BioSample and BioProject pathogen submissions, and uses many terms from MIxS. The tables in the database are summarized in Table 1. Additional file 1: Figure S1 shows the relationships between these tables, and the database is documented in detail in the BioAttDB_Documentation.pdf file provided with the software.

Table 1

Overview of the tables in the sample metadata database

Database table	Content
Sample	Identity of a sample, including strain name, serovar, and submission date
Collection	Where, when and what type of environment the sample was collected from
Human_Host	Information about the human host for clinical samples, such as age and gender
Non_Human_Host	Information about non-human hosts for environmental samples
Study_Method	Methods used for obtaining and identifying a sample
Project	Information about the project associated with the strain
Project_Sample	Links projects to samples
Owner	Information about the organization that submitted the information about a strain
Collection_Owner	Information about the organization that collected the strain
Project_Publication	Links a sample to publications by PubMed Identifier
Cross_Reference	Stores source and id pairs for documents and databases that reference a sample

Overview of the tables in the sample metadata database

Scripts to import, parse, and standardize metadata from NCBI

The import of NCBI metadata into the metadata database is handled in four steps. In the first step, performed by the DataDownload.sh script, the BioProject and BioSample XML files are downloaded from the NCBI FTP server. Next, the DataSplit.sh script splits the single XML file provided by NCBI into multiple files containing a subset of the nodes relevant to the database schema for more efficient parsing. Parsing is performed by a Java program, which uses a document object model (DOM) parser to map the XML files to Java classes, create tables, and load the data into the database. When the BioProject and BioSample XML schemas are changed by NCBI, the parser code will need to be updated to reflect the changes. The DataMapping.sh script calls the parser and pre-parses the XML files to create a mapping between BioProject and BioSample files. Finally, the DataUpdate.sh calls the parser twice – once to create the database, and once to load the data into the database.

LabKey module for database query and visualization

LabKey Server [13] is a data management platform designed for biological data. It is a modular, web-based Java application allowing users to create database schemas, queries, forms, and visualizations in support of research. Rather than requiring the user to load the data into LabKey’s schema, we have chosen to interface with the Metadata Database. This allows investigators who may be using another system to interface with the database without having to use LabKey. For those using LabKey, the module provides a simple interface to query the metadata database, and make the data available via the LabKey APIs. The interface and query logic is written in HTML and JavaScript, and is easily extendable by the end user. Once a query is performed, results are displayed in a table and can be filtered, visualized, and exported using the capabilities built into LabKey. In addition to the built-in table and graph views from LabKey, the module adds the ability to summarize the results of a query in the form of a 2D histogram. The visualization, which is built using D3.js [14], creates a two-dimensional histogram using two variables selected by the user. The visualization is interactive, allowing the user to mouse over to see the exact count for any given combination. In addition to the 2D histogram view, the software provides functionality to geocode based on any column in a List (LabKey’s user-created database tables) and display the results on a map. In this distribution, the geocoding and mapping is performed using a Google Maps API (https://developers.google.com/maps/), though this could be changed by the end user to use a geospatial analysis package of their choice.

Results and discussion

In this section, we highlight two examples showing how the Pathogen Metadata Platform can be used in the investigation of disease outbreaks. In these examples, the database has been populated with data from NCBI on October 27, 2014. Time to populate the database will depend on the current size of BioSample and BioProject, connection speed, parameters used for splitting, and processor speed. On our system, upload time for the database in May 2016, with size 4.7 GB, was less than 16 h.

Identifying and Summarizing Strain Data for a Pathogen Species

In this example, there is a new disease outbreak and an investigator wants to determine whether there have been recent outbreaks that may be related. The investigator performs a search on the pathogen name using the basic query interface. Figure 1a shows a search for samples containing data from Listeria monocytogenes. The results are returned in the form of a LabKey table view, which contains information about the samples, including relevant metadata such as strain name, isolation source, collection date, serovar, as well as a reference to the accession number in the NCBI Sequence Read Archive (SRA). This table is then filtered to include only samples collected within the past three years, as shown in Fig. 1b. The table can be exported for use in a bioinformatics analysis pipeline in order to, for example, identify which strains are most closely related to the outbreak strain. Finally, the filtered data is summarized via the 2D histogram view. Figure 1c-d shows the creation of a 2D histogram showing the number of samples collected across years and isolation sources for insight into potential types of sources of the outbreak.

Fig. 1

Obtaining information about Listeria strains using the Pathogen Metadata Platform. a Querying for a pathogen name. b Filtering query results. c Selecting metadata types for a 2D histogram. d 2D histogram of counts for two metadata types

Obtaining and visualizing information about closely related pathogen strains

In this example, investigators have sequenced a pathogen sample from a patient and performed phylogenetic analyses using RAxML [15], phylogenetic software that uses a maximum likelihood approach. This identified 22 Salmonella enterica serovar Typhimurium genomes from NCBI that are closely related to the patient’s strain. The investigator wants to know where and what type of environments these closely related strains came from. Information about these strains can be obtained by using the SRA Search form within the LabKey module. SRA identifiers are entered as a comma-separated list (Fig. 2a) and are returned as a LabKey table (Fig. 2c). This table is then filtered and a 2D histogram summarizing isolation sources and collection years is created as in the previous example (Fig. 2b).

Fig. 2

Obtaining and visualizing metadata for Salmonella strains. a Querying for a set of sequence read archive ids. b 2D histogram of counts for isolation source and year. c Table of query results. d Map of collection locations of the strains The collection locations of these strains are then mapped. To do this, the table of results is exported as a LabKey list. The “Strain Geography” tab within the LabKey Module allows the user to select this list, along with the column containing the location information to be passed to the geocoder. A map is then presented, with each strain with a location returned by the geocoder displayed as a point on the map (Fig. 2d). Here, we see that the majority of the closely-related strains found within the United States are located in the northeast.

Relationship to other resources

The Pathogen Metadata Platform offers a few advantages relative to other currently available resources. First, once installed, the platform provides an easy way to query and obtain tables of standardized metadata. In this respect is it similar to capabilities offered in ViPR for some virus genomes [6], and in PATRIC for assembled bacterial genomes [7], but provides access to all sample entries in BioSample including for the growing number associated with unassembled genomic data. Second, the platform integrates mapping of geographical locations for genomes from a large database. Available software for mapping geolocations of pathogen genomes includes Supramap, which superimposes phylogenies onto a map [16], and GoMap, which is currently implemented to map HIV strains with drug resistance mutation information [17]. Unlike these, the Pathogen Metadata Platform links mapping with all samples from BioSample, though without a DNA analysis component. In addition, the platform provides interactive 2D histograms to show the variables most strongly associated with the queried pathogen, such as types of environments the pathogen is frequently collected from. Interactive summary figures for pathogen genome metadata have not been incorporated into other webserver applications yet, but provide a way to understand pathogen context quickly, especially when there are large numbers of genomes per species.

Conclusions

The Pathogen Metadata Platform provides functionalities for parsing and loading metadata from NCBI into a relational schema, as well as query and visualization capabilities. This open-source software is modular, such that some components can be individually incorporated into other platforms and modified for specific purposes. For example, the metadata database could be used with other software, and data from sources other than NCBI can be added to it. In addition, the software is extensible, and the LabKey platform provides the opportunity to develop modules for additional analyses. We believe this software will be particularly useful as a complement to DNA analyses, as it has been in our own research. The platform could be paired with easy-to-use DNA analysis software that assesses the relatedness of pathogen strains to enable biosurveillance and public health investigations.

Availability and requirements

Project Name: Pathogen Metadata Platform Project Home Page: https://github.com/wchangmitre/bioattribution Operating system: Linux Programming Environment: Java, SQL Requirements: A working installation of LabKey Server and PostgreSQL database server License: Apache License

16 in total

1. D³: Data-Driven Documents.

Authors: Michael Bostock; Vadim Ogievetsky; Jeffrey Heer
Journal: IEEE Trans Vis Comput Graph Date: 2011-12 Impact factor: 4.579

2. The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification.

Authors: T B K Reddy; Alex D Thomas; Dimitri Stamatis; Jon Bertsch; Michelle Isbandi; Jakob Jansson; Jyothi Mallajosyula; Ioanna Pagani; Elizabeth A Lobos; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

3. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

4. The minimum information about a genome sequence (MIGS) specification.

Authors: Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal: Nat Biotechnol Date: 2008-05 Impact factor: 54.908

5. BioJava: an open-source framework for bioinformatics in 2012.

Authors: Andreas Prlić; Andrew Yates; Spencer E Bliven; Peter W Rose; Julius Jacobsen; Peter V Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L Heuer; H Brandstätter-Müller; Philip E Bourne; Scooter Willis
Journal: Bioinformatics Date: 2012-08-09 Impact factor: 6.937

6. LabKey Server: an open source platform for scientific data integration, analysis and collaboration.

Authors: Elizabeth K Nelson; Britt Piehler; Josh Eckels; Adam Rauch; Matthew Bellew; Peter Hussey; Sarah Ramsay; Cory Nathe; Karl Lum; Kevin Krouse; David Stearns; Brian Connolly; Tom Skillman; Mark Igra
Journal: BMC Bioinformatics Date: 2011-03-09 Impact factor: 3.307

7. The Geogenomic Mutational Atlas of Pathogens (GoMAP) web system.

Authors: David P Sargeant; Michael W Hedden; Sandeep Deverasetty; Christy L Strong; Izua J Alaniz; Alexandria N Bartlett; Nicholas R Brandon; Steven B Brooks; Frederick A Brown; Flaviona Bufi; Monika Chakarova; Roxanne P David; Karlyn M Dobritch; Horacio P Guerra; Kelvy S Levit; Kiran R Mathew; Ray Matti; Dorothea Q Maza; Sabyasachy Mistry; Nemanja Novakovic; Austin Pomerantz; Timothy F Rafalski; Viraj Rathnayake; Noura Rezapour; Christian A Ross; Steve G Schooler; Sarah Songao; Sean L Tuggle; Helen J Wing; Sandy Yousif; Martin R Schiller
Journal: PLoS One Date: 2014-03-27 Impact factor: 3.240

8. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

9. SRAdb: query and use public next-generation sequencing data from within R.

Authors: Yuelin Zhu; Robert M Stephens; Paul S Meltzer; Sean R Davis
Journal: BMC Bioinformatics Date: 2013-01-17 Impact factor: 3.169

10. PATRIC, the bacterial bioinformatics database and analysis resource.

Authors: Alice R Wattam; David Abraham; Oral Dalay; Terry L Disz; Timothy Driscoll; Joseph L Gabbard; Joseph J Gillespie; Roger Gough; Deborah Hix; Ronald Kenyon; Dustin Machi; Chunhong Mao; Eric K Nordberg; Robert Olson; Ross Overbeek; Gordon D Pusch; Maulik Shukla; Julie Schulman; Rick L Stevens; Daniel E Sullivan; Veronika Vonstein; Andrew Warren; Rebecca Will; Meredith J C Wilson; Hyun Seung Yoo; Chengdong Zhang; Yan Zhang; Bruno W Sobral
Journal: Nucleic Acids Res Date: 2013-11-12 Impact factor: 16.971

2 in total

1. Genomics of host-pathogen interactions: challenges and opportunities across ecological and spatiotemporal scales.

Authors: Kathrin Näpflin; Emily A O'Connor; Lutz Becks; Staffan Bensch; Vincenzo A Ellis; Nina Hafer-Hahmann; Karin C Harding; Sara K Lindén; Morten T Olsen; Jacob Roved; Timothy B Sackton; Allison J Shultz; Vignesh Venkatakrishnan; Elin Videvall; Helena Westerdahl; Jamie C Winternitz; Scott V Edwards
Journal: PeerJ Date: 2019-11-05 Impact factor: 2.984

2. Ordering the mob: Insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids.

Authors: Alex Orlek; Hang Phan; Anna E Sheppard; Michel Doumith; Matthew Ellington; Tim Peto; Derrick Crook; A Sarah Walker; Neil Woodford; Muna F Anjum; Nicole Stoesser
Journal: Plasmid Date: 2017-03-09 Impact factor: 3.466

2 in total