Literature DB >> 26433226

BacWGSTdb, a database for genotyping and source tracking bacterial pathogens.

Abstract

Whole genome sequencing has become one of the routine methods in molecular epidemiological practice. In this study, we present BacWGSTdb (http://bacdb.org/BacWGSTdb), a bacterial whole genome sequence typing database which is designed for clinicians, clinical microbiologists and hospital epidemiologists. This database borrows the population structure from the current multi-locus sequence typing (MLST) scheme and adopts a hierarchical data structure: species, clonal complex and isolates. When users upload the pre-assembled genome sequences to BacWGSTdb, it offers the functionality of bacterial genotyping at both traditional MLST and whole-genome levels. More importantly, users are told which isolates in the public database are phylogenetically close to the query isolate, along with their clinical information such as host, isolation source, disease, collection time and geographical location. In this way, BacWGSTdb offers a rapid and convenient platform for worldwide users to address a variety of clinical microbiological issues such as source tracking bacterial pathogens.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 26433226 PMCID： PMC4702769 DOI： 10.1093/nar/gkv1004

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Based on the premise that similar isolates may share similar medical trait, a prima facie concern shared by the global medical community is: ‘Have we seen the particular pathogen before? Where, when and what kind of disease is it associated with?’ To address this concern, a variety of genotyping methods have been developed, of which multi-locus sequence typing (MLST) is considered the gold standard for many bacterial pathogens for over a decade (1,2). While the performance of MLST is good enough in inter-lineage genotyping, this technology lacks enough discriminatory capability to differentiate tightly linked bacterial isolates (3,4). The advent of next-generation sequencing technologies has made it possible to obtain the entire bacterial genome at relatively modest cost and effort. Because its extremely high resolution could afford source tracking of the same bacterial clone isolated from different patients, areas or periods, whole genome sequencing (WGS) has been increasingly used to solve a wide range of research problems concerning bacterial epidemiology, drug resistance, pathogenicity and evolution (5–9). To date, US Food and Drug Administration (FDA) has granted WGS the marketing authorization for investigating food-borne outbreaks in USA. It is therefore expected in the near future that WGS would become a routine tool not only for basic research but also for clinical diagnostics and surveillance. Nevertheless, the development of sequencing technology itself is not sufficient to achieve this goal. An easy-to-use public database is also required in order for international exchange of whole genome sequence typing (WGST) information of bacteria. Although Sequence Read Archive (SRA) and European Nucleotide Archive (ENA) have already offered a platform for storing the raw reads of WGS (10,11), it is far from convenient for investigators to deploy the raw data, especially those clinicians with limited bioinformatics skills. More importantly, the provenance for many isolates, such as host, isolation source, disease, collection time and geographical location, has not always been submitted along with the genome sequences. Consequently, medical significance could not be predicted even if a highly similar genome can be found. From this perspective, a new tool that is designed for the clinicians, clinical microbiologists and hospital epidemiologists monitoring the emergence and outbreak of important bacterial pathogens is urgently needed. Generally, two strategies scale well to handling the genomic comparison of thousands of bacterial isolates: gene-by-gene genomic analysis and a reference genome-based single nucleotide polymorphism (SNP) strategy (12). The former is actually a MLST-like approach called whole-genome MLST (wgMLST), which is based on indexing alleles for all coding sequences in the genome, and therefore providing a highly scalable means of studying the sequence variation encoded within it. In theory, the gene-by-gene strategy is suitable for typing bacteria at a wide range of resolutions and might present a much more accurate phylogenetic relationship than traditional MLST (13,14). Currently the Bacterial Isolate Genome Sequence Database (BIGSdb), which was developed following this strategy, has been integrated into the PubMLST database (http://pubmlst.org) for a few species. Users can choose either the traditional MLST scheme or the new whole genome-based one according to their specific typing purpose (14–16). However, the phylogenetic analysis performed by this strategy is quite time-consuming and usually demands considerable computational resources, especially when manipulating hundreds of genome sequences together. In contrast, the reference genome-based SNP strategy borrows the population structure from the current MLST schemes and requires a reference genome for each of the clonal complexes. The genomes of bacterial isolates are compared against the reference genome and the derived SNP data are used for further phylogenetic analysis. This algorithm assigns a higher identity between sequences differing in only one single nucleotide and a lower identity between sequences with multiple differences. Although it is unsuited to some highly diverse bacterial species such as Pseudomonas aeruginosa, or for comparing relationship among remotely related lineages, the reference genome-based SNP strategy offers a satisfactory resolution to differentiate isolates belonging to the same clone and is therefore suitable for analyzing the clonal structure of isolates emerging in an outbreak. In this study, we introduced a new database which offers the ability to extract MLST information from a bacterial genome sequence. More importantly, it would also help find isolates in the public database that are phylogenetically close to the query isolate, along with their clinical information. We believe this function is vital for source tracking bacterial pathogens during outbreak investigation in the era of genomic epidemiology.

DATABASE DESCRIPTION

Bacterial Whole Genome Sequence Typing Database (BacWGSTdb, http://bacdb.org/BacWGSTdb) aims to provide genotyping at both traditional MLST and WGST level. For this purpose, we borrowed the population structure from the current MLST schemes and specified a reference genome for each of the clonal complexes. The clonal complex is defined herein to be a set of Sequence Types (STs) that differ by one or two alleles. The isolates stored in our database are firstly genotyped according to the MLST scheme, and an appropriate reference genome is chosen by the typing result. The complete or draft genome of the isolates continues to be compared against the specified reference genome. The obtained SNP information is stored in the database as well as the clinical information of isolates, including host, isolation source, disease, collection time and geographical location. Figures 1 and 2 list the infrastructure and the general workflow of data processing of BacWGSTdb, respectively.

Figure 1.

Database structure.

Figure 2.

Workflow of data processing. BacWGSTdb contains three main sections: Browse, Tools and Submission. They all adopt a hierarchical infrastructure: species, clonal complex (represented by a reference genome) and isolates. SNP data is the key component of the database and connects the three sections together.

Database structure. Workflow of data processing. BacWGSTdb contains three main sections: Browse, Tools and Submission. They all adopt a hierarchical infrastructure: species, clonal complex (represented by a reference genome) and isolates. SNP data is the key component of the database and connects the three sections together. Construction of the phylogenetic tree in BacWGSTdb relies on Neighbor-Joining (NJ) algorithm. Although NJ is less accurate than other algorithms such as Maximum-Parsimony or Maximum-Likelihood, this disadvantage is minimized to a great extent when the compared strains are very close to each other. Meanwhile, the NJ algorithm runs significantly faster than the others, especially when manipulating the genome of hundreds of isolates together (17,18). Thus, the speed of retrieval in BacWGSTdb is fast, which fulfills the need of real-time monitoring and identification of bacterial outbreaks. BacWGSTdb has been implemented using MySQL 5.6 (http://www.mysql.org), PHP 5.5 (http://www.php.net) and Apache 2.4 (http://www.apache.org) on a Red Hat Enterprise Linux Server 6.0. The interface component consists of webpages designed and implemented in HTML/CSS in a Linux environment. It has been tested in the Google Chrome, Mozilla Firefox, Apple Safari, Internet Explorer and Microsoft Edge web browsers. At the background of BacWGSTdb, BLAST 2.2.26 is used for comparing the query genome with the MLST allele sequences (19). MUMmer 3.22 is used for alignment with the reference genome and the subsequent SNP identification (20). Indels and adjacent mismatches are not considered as true SNPs and are pruned by self-developed Perl scripts. The phylogenetic tree was generated by Clearcut 1.0, a fast and open source implementation for the relaxed NJ algorithm and displayed as Scalable Vector Graphics (SVG) in the web browser using TreeVector (21,22). BacWGSTdb currently encompasses nine bacterial organisms of medical importance, i.e. Acinetobacter baumannii, Bacillus anthracis, Escherichia coli, Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae and Yersinia pestis, all of which can be described by a clonal population structure. At the present stage, genome sequences from GenBank and PATRIC databases have been deployed for preparing BacWGSTdb (23,24). The genome sequences from SRA and ENA databases with detailed strain information have also been incorporated (10,11). When the genome assembly is not available, the raw sequence reads were de novo assembled into the draft genome, which was further mapped to the reference genome. The obtained SNP data were stored in BacWGSTdb. The database will be updated periodically and new species can be easily added.

USAGE OF BacWGSTdb

Use of BacWGSTdb includes two major parts: TOOLS and BROWSE. One of the most important tools, Typing & Tracking, is designed for users who have sequenced the genome of their query isolates. After uploading a pre-assembled complete or draft genome sequence, users are told the MLST information of the isolate as well as the recommended reference genome. The query genome is then aligned against the reference genome and the SNP data are provided for download, which is in standard Variant Call Format (VCF). Next, the SNP data would be automatically compared with those deposited in the database, and the most similar isolates to the query one will be displayed (Figure 3). At the bottom of the resulting page, a phylogenetic tree is displayed in order to better reveal the phylogenetic relationship between the query isolate and the listed close isolates.

Figure 3.

Usage example of the tool Typing & Tracking. Panel (A) shows the entry page of Typing & Tracking, in which users choose species, MLST scheme and upload a query genome sequence. Panel (B–D) are the results of Typing & Tracking: Panel (B) lists the MLST information, the suggested reference genome and the SNP file for download; Panel (C) lists the ten most similar isolates to the query one based on the number of different SNPs; Panel (D) shows the phylogenetic relationship between the query and the ten most similar isolates. We have also developed a series of tools to serve as effective supplementations to the key tool Typing & Tracking, including: SNP-annotation: this tool facilitates users to predict the outcome of SNPs. Based on the genomic annotation stored within the database, each of the SNPs uploaded by users will be judged whether it is synonymous, non-synonymous or intergenic. Choose-refgenome-by-ST: by using this tool, users input the ST number and are told which reference genome they should use. Generate-SNP-by-genome: when users are willing to choose other reference genomes instead of the recommended one, they can upload both the reference and the query genome and download the resulting SNP data. Coordinate-conversion: if users don't generate the SNP data by the recommended reference genome at the very beginning, but they are willing to use or submit the SNP data into our database, they can make the coordinate conversion with this tool. Users are required to upload their SNP data and specify the two reference genomes. Pairwise alignment between the two genomes runs at the backend server and then the converted coordinates will be provided. The BROWSE function is designed for visualizing and comparing isolates deposited in the database. When users browse BacWGSTdb, they need to choose a reference genome first (each reference genome represents a clonal complex) and further choose isolates of interest based on ST, host, clinical outcome, geographical location or any other attributes (Figure 4). According to the SNP data against the same reference genome, a NJ unrooted tree is provided for guidance, which reflects the phylogenetic relationship between the selected isolates. Users can also upload their own SNP data (e.g., produced by Typing & Tracking) and compare it with those in the database to figure out the phylogenetic position of their query isolate among the selected isolates. In addition to view the SVG formatted phylogenetic tree in the browser directly, users can also choose to download the Newick-formatted tree files for examination and/or annotation in external tree-drawing applications.

Figure 4.

Usage example of Browse. Panel (A), a snapshot of isolate information in Browse Page. When users want to incorporate their query SNP data into the phylogenetic analysis, the uploaded SNP file should follow the same reference genome to the selected isolates. Panel (B), a phylogenetic tree based on the SNP data, which contains all ST208OD isolates and user query in this case. In the SUBMISSION page, users are encouraged to submit their own data to BacWGSTdb. They need to fill in some basic information of their isolate and upload the SNP data (e.g. produced by Typing & Tracking). The uploaded SNP data will appear in the BROWSE page 24 h after submission. Users can also contact the administrator to make the curation. The SNP data could be prepared in two ways: users can directly map the raw WGS reads to a reference genome, or align the de novo-assembled contigs to the reference genome. We recommend the latter way because the SNP data stored in BacWGSTdb are prepared in this way.

EXAMPLE

The following is an example of how to use BacWGSTdb. A.baumannii has emerged worldwide as an important nosocomial pathogen due to its global occurrence and the ability to develop antimicrobial resistance. Clonal dissemination is characteristic of this important bacterial pathogen as revealed by previous studies (25–30). Currently, there are two MLST schemes available for A. baumannii, namely MLST-OD (associated with Oxford Database, http://pubmlst.org) and MLST-IP scheme (developed by Institute Pasteur, http://bigsdb.web.pasteur.fr). The former has a higher resolution and the latter is relatively more conservative. In the year 2014, an outbreak of bacteremia caused by A. baumannii was detected in a tertiary hospital in Hangzhou, China. We therefore selected one isolate (ABMDR55) for WGS using Illumina Miseq sequencer and the raw reads were assembled into contigs by using CLC Genomics Workbench 8.0 software. Then we analyzed the draft genome by the tool Typing & Tracking in BacWGSTdb. The resulting page confirmed that ABMDR55 belonged to ST208OD/ST2IP according to the two MLST schemes and its appropriate reference genome was ACICU (ST437OD/ST2IP). The strains closely related to ABMDR55 in the database were all isolated from different Chinese cities and different time periods. The close relationship among these isolates indicated they probably belonged to the same clone which had been widely disseminated from a wide spatial and temporal range in China. The SNP file between ABMDR55 and ACICU was downloaded from the resulting page for further analysis. The entire analysis process took <30 s (Figure 3). Then we went to the BROWSE page for obtaining the clinical information of isolates that were close to ABMDR55 (Figure 4). In this case, ST208OD is very likely to be a pandemic lineage since a total of 240 ST208OD A. baumannii genomic sequence data were compiled in BacWGSTdb. The strains belonging to ST208OD were isolated from different countries, such as Spain, Denmark, Czech, Iraq, Thailand, China, Japan and USA. The earliest strain was isolated in the year 2002, and the most recent one was in 2014. We selected all of the ST208OD isolates and meanwhile uploaded the SNP file of ABMDR55 to perform the phylogenetic analysis. Generating the phylogenetic tree took <15 s. According to the derived NJ tree, ABMDR55 and the Chinese ST208OD isolates were grouped into an independent branch (Figure 4), which was consistent with a wide clonal dissemination of A. baumannii in China (27).

CONCLUDING REMARKS AND PERSPECTIVES

In light of the rising threats of antimicrobial resistance and emerging virulence among bacterial pathogens, BacWGSTdb represents a rapid and convenient tool for monitoring the emergence or dissemination of new clones and also for global collaboration on the molecular epidemiological investigation of medically important bacterial pathogens. BacWGSTdb will continue to improve, and additional features for analyzing WGS data are also under development.

30 in total

Review 1. Pathogen typing in the genomics era: MLST and the future of molecular epidemiology.

Authors: Marcos Pérez-Losada; Patricia Cabezas; Eduardo Castro-Nallar; Keith A Crandall
Journal: Infect Genet Evol Date: 2013-01-26 Impact factor: 3.342

2. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

3. Molecular epidemiology of carbapenem-nonsusceptible Acinetobacter baumannii in the United States.

Authors: Jennifer M Adams-Haduch; Ezenwa O Onuoha; Tatiana Bogdanovich; Guo-Bao Tian; Jonas Marschall; Carl M Urban; Brad J Spellberg; Diane Rhee; Diane C Halstead; Anthony W Pasculle; Yohei Doi
Journal: J Clin Microbiol Date: 2011-09-14 Impact factor: 5.948

4. Risk factors and outcome analysis of acinetobacter baumannii complex bacteremia in critical patients.

Authors: Hao-Yuan Lee; Chyi-Liang Chen; Si-Ru Wu; Chih-Wei Huang; Cheng-Hsun Chiu
Journal: Crit Care Med Date: 2014-05 Impact factor: 7.598

Review 5. MLST revisited: the gene-by-gene approach to bacterial genomics.

Authors: Martin C J Maiden; Melissa J Jansen van Rensburg; James E Bray; Sarah G Earle; Suzanne A Ford; Keith A Jolley; Noel D McCarthy
Journal: Nat Rev Microbiol Date: 2013-09-02 Impact factor: 60.633

6. BIGSdb: Scalable analysis of bacterial genome variation at the population level.

Authors: Keith A Jolley; Martin C J Maiden
Journal: BMC Bioinformatics Date: 2010-12-10 Impact factor: 3.169

7. GenBank.

Authors: Dennis A Benson; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 19.160

8. New insights into dissemination and variation of the health care-associated pathogen Acinetobacter baumannii from genomic analysis.

Authors: Meredith S Wright; Daniel H Haft; Derek M Harkins; Federico Perez; Kristine M Hujer; Saralee Bajaksouzian; Michael F Benard; Michael R Jacobs; Robert A Bonomo; Mark D Adams
Journal: MBio Date: 2014-01-21 Impact factor: 7.867

9. NCBI BLAST: a better web interface.

Authors: Mark Johnson; Irena Zaretskaya; Yan Raytselis; Yuri Merezhuk; Scott McGinnis; Thomas L Madden
Journal: Nucleic Acids Res Date: 2008-04-24 Impact factor: 16.971

10. PATRIC, the bacterial bioinformatics database and analysis resource.

Authors: Alice R Wattam; David Abraham; Oral Dalay; Terry L Disz; Timothy Driscoll; Joseph L Gabbard; Joseph J Gillespie; Roger Gough; Deborah Hix; Ronald Kenyon; Dustin Machi; Chunhong Mao; Eric K Nordberg; Robert Olson; Ross Overbeek; Gordon D Pusch; Maulik Shukla; Julie Schulman; Rick L Stevens; Daniel E Sullivan; Veronika Vonstein; Andrew Warren; Rebecca Will; Meredith J C Wilson; Hyun Seung Yoo; Chengdong Zhang; Yan Zhang; Bruno W Sobral
Journal: Nucleic Acids Res Date: 2013-11-12 Impact factor: 16.971

65 in total

1. Molecular Epidemiology and Mechanism of Sulbactam Resistance in Acinetobacter baumannii Isolates with Diverse Genetic Backgrounds in China.

Authors: Yunxing Yang; Ying Fu; Peng Lan; Qingye Xu; Yan Jiang; Yan Chen; Zhi Ruan; Shujuan Ji; Xiaoting Hua; Yunsong Yu
Journal: Antimicrob Agents Chemother Date: 2018-02-23 Impact factor: 5.191

2. Detection of an Escherichia coli Sequence Type 167 Strain with Two Tandem Copies of blaNDM-1 in the Chromosome.

Authors: Ping Shen; Maoli Yi; Ying Fu; Zhi Ruan; Xiaoxing Du; Yunsong Yu; Xinyou Xie
Journal: J Clin Microbiol Date: 2016-12-28 Impact factor: 5.948

3. Decreased Susceptibility to Tigecycline Mediated by a Mutation in mlaA in Escherichia coli Strains.

Authors: Fang He; Juan Xu; Jianfeng Wang; Qiong Chen; Xiaoting Hua; Ying Fu; Yunsong Yu
Journal: Antimicrob Agents Chemother Date: 2016-11-21 Impact factor: 5.191

4. Core Genome Multilocus Sequence Typing: a Standardized Approach for Molecular Typing of Mycoplasma gallisepticum.

Authors: Mostafa Ghanem; Leyi Wang; Yan Zhang; Scott Edwards; Amanda Lu; David Ley; Mohamed El-Gazzar
Journal: J Clin Microbiol Date: 2017-12-26 Impact factor: 5.948

5. Antimicrobial Susceptibility and Clonality of Vaginally Derived Multidrug-Resistant Mobiluncus Isolates in China.

Authors: Xueying Zhang; Yongying Bai; Long Zhang; Mohamed S Draz; Zhi Ruan; Yuning Zhu
Journal: Antimicrob Agents Chemother Date: 2020-07-22 Impact factor: 5.191

6. LOCUST: a custom sequence locus typer for classifying microbial isolates.

Authors: Lauren M Brinkac; Erin Beck; Jason Inman; Pratap Venepally; Derrick E Fouts; Granger Sutton
Journal: Bioinformatics Date: 2017-06-01 Impact factor: 6.937

7. An insight into the genome of a Listeria monocytogenes strain isolated from a bloodstream infection and phylogenetic analysis.

Authors: Weizhong Wang; Juan Xu; Yanmin Chen; Zhongliang Zhu; Fang He
Journal: J Clin Lab Anal Date: 2021-05-16 Impact factor: 2.352

Review 8. Cracking the Challenge of Antimicrobial Drug Resistance with CRISPR/Cas9, Nanotechnology and Other Strategies in ESKAPE Pathogens.

Authors: Tanzeel Zohra; Muhammad Numan; Aamer Ikram; Muhammad Salman; Tariq Khan; Misbahud Din; Muhammad Salman; Ayesha Farooq; Afreenish Amir; Muhammad Ali
Journal: Microorganisms Date: 2021-04-29

9. Epigenomics, genomics, resistome, mobilome, virulome and evolutionary phylogenomics of carbapenem-resistant Klebsiella pneumoniae clinical strains.

Authors: Katlego Kopotsa; Nontombi M Mbelle; John Osei Sekyere
Journal: Microb Genom Date: 2020-11-10

10. The global emergence of a novel Streptococcus suis clade associated with human infections.

Authors: Xingxing Dong; Yanjie Chao; Yang Zhou; Rui Zhou; Wei Zhang; Vincent A Fischetti; Xiaohong Wang; Ye Feng; Jinquan Li
Journal: EMBO Mol Med Date: 2021-06-17 Impact factor: 12.137