Literature DB >> 27824078

PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing.

Yen-Yi Liu¹, Chien-Shun Chiou¹, Chih-Chieh Chen^2,3.

Abstract

With the advance of next generation sequencing techniques, whole genome sequencing (WGS) is expected to become the optimal method for molecular subtyping of bacterial isolates. To use WGS as a general subtyping method for disease outbreak investigation and surveillance, the layout of WGS-based typing must be comparable among laboratories. Whole genome multilocus sequence typing (wgMLST) is an approach that achieves this requirement. To apply wgMLST as a standard subtyping approach, a pan-genome allele database (PGAdb) for the population of a bacterial organism must first be established. We present a free web service tool, PGAdb-builder (http://wgmlstdb.imst.nsysu.edu.tw), for the construction of bacterial PGAdb. The effectiveness of PGAdb-builder was tested by constructing a pan-genome allele database for Salmonella enterica serovar Typhimurium, with the database being applied to create a wgMLST tree for a panel of epidemiologically well-characterized S. Typhimurium isolates. The performance of the wgMLST-based approach was as high as that of the SNP-based approach in Leekitcharoenphon's study used for discerning among epidemiologically related and non-related isolates.

Entities: Chemical Disease Species

Mesh：

Year: 2016 PMID： 27824078 PMCID： PMC5099940 DOI： 10.1038/srep36213

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Molecular subtyping of bacterial isolates has been fundamental for epidemiologic study of infectious diseases. Subtyping methods used for disease outbreak investigation and surveillance across regions and countries must be standardized so that the results can be compared across laboratories. For example, pulsed-field gel electrophoresis (PFGE) is a good example; it has been standardized and successfully implemented as a common subtyping tool in the foodborne disease surveillance network—PulseNet1. Although PFGE is highly discriminatory to most bacterial organisms, it is labor- and time-consuming and sometimes insufficient in discerning among strains of highly clonal organisms. A multilocus variable-number tandem repeat analysis (MLVA) exhibits a much higher level of discrimination than PFGE in discerning among very closely related strains; however, MLVA is very organism-specific, and comparing its results across laboratories is difficult23. With the advance of next-generation sequencing (NGS) techniques, whole genome sequencing (WGS) has become a practical and powerful subtyping tool for disease outbreak detection45. To use WGS as a standard subtyping tool for disease surveillance and the investigation of common outbreaks across regions or countries, the layout of fingerprints (genotypes) generated from WGS data must be comparable among laboratories. Currently, NGS platforms generally produce millions of short sequences (reads) for a bacterial strain. The millions of reads can be further assembled into longer sequences (contigs) and annotated using various assemblers678. A number of algorithms and approaches have been developed for analyzing WGS data91011121314. Single nucleotide polymorphism (SNP) is an approach frequently used to analyze WGS data for evolutionary study and disease outbreak investigation151617. To apply the SNP approach, a reference genome sequence is required for selecting SNPs from WGS data of strains. When different reference sequences are used, different SNP sets are generally yielded, making the SNP profiles incomparable across laboratories. Whole genome multilocus sequence typing (wgMLST)1418, an extended concept of the traditional MLST19, is considered an ideal approach to sort out WGS data and generate genetic layouts that are portable and comparable among laboratories. To use wgMLST as a standard subtyping tool, a pan-genome allele database (PGAdb) for the population of a bacterial organism must first be established. In a PGAdb, genes (loci) and their sequence variants (alleles) are designated using a standardized numbering system. An allelic sequence consists of a series of digital numbers and can be portable and comparable across laboratories. We present a web service tool, PGAdb-builder that can be used for the construction of bacterial pan-genome allele databases. In this paper, we demonstrate the function of the PGAdb-builder by constructing a S. Typhimurium PGAdb and generating a wgMLST tree for a panel of epidemiologically well-characterized S. Typhimurium isolates, which were sequenced previously by the DTU Food20.

Methods and Implementation

The flowchart for the proposed PGAdb-builder is illustrated in Fig. 1. The PGAdb-builder server comprises two functional modules: Build_PGAdb for creating a PGAdb database and Build_wgMLSTtree for constructing a wgMLST tree from uploaded genome contigs and formulating genetic relatedness trees by using the PGAdb for generating allelic sequences. The details of the Build_PGAdb and Build_wgMLSTtree modules are described herein.

Figure 1

The schematic work flow of PGAdb-builder.

Build_PGAdb

The Build_PGAdb module executes the annotation of uploaded genome contigs by using the Prokka pipeline21, a rapid bacterial genome annotation tool. Subsequently, the output gff file created in the annotation process is processed to place proteins into orthologous clusters by using the Roary pipeline22, a tool that can rapidly process a large-scale collection of genomes. In this module, paralogous genes are excluded from a pan-genome allele dataset. Each orthologous cluster consists of a protein family with 95% (adjustable between 90% and 99%) sequence identity. Each protein family is defined as a locus (gene). The orthologous proteins in each cluster are converted to nucleotide sequences through inference to the ffn file created in the annotation process to establish a pan-genome allele dataset. In this step, sequences in a locus with one or more mismatched nucleotides between each other are defined as different alleles. The loci of a pan-genome allele dataset are then encoded with a prefix string of three alphabetic letters followed by an eight digits serial number (e.g., SAL00000001, SAL00000002…) and the alleles in each locus are simply assigned by a series of integers beginning from 1 to n (e.g. 1, 2, 3, … n).

Build_wgMLSTtree

The Build_wgMLSTtree module compares the uploaded genome contigs of strains by using a PGAdb database and constructs genetic relatedness trees (wgMLST trees). To create a wgMLST tree, the uploaded genome contigs is compared with the built PGAdb using BLASTN23. If an allele is present in a locus, the predefined allele number is assigned; however, if an allele is absent, “0” is assigned. After the allele finding process is finished, an “allelic sequence” for an uploaded genome is created. A dendrogram with bootstrap values, which is calculated by the ETE tool kit24, is then constructed from allelic sequences with the PHYLIP program25 through use of UPGMA clustering algorithm.

Implementation

The PGAdb-builder server is created through an integration of the Build_PGAdb and Build_wgMLSTtree modules in PHP scripts. The web page was constructed using HTML, javascript, and PHP. The server runs on a Linux cluster with 2.40 GHz Intel Xeon processors comprising 24 cores. Dendrograms labeled with bootstrap values made using Build_wgMLSTtree module are output in the webpage and in a downloadable Newick and a pdf format.

Webserver

Input format

The two modules of PGAdb-builder accepts genome contigs in the FASTA format (Fig. 2A). When the default parameter was usded (protein sequence identity = 95%), Build_PGAdb required approcimately 19 hours to construct a database in a test using 487 S. Typhimurium genomes (487ST_set). Build_PGAdb creates a Database ID after the process finished. Build_wgMLSTtree required 4.5 hours to construct a wgMLST tree (with pan-genome scheme) for 34 S. Typhimurium genomes by using the PGAdb when the default parameters (alignment coverage ≥ 90%; alignment identity ≥ 90%) were set. Users are encouraged to provide e-mail addresses through which to receive a notification for when a job finishes.

Figure 2

The features of the PGAdb-builder server.

(A) Input page of the Build_PGAdb (right panel) and the Build_wgMLSTtree (left panel). (B) Output page of the Build_PGAdb. (C) Output page of the Build_wgMLSTtree.

Output format

The output of Build_PGAdb comprises (A) a summary of settings; (B) a pie chart illustrating the numbers (percentages) of loci for the core genome, dispensable genome, and unique genes in the PGAdb; (C) a checkbox menu for the selection of the user-defined scheme; and (D) buttons, to perform “Go To wgMLSTtree” and “Download User Defined Scheme.” The file of the user-defined scheme can be used as the input for the module of Build_wgMLSTtree from the “Upload User Defined Scheme” option. Through this mechanism, users can exchange their pan-genome database by sharing their scheme files. The output of Build_wgMLSTtree includes (A) a summary of settings, (B) a genetic relatedness tree constructed using the scheme, which is selected by users, and (C) a summary of output files to download. Examples of Build_PGAdb and Build_wgMLSTtree outputs are shown in Fig. 2B,C, respectively.

Example Analysis

We tested the ability of the Build_PGAdb module to construct a PGAdb by using 487 Salmonella Typhimurium (487ST_set) strains of genome contigs (Table S1), which were downloaded from the National Center for Biotechnology Information (NCBI) Genome database (https://www.ncbi.nlm.nih.gov/genome). The operation required approximately 19 hours on a Linux server with 2.40 GHz Intel Xeon processors comprising 24 cores. The S. Typhimurium PGAdb contained 27,011 loci, of which 12.5% (3,375 loci) belonged to the core genome, 44% (11,905 loci) belonged to the dispensable genome, and 43.5% (11,731 loci) belonged to the unique genes. In this step, we defined the core genome as having genes present in 95% of the tested genomes, a dispensable genome as having genes present in two or more but less than 95% of the genomes, and unique genes as being present only in a single genome. The PGAdb from the 487ST_set was then used to construct a wgMLST tree for 34 epidemiologically well-characterized S. Typhimurium isolates by using the Build_wgMLSTtree module. The allelic sequences for the 34 isolates were formed on the basis of the 27,011 loci for the core genome. As illustrated in Fig. 3, the genetic relationships among the 34 isolates constructed using the wgMLST-based approach were highly concordant with the relationships of the isolates determined using the SNP-based method, as shown in a previous study20.

Figure 3

Dendrogram (genetic relatedness tree) for 34 epidemiologically well-characterized S.

Typhimurium isolates sequenced by DTU Food20. Isolates for 6 foodborne disease outbreaks are marked.

Conclusion

The proposed online tool PGAdb-builder, comprising two modules, Build_PGAdb and Build_wgMLSTtree, was established to enable users to use WGS data to construct bacterial pan-genome allele databases and to apply the databases to create genetic relatedness trees for bacterial strains. A strong advantage of the PGAdb-builder server is that the built PGAdb with the user-defined scheme can be reused through uploading the downloaded “User defined scheme file” (UDS file), which records the database ID and the defined scheme. Through this mechanism, users can exchange their PGAdbs by only sharing the UDS files. This PGAdb-builder would be a useful online tool for the construction of bacterial pan-genome allele databases and construction of genetic relatedness tree.

Additional Information

How to cite this article: Liu, Y.-Y. et al. PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing. Sci. Rep. 6, 36213; doi: 10.1038/srep36213 (2016).

25 in total

1. Genetic epidemiology of single-nucleotide polymorphisms.

Authors: A Collins; C Lonjou; N E Morton
Journal: Proc Natl Acad Sci U S A Date: 1999-12-21 Impact factor: 11.205

Review 2. Single nucleotide polymorphisms and the future of genetic epidemiology.

Authors: N J Schork; D Fallin; J S Lanchbury
Journal: Clin Genet Date: 2000-10 Impact factor: 4.438

3. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

4. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

5. Genetic relationships of phage types and single nucleotide polymorphism typing of Salmonella enterica Serovar Typhimurium.

Authors: Stanley Pang; Sophie Octavia; Peter R Reeves; Qinning Wang; Gwendolyn L Gilbert; Vitali Sintchenko; Ruiting Lan
Journal: J Clin Microbiol Date: 2012-01-11 Impact factor: 5.948

6. Characterization of Foodborne Outbreaks of Salmonella enterica Serovar Enteritidis with Whole-Genome Sequencing Single Nucleotide Polymorphism-Based Analysis for Surveillance and Outbreak Detection.

Authors: Angela J Taylor; Victoria Lappi; William J Wolfgang; Pascal Lapierre; Michael J Palumbo; Carlota Medus; David Boxrud
Journal: J Clin Microbiol Date: 2015-08-12 Impact factor: 5.948

7. BIGSdb: Scalable analysis of bacterial genome variation at the population level.

Authors: Keith A Jolley; Martin C J Maiden
Journal: BMC Bioinformatics Date: 2010-12-10 Impact factor: 3.169

8. Roary: rapid large-scale prokaryote pan genome analysis.

Authors: Andrew J Page; Carla A Cummins; Martin Hunt; Vanessa K Wong; Sandra Reuter; Matthew T G Holden; Maria Fookes; Daniel Falush; Jacqueline A Keane; Julian Parkhill
Journal: Bioinformatics Date: 2015-07-20 Impact factor: 6.937

9. AGP: a multimethods web server for alignment-free genome phylogeny.

Authors: Jinkui Cheng; Fuliang Cao; Zhihua Liu
Journal: Mol Biol Evol Date: 2013-02-06 Impact factor: 16.240

10. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

17 in total

1. Whole-Genome Sequencing of Corynebacterium diphtheriae Isolates Recovered from an Inner-City Population Demonstrates the Predominance of a Single Molecular Strain.

Authors: Samuel D Chorlton; Gordon Ritchie; Tanya Lawson; Marc G Romney; Christopher F Lowe
Journal: J Clin Microbiol Date: 2020-01-28 Impact factor: 5.948

2. Activity of Imipenem-Relebactam and Meropenem-Vaborbactam against Carbapenem-Resistant, SME-Producing Serratia marcescens.

Authors: M Biagi; A Shajee; A Vialichka; M Jurkovic; X Tan; E Wenzler
Journal: Antimicrob Agents Chemother Date: 2020-03-24 Impact factor: 5.191

3. 5NosoAE: a web server for nosocomial bacterial antibiogram investigation and epidemiology survey.

Authors: Chih-Chieh Chen; Yen-Yi Liu; Ya-Chu Yang; Chu-Yi Hsu
Journal: Nucleic Acids Res Date: 2022-05-25 Impact factor: 19.160

Review 4. Persistent Infection and Long-Term Carriage of Typhoidal and Nontyphoidal Salmonellae.

Authors: Ohad Gal-Mor
Journal: Clin Microbiol Rev Date: 2018-11-28 Impact factor: 26.132

5. A High-Throughput Short Sequence Typing Scheme for Serratia marcescens Pure Culture and Environmental DNA.

Authors: Thibault Bourdin; Alizée Monnier; Marie-Ève Benoit; Emilie Bédard; Michèle Prévost; Caroline Quach; Eric Déziel; Philippe Constant
Journal: Appl Environ Microbiol Date: 2021-09-29 Impact factor: 5.005

6. Construction of a Pan-Genome Allele Database of Salmonella enterica Serovar Enteritidis for Molecular Subtyping and Disease Cluster Identification.

Authors: Yen-Yi Liu; Chih-Chieh Chen; Chien-Shun Chiou
Journal: Front Microbiol Date: 2016-12-15 Impact factor: 5.640

7. In-silico Taxonomic Classification of 373 Genomes Reveals Species Misidentification and New Genospecies within the Genus Pseudomonas.

Authors: Phuong N Tran; Michael A Savka; Han Ming Gan
Journal: Front Microbiol Date: 2017-07-12 Impact factor: 5.640

8. PacBio But Not Illumina Technology Can Achieve Fast, Accurate and Complete Closure of the High GC, Complex Burkholderia pseudomallei Two-Chromosome Genome.

Authors: Jade L L Teng; Man Lung Yeung; Elaine Chan; Lilong Jia; Chi Ho Lin; Yi Huang; Herman Tse; Samson S Y Wong; Pak Chung Sham; Susanna K P Lau; Patrick C Y Woo
Journal: Front Microbiol Date: 2017-08-02 Impact factor: 5.640

9. Pantoea ananatis Genetic Diversity Analysis Reveals Limited Genomic Diversity as Well as Accessory Genes Correlated with Onion Pathogenicity.

Authors: Shaun P Stice; Spencer D Stumpf; Ron D Gitaitis; Brian H Kvitko; Bhabesh Dutta
Journal: Front Microbiol Date: 2018-02-13 Impact factor: 5.640

10. Identification of a novel botulinum neurotoxin gene cluster in Enterococcus.

Authors: Jason Brunt; Andrew T Carter; Sandra C Stringer; Michael W Peck
Journal: FEBS Lett Date: 2018-01-23 Impact factor: 4.124