Literature DB >> 22130594

PGAP: pan-genomes analysis pipeline.

Yongbing Zhao¹, Jiayan Wu, Junhui Yang, Shixiang Sun, Jingfa Xiao, Jun Yu.

Abstract

SUMMARY: With the rapid development of DNA sequencing technology, increasing bacteria genome data enable the biologists to dig the evolutionary and genetic information of prokaryotic species from pan-genome sight. Therefore, the high-efficiency pipelines for pan-genome analysis are mostly needed. We have developed a new pan-genome analysis pipeline (PGAP), which can perform five analytic functions with only one command, including cluster analysis of functional genes, pan-genome profile analysis, genetic variation analysis of functional genes, species evolution analysis and function enrichment analysis of gene clusters. PGAP's performance has been evaluated on 11 Streptococcus pyogenes strains. AVAILABILITY: PGAP is developed with Perl script on the Linux Platform and the package is freely available from http://pgap.sf.net. CONTACT: junyu@big.ac.cn; xiaojingfa@big.ac.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 22130594 PMCID： PMC3268234 DOI： 10.1093/bioinformatics/btr655

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

With the rapid development of DNA sequencing technology, many large-scale microbial genomes projects are being processed, such as Ten Thousand Microbial Genomes Project and NIH Human Microbiome Project (HMP) (Peterson ). Accumulations of bacterial whole genome sequences also give the biologists more opportunities to explore and test the evolutionary hypotheses on a larger scale than before. In 2005, Tettelin and colleagues introduced a new conception ‘pan-genome’ (Tettelin ). Soon afterwards, pan-genome has been widely used to provide insight into the analysis of the evolution of Streptococcus pneumoniae (Hiller ), Haemophilus influenzae (Hogg ), Escherichia coli (Rasko ), and so on. Besides evolution, pan-genome has been widely used to detect strain-specific virulence factors for some pathogens, Legionella pneumophila (D'Auria ). It is also helpful to investigate the pathogens of epidemic diseases by scanning variable functional genes in core genomes (Bayjanov ; Holt ) and develop vaccines against bacterial pathogens from reverse vaccinology sight (Serruto ). In order to make pan-genome analysis for one bacterial population as easy as possible, there is a great need to develop high-efficiency tools for bacterial pan-genome analysis. For pan-genome analysis, there are only Panseq (Laing ) and PGAT (Brittnacher ) so far. Panseq does well in extracting the ‘core’ and ‘accessory’ regions among genomic sequences and detecting the SNP among core regions. However, it is short of the ability to present the pan-genome profiles of given strains, trace the evolutionary history with multiple materials and point out the variation and function enrichment of functional genes. As a web-based database, PGAT has integrated ortholog assignments, gene content queries, sequences polymorphisms and metabolic pathways information. However, so far it only provides analytical result of limited species in the database and it cannot analyze the genome data from users. We have developed a new stand-alone program called pan-genomes analysis pipeline (PGAP), which has integrated multiple function models and could be used to study the evolutionary history of bacteria, discover pathogenic mechanism, and prevent and control epidemics.

2 METHODS AND ALGORITHM

2.1 Test datasets

The accession numbers for 11 S.pyogenes strains are NC_008022, NC_008024, NC_008023, NC_008021, NC_002737, NC_007297, NC_003485, NC_007296, NC_004070, NC_004606 and NC_006086. All genome data are available from NCBI FTP.

2.2 Program algorithm

Five analysis modules will be executed in PGAP after checking and pre-preparation (Supplementary Fig. S1). They are cluster analysis of functional genes, pan-genome profile analysis, genetic variation analysis of functional genes, species evolution analysis and function enrichment analysis of gene clusters. Among all these five modules, the cluster analysis of functional genes module is the basis for the whole program, as other modules are dependent on the orthologous clusters' output from cluster analysis of functional genes. As for species evolution analysis, it is dependent on the results from genetic variation analysis of functional genes and orthologous clusters (Supplementary Material).

3 RESULTS AND DISCUSSION

To evaluate the performance of PGAP, 11 S.pyogenes strains' genomes are employed to test using both GeneFamily (GF) and MultiParanoid (MP) methods with default parameters setting, except that thread number was set to 2, which has no influence over the results but the time cost may differ. After the functional genes being clustered, there are total 2889 clusters detected by GF method and 2743 clusters detected by MP method. As for core clusters, there are 1376 core genes detected in Tristan Lefebure research (Lefebure and Stanhope, 2007). In PGAP pipeline, 1366 core clusters have been detected by MP method and 1332 core clusters have been detected by GF method, which mean that the results of PGAP are consistent with the result of Tristan Lefebure. As for the consistency between MP method and GF method, we find that the clusters shared by 2–11 strains are consistent (Supplementary Fig. S2), while unique gene number detected by GF method is slightly higher than MP method, which may be caused by different algorithm process in the two methods. The pan-genome profile analysis result (Supplementary Fig. S3) shows that the cluster numbers of core genomes for both methods are almost convergent when the strains number reaches nine, while the cluster number of pan-genome is still increasing. We could infer that S.pyogenes has an open pan-genome, which means that S.pyogenes may have robust ability in importing new genes. There are 2012 clusters involved with indel or mutation events in the GF method's result, while there are 2203 clusters involved with indel or mutation events in the MP method's result. As for dN/dS ratio, we find that 583 clusters in MP result are suffering less selection pressure (dN/dS > 1), and 576 clusters in GF result are suffering less selection pressure. At the same time, we could also select those variable clusters as the markers for typing different strains from genetic variation analysis result. Based on pan-genome profiles and SNP information, phylogenetic trees are constructed (Supplementary Fig. S4). Within the same method, there are obvious differences among the phylogenetic trees generated by different data materials or algorithms but for the same data materials and algorithms, the results from MP method and GF method are almost same, thought there are some slight differences. From the results of function enrichment analysis of gene clusters (Supplementary Fig. S5), we find that whole clusters and core clusters are rich in translation, ribosomal structure and biogenesis, transcription, replication, recombination and repair, cell wall/membrane/envelope biogenesis and cell motility in the results from both methods. However, dispensable clusters and strain-specific clusters are still rich in transcription, replication, recombination and repair and cell wall/membrane/envelope biogenesis and cell motility, while the clusters' numbers of translation, ribosomal structure and biogenesis decrease sharply as compared to the core clusters and whole clusters. Besides, we find that strain-specific clusters are also rich in carbohydrate transport and metabolism, which may be related to their different living niche. As for the strain-specific clusters, we find that the genes or clusters are different from the population sight, which may help us to find the mechanisms for bacterial drug resistant or sensitive, and pathogenic or non-pathogenic (Pallen and Wren, 2007). In conclusion, PGAP could cluster all genes into different clusters, detect genetic variation in each gene cluster, and construct phylogenetic trees with different methods and data. These data could be used for studying species evolution, microbial typing in epidemics, and they are also helpful to discover pathogenic mechanism. As for the time cost of running the above tasks on IBM system x3630 M3, we also record the time table for all the five modules from both methods (Supplementary Table S1). It shows that GF method can save more time than MP method in the cluster analysis of functional genes, but no obvious difference is found in the other four sections. During the whole process, cluster analysis of functional genes and pan-genome profile analysis take more time than other modules. According to PGAP algorithm, the time cost of the cluster analysis of functional genes and pan-genome profile analysis may increase obviously with the strains number increasing, but almost all tasks can be run on personal computer. PGAP is a revolution of pipeline in genome analysis because it has integrated five analysis modules, which are commonly used in genome research. Users can perform the five analysis tasks for their research with just one command. One of our major goals, which is to provide full automation of our pipeline's entire workflow, has been achieved. However, in all the five modules, cluster analysis of functional genes is the foundation of the whole process, and as we know, homologs and orthologs identification are complex tasks in bioinformatics and there are no standard parameters suitable for all genome due to different evolution distance. To make results accurate and reliable, we have invoked two methods with different features in the cluster analysis of functional genes section, making user feel easy to choose according to their own requirements. Though there are default parameters of those programs that PGAP invoked, we still make series of important parameters for users to customize the pipeline according to their data. On the other hand, pan-genome analysis is a hot topic in comparative genomics for bacterial genome (Hiller ; Lefebure and Stanhope, 2007; Tettelin ). Though PGAP is not the first case to perform pan-genome analysis in bioinformatics program, we have integrated multiple analysis sections, which will save users more time and energy. At last, the modular organization of PGAP allows us to update it continually to keep the pace of the development of genome researches, such as new algorithm and methods for cluster analysis of functional genes, new techniques and methods in mining genome genetic information. In the next version, we will integrate new homologs or orthologs clustering methods into PGAP, cut the time cost of the protein sequences clustering section and integrate the whole genome structure analysis and pathway analysis into our PGAP. Funding: National Basic Research Program (973 Program) (No. 2010CB126604); Special Foundation Work Program (No. 2009FY120100), Ministry of Science and Technology of the People's Republic of China; National Science Foundation of China (No. 31071163). Conflict of interest: none declared.

14 in total

Review 1. Genome-based approaches to develop vaccines against bacterial pathogens.

Authors: Davide Serruto; Laura Serino; Vega Masignani; Mariagrazia Pizza
Journal: Vaccine Date: 2009-02-05 Impact factor: 3.641

2. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".

Authors: Hervé Tettelin; Vega Masignani; Michael J Cieslewicz; Claudio Donati; Duccio Medini; Naomi L Ward; Samuel V Angiuoli; Jonathan Crabtree; Amanda L Jones; A Scott Durkin; Robert T Deboy; Tanja M Davidsen; Marirosa Mora; Maria Scarselli; Immaculada Margarit y Ros; Jeremy D Peterson; Christopher R Hauser; Jaideep P Sundaram; William C Nelson; Ramana Madupu; Lauren M Brinkac; Robert J Dodson; Mary J Rosovitz; Steven A Sullivan; Sean C Daugherty; Daniel H Haft; Jeremy Selengut; Michelle L Gwinn; Liwei Zhou; Nikhat Zafar; Hoda Khouri; Diana Radune; George Dimitrov; Kisha Watkins; Kevin J B O'Connor; Shannon Smith; Teresa R Utterback; Owen White; Craig E Rubens; Guido Grandi; Lawrence C Madoff; Dennis L Kasper; John L Telford; Michael R Wessels; Rino Rappuoli; Claire M Fraser
Journal: Proc Natl Acad Sci U S A Date: 2005-09-19 Impact factor: 11.205

3. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors: R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal: Science Date: 1995-07-28 Impact factor: 47.728

4. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.

Authors: Kathryn E Holt; Julian Parkhill; Camila J Mazzoni; Philippe Roumagnac; François-Xavier Weill; Ian Goodhead; Richard Rance; Stephen Baker; Duncan J Maskell; John Wain; Christiane Dolecek; Mark Achtman; Gordon Dougan
Journal: Nat Genet Date: 2008-07-27 Impact factor: 38.330

5. The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates.

Authors: David A Rasko; M J Rosovitz; Garry S A Myers; Emmanuel F Mongodin; W Florian Fricke; Pawel Gajer; Jonathan Crabtree; Mohammed Sebaihia; Nicholas R Thomson; Roy Chaudhuri; Ian R Henderson; Vanessa Sperandio; Jacques Ravel
Journal: J Bacteriol Date: 2008-08-01 Impact factor: 3.490

6. Legionella pneumophila pangenome reveals strain-specific virulence factors.

Authors: Giuseppe D'Auria; Nuria Jiménez-Hernández; Francesc Peris-Bondia; Andrés Moya; Amparo Latorre
Journal: BMC Genomics Date: 2010-03-17 Impact factor: 3.969

7. PanCGHweb: a web tool for genotype calling in pangenome CGH data.

Authors: Jumamurat R Bayjanov; Roland J Siezen; Sacha A F T van Hijum
Journal: Bioinformatics Date: 2010-03-10 Impact factor: 6.937

8. The NIH Human Microbiome Project.

Authors: Jane Peterson; Susan Garges; Maria Giovanni; Pamela McInnes; Lu Wang; Jeffery A Schloss; Vivien Bonazzi; Jean E McEwen; Kris A Wetterstrand; Carolyn Deal; Carl C Baker; Valentina Di Francesco; T Kevin Howcroft; Robert W Karp; R Dwayne Lunsford; Christopher R Wellington; Tsegahiwot Belachew; Michael Wright; Christina Giblin; Hagit David; Melody Mills; Rachelle Salomon; Christopher Mullins; Beena Akolkar; Lisa Begg; Cindy Davis; Lindsey Grandison; Michael Humble; Jag Khalsa; A Roger Little; Hannah Peavy; Carol Pontzer; Matthew Portnoy; Michael H Sayre; Pamela Starke-Reed; Samir Zakhari; Jennifer Read; Bracie Watson; Mark Guyer
Journal: Genome Res Date: 2009-10-09 Impact factor: 9.043

9. Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions.

Authors: Chad Laing; Cody Buchanan; Eduardo N Taboada; Yongxiang Zhang; Andrew Kropinski; Andre Villegas; James E Thomas; Victor P J Gannon
Journal: BMC Bioinformatics Date: 2010-09-15 Impact factor: 3.169

10. Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition.

Authors: Tristan Lefébure; Michael J Stanhope
Journal: Genome Biol Date: 2007 Impact factor: 13.583

198 in total

1. Analysis of pan-genome to identify the core genes and essential genes of Brucella spp.

Authors: Xiaowen Yang; Yajie Li; Juan Zang; Yexia Li; Pengfei Bie; Yanli Lu; Qingmin Wu
Journal: Mol Genet Genomics Date: 2016-01-02 Impact factor: 3.291

2. Proposal for Unification of the Genus Metakosakonia and the Genus Phytobacter to a Single Genus Phytobacter and Reclassification of Metakosakonia massiliensis as Phytobacter massiliensis comb. nov.

Authors: Yuanyuan Ma; Rong Yao; Yuanyuan Li; Xiuqin Wu; Shuying Li; Qianli An
Journal: Curr Microbiol Date: 2020-04-30 Impact factor: 2.188

3. Genomic, Phenotypic, and Virulence Analysis of Streptococcus sanguinis Oral and Infective-Endocarditis Isolates.

Authors: Shannon P Baker; Tara J Nulton; Todd Kitten
Journal: Infect Immun Date: 2018-12-19 Impact factor: 3.441

4. GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis.

Authors: Bruno Contreras-Moreira; Pablo Vinuesa
Journal: Appl Environ Microbiol Date: 2013-10-04 Impact factor: 4.792

5. Census-based rapid and accurate metagenome taxonomic profiling.

Authors: Amirhossein Shamsaddini; Yang Pan; W Evan Johnson; Konstantinos Krampis; Mariya Shcheglovitova; Vahan Simonyan; Amy Zanne; Raja Mazumder
Journal: BMC Genomics Date: 2014-10-21 Impact factor: 3.969

6. Cyanobacterial phylogenetic analysis based on phylogenomics approaches render evolutionary diversification and adaptation: an overview of representative orders.

Authors: Ratna Prabha; Dhananjaya P Singh
Journal: 3 Biotech Date: 2019-02-15 Impact factor: 2.406

7. Complete Genome Sequence of Clostridium kluyveri JZZ Applied in Chinese Strong-Flavor Liquor Production.

Authors: Yansheng Wang; Bin Li; Hong Dong; Xunduan Huang; Ruiyu Chen; Xingjie Chen; Laoji Yang; Bing Peng; Guopai Xie; Wei Cheng; Biao Hao; Changrun Li; Junfeng Xia; Buchang Zhang
Journal: Curr Microbiol Date: 2018-07-20 Impact factor: 2.188

8. Comparative genomic analysis of Geosporobacter ferrireducens and its versatility of anaerobic energy metabolism.

Authors: Man-Young Jung; So-Jeong Kim; Jong-Geol Kim; Heeji Hong; Joo-Han Gwak; Soo-Je Park; Yang-Hoon Kim; Sung-Keun Rhee
Journal: J Microbiol Date: 2018-05-02 Impact factor: 3.422

9. MetaPGN: a pipeline for construction and graphical visualization of annotated pangenome networks.

Authors: Ye Peng; Shanmei Tang; Dan Wang; Huanzi Zhong; Huijue Jia; Xianghang Cai; Zhaoxi Zhang; Minfeng Xiao; Huanming Yang; Jian Wang; Karsten Kristiansen; Xun Xu; Junhua Li
Journal: Gigascience Date: 2018-11-01 Impact factor: 6.524

10. Dissecting the Evolutionary Development of the Species Bifidobacterium animalis through Comparative Genomics Analyses.

Authors: Gabriele Andrea Lugli; Walter Mancino; Christian Milani; Sabrina Duranti; Leonardo Mancabelli; Stefania Napoli; Marta Mangifesta; Alice Viappiani; Rosaria Anzalone; Giulia Longhi; Douwe van Sinderen; Marco Ventura; Francesca Turroni
Journal: Appl Environ Microbiol Date: 2019-03-22 Impact factor: 4.792