Literature DB >> 21765097

PGAT: a multistrain analysis resource for microbial genomes.

M J Brittnacher¹, C Fong, H S Hayden, M A Jacobs, Matthew Radey, L Rohmer.

Abstract

MOTIVATION: The Prokaryotic-genome Analysis Tool (PGAT) is a web-based database application for comparing gene content and sequence across multiple microbial genomes facilitating the discovery of genetic differences that may explain observed phenotypes. PGAT supports database queries to identify genes that are present or absent in user-selected genomes, comparison of sequence polymorphisms in sets of orthologous genes, multigenome display of regions surrounding a query gene, comparison of the distribution of genes in metabolic pathways and manual community annotation.
AVAILABILITY AND IMPLEMENTATION: The PGAT website may be accessed at http://nwrce.org/pgat. CONTACT: mbrittna@uw.edu.

Entities: Disease Species

Mesh：

Year: 2011 PMID： 21765097 PMCID： PMC3157930 DOI： 10.1093/bioinformatics/btr418

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Whole-genome sequence comparison of related bacteria is increasing in scale owing to second-generation sequencing technologies such as Illumina (http://www.illumina.com/) and 454 pyrosequencing (http://www.454.com) that can sequence more than a hundred bacterial genomes in few months. The main challenge currently presented by those technologies is the ability to compare a large number of draft sequences efficiently in order to elucidate the biological significance of the differences. Substantial progress has been made to accurately align whole genome sequences to uncover genetic polymorphisms (Darling ). However, linking polymorphisms with functional differences still requires examination of their effect on proteins encoded by these regions (e.g. non-synonymous substitutions, gene inactivation by frameshifts, etc.). The motivation for the development of the Prokaryotic-Genome Analysis Tool (PGAT) was the need for a data-mining tool by which draft genome sequences could be compared among themselves and with completed genomes to explore genetic differences that result in functional differences. The main features of PGAT are as follows: (i) implementation as a web-based database application to support data mining; (ii) ability to efficiently integrate large numbers of genomes including draft genome assemblies; (iii) homogenization of genome annotation across the genomes; and (iv) support for manual community annotation. PGAT integrates many features of current online resources such as the Integrated Microbial Genomes IMG (Markowitz ), the Burkholderia Genome Database (Winsor ) and Neisseria Base (Kislyuk ). Its main difference is the homogenization of gene features across the genomes and the integrated functionality to compare gene content, single nucleotide polymorphisms (SNPs) in orthologous genes, and the resulting impact of SNPs and indels on the encoded proteins. Currently, PGAT websites host Burkholderia pseudomallei—B.mallei, Francisella tularensis, Yersinia pestis and Salmonella enterica.

2 RESULTS

2.1 Ortholog assignment

In order to determine the presence or absence of genes and to detect sequence polymorphisms in their coding regions in a multigenome comparison, it is essential to accurately define orthologous genes for this set of genomes. There are many methods of determining orthologs [for a recent evaluation of popular methods, see Salichos and Rokas (2011)]. Ortholog prediction methods typically depend upon annotation that has been derived from single genome processing. Spurious results are possible where the particular genes that were called vary from genome to genome, a problem that is more acute in high GC content genomes. To homogenize annotation across a set of highly related genomes, the authors developed a method of ortholog assignment that removes the bias of individual genome annotation. Genes from an initial set of complete genomes are pooled and a single ‘reference’ gene is selected for each gene family determined by Blast (Altschul ) protein sequence alignment of this set on itself. The reference genes are then mapped, using protein Blast sequence alignment, into the set of all open reading frames (ORFs) in a six-frame translation of each genome sequence. A homogenized set of orthologous genes are thus identified across all genomes. Pseudogenes are also identified where reference gene alignments are split across two or more ORFs, or the ORF contains only part of a gene. We use the very conservative rule that ortholog sequence alignments must include >80% of the gene length and have sequence identity greater than 91–92%. The latter threshold is determined by statistical comparison with a reference set of orthologs. This method is only applicable to highly similar (~96% identity or higher) genome sequence where the arbitrary choice of the reference gene has little impact on the results. The same method of aligning reference genes with all ORFs is applied to draft genomes to identify orthologs. Gene start sites are homogenized across genomes based on the most consensual site. Functional annotation of orthologs is derived from previously annotated genomes. Novel genes, identified as Glimmer-predicted (Delcher ) coding regions that do not map back into any of the previously processed genomes, are added to the set of reference genes. The PGAT web interface facilitates manual annotation to correct errors introduced by these automated methods. This feature will also support the involvement of experts in the microbial research community in the ongoing improvement of the functional annotation, similar to what has been done for Pseudomonas research (Brinkman ; Winsor ).

2.2 Gene content queries

Lists of genes can be generated through user-defined queries that compare gene content between genomes. For example, selecting options for ‘present’ in all 22 Burkholderia pseudomallei genomes with both chromosomes available returns a list of 4983 core genes (i.e. genes present in every genome in the database). There is an option to ‘consider pseudogenes as present’ in order to include genes that may not be assembled properly in draft sequences. A query of all distinct genes returns 8568 genes in the ‘pan-genome’, a concept introduced by Tettelin ) referring to all genes existing in at least one of the genomes available for the species. These numbers are consistent with the results of a recent study of B.pseudomallei genomes (Nandi ) based on 11 genomes. Loss of function through gene deletion or gain of function through gene acquisition, commonly used to explain differences in observed phenotypes, can also be explored in PGAT. For example, selecting ‘present’ for B.pseudomallei K96243 and 668, ‘absent’ for 1106a and 1710b, ‘ignore’ for the remainder and the ‘present in all’ option, a list of 38 genes is returned. Most of these genes occur in genomic islands in K96243 and 668 that are absent from the 1106a and 1710b strains. This organization in islands can be easily visualized through the ‘synteny map’ that displays the genomic region from 1 to 100 kb in length aligned around a selected gene for the genomes in which this gene is present. Lists and sequences of orthologous genes can also be generated and downloaded.

2.3 Sequence polymorphisms

Sequence polymorphisms (nucleotide substitutions, insertions or deletions) in gene sequences are useful for inferring phylogeny and possible loss/change of function by deleterious mutations. For each gene, a table of sequence polymorphisms, identified by multiple sequence alignment of orthologs using Muscle (Edgar, 2004), is displayed. The nucleotide and protein sequence alignment can also be generated from within each gene page. A table of all SNPs in genes common to the genomes (core genes) can be downloaded in order to derive phylogenetic relationships or to develop an overview of sequence variation.

2.4 Metabolic pathways

The Pathways tab allows selection of a subset of genomes in which to compare the presence and absence of genes in various metabolic pathways. Expanding the metabolic pathway categories leads to tables of the numbers of genes represented in the pathway for each of the selected genomes. Genes that are functional in those pathways can be compared with the total number of genes in those pathways for the set of genomes in PGAT. The number of pseudogenes (if any) is shown in parentheses. KEGG (Kanehisa and Goto, 2000) pathway diagrams display functional genes and pseudogenes, along with a table of KO numbers and description.

3 IMPLEMENTATION

The PGAT application has a relational database back end that runs on a PostgreSQL server(http://www.postgresql.org). The web interface, implemented using Perl CGI scripts, runs on an Apache web server (http://www.apache.org). A ‘demo tool’ and a tutorial is available online to introduce the user to many features of PGAT.

13 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Sequencing solution: use volunteer annotators organized via Internet.

Authors: F S Brinkman; R E Hancock; C K Stover
Journal: Nature Date: 2000-08-31 Impact factor: 49.962

3. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

4. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

5. Improved microbial gene identification with GLIMMER.

Authors: A L Delcher; D Harmon; S Kasif; O White; S L Salzberg
Journal: Nucleic Acids Res Date: 1999-12-01 Impact factor: 16.971

6. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".

Authors: Hervé Tettelin; Vega Masignani; Michael J Cieslewicz; Claudio Donati; Duccio Medini; Naomi L Ward; Samuel V Angiuoli; Jonathan Crabtree; Amanda L Jones; A Scott Durkin; Robert T Deboy; Tanja M Davidsen; Marirosa Mora; Maria Scarselli; Immaculada Margarit y Ros; Jeremy D Peterson; Christopher R Hauser; Jaideep P Sundaram; William C Nelson; Ramana Madupu; Lauren M Brinkac; Robert J Dodson; Mary J Rosovitz; Steven A Sullivan; Sean C Daugherty; Daniel H Haft; Jeremy Selengut; Michelle L Gwinn; Liwei Zhou; Nikhat Zafar; Hoda Khouri; Diana Radune; George Dimitrov; Kisha Watkins; Kevin J B O'Connor; Shannon Smith; Teresa R Utterback; Owen White; Craig E Rubens; Guido Grandi; Lawrence C Madoff; Dennis L Kasper; John L Telford; Michael R Wessels; Rino Rappuoli; Claire M Fraser
Journal: Proc Natl Acad Sci U S A Date: 2005-09-19 Impact factor: 11.205

7. A genomic survey of positive selection in Burkholderia pseudomallei provides insights into the evolution of accidental virulence.

Authors: Tannistha Nandi; Catherine Ong; Arvind Pratap Singh; Justin Boddey; Timothy Atkins; Mitali Sarkar-Tyson; Angela E Essex-Lopresti; Hui Hoon Chua; Talima Pearson; Jason F Kreisberg; Christina Nilsson; Pramila Ariyaratne; Catherine Ronning; Liliana Losada; Yijun Ruan; Wing-Kin Sung; Donald Woods; Richard W Titball; Ifor Beacham; Ian Peak; Paul Keim; William C Nierman; Patrick Tan
Journal: PLoS Pathog Date: 2010-04-01 Impact factor: 6.823

8. Evaluating ortholog prediction algorithms in a yeast model clade.

Authors: Leonidas Salichos; Antonis Rokas
Journal: PLoS One Date: 2011-04-13 Impact factor: 3.240

9. The integrated microbial genomes system: an expanding comparative analysis resource.

Authors: Victor M Markowitz; I-Min A Chen; Krishna Palaniappan; Ken Chu; Ernest Szeto; Yuri Grechkin; Anna Ratner; Iain Anderson; Athanasios Lykidis; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2009-10-28 Impact factor: 16.971

10. Pseudomonas Genome Database: facilitating user-friendly, comprehensive comparisons of microbial genomes.

Authors: Geoffrey L Winsor; Thea Van Rossum; Raymond Lo; Bhavjinder Khaira; Matthew D Whiteside; Robert E W Hancock; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971

44 in total

1. Exploiting a natural auxotrophy for genetic selection.

Authors: Elizabeth Ramage; Larry Gallagher; Colin Manoil
Journal: Appl Environ Microbiol Date: 2012-06-01 Impact factor: 4.792

2. Resources for Genetic and Genomic Analysis of Emerging Pathogen Acinetobacter baumannii.

Authors: Larry A Gallagher; Elizabeth Ramage; Eli J Weiss; Matthew Radey; Hillary S Hayden; Kiara G Held; Holly K Huse; Daniel V Zurawski; Mitchell J Brittnacher; Colin Manoil
Journal: J Bacteriol Date: 2015-04-06 Impact factor: 3.490

3. Ranking essential bacterial processes by speed of mutant death.

Authors: Larry A Gallagher; Jeannie Bailey; Colin Manoil
Journal: Proc Natl Acad Sci U S A Date: 2020-07-14 Impact factor: 11.205

4. Spectral Library Searching To Identify Cross-Linked Peptides.

Authors: Devin K Schweppe; Juan D Chavez; Arti T Navare; Xia Wu; Bianca Ruiz; Jimmy K Eng; Henry Lam; James E Bruce
Journal: J Proteome Res Date: 2016-04-28 Impact factor: 4.466

5. Cross-species comparison of the Burkholderia pseudomallei, Burkholderia thailandensis, and Burkholderia mallei quorum-sensing regulons.

Authors: Charlotte D Majerczyk; Mitchell J Brittnacher; Michael A Jacobs; Christopher D Armour; Matthew C Radey; Richard Bunt; Hillary S Hayden; Ryland Bydalek; E Peter Greenberg
Journal: J Bacteriol Date: 2014-09-02 Impact factor: 3.490

6. MetaPGN: a pipeline for construction and graphical visualization of annotated pangenome networks.

Authors: Ye Peng; Shanmei Tang; Dan Wang; Huanzi Zhong; Huijue Jia; Xianghang Cai; Zhaoxi Zhang; Minfeng Xiao; Huanming Yang; Jian Wang; Karsten Kristiansen; Xun Xu; Junhua Li
Journal: Gigascience Date: 2018-11-01 Impact factor: 6.524

7. Global analysis of the Burkholderia thailandensis quorum sensing-controlled regulon.

Authors: Charlotte Majerczyk; Mitchell Brittnacher; Michael Jacobs; Christopher D Armour; Mathew Radey; Emily Schneider; Somsak Phattarasokul; Richard Bunt; E Peter Greenberg
Journal: J Bacteriol Date: 2014-01-24 Impact factor: 3.490

8. Genome sequence of Francisella tularensis subspecies holarctica strain FSC200, isolated from a child with tularemia.

Authors: Kerstin Svensson; Andreas Sjödin; Mona Byström; Malin Granberg; Mitchell J Brittnacher; Laurence Rohmer; Michael A Jacobs; Elizabeth H Sims-Day; Ruth Levy; Yang Zhou; Hillary S Hayden; Regina Lim; Jean Chang; Donald Guenthener; Allison Kang; Eric Haugen; Will Gillett; Rajinder Kaul; Mats Forsman; Pär Larsson; Anders Johansson
Journal: J Bacteriol Date: 2012-12 Impact factor: 3.490

9. Strain-dependent diversity in the Pseudomonas aeruginosa quorum-sensing regulon.

Authors: Sudha Chugani; Byoung Sik Kim; Somsak Phattarasukol; Mitchell J Brittnacher; Sang Ho Choi; Caroline S Harwood; E Peter Greenberg
Journal: Proc Natl Acad Sci U S A Date: 2012-09-17 Impact factor: 11.205

10. PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes.

Authors: Sandip Paul; Archana Bhardwaj; Sumit K Bag; Evgeni V Sokurenko; Sujay Chattopadhyay
Journal: Genomics Date: 2015-10-09 Impact factor: 5.736