Literature DB >> 20219865

PanCGHweb: a web tool for genotype calling in pangenome CGH data.

Jumamurat R Bayjanov¹, Roland J Siezen, Sacha A F T van Hijum.

Abstract

UNLABELLED: A pangenome is the total of genes present in strains of the same species. Pangenome microarrays allow determining the genomic content of bacterial strains more accurately than conventional comparative genome hybridization microarrays. PanCGHweb is the first tool that effectively calls genotype based on pangenome microarray data. AVAILABILITY: PanCGHweb, the web tool is accessible from: http://bamics2.cmbi.ru.nl/websoftware/pancgh/.

Entities: Disease Species

Mesh：

Year: 2010 PMID： 20219865 PMCID： PMC2859125 DOI： 10.1093/bioinformatics/btq103

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Pangenome microarrays contain probes that target all known genes within related strains of the same species (Tettelin et al., 2005).When compared to conventional comparative genome hybridization (CGH) microarrays that target the gene content of a single species, they allow to more accurately determine the genotype of a given bacterial strain (Bayjanov et al., 2009; Castellanos et al., 2009; Willenbrock et al., 2007). In pangenomes, orthologous genes can be defined as homologous genes derived by a strain divergence event from a single ancestral sequence. These orthologous genes (strain orthologs) share different levels of nucleotide sequence identity with paralogous genes (homologous genes derived by a duplication event from a single sequence) (Fitch, 1970). Effective genotyping can be achieved by grouping genes into ortholog groups (OGs) and subsequently genotyping at the level of OGs. Recently, we published an algorithm (PanCGH) that effectively deals with assigning OG presence/absence to each strain analyzed by pangenome microarrays (Bayjanov et al., 2009). Here, we describe a web tool—PanCGHweb—that uses this algorithm to effectively genotype strains based on pangenome microarray data.

2 METHODS

2.1 Implementation

PanCGHweb is implemented in Python and R, and its wizard-like web-interface is generated by the FG-web framework (S.A.F.T.van Hijum et al., unpublished data). There are three major sections in the web-interface: (i) data upload; (ii) parameter settings; and (iii) displaying the results (Fig. 1A). The web tool works with major web browsers such as Internet Explorer, Firefox, Safari and Opera.

Fig. 1.

The PanCGHweb web tool. (A) Process flow in PanCGHweb. (B) Histogram of presence/absence of OGs for a reference strain (in this example Lactococcus lactis IL1403). Horizontal axis: presence score of OGs. Vertical axis: number of OGs with a corresponding presence score. Black bars: frequency of presence score of OGs that contain at least one gene from the reference strain. Gray bars: frequency of presence score of OGs that do not contain gene from the reference strain. (C) Phylogenetic tree of strains based on presence/absence of OGs in 39 L. lactis strains.

2.2 Input data

Open reading frame sequences for each reference bacterial strain and/or plasmid, on which probes were designed, should be provided by (i) selecting from the available daily updated Genbank sequences and (ii) optionally, uploading FASTA-formatted DNA sequences that are absent in the Genbank list. Normalized microarray hybridization data, where replicated measurements are represented by a single value (e.g. by averaging), should also be provided as tab-delimited file(s). Probe sequences should be provided in FASTA format.

2.3 Algorithm

The PanCGH algorithm calls presence/absence of OGs based on pangenome microarray data. PanCGHweb performs the following steps: (i) orthology grouping; (ii) alignment of probes to genes; and (iii) genotype calling. Step 1: Inparanoid (Remm et al., 2001) is used with its default settings (minimum bit score of 50 and confidence score of 0.25) for the orthology prediction among genes of the selected reference genomes (Genbank files; see above). The run time of Inparanoid is reduced by a few orders of magnitudes by adapting the software to use BLAT (Kent, 2002) for sequence alignments. Genes that are not part of the selected reference genomes can be grouped based on their homology, or each gene can form a separate group. Step 2: the microarray probes are aligned by BLAT to the individual gene members of each OG. Probes that could not be aligned to any gene and genes with no matching probes are reported. Step 3: using the PanCGH algorithm (Bayjanov et al., 2009) the fluorescence signal intensities of probes associated to each gene are summarized to a gene score (the most frequently occurring signal intensity). The maximum of gene scores of all gene members of an OG is used as the presence score for that OG. An OG is considered to be present if its presence score is above the threshold of 5.5 in log scale. The steps involved in determining the optimal threshold value are described on the web site of PanCGHweb.

2.4 Output of the algorithm

Results of PanCGHweb include: (i) projection plot, which overlays presence/absence of OGs on the selected genomes; (ii) histogram of presence score of OGs for any reference strain, which can be used to validate whether the default threshold of 5.5 is an optimal choice for presence/absence calling (Fig. 1B); (iii) receiver operating curves using all possible presence/absence calling thresholds for all reference strains; (iv) two different phylogenetic trees of strains, one based on presence/absence values and the other based on presence scores. Such trees enable estimating the genomic diversity among all strains (Fig. 1C); (v) hierarchical tree based on signal intensity values of all arrays; (vi) box and whisker plot that shows signal intensity distribution among all arrays; and (vii) orthology grouping information and presence/absence of genes in each strain. Additionally, the following tab-delimited files can be downloaded: OGs list, alignment of probes to genes, presence/absence of OGs and presence score of OGs.

3 CONCLUSIONS

For genotyping, pangenome microarrays offer a cost-effective alternative to DNA sequencing and allow to more accurately determine genomic content compared to standard CGH techniques. We have developed a web tool for pangenome microarray analysis based on our PanCGH algorithm. It enables researchers to analyze these complex hybridization data in a facile and transparent way to understand genomic diversity among related strains.

7 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Discovery of stable and variable differences in the Mycobacterium avium subsp. paratuberculosis type I, II, and III genomes by pan-genome microarray analysis.

Authors: Elena Castellanos; Alicia Aranaz; Katherine A Gould; Richard Linedale; Karen Stevenson; Julio Alvarez; Lucas Dominguez; Lucia de Juan; Jason Hinds; Tim J Bull
Journal: Appl Environ Microbiol Date: 2008-12-01 Impact factor: 4.792

3. Distinguishing homologous from analogous proteins.

Authors: W M Fitch
Journal: Syst Zool Date: 1970-06

4. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".

Authors: Hervé Tettelin; Vega Masignani; Michael J Cieslewicz; Claudio Donati; Duccio Medini; Naomi L Ward; Samuel V Angiuoli; Jonathan Crabtree; Amanda L Jones; A Scott Durkin; Robert T Deboy; Tanja M Davidsen; Marirosa Mora; Maria Scarselli; Immaculada Margarit y Ros; Jeremy D Peterson; Christopher R Hauser; Jaideep P Sundaram; William C Nelson; Ramana Madupu; Lauren M Brinkac; Robert J Dodson; Mary J Rosovitz; Steven A Sullivan; Sean C Daugherty; Daniel H Haft; Jeremy Selengut; Michelle L Gwinn; Liwei Zhou; Nikhat Zafar; Hoda Khouri; Diana Radune; George Dimitrov; Kisha Watkins; Kevin J B O'Connor; Shannon Smith; Teresa R Utterback; Owen White; Craig E Rubens; Guido Grandi; Lawrence C Madoff; Dennis L Kasper; John L Telford; Michael R Wessels; Rino Rappuoli; Claire M Fraser
Journal: Proc Natl Acad Sci U S A Date: 2005-09-19 Impact factor: 11.205

5. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

Authors: M Remm; C E Storm; E L Sonnhammer
Journal: J Mol Biol Date: 2001-12-14 Impact factor: 5.469

6. PanCGH: a genotype-calling algorithm for pangenome CGH data.

Authors: Jumamurat R Bayjanov; Michiel Wels; Marjo Starrenburg; Johan E T van Hylckama Vlieg; Roland J Siezen; Douwe Molenaar
Journal: Bioinformatics Date: 2009-01-07 Impact factor: 6.937

7. Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray.

Authors: Hanni Willenbrock; Peter F Hallin; Trudy M Wassenaar; David W Ussery
Journal: Genome Biol Date: 2007 Impact factor: 13.583

7 in total

8 in total

1. PGAP: pan-genomes analysis pipeline.

Authors: Yongbing Zhao; Jiayan Wu; Junhui Yang; Shixiang Sun; Jingfa Xiao; Jun Yu
Journal: Bioinformatics Date: 2011-11-29 Impact factor: 6.937

Review 2. Systems solutions by lactic acid bacteria: from paradigms to practice.

Authors: Willem M de Vos
Journal: Microb Cell Fact Date: 2011-08-30 Impact factor: 5.328

Review 3. A brief review of software tools for pangenomics.

Authors: Jingfa Xiao; Zhewen Zhang; Jiayan Wu; Jun Yu
Journal: Genomics Proteomics Bioinformatics Date: 2015-02-23 Impact factor: 7.691

4. Genome-scale diversity and niche adaptation analysis of Lactococcus lactis by comparative genome hybridization using multi-strain arrays.

Authors: Roland J Siezen; Jumamurat R Bayjanov; Giovanna E Felis; Marijke R van der Sijde; Marjo Starrenburg; Douwe Molenaar; Michiel Wels; Sacha A F T van Hijum; Johan E T van Hylckama Vlieg
Journal: Microb Biotechnol Date: 2011-02-21 Impact factor: 5.813