Literature DB >> 17488835

Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis.

Jake Gunn Glanville¹, Dan Kirshner, Nandini Krishnamurthy, Kimmen Sjölander.

Abstract

Phylogenomic analysis addresses the limitations of function prediction based on annotation transfer, and has been shown to enable the highest accuracy in prediction of protein molecular function. The Berkeley Phylogenomics Group provides a series of web servers for phylogenomic analysis: classification of sequences to pre-computed families and subfamilies using the PhyloFacts Phylogenomic Encyclopedia, FlowerPower clustering of proteins sharing the same domain architecture, MUSCLE multiple sequence alignment, SATCHMO simultaneous alignment and tree construction and SCI-PHY subfamily identification. The PhyloBuilder web server provides an integrated phylogenomic pipeline starting with a user-supplied protein sequence, proceeding to homolog identification, multiple alignment, phylogenetic tree construction, subfamily identification and structure prediction. The Berkeley Phylogenomics Group resources are available at http://phylogenomics.berkeley.edu.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 17488835 PMCID： PMC1933202 DOI： 10.1093/nar/gkm325

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The standard protocol for gene function prediction involves homology-based annotation transfer (e.g. using the top BLAST hit); this approach is now known to be fraught with systematic errors (1–3). Biological processes such as gene duplication, mutation at critical residues, speciation and domain shuffling contribute to modifications of the original function that significantly complicate the process of functional annotation (1,4–6). Existing annotation errors can also be propagated by homology-based annotation transfer (7). Phylogenomic inference of gene function is known to be the most robust and accurate method for functional annotation. This approach enables the function of a protein to be inferred in an evolutionary context, avoiding the pitfalls of simple pairwise sequence comparison based approaches, and vastly improving the accuracy of functional annotation (8–10). Phylogenomic analysis proceeds in stages, starting with homolog identification and multiple sequence alignment (MSA). The (masked) alignment is then used as input to phylogenetic tree construction. Examination of the tree topology enables biologists to discriminate between orthologs (with presumably conserved function) and paralogs (related by gene duplication, and potentially divergent in function), providing improved discrimination of specific function in instances when a protein family has evolved multiple but related distinct functions (11,12). To increase the confidence in function prediction, the source of the annotations can be examined; the Gene Ontology resource includes evidence codes for annotations for this purpose (13). The Berkeley Phylogenomics Group has developed a series of web servers for individual steps in a phylogenomic pipeline and a single web server PhyloBuilder that performs all the steps as shown in Figure 1. Each web server can be used individually or in combination for phylogenomic inference.

Figure 1.

Berkeley Phylogenomics Group web servers for the different steps of a phylogenomic pipeline. Top: Users can submit sequences for classification against the PhyloFacts Phylogenomic Encyclopedia of pre-computed families and subfamilies. Middle: The phylogenomic pipeline. Bottom: Web servers for specific tasks in the pipeline. Many of these servers cover more than one step in the process, e.g. the PhyloBuilder web server, which performs all the steps of the pipeline and outputs a MSA, subfamilies, domain/3D structure predictions and phylogenetic trees overlaid with annotations.

Each server includes Java applets for viewing the associated data; data can also be downloaded in standard formats. Users can bookmark a results page, or choose to receive results by email. Berkeley Phylogenomics Group web servers for the different steps of a phylogenomic pipeline. Top: Users can submit sequences for classification against the PhyloFacts Phylogenomic Encyclopedia of pre-computed families and subfamilies. Middle: The phylogenomic pipeline. Bottom: Web servers for specific tasks in the pipeline. Many of these servers cover more than one step in the process, e.g. the PhyloBuilder web server, which performs all the steps of the pipeline and outputs a MSA, subfamilies, domain/3D structure predictions and phylogenetic trees overlaid with annotations.

PHYLOFACTS PHYLOGENOMIC ENCYCLOPEDIA

PhyloFacts enables functional classification of user-submitted sequences to pre-computed families and subfamilies from across the Tree of Life (14). Hidden Markov models are provided for functional classification of novel sequences to families and subfamilies. PhyloFacts protein family ‘books’ include an MSA, phylogenetic trees, predicted structures and critical residues, experimental and annotation data, hidden Markov models, and links to other resources. Since the initial publication (14), the PhyloFacts resource has significantly increased in size, from ∼9000 families in May 2006 to >27 000 families in April 2007. Most of this increase in size has been to expand our coverage of microbial gene families and gene families found in the human genome, including homologs in other species. New functionality included in PhyloFacts over the past year also includes super-fast classification of user-submitted sequences to global homology groups (proteins sharing the same domain architecture) and a new protocol for functional sub-classification. Usage: Users can submit DNA or protein sequences in FASTA format for classification to PhyloFacts families and subfamilies. PhyloFacts family books are selected for HMM scoring by a pre-processing step of BLAST search of the query sequences against the consensus sequences for each of the families in the resource; HMMs from families with BLAST E-values of 10 or better are scored against the query. An example output from PhyloFacts is shown in Figure 2. Clicking on ‘View Alignment’ displays the pairwise alignment between the submitted query and consensus sequence and statistics about the alignment. Clicking in the ‘Search subfamilies’ box for families of interest followed by clicking on the box at bottom labeled ‘Search selected books for top-scoring subfamily HMMs against query’ initiates the subfamily HMM-based classification; logistic regression analysis is used to differentiate sequences that can be assigned to the top-scoring subfamily and those that represent novel subtypes. Users can examine PhyloFacts protein family books by following links in the ‘PhyloFacts book’ column in the table of results. Super-fast classification of query sequences to families with global similarity is provided (results would otherwise include local matches). Users can bookmark a results page, or choose to receive results by email. PhyloFacts is available at http://phylogenomics.berkeley.edu/phylofacts.

Figure 2.

Result of functional classification against PhyloFacts. The figure shows HMM scoring results for the UniProt sequence Q6BH13 from Debaryomyces hansenii. The search retrieves protein family books constructed using three different protocols: global homology, conserved region and domain. Subfamily classification is enabled by selecting books (clicking in boxes at left side of table, under ‘Search subfamilies’) followed by clicking the ‘Go’ button at bottom. See text for details.

FLOWERPOWER HOMOLOGY DETECTION

FlowerPower is an iterative homology-detection server akin to PSI-BLAST (15), but designed specifically for phylogenomic inference of function (16). FlowerPower is optimized for the retrieval of sequences sharing the same domain architecture; this prevents transfer of database annotation based on partial homology (i.e. local instead of global similarity). FlowerPower uses iterative subfamily hidden Markov model (HMM) searches against PSI-BLAST-identified homologs and alignment analysis to discriminate between partial and global homologies; this approach outperforms existing methods in gathering global homologs. Usage: The input to FlowerPower is a protein sequence in FASTA format; default parameters search the UniProt (17) database for proteins sharing the same domain architecture. The ‘Advanced Settings’ page enables users to modify the PSI-BLAST parameters for database searched, number of iterations and maximum number of hits returned. Parameters for the iterated search with subfamily HMMs can also be modified. Finally, users can choose between two homolog-selection modes: global (to both query and hit) and ‘glocal’ (global-local homology, retrieved sequences must align over a specific region, but can have additional structure). Results include the selected sequences, the raw FlowerPower alignment, a MUSCLE (18) re-alignment, and the results of the initial PSI-BLAST search. FlowerPower is available through http://phylogenomics.berkeley.edu/flowerpower/.

MULTIPLE SEQUENCE ALIGNMENT USING MUSCLE

The MUSCLE software produces high-accuracy multiple sequence alignments, with outstanding scores on benchmark dataset tests; it is also very fast, making it suitable for large-scale application (18). We employ MUSCLE in our internal pipeline for the PhyloFacts Phylogenomic Encyclopedia construction (14). Usage: The input to MUSCLE is a set of protein sequences in FASTA format. Alignments can be viewed online or downloaded in Aligned FASTA format. MUSCLE is available at http://phylogenomics.berkeley.edu/muscle.

SATCHMO

SATCHMO (Simultaneous Alignment and Tree Construction using Hidden Markov mOdels) is a progressive method of multiple sequence alignment that uses agglomerative clustering to estimate a phylogenetic tree simultaneously with the alignment. SATCHMO uses Dirichlet mixture densities (19) to construct profiles, and profile–profile scoring and alignment (20–23) to determine the phylogenetic tree topology. Each node in the tree contains a MSA and a corresponding profile. As sequences diverge in evolution, small insertions, deletions and mutations result in changes in structure and function; SATCHMO is intended to model these changes in different lineages in a family. Profiles and alignments at internal nodes in the tree represent the sequences descending from that node and may be of different lengths. The alignment at the root of the tree is an estimate of the conserved core structure defining all family members; when highly divergent sequences are input to SATCHMO this root alignment may be a small fraction of the average sequence length. Tree topologies produced using SATCHMO are consistent with expert-defined subtypes; alignment accuracy is also high (20). Usage: The input to SATCHMO is a set of unaligned protein sequences, in FASTA format. The SATCHMO root alignment can be viewed online using a Java applet or downloaded from the website. Special SATCHMO tree-alignment viewing software is available online (currently for PCs only) enabling the different alignments descending from each internal node of the tree to be examined separately. SATCHMO is available at http://phylogenomics.berkeley.edu/satchmo.

SCI-PHY AND SUBFAMILY HMM CONSTRUCTION

SCI-PHY (Subfamily Classification in PHYlogenomics) uses Bayesian and information-theoretic approaches to construct a hierarchical tree and cut the tree into subtrees to identify functional subfamilies (24). Subfamily hidden Markov models are constructed using Dirichlet mixture densities to derive a position- and subfamily-specific weighting scheme to share information across subfamilies; this has been shown to increase the separation between homologous and unrelated sequences and to provide high specificity of classification (25). Usage: The input to SCI-PHY is a MSA in either Aligned FASTA or the UCSC A2M format. Outputs include the MSA divided into subfamilies, the SCI-PHY tree, and subfamily and family HMMs in both HMMER and UCSC SAM formats. The SCI-PHY tree can be downloaded or viewed online using the Java ATV applet (26). SCI-PHY is available at: http://phylogenomics.berkeley.edu/SCI-PHY.

PHYLOBUILDER

PhyloBuilder is an automated computational pipeline for phylogenomic analysis, starting from an input protein sequence. PhyloBuilder is a modified version of the pipeline we use to populate the PhyloFacts Phylogenomic Encyclopedias with protein family books (14). The PhyloBuilder pipeline has multiple stages, as shown in Figure 1. In stage 1, FlowerPower is used to retrieve global homologs for the user-supplied sequence. Program parameters for this stage are set by default to maximize the retrieval of proteins sharing a common domain architecture; alternative settings are provided to enable users to request the selection of glocal homologs (sequences sharing a common domain but which may have different overall folds). If fewer than three sequences matching user criteria are identified, the program skips stages 2 and 3 and jumps directly to stage 4. In stage 2, the FlowerPower cluster is aligned using MUSCLE, followed by alignment masking in preparation for phylogenetic analysis (removing columns containing >70% gap characters). In stage 3, the masked alignment is used as the basis for neighbor-joining tree construction using the PHYLIP software (27), and submitted to the SCI-PHY software for subfamily identification. In stage 4, Gene Ontology annotations and evidence codes (13), Enzyme Classification data, and other data are retrieved for sequences in the cluster, and put into a spreadsheet, separated into SCI-PHY subfamilies. The species of origin, accession and definition lines are overlaid on the neighbor-joining tree, and can be viewed using the ATV tree-viewer/editor Java applet. In stage 5, domain and 3D-structure predictions for the family as a whole are performed based on analysis of the consensus sequence for the family: PFAM domains (28) are predicted (using the PFAM gathering threshold), transmembrane domains and signal peptides are predicted using the Phobius server (29), and homologous 3D structures are identified using BLAST analysis against the Protein Data Bank (PDB) (30). The phylogenetic trees produced by the PhyloBuilder web server can be used to identify orthologs manually; users can also download these trees and alignments for input to automated ortholog identification programs such as Orthostrapper (11) and RIO (12). Usage: Users paste in (or upload) a protein sequence for analysis. Results are stored for ten days; users can request long-term storage of these results. PhyloBuilder program outputs include a multiple sequence alignment, phylogenetic tree, subfamily identification, predicted domain/3D structure, and experimental and annotation data (see Figure 3). PhyloBuilder is available at http://phylogenomics.berkeley.edu/phylobuilder.

Figure 3.

Result of PhyloBuilder run for human Caspase-1. PhyloBuilder takes an input protein sequence and outputs a web page containing a cluster of homologous proteins, multiple sequence alignment, neighbor-joining tree, predicted subfamilies, PFAM domains, transmembrane domains and signal peptides, and retrieval of Gene Ontology (GO) and Enzyme Classification (EC) data. Top: Summary data include the number of homologs retrieved, taxonomic distribution, EC numbers and GO annotations and evidence codes. SCI-PHY subfamilies can be viewed by clicking on the link labeled ‘View details’ (see inset at top right). The multiple sequence alignment can be viewed using JalView or hypertext. A spreadsheet with annotations for all sequences is available under ‘Experimental and annotation data for sequences in family’. Middle: PFAM and transmembrane domain/signal peptide predictions are displayed. Neighbor-joining and SCI-PHY trees can be viewed using ATV. Homologous 3D structures can be viewed using JMOL; residues predicted to be critical using evolutionary conservation analysis are displayed on the structure. Catalytic Site Atlas data are included. Bottom: Various downloads are available, including a multiple sequence alignment for the family and individual subfamilies, a FASTA file for all PSI-BLAST hits, NJ tree and HMMs for the family and SCI-PHY subfamilies in HMMER and SAM formats. An ‘Edit book info’ button enables users to add descriptive labels to families and subfamilies (as shown in the inset at top right). See text for details.

FUTURE WORK

The PhyloFacts Phylogenomic Encyclopedia is under continuous expansion; we plan to continue our development of this resource to cover all protein families across the Tree of Life. The conservative parameterization of homology clustering component of the PhyloBuilder server occasionally results in a somewhat restrictive set of homologs when global homology is enforced. We plan to explore PhyloBuilder parameter settings that retain selectivity while optimizing sensitivity, and to allow users to input a multiple sequence alignment constructed independently instead of being dependent on the FlowerPower clustering used in PhyloBuilder. Computational efficiency remains a significant challenge in phylogenomic inference. Many of the steps in a phylogenomic pipeline are computationally intensive; this causes us to limit the size of inputs and the number of jobs submitted per day (see individual web server pages for guidelines). We plan to improve the computational efficiency of these servers and also increase the size of our compute cluster in order to overcome this limitation.

29 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Errors in genome annotation.

Authors: S E Brenner
Journal: Trends Genet Date: 1999-04 Impact factor: 11.639

3. Phylogenetic inference in protein superfamilies: analysis of SH2 domains.

Authors: K Sjölander
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1998

Review 4. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.

Authors: J A Eisen
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

5. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology.

Authors: K Sjölander; K Karplus; M Brown; R Hughey; A Krogh; I S Mian; D Haussler
Journal: Comput Appl Biosci Date: 1996-08

Review 6. Predicting functions from protein sequences--where are the bottlenecks?

Authors: P Bork; E V Koonin
Journal: Nat Genet Date: 1998-04 Impact factor: 38.330

Review 7. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

8. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function.

Authors: Nandini Krishnamurthy; Duncan Brown; Kimmen Sjölander
Journal: BMC Evol Biol Date: 2007-02-08 Impact factor: 3.260

10. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

15 in total

1. Crystal structure of a metal-dependent phosphoesterase (YP_910028.1) from Bifidobacterium adolescentis: Computational prediction and experimental validation of phosphoesterase activity.

Authors: Gye Won Han; Jaeju Ko; Carol L Farr; Marc C Deller; Qingping Xu; Hsiu-Ju Chiu; Mitchell D Miller; Jana Sefcikova; Srinivas Somarowthu; Penny J Beuning; Marc-André Elsliger; Ashley M Deacon; Adam Godzik; Scott A Lesley; Ian A Wilson; Mary Jo Ondrechen
Journal: Proteins Date: 2011-05-02

2. INTREPID: a web server for prediction of functionally important residues by evolutionary analysis.

Authors: Sriram Sankararaman; Bryan Kolaczkowski; Kimmen Sjölander
Journal: Nucleic Acids Res Date: 2009-05-13 Impact factor: 16.971

3. Arabidopsis thaliana PGR7 encodes a conserved chloroplast protein that is necessary for efficient photosynthetic electron transport.

Authors: Hou-Sung Jung; Yuki Okegawa; Patrick M Shih; Elizabeth Kellogg; Salah E Abdel-Ghany; Marinus Pilon; Kimmen Sjölander; Toshiharu Shikanai; Krishna K Niyogi
Journal: PLoS One Date: 2010-07-21 Impact factor: 3.240