Literature DB >> 20478830

ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids.

Haim Ashkenazy¹, Elana Erez, Eric Martz, Tal Pupko, Nir Ben-Tal.

Abstract

It is informative to detect highly conserved positions in proteins and nucleic acid sequence/structure since they are often indicative of structural and/or functional importance. ConSurf (http://consurf.tau.ac.il) and ConSeq (http://conseq.tau.ac.il) are two well-established web servers for calculating the evolutionary conservation of amino acid positions in proteins using an empirical Bayesian inference, starting from protein structure and sequence, respectively. Here, we present the new version of the ConSurf web server that combines the two independent servers, providing an easier and more intuitive step-by-step interface, while offering the user more flexibility during the process. In addition, the new version of ConSurf calculates the evolutionary rates for nucleic acid sequences. The new version is freely available at: http://consurf.tau.ac.il/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20478830 PMCID： PMC2896094 DOI： 10.1093/nar/gkq399

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The degree to which an amino (or nucleic) acid position is evolutionarily conserved is strongly dependent on its structural and functional importance. Thus, conservation analysis of positions among members from the same family can often reveal the importance of each position for the protein (or nucleic acid)’s structure or function. ConSurf (1,2) and ConSeq (3) are web servers for calculating the evolutionary rate of each position of the protein and for identifying structurally and functionally important regions within proteins. The degree of conservation of each position is the inverse of the site’s evolutionary rate; rapidly evolving positions are variable while slowly evolving positions are conserved. In ConSurf, the evolutionary rate is estimated based on the evolutionary relatedness between the protein and its homologues and considering the similarity between amino acids as reflected in the substitutions matrix (4,5). One of the advantages of ConSurf in comparison to other methods is the accurate computation of the evolutionary rate by using either an empirical Bayesian method or a maximum likelihood (ML) method (5). The differences between the two methods are explained in detail in reference (4). The strength of those methods is that they explicitly account for the stochastic process underlying the evolution of the analyzed sequences, and that they rely on the phylogeny of the sequences. Thus, they can correctly discriminate between conservation due to short evolutionary time and genuine sequence conservation. In addition, the Bayesian based method provides reliability estimates for the site-specific conservation scores.

METHODS

A short description of the methodology is provided below. More detailed description is available at http://consurf.tau.ac.il/, under ‘OVERVIEW’, ‘QUICK HELP’ and ‘FAQ’.

ConSurf protocol

A flowchart of the ConSurf web server is shown in Figure 1 and detailed below.

Figure 1.

A flowchart of ConSurf protocol.

The sequence is extracted from the 3D structure (if given). Homologous sequences are collected using a BLAST (or PSI-BLAST) (6,7) search against a selected database. The user may specify criteria for defining homologues. The user can also manually select the desired sequences from the BLAST results. The sequences are clustered and highly similar sequences are removed using CD-HIT (8). A multiple sequence alignment (MSA) of the homologous sequences is constructed using MAFFT, PRANK, T-COFFEE, MUSCLE or CLUSTALW. A phylogenetic tree is reconstructed based on the MSA, using the neighbor-joining algorithm as implemented in the Rate4Site program (4,5). Position-specific conservation scores are computed using the empirical Bayesian or ML algorithms (4,5). The continuous conservation scores are divided into a discrete scale of nine grades for visualization, from the most variable positions (grade 1) colored turquoise, through intermediately conserved positions (grade 5) colored white, to the most conserved positions (grade 9) colored maroon. The conservation scores are projected onto the protein/nucleotide sequence and on the MSA. A flowchart of ConSurf protocol.

Outputs

If a protein 3D structure is provided: For all cases, ConSurf creates the following outputs: For proteins in which the 3D structure was not provided by the user, an up-to-date version of the Protein Data Bank (13) is searched for relevant homologues. If a structure of at least one homologous protein is available, the user may map the conservation scores on the structure. This option should ease the procedure for the non-expert users, who may be unfamiliar with the 3D structure homologue. This option can also be useful for analyzing proteins that share the same sequence but differ in their 3D structure (for example, two structures solved in different conformations or with different ligands). The nine-color conservation scores are projected onto the 3D structure of the query protein and the colored protein structure is shown by FirstGlance in Jmol (http://firstglance.jmol.org). Scripts for visualizing the protein colored with ConSurf scores are generated for PyMol (http://www.pymol.org; 9), Chimera (10), Jmol (http://www.jmol.org/; 11) and RasMol (12). The sequence and MSA colored by ConSurf conservation scores. A text file that summarizes for each position the normalized score calculated, the assigned color, the reliability estimation (for the Bayesian method) and the amino acids/nucleotides observed in the respective MSA column. The sequences selected for the MSA and the MSA constructed (unless those files were uploaded by the user). A file with the frequency of each amino acid/nucleotide observed in each column of the MSA. The evolutionary tree, which was calculated by the server or uploaded by the user, is shown using an interactive Java applet written for that purpose. As an example we provide the main output of a ConSurf run for the N-terminal region of the GAL4 transcription factor in yeast (PDB ID: 3COQ, chain A and B) in complex with its DNA recognition site (Figure 2). The analysis revealed, as expected, that the functional regions of this protein are highly conserved. For example, all the cysteines that form the Zn(2)-C6 DNA binding domain (CYS11, CYS14, CYS21, CYS28, CYS31, CYS38; 14) were assigned the highest conservation scores. Likewise, PRO26, which is known to be central for DNA binding (15) is also highly conserved according to our analysis. In addition, other amino acid residues, which are in contact with the DNA (i.e. GLN9, LYS17, LYS18, LYS20, ARG15, LYS23; 16) are relatively conserved.

Figure 2.

A ConSurf analysis for the GAL4 transcription factor and its DNA binding site. The 3D structure of the N-terminal region of the GAL4 transcription factor in yeast bound to the DNA is presented using a space-filled model. The amino-acids and the nucleotides are colored by their conservation grades using the color-coding bar, with turquoise-through-maroon indicating variable-through-conserved. Positions, for which the inferred conservation level was assigned with low confidence, are marked with light yellow. The figure reveals that the functionally important regions on both the DNA and the protein are highly conserved. The run was carried out using PDB code 3COQ and the figure was generated using the PyMol (10) script output by ConSurf. ConSurf was also applied to nucleic acid sequences from yeast, which are the known binding sites of GAL4 and their adjacent neighborhood (Figure 2). As anticipated, the analysis revealed that the consensus pattern CGG-N11-CCG typical to GAL4 binding site is highly conserved. An extended full ConSurf analysis of this example is available in the ‘GALLERY’ section on the ConSurf web site.

NEW ADDITIONS AND IMPROVEMENTS IN ConSurf 2010

Analyzing nucleic acid sequences

Despite increasing interest in the non-coding fraction of transcriptomes, the number, the level of conservation, and functions, if any, of many non-protein-coding transcripts remain to be discovered. However, it has already been shown that many of the non-coding sequences are connected to regulatory processes. The new version of ConSurf offers estimations of the evolutionary rate for each position of nucleic acid sequences in the same manner used for amino acid residues. For that purpose, four evolutionary models were implemented in the Rate4Site program: (i) the Juke and Cantor 69 model (JC69), which assumes equal base frequencies and equal substitution rates (17). (ii) The Tamura 92 model that uses only one parameter, which captures variation in G-C content (18). (iii) The HKY85 model, which distinguishes between transitions and transversions and allows unequal base frequencies (19). (iv) The General Time Reversible (GTR) model, which is the most general time-reversible model. The GTR parameters consist of an equilibrium base frequency vector, giving the frequency at which each base occurs at each site, and the rate matrix (20). When enough data (i.e. sequences) are available, the GTR model is superior over the more simplified Tamura 92 model. However, the Tamura 92 model is recommended in cases in which the data are not sufficient for reliable estimation of the model parameters and thus it is the default option for analyzing nucleic acid sequences in ConSurf.

Improved substitution matrix for protein sequences

The LG substitution matrix, which incorporates variability of evolutionary rates across sites in the matrix estimation was shown to outperform other substitutions matrices for proteins (21). The LG matrix was added to Rate4Site and is offered in the new version of ConSurf in addition to the previous substitution models: JTT (22), Dayhoff (23), WAG (24), mtREV (25) and cpREV (26).

Improved selection of homologous proteins

The accuracy of conservation scores is directly influenced by the amount and quality of sequence data available in the MSA and the relatedness between the homologous sequences themselves and the sequence of interest. For example, using homologous sequences with different functions might blur the signal. One of the important changes in the new version of ConSurf is the addition of a clear and intuitive interface that helps controlling which of the sequences are included in the analysis. These improvements include: A variety of sequence databases. The server offers the user the option to search for relevant sequences in several automatically updated sequence databases including: (i) SWISS-PROT (default) (27); (ii) A filtered version of the uniprot database (28); (iii) uniprot (29) (iv) UniRef90 in which redundant sequences were removed at level of 90% identity (30); (v) the NCBI non-redundant (nr) database. Manual selection of sequences for the analysis. After searching for homologous sequences, the user can manually select the relevant sequences to be included in the analysis using a simple form that provides all the relevant data for the sequences found and links to external web resources. Removing redundant sequences. The user can specify the level of redundant sequences for removal. The sequences found are clustered by their level of identity using CD-HIT (8) and the cutoff specified by the user (default level is 95% identity). Only one sequence (the longest) from each cluster is used for the analysis. Automatic removal of remote homologues. The user can control the level of sequence identity for which a hit sequence is still considered a homologue. Filtration according to the sequence identity between the sequences found and the sequence of interest enables the user to filter out sequences that share significant alignment with the protein of interest, however, might have different function or structure. The default level is set to 35% identity, which is the upper bound of the ‘twilight zone’ for protein structures (31). Better alignments. The user can choose to align the sequences using one of the following leading alignment algorithms: MAFFT (32), T-COFFEE (EXPRESSO mode) (33), PRANK (34) MUSCLE (35) and CLUSTALW (36). The EXPRESSO mode of T-COFFEE uses structural information (if available) and structural alignment methods to construct structure-based MSA. MAFFT and PRANK were shown to be among the leading sequence alignment algorithms (34,37). MAFFT-LINSi is much faster than PRANK and thus was chosen to be the default alignment algorithm in ConSurf.

Improved user interface

In this new version of ConSurf, we put great emphasis on the user interface. ConSurf now presents an easier and more intuitive step-by-step interface, while still offering the user great flexibility during the process as described above. Each step is accompanied by built-in detailed help.

IMPLEMENTATION

The new version of the ConSurf web server runs on a Linux cluster of 2.6GHz AMD Opteron processors, equipped with 4 GB RAM per quad-core node. The server runs with up to date versions of the supported MSA programs, and regularly updated databases. Running time depends on the dataset size (number and length of sequences) and the server load. The ConSurf server is implemented in PHP and Perl using the support of BioPerl modules (38). Rate4Site is implemented in C++ (4). For proteins with available 3D structure the conservation scores are projected on the structure and visualized using version 1.44 of FirstGlance in Jmol.

CONCLUSIONS

ConSurf and ConSeq have an established reputation in the identification of functional regions in proteins using evolutionary information. In addition, these methods are a focal point that facilitates the development of more useful tools in our group and in other groups. For example, they are the basis for the development of the PatchFinder tool for the automatic detection of clusters of highly conserved amino acids (39), and the detection of DNA-binding proteins (40). Along with the massive growth of sequence and structure databases we believe that this new version of the ConSurf server will be highly useful to a growing number of molecular biology researchers and allow them to perform complex analyses using sophisticated algorithms accurately, easily and comprehensively.

FUNDING

BLOOMNET ERA-PG; Israeli Science Foundation (878/09 to T.P.). Funding for open access charge: BLOOMNET ERA-PG. Conflict of interest statement. None declared.

36 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Twilight zone of protein sequence alignments.

Authors: B Rost
Journal: Protein Eng Date: 1999-02

3. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.

Authors: S Whelan; N Goldman
Journal: Mol Biol Evol Date: 2001-05 Impact factor: 16.240

4. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.

Authors: Tal Pupko; Rachel E Bell; Itay Mayrose; Fabian Glaser; Nir Ben-Tal
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

5. ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information.

Authors: Fabian Glaser; Tal Pupko; Inbal Paz; Rachel E Bell; Dalit Bechor-Shental; Eric Martz; Nir Ben-Tal
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

Review 6. Protein database searches using compositionally adjusted substitution matrices.

Authors: Stephen F Altschul; John C Wootton; E Michael Gertz; Richa Agarwala; Aleksandr Morgulis; Alejandro A Schäffer; Yi-Kuo Yu
Journal: FEBS J Date: 2005-10 Impact factor: 5.542

7. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases.

Authors: K Tamura
Journal: Mol Biol Evol Date: 1992-07 Impact factor: 16.240

8. ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures.

Authors: Meytal Landau; Itay Mayrose; Yossi Rosenberg; Fabian Glaser; Eric Martz; Tal Pupko; Nir Ben-Tal
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

9. The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures.

Authors: Ofir Goldenberg; Elana Erez; Guy Nimrod; Nir Ben-Tal
Journal: Nucleic Acids Res Date: 2008-10-29 Impact factor: 16.971

10. The universal protein resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

814 in total

1. A knowledge-based potential highlights unique features of membrane α-helical and β-barrel protein insertion and folding.

Authors: Daniel Hsieh; Alexander Davis; Vikas Nanda
Journal: Protein Sci Date: 2011-11-23 Impact factor: 6.725

2. Crystal structure of the central axis DF complex of the prokaryotic V-ATPase.

Authors: Shinya Saijo; Satoshi Arai; K M Mozaffor Hossain; Ichiro Yamato; Kano Suzuki; Yoshimi Kakinuma; Yoshiko Ishizuka-Katsura; Noboru Ohsawa; Takaho Terada; Mikako Shirouzu; Shigeyuki Yokoyama; So Iwata; Takeshi Murata
Journal: Proc Natl Acad Sci U S A Date: 2011-11-23 Impact factor: 11.205

3. The 1.7 Å resolution structure of At2g44920, a pentapeptide-repeat protein in the thylakoid lumen of Arabidopsis thaliana.

Authors: Shuisong Ni; Michael E McGookey; Stuart L Tinch; Alisha N Jones; Seetharaman Jayaraman; Liang Tong; Michael A Kennedy
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2011-11-25

4. Target highlights in CASP9: Experimental target structures for the critical assessment of techniques for protein structure prediction.

Authors: Andriy Kryshtafovych; John Moult; Sergio G Bartual; J Fernando Bazan; Helen Berman; Darren E Casteel; Evangelos Christodoulou; John K Everett; Jens Hausmann; Tatjana Heidebrecht; Tanya Hills; Raymond Hui; John F Hunt; Jayaraman Seetharaman; Andrzej Joachimiak; Michael A Kennedy; Choel Kim; Andreas Lingel; Karolina Michalska; Gaetano T Montelione; José M Otero; Anastassis Perrakis; Juan C Pizarro; Mark J van Raaij; Theresa A Ramelot; Francois Rousseau; Liang Tong; Amy K Wernimont; Jasmine Young; Torsten Schwede
Journal: Proteins Date: 2011-10-21

5. Temperature-sensitive mutants and revertants in the coronavirus nonstructural protein 5 protease (3CLpro) define residues involved in long-distance communication and regulation of protease activity.

Authors: Christopher C Stobart; Alice S Lee; Xiaotao Lu; Mark R Denison
Journal: J Virol Date: 2012-02-15 Impact factor: 5.103

10. Analysis of the complement sensitivity of oral treponemes and the potential influence of FH binding, FH cleavage and dentilisin activity on the pathogenesis of periodontal disease.

Authors: D P Miller; J V McDowell; J K Bell; M P Goetting-Minesky; J C Fenno; R T Marconi
Journal: Mol Oral Microbiol Date: 2014-06-03 Impact factor: 3.563