Literature DB >> 27166375

ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules.

Haim Ashkenazy¹, Shiran Abadi², Eric Martz³, Ofer Chay⁴, Itay Mayrose⁵, Tal Pupko⁶, Nir Ben-Tal⁷.

Abstract

The degree of evolutionary conservation of an amino acid in a protein or a nucleic acid in DNA/RNA reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and function of the macromolecule. The ConSurf web server (http://consurf.tau.ac.il), established over 15 years ago, analyses the evolutionary pattern of the amino/nucleic acids of the macromolecule to reveal regions that are important for structure and/or function. Starting from a query sequence or structure, the server automatically collects homologues, infers their multiple sequence alignment and reconstructs a phylogenetic tree that reflects their evolutionary relations. These data are then used, within a probabilistic framework, to estimate the evolutionary rates of each sequence position. Here we introduce several new features into ConSurf, including automatic selection of the best evolutionary model used to infer the rates, the ability to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree that enables interactively rerunning ConSurf with the taxa of a sub-tree.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins
RNA
DNA

Year: 2016 PMID： 27166375 PMCID： PMC4987940 DOI： 10.1093/nar/gkw408

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

ConSurf is a widely used tool for revealing functional regions in macromolecules by analysing the evolutionary dynamics of amino/nucleic acids substitutions among homologous sequences (1–4). ConSurf estimates the evolutionary rates of the amino/nucleic acids and maps them onto the sequence and/or structure of the query macromolecule. Slowly evolving sites on the query surface are usually important for function and thus, ConSurf analysis can pinpoint critically important sites within the query macromolecule. This is particularly true when the structure of the query macromolecule is known, allowing to differentiate between slowly evolving positions at the core, which are usually important for structural stability/folding (e.g. 5), and clusters of slowly evolving surface positions, important for function (6–14). In the absence of structure, the evolutionary data are presented on the query sequence together with site-specific predictions of the buried/exposed status of each position, i.e. ConSeq mode (15). The power of ConSurf, in comparison to other popular alternatives based on consensus and relative entropy approaches, is that the evolutionary rates are estimated based on the phylogenetic relationships among the homologues and the specific dynamics of the analysed sequences using advanced probabilistic evolutionary models (16,17). This statistically robust approach makes it easier to differentiate between apparent conservation due to short evolutionary time and genuine conservation reflecting the action of purifying selection. Notably, ConSurf also assigns confidence intervals around the calculated evolutionary rates, which estimate the credibility of the results. The superiority of ConSurf's estimation of evolutionary conservation over entropy based methods in accurate prediction of protein active sites, as well as the identification of biologically active peptides, was previously demonstrated (18–20); elaborate comparison of ConSurf to alternatives that do not explicitly account for the phylogenetic relations among sequences is provided in the OVERVIEW section of the ConSurf web server. The best established alternative to ConSurf for the detection of functional regions is the Evolutionary Trace method and variants thereof (6–8,21,22). These are also based on phylogenetic analysis, but lack the mathematical rigour of ConSurf, and do not provide any credibility interval around the inferred scores. Recently, the Golding lab introduced a rigorous model, based on phylogenetic Gaussian process, that accounts for spatial correlation of substitution rates in different positions according to the protein tertiary structure (23,24). Unlike ConSurf, this sophisticated approach requires knowledge of the protein structure. Here we report the introduction of several new features into ConSurf, designed to improve the performance and the interface of the web server in the detection of functional regions in proteins and nucleic acids.

MATERIALS AND METHODS

In a typical ConSurf application, the query protein is first BLASTed (25) against the UNIREF-90 database (26). Redundant homologous sequences are then removed using the CD-HIT clustering method (27,28). The resulting sequences are next aligned using MAFFT (29) and the generated multiple sequence alignment (MSA) is used to reconstruct a phylogenetic tree. Given the tree and the MSA, the Rate4Site algorithm (16) is used to calculate position-specific evolutionary rates under an empirical Bayesian methodology (17). The rates are normalized and grouped into nine conservation grades 1-through-9, where 1 includes the most rapidly evolving positions, 5 includes positions of intermediate rates, and 9 includes the most evolutionarily conserved positions. It is important to notice that structural data are not used up to this point, and the rates are estimated based on sequence data alone. Finally, the conservation grades are mapped onto the query sequence and/or structure using the ConSurf colour-code, with cyan-through-purple corresponding to variable (grade 1)-through-conserved (grade 9) positions. The analysis is conducted only if there are at least five homologous proteins, otherwise the degree of uncertainty is too high. The protocol for the selection of homologous sequences was shaped while massive amounts of sequence data are becoming available (2,3). A balance between the number of sequences used for analysis and their evolutionary or functional relationship to the query molecule should be maintained. Thus, we try to adjust the default parameters used for homologues collection to maintain this balance. This includes (i) using CS-BLAST, suggested to be more sensitive and accurate in searching for remote homologues, compared to the commonly used BLAST algorithm (25); (ii) for proteins, only sequences sharing at least 35% sequence identity with the query sequence are considered. This was suggested to be the upper boundary of the ‘twilight zone’ for protein structures (30); (iii) the MAFFT-LINSi procedure, suggested to be one of the most accurate MSA methodologies (31), is used to align the homologous sequences. Many alternatives to this typical outline are provided in the server. For example, ConSurf is also applicable to nucleotide sequences, it can be used with an external pre-built MSA, and users can control many details of the default algorithmic flow described above. The ConSurf methodology and these advanced options are described in detail in the ‘OVERVIEW’, ‘QUICK HELP’ and ‘FAQ’ sections of the ConSurf website (http://consurf.tau.ac.il).

RECENT ADDITIONS AND IMPROVEMENTS

Selecting the evolutionary model that best fits the data

In the previous ConSurf version, the user was allowed to select one of several evolutionary models which differ from each other in their biological assumptions and in the number of free parameters. For nucleotide sequences the following models have been implemented: the Jukes and Cantor model (JC69), which assumes equal base frequencies and equal substitution rates (32); the Tamura 92 model that uses only one parameter, which captures variation in G-C content (33); the HKY85 model, which distinguishes between transitions and transversions and allows for unequal base frequencies (34); and the General Time Reversible model, which includes free parameters for each transition type and base frequency (35). For protein sequences several models were implemented: LG (36), JTT (37), Dayhoff (38), WAG (39), mtREV for mitochondrial proteins (40) and cpREV for chloroplast proteins (41). Different models can result in different estimations of the phylogeny and evolutionary rate (42,43). ConSurf now allows automatic selection of the model that best fits the analysed sequences, as determined by the Akaike information criterion (AIC) (44–46). Users who prefer this option need to select it in the menu.

Predicting RNA secondary structures

For RNA sequence queries, ConSurf now offers the possibility to predict the secondary structure. Structures are predicted using the RNAfold program of the Vienna package (47,48), and the structure with the lowest free energy is selected. The ConSurf conservation grades are mapped onto the predicted secondary structure. Correlating the evolutionary data with the structural model offers the means to quickly detect functional regions within the RNA query. To exemplify this feature, we analyse the well-studied Phe-tRNA molecule (Figure 1A). The calculation is based on RFAM homologous sequences (49) of the Phe-tRNA molecule (RFAM RF00005 family) clustered by CD-HIT to the level of 80% sequence identity and aligned using MAFFT. The results show that some bases in the TΨC and D loops are assigned particularly high conservation grades. Some of these positions are known to be of structural and functional importance (50,51). Figure 1B shows the conservation grades on the 3D structure of the Phe-tRNA molecule (PDB ID: 1EHZ chain A), further emphasizing the importance of the evolutionarily conserved positions.

Figure 1.

ConSurf analysis of yeast Phe-tRNA. (A) Secondary structure prediction of the molecule coloured by conservation using the colour-code bar. (B) Same analysis using the X-ray crystal structure of the molecule (PDB ID: 1EHZ, chain A).

Predicting a template-based structure for protein sequences

In the previous version, when only the sequence (rather than the structure) of the query protein was provided as input (i.e. ‘ConSeq mode’), ConSurf searched the PDB (52,53) using BLAST (54) to suggest probable homologues of known structure. If the search was productive, the conservation grades were mapped on each of the homologous structures. In the new version, we go one step further and use HHPred (55) and MODELLER (56) to produce a homology-model of the query. Briefly, HHPred uses a hidden Markov model to search for potential templates of known 3D structure in the PDB (57). The MODELLER algorithm (56) is then used to predict a 3D model for the query sequence. The ConSurf conservation grades are subsequently mapped on the predicted model. In addition, the homology model is used to predict the solvent accessibilities of the amino acids. To this end, we use the relative solvent accessible surface areas of the amino acids, calculated using NACCESS (58) and the predicted structure. The derivation of solvent accessibility from the 3D model is expected to be more accurate compared to the buried/exposed prediction, made solely using the protein sequence (59). The latter option is still offered in cases where a template is not available.

Refining ConSurf results using a subset of sequences

Occasionally, purifying selective forces may be strong in one part of the phylogeny yet relaxed (or different) in the remaining parts, indicative of gain or loss of function in some taxonomic clades or in protein subfamilies. In a typical ConSurf analysis, the whole set of homologues (either user-supplied or automatically collected using the default setting) is analysed as a single group, masking this important functional signal. The new ConSurf version provides the means to refine an initial ConSurf analysis by allowing users to select a subtree containing a fraction of the homologous sequences and conduct a follow-up analysis of these selected sequences. To this end, in the ConSurf Results page, the MSA and tree are now visualized using the WASABI platform (60). Users can thus choose any internal node on the phylogenetic tree and open a WASABI menu using a right mouse click (see an example in Supplementary Figure S1). Selecting the option ‘run ConSurf on subtree’ will issue a new window with a follow-up ConSurf run for the selected sequences of the subtree.

Improved visualizations

This new version of ConSurf suggests three major visualization improvements: Accounting for protein assembly. Many proteins function together as complexes, or biological units (5). Therefore, accounting for the full assembly can shade further light on the importance of residues located at the interfaces between the subunits. The new version of ConSurf automatically suggests the possibility to map the calculated evolutionary conservation grades of the amino acids not only onto a single chain, taken from the asymmetric unit of the crystal, which is often deposited in the PDB, but also on all the appearances of the chain in the biological assembly as predicted by PISA (http://www.ebi.ac.uk/pdbe/prot_int/pistart.html) (61). Figure 2A demonstrates this new feature using the 3D structure of the β subunit of DNA polymerase III from Escherichia coli. The protein functions as a homo-dimer, and as anticipated, most of the residues at the inter-subunit interfaces (62) are highly evolutionarily conserved (Leu108, Lys74, Ile272, Leu273, Glu300, Glu304). The full ConSurf analysis of this structure is available for interactive exploration under the ‘GALLERY’ section of the web server.

Figure 2.

ConSurf analysis of the β subunit of DNA polymerase III from Escherichia coli (PDB ID: 2POL). The interfaces between the two subunits of the homodimer (on both sides of the dotted line) are highly conserved, as well as the internal face of the ring, which interacts with the DNA. (A) Molecule coloured by the traditional ConSurf scale. (B) Molecule coloured by the new colour-blind friendly scale.

Supporting non-Java-based visualization. To enable interactive visualization of 3D molecular structures on devices with no Java installed, or for which Java is not available (e.g. smart phones), a new version of FirstGlance in Jmol (http://bioinformatics.org/firstglance/fgij/) was implemented with JSmol (63). Briefly, JSmol uses HTML5 and JavaScript to implement the functionality offered by the Jmol application, which is implemented in Java (http://www.jmol.org/). It allows 3D visualization on modern web browsers (e.g. Chrome, Edge) which, in attempt to avoid security threats, no longer support Java applets (such as Jmol) running from the browser. New colour-blind friendly pallet. In addition to the traditional cyan-through-purple pallet corresponding to variable (grade 1)-through-conserved (grade 9) scores, the new version also suggests a more colour-blind friendly pallet of green-through-purple scale (see example in Figure 2B). ConSurf analysis of the β subunit of DNA polymerase III from Escherichia coli (PDB ID: 2POL). The interfaces between the two subunits of the homodimer (on both sides of the dotted line) are highly conserved, as well as the internal face of the ring, which interacts with the DNA. (A) Molecule coloured by the traditional ConSurf scale. (B) Molecule coloured by the new colour-blind friendly scale.

CONCLUSIONS AND PROSPECTS

We presented improvements to the ConSurf method and web server for the detection of functional regions in protein and nucleotide sequences. The ConSurf calculation is conducted using sequence data, but the results are particularly enlightening when viewed on the 3D structure of the macromolecule, or model thereof. The main changes compared to the previous version of ConSurf are summarized in Table 1.

Table 1.

The main recent improvements in ConSurf

Feature	ConSurf—2010	ConSurf—2016
Selecting evolutionary model	Only according to user selection	New option for automatic selection of the model showing the best fit to MSA
RNA secondary structure	Not available	Predicting RNA secondary structure using Vienna package and projecting ConSurf grades on the structure
Projecting scores on identical chains and protein assemblies	Scores projected only on single chain, and do not support protein assemblies	Projecting scores on all identical chains and the most probable assembly downloaded from PISA
FirstGlance in Jmol	Version 1.44 supporting only Jmol viewer	Version 2.42 supporting JSmol which is Java free viewer (see additional features and improvements of the new version at: http://bioinformatics.org/firstglance/fgij/versions.htm)
Phylogenetic tree viewer	Only the tree is shown using a Java applet	The MSA is shown together with the phylogenetic tree using the WASABI platform (Java free)
Rerun ConSurf using sequences from sub-tree	Not available	Interactive selection of sub-tree sequences using WASABI, and rerun ConSurf with these sequences
Structural information for proteins query sequence (no PDB provided)	Suggesting highly similar homologues sequences to the protein query sequence and projecting ConSurf scores on them	In addition to the suggested homologues, template based structure prediction is performed using HHPred and MODELLER
Solvent accessibility information when protein's PDB structure is not available	Predicted from sequence information only	When possible, extracted using NACESS from the 3D structure modelled by HHPred

Our understanding of the physicochemical interactions underlying the selective forces responsible for evolutionary-rate differences among sites is very partial. Quantitatively, it was estimated that only 60% of the data can be explained (64). Nevertheless, exploiting evolutionary rate differences is useful in various biological studies, including structure analysis (e.g. 65,66) and prediction (e.g. 67), interpretation (68) and design of mutations (69), identification of natural peptides (20), systems and genome-wide studies (70) and studies of the last common ancestor (71). Until a few years ago, the bottleneck of the analysis was to obtain a sufficient number of homologous sequences of the query and to design an algorithm that makes the best use of these homologues to estimate the evolutionary rates. While efforts to improve the rates estimates are constantly undergoing (23,24,72,73), with the flood of sequence data, the main challenge now is to cope with thousands of homologous sequences by clustering the data and finding a large and diverse set of true homologues that faithfully represents the diversity. We plan to implement such a clustering method in ConSurf in the very near future.

64 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Authors: Michael Remmert; Andreas Biegert; Andreas Hauser; Johannes Söding
Journal: Nat Methods Date: 2011-12-25 Impact factor: 28.547

3. Wasabi: An Integrated Platform for Evolutionary Sequence Analysis and Data Visualization.

Authors: Andres Veidenberg; Alan Medlar; Ari Löytynoja
Journal: Mol Biol Evol Date: 2015-12-03 Impact factor: 16.240

4. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases.

Authors: K Tamura
Journal: Mol Biol Evol Date: 1992-07 Impact factor: 16.240

5. Sequence context-specific profiles for homology searching.

Authors: A Biegert; J Söding
Journal: Proc Natl Acad Sci U S A Date: 2009-02-20 Impact factor: 11.205

6. Evolutionarily conserved Galphabetagamma binding surfaces support a model of the G protein-receptor complex.

Authors: O Lichtarge; H R Bourne; F E Cohen
Journal: Proc Natl Acad Sci U S A Date: 1996-07-23 Impact factor: 11.205

7. ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures.

Authors: Meytal Landau; Itay Mayrose; Yossi Rosenberg; Fabian Glaser; Eric Martz; Tal Pupko; Nir Ben-Tal
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. Rfam 12.0: updates to the RNA families database.

Authors: Eric P Nawrocki; Sarah W Burge; Alex Bateman; Jennifer Daub; Ruth Y Eberhardt; Sean R Eddy; Evan W Floden; Paul P Gardner; Thomas A Jones; John Tate; Robert D Finn
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 19.160

9. Phylogenetic Gaussian process model for the inference of functionally important regions in protein tertiary structures.

Authors: Yi-Fei Huang; G Brian Golding
Journal: PLoS Comput Biol Date: 2014-01-16 Impact factor: 4.475

10. The bacterial dicarboxylate transporter VcINDY uses a two-domain elevator-type mechanism.

Authors: Christopher Mulligan; Cristina Fenollar-Ferrer; Gabriel A Fitzgerald; Ariela Vergara-Jaque; Desirée Kaufmann; Yan Li; Lucy R Forrest; Joseph A Mindell
Journal: Nat Struct Mol Biol Date: 2016-02-01 Impact factor: 15.369

808 in total

1. The structure of SDS22 provides insights into the mechanism of heterodimer formation with PP1.

Authors: Meng S Choy; Nicolas Bolik-Coulon; Tara L Archuleta; Wolfgang Peti; Rebecca Page
Journal: Acta Crystallogr F Struct Biol Commun Date: 2018-11-30 Impact factor: 1.056

2. Zooming in on Cadherin-23: Structural Diversity and Potential Mechanisms of Inherited Deafness.

Authors: Avinash Jaiganesh; Pedro De-la-Torre; Aniket A Patel; Domenic J Termine; Florencia Velez-Cortes; Conghui Chen; Marcos Sotomayor
Journal: Structure Date: 2018-07-19 Impact factor: 5.006

3. Biallelic GRM7 variants cause epilepsy, microcephaly, and cerebral atrophy.

Authors: Dana Marafi; Tadahiro Mitani; Sedat Isikay; Jozef Hertecant; Mohammed Almannai; Kandamurugu Manickam; Rami Abou Jamra; Ayman W El-Hattab; Jaishen Rajah; Jawid M Fatih; Haowei Du; Ender Karaca; Yavuz Bayram; Jaya Punetha; Jill A Rosenfeld; Shalini N Jhangiani; Eric Boerwinkle; Zeynep C Akdemir; Serkan Erdin; Jill V Hunter; Richard A Gibbs; Davut Pehlivan; Jennifer E Posey; James R Lupski
Journal: Ann Clin Transl Neurol Date: 2020-04-14 Impact factor: 4.511

4. A bipartite periplasmic receptor-diguanylate cyclase pair (XAC2383-XAC2382) in the bacterium Xanthomonas citri.

Authors: Raphael D Teixeira; Cristiane R Guzzo; Santiago Justo Arévalo; Maxuel O Andrade; Josielle Abrahão; Robson F de Souza; Chuck S Farah
Journal: J Biol Chem Date: 2018-05-04 Impact factor: 5.157

5. Structure of the DASH/Dam1 complex shows its role at the yeast kinetochore-microtubule interface.

Authors: Simon Jenni; Stephen C Harrison
Journal: Science Date: 2018-05-04 Impact factor: 47.728

6. Mutations in PPCS, Encoding Phosphopantothenoylcysteine Synthetase, Cause Autosomal-Recessive Dilated Cardiomyopathy.

Authors: Arcangela Iuso; Marit Wiersma; Hans-Joachim Schüller; Ben Pode-Shakked; Dina Marek-Yagel; Mathias Grigat; Thomas Schwarzmayr; Riccardo Berutti; Bader Alhaddad; Bart Kanon; Nicola A Grzeschik; Jürgen G Okun; Zeev Perles; Yishay Salem; Ortal Barel; Amir Vardi; Marina Rubinshtein; Tal Tirosh; Gal Dubnov-Raz; Ana C Messias; Caterina Terrile; Iris Barshack; Alex Volkov; Camilla Avivi; Eran Eyal; Elisa Mastantuono; Muhamad Kumbar; Shachar Abudi; Matthias Braunisch; Tim M Strom; Thomas Meitinger; Georg F Hoffmann; Holger Prokisch; Tobias B Haack; Bianca J J M Brundel; Dorothea Haas; Ody C M Sibon; Yair Anikster
Journal: Am J Hum Genet Date: 2018-05-10 Impact factor: 11.025

7. Limits to Compensatory Mutations: Insights from Temperature-Sensitive Alleles.

Authors: Katarzyna Tomala; Piotr Zrebiec; Daniel L Hartl
Journal: Mol Biol Evol Date: 2019-09-01 Impact factor: 16.240

8. Integrating structural and evolutionary data to interpret variation and pathogenicity in adapter protein complex 4.

Authors: John E Gadbery; Abin Abraham; Carli D Needle; Christopher Moth; Jonathan Sheehan; John A Capra; Lauren P Jackson
Journal: Protein Sci Date: 2020-04-25 Impact factor: 6.725

9. Novel mutations in the KCNJ10 gene associated to a distinctive ataxia, sensorineural hearing loss and spasticity clinical phenotype.

Authors: Matias Morin; Anna-Lena Forst; Paula Pérez-Torre; Adriano Jiménez-Escrig; Verónica Barca-Tierno; Eva García-Galloway; Richard Warth; Jose Luis Lopez-Sendón Moreno; Miguel Angel Moreno-Pelayo
Journal: Neurogenetics Date: 2020-02-15 Impact factor: 2.660

10. Structural Basis for Regulation of ESCRT-III Complexes by Lgd.

Authors: Brian J McMillan; Christine Tibbe; Andrew A Drabek; Tom C M Seegar; Stephen C Blacklow; Thomas Klein
Journal: Cell Rep Date: 2017-05-30 Impact factor: 9.423