Literature DB >> 17586822

Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach.

Adi Stern¹, Adi Doron-Faigenboim, Elana Erez, Eric Martz, Eran Bacharach, Tal Pupko.

Abstract

Biologically significant sites in a protein may be identified by contrasting the rates of synonymous (K(s)) and non-synonymous (K(a)) substitutions. This enables the inference of site-specific positive Darwinian selection and purifying selection. We present here Selecton version 2.2 (http://selecton.bioinfo.tau.ac.il), a web server which automatically calculates the ratio between K(a) and K(s) (omega) at each site of the protein. This ratio is graphically displayed on each site using a color-coding scheme, indicating either positive selection, purifying selection or lack of selection. Selecton implements an assembly of different evolutionary models, which allow for statistical testing of the hypothesis that a protein has undergone positive selection. Specifically, the recently developed mechanistic-empirical model is introduced, which takes into account the physicochemical properties of amino acids. Advanced options were introduced to allow maximal fine tuning of the server to the user's specific needs, including calculation of statistical support of the omega values, an advanced graphic display of the protein's 3-dimensional structure, use of different genetic codes and inputting of a pre-built phylogenetic tree. Selecton version 2.2 is an effective, user-friendly and freely available web server which implements up-to-date methods for computing site-specific selection forces, and the visualization of these forces on the protein's sequence and structure.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Amino Acids
Proteins

Year: 2007 PMID： 17586822 PMCID： PMC1933148 DOI： 10.1093/nar/gkm382

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Current protein sequences have been shaped by a prolonged and extensive evolutionary process. Thus, studying a protein's evolutionary history may contribute to the inference of the proteins’ properties. Evolutionary conserved sites may be indicative of active sites or of protein–protein interaction domains, while highly variable sites may represent sites subjected to positive Darwinian selection (1,2). Such positively selected sites may be interpreted as being a consequence of molecular adaptation, which confers an evolutionary advantage to the organism. Detecting the level of selection operating on a given protein is enabled by computing ω, the ratio between non-synonymous (K) and synonymous (K) substitutions. Sites showing ω values significantly higher than one are indicative of positive Darwinian selection, while sites showing ω values significantly lower than one are indicative of purifying selection (2,3). Selecton was first introduced in 2005 (4), and implemented an evolutionary codon model (5) which enabled calculating ω at each codon site using a maximum-likelihood (ML) approach. With the advent of sequenced genomes, the comparison of DNA sequences in order to infer meaningful information has become a basic procedure for many researchers. The study of the selection forces operating on a protein, and specifically the attempt to identify positive selection in proteins has been on the rise, and with this rise the number of users of Selecton has increased dramatically. This has motivated us to upgrade the server to include most recent state-of-the-art methods for calculating positive selection. We have thus implemented five evolutionary models (6–8) that enable studying the selection forces operating on a protein. Specifically, these models enable testing whether positive selection has operated on the gene under study. This is achieved by comparing between a null model assuming no positive selection, and a model which allows positive selection. We note that positive selection is not a common phenomenon and is not apparent in most examined datasets (9,10). In these cases, Selecton can be used to accurately infer sites undergoing purifying selection. Both positive and purifying selections are calculated by inferring ω at each codon site. All calculations explicitly take into account the phylogenetic relations among the sequences and the underlying stochastic process of evolution. The value of ω at each site is then translated to a discrete color scale, and projected onto one of the homologous sequences specified by the user. If a 3-dimensional (3D) structure of the protein is available, the scores will also be projected onto the Van-der-Waals surface of the protein. Various other web servers are available for analyzing the evolutionary forces operating on a gene, including servers which perform analyses only on an amino-acid alignment (e.g. 11,12–14), and servers which perform analyses of positive selection in codon-based alignments (15–18). For instance, The SNAP server (15) is based on counting methods which estimate the number of synonymous and non-synonymous substitutions between all pairs of sequences (19,20), and the Data-Monkey server (16) is based on a combination of model hypothesis testing and codon ancestral sequence reconstruction. The advantage of Selecton over these alternative web servers is that it combines the implementation of state-of-the-art methods for detecting positive selection, in a highly user-friendly interface which requires no previous expert knowledge. This enables the vast community of molecular biology researchers to study positive and purifying selection operating on a protein. Furthermore, expert users may use the advanced options to fine-tune the server to their exact requirements. We provide here a brief review of Selecton, emphasizing the novel features added in version 2.2 of the server. We further present the results of the server when analyzed on the TRIM5α protein, a viral host-defense factor that has recently reached the spotlight of positive selection studies (3,21). We delineate how the use of the Selecton server enables the effortless detection of the primate species-specific retroviral restriction domain of TRIM5α found previously (21,22).

SELECTON VERSION 2 MAJOR INNOVATIONS

Evolutionary models implemented

Several different evolutionary models were implemented in Selecton version 2.2, each emphasizing different biological phenomena: M7 and M8: These two models were first introduced by Yang et al. (6). In brief, the M8 model assumes that ω values come from a mixture of a discrete beta distribution, and an additional category ωs ⩾ 1 which allows for positive selection. The M8 is the default model for Selecton runs. The M7 model, nested within M8, does not include the additional ωs category. Since the beta distribution is defined only on the interval [0,1], it thus follows that M7 does not allow for positive selection in the protein. M8a: This model is a variation on the M8 model. In M8a, the additional category ωs is set to 1. Thus, this model allows only for purifying and neutral selection (7). M5: This model assumes a gamma distribution over ω (6). MEC: This model (8) is the only model here which takes into account the differences between amino-acid replacement rates. In brief, this model expands a 20 by 20 amino-acid replacement rate matrix [such as the commonly used JTT matrix (23)] into a 61 by 61 sense-codon rate matrix. Hence, when the non-synonymous ratio K is inferred, the different replacement probabilities between amino acids with distinct properties are taken into account. For instance, all other models in Selecton assume that the evolutionary rates of leucine (UUG) being replaced by either tryptophan (UGG) or phenylalanine (UUU) are equal, since both require one transversion. However, according to the JTT matrix the latter is five times more likely than the former. Thus, under the MEC model, a position with radical replacements will obtain a higher K value than a position with more moderate replacements. It should be noted that in the MEC model, ω values are not directly equivalent to the K/K ratios. ω values are used to calculate these ratios, which are later color-coded onto the results For a more elaborate explanation, refer to the Selecton FAQ section (http://selecton.bioinfo.tau.ac.il/faq.html). Comparison of these models allows for statistical testing of the hypothesis that there is positive selection operating on the protein (H1), by contrasting this hypothesis to a null model (H0). Part of the Selecton output is the likelihood of each model, allowing for comparison using either a likelihood ratio test (LRT) if the models are nested, or by comparing the second order Akaike Information Criterion (AICC) (24) scores if they are not. In brief, an LRT consists of comparing twice the log-likelihood difference of both models to a χ2 table. An alternative approach is to compare the AICC scores defined by − 2 · log L + 2p · (N/N − p − 1), where L represents the likelihood of the model given the data, p represents the number of free parameters and N represents the sequence length. The lower the AICC score, the better the fit of the model to the data, and hence the model is considered more justified. We recommend performing one of the following comparisons: M8 against M8a (nested models with one degree of freedom). This is the default comparison performed by Selecton. M8 against M7 (nested models with two degrees of freedom). MEC against M8a (non-nested models: five free parameters and four free parameters, respectively) For a comparison on the advantages and disadvantages of each model, please refer to the Selecton FAQ section (http://selecton.bioinfo.tau.ac.il/faq.html). To enable easier use of Selecton, at the end of a run with a model which allows for positive selection, the user may click on a button labeled ‘Test Statistical Significance’. This will run Selecton with the appropriate null model, perform the likelihood comparison (LRT or AICc) and output the significance level of this comparison. It should be noted that in the first two nested comparisons, nesting is achieved by fixing one of the parameters on the boundary of the parameter space (6,25), requiring caution when comparing the models with an LRT. Nevertheless, it has been previously shown that using a to approximate the distribution of the LRT when comparing the M8 versus M8a leads to a conservative approach (25). This approach was adopted by Selecton.

Parameter estimation

Parameters common to all models used in Selecton are codon equilibrium frequencies πi, the transition transversion ratio κ and the phylogenetic tree branch lengths. πi are calculated as in (6,26) using the products of the observed nucleotide frequencies (also known as F3X4). κ and branch lengths are all ML estimates. We use an expectation maximization approach (27) to solve the problem of multivariate optimization in the case of branch lengths [a similar approach is described in detail in (28,29)].

An empirical Bayesian method for calculating ω values

The heart of the Selecton server is the calculation of the ω values at each codon position. In the previous version of the server, the ML method (4) was implemented as the sole method of calculation. Recently, we have shown that an empirical Bayesian method can significantly improve the accuracy of inference of conservation scores (30). While the ML method was found to have a relatively high level of false positives, the Bayesian method showed an improved specificity that reduced the level of false positives. The empirical Bayesian method is particularly superior to the ML method when the number of homologous sequences analyzed is small (30). Thus, an empirical Bayesian method of calculating ω values was implemented in the server (6). Following Yang (31), the distributions are approximated using eight discrete categories (the user may define a different number if desired) and the ω values are computed by calculating the expectation of the posterior ω distribution. It should be noted that although more reliable than the ML method, the empirical Bayesian method also suffers from inaccuracy in small data sets, mostly due to sampling errors in the estimation of parameters (such as the distribution shape parameters). Recently two alternatives have been proposed: the full Bayesian estimation (32) and the hierarchical Bayesian estimation (33). However, both alternatives are much more computationally intensive, and hence were not implemented in Selecton.

Reliability of the ω inferences

The reliability of the ω values estimates depends on several factors, including the number of gaps in the MSA, the number of homologous sequences used and their divergence. Thus, an essential improvement implemented in version 2.2 of Selecton is the inclusion of a measure of confidence of the inference. A confidence interval around each ω estimate is defined by the 5th and 95th percentiles of the posterior distribution inferred for each position (see FAQ section of Selecton, http://selecton.bioinfo.tau.ac.il/faq.html). For positions with an inferred ω > 1, if the lower bound of the confidence interval is >1, the inference of positive selection at this position is considered reliable. Selecton furthermore outputs the distribution of the posterior probabilities of ω at each site of the protein.

Visualization of results

The new version of Selecton enables the projection of the ω inferences also onto the primary sequence of the protein. Selecton uses a seven-color scale for representing the different types of selection. Shades of yellow (colors 1 and 2) indicate ω > 1, with dark-yellow standing for sites where reliable positive selection was inferred, and light-yellow standing for positive selection that is not statistically significant. Shades of white through magenta (colors 3 through 7) indicate various level of ω ⩽ 1. A powerful new 3D visualization tool, FirstGlance in Jmol (FGiJ, http://firstglance.jmol.org), was also implemented in Selecton. FGiJ displays the ω values on the 3D structure of the protein. Its power lies in its easy handling and it needs no installation. It allows preparing presentations with Selecton results snapshots, saving a PDB file containing the Selecton color-coding scheme and allows easy manipulation of different properties of the colored 3D molecule. FGiJ works in all popular web browsers and computer platforms.

Genetic code

Eleven different genetic codes were implemented, including four different nuclear code variants and seven different mitochondrial code variants. This allows the analysis of genes from organisms and organelles which use non-standard genetic codes.

Phylogenetic tree

By default Selecton runs are carried out using phylogenetic trees that the server computes using the neighbor-joining algorithm (34). As input for the neighbor-joining algorithm, pairwise distances are computed applying the ML criterion under a codon model (35) which assumes no selection (ω = 1 for all sites and κ = 2). We note that in general, more accurate tree topologies lead to better estimate of parameters (36,37). However, it was previously shown that the detection of positive selection is in general robust to tree topology inaccuracies (6,38,39). Hence, the strategy we adopted in Selecton was to avoid the computationally intense search for the precise tree topology, yet to allow the user to provide a pre-computed phylogenetic-tree as an additional input. This new feature enables users to supply a more accurate tree, if available. Additionally, users can supply the tree phylogeny and have the server optimize its branch lengths.

Precision level

The precision level of the computations is defined by setting the cutoff (ε), which defines when two likelihood values have converged. Selecton allows the user to choose between three levels of precision, which also directly affect the speed of calculation: low (ε = 1), intermediate (ε = 0.1) and high (ε = 0.01). The default level of precision for Selecton runs is intermediate.

BIOLOGICAL EXAMPLE

We illustrate the power of Selecton to detect site-specific selection forces by analyzing the evolution of the TRIM5α protein, a protein that has recently been shown to have undergone extensive positive selection during the course of primate evolution (21,22). Furthermore, positively selected regions were found to correlate with the species-specificity determinants of the protein. Here, we wish to exemplify the ease with which Selecton enables detecting the species-specific viral restriction domains of TRIM5α.

Study of TRIM5α

TRIM5 is a member of the large tripartite motif family in primate genomes, characterized by having RING finger, B-box and coiled-coil domains, as well as an additional SPRY domain found in the α isoform (40). TRIM5α was found to account for HIV-1 resistance observed in rhesus cells (41,42). It is not yet known how TRIM5α mediates viral restriction, although a shorter, alternate transcript of the TRIM5 gene has been shown to be a ubiquitin ligase (43). TRIM5α restriction probably acts on the viral capsid (44), although direct physical interaction between TRIM5α and the capsid proteins has not yet been demonstrated. TRIM5α variants from humans, rhesus monkeys and African green monkeys (AGM) display different but overlapping restriction specificities, which all have the following common property: each TRIM5α is unable to restrict retroviruses isolated from the same species, yet is able to restrict most retroviruses from other species (41). This indicates that TRIM5α is an important natural barrier to cross-species retrovirus transmission. This type of interaction between a host protein and a parasite protein leads to genetic conflict between the two proteins. Such a conflict may lead to rapid fixation of mutations that alter amino acids at the protein–protein interface, which is the hallmark of positive selection (6). Thus, it has been hypothesized that TRIM5α is in an antagonistic conflict with the retroviral capsid proteins. Sawyer et al. (21) analyzed the selection forces acting on TRIM5α and identified a patch of positively selected residues in the SPRY domain. This patch was identified as the species-specific determinant, which is sufficient and necessary for HIV restriction in rhesus monkey cells. Substitution of this patch from the human TRIM5α with the rhesus patch, and vice versa, conferred or abolished HIV-1 restriction, respectively (21). In fact, the region determining the species-specificity of the HIV-1 restriction was eventually mapped to two alternative positions in the rhesus SPRY domains (21). A single arginine to proline replacement at residue 332 of the human TRIM5α, or conversely the exchange of the six residues at positions 335–340 for the eight residues of the rhesus sequence, conferred the human TRIM5α an enhanced ability to restrict HIV-1 (22). To test the use of Selecton, 20 primate TRIM5α sequences (21) were used as input for the Selecton server. The server was run with the MEC model (log-likelihood = –6716; AICC score = 13 442) and compared with the M8a null model (log-likelihood = –6779; AICC score = 13 564). Since the AICC score of the MEC model is lower, we assume that the MEC model which allows for positives selection indeed fits the TRIM5α data better than a model which does not. The results of the MEC analysis were projected by the server onto the primary sequence of the human TRIM5α (Figure 1). The full results of the run are available in the Gallery section of Selecton (http://selecton.bioinfo.tau.ac.il/gallery.html). The results show an abundance of yellow-colored sites, indicating that TRIM5α has undergone extensive positive selection. Specifically, the two specific determinants conferring HIV-1 species-specific restriction showed exceptionally high levels of positive selection (Figure 1; positions boxed in black), indicating that these sites have undergone excess amino-acid fixations during the course of primate evolution. In fact, the entire SPRY domain (sites 281–493) displays extensive positive selection, as opposed to the RING finger domain (sites 15–59), the B-box domain (sites 90–132) and the coiled-coil domains (sites 130–241), which display mostly purifying selection with some dispersed positively selected sites.

Figure 1.

Selecton results for TRIM5α run on 20 primate sequences (21) with the MEC model (8). Positive selection is colored in shades of yellow, and purifying selection is colored in shades of magenta. The two species-specific restriction determinants are indicated in boxes. Replacement of these positions with their rhesus equivalent positions leads to a reversal of restriction characteristics. Both determinants show a significantly high level of positive selection.

CONCLUSIONS

We describe here Selecton version 2.2, a web-based bioinformatics tool for the identification of site-specific positive selection and purifying selection in a protein. The minimal input for the server consists of a file of homologous coding sequences. The server performs a codon-based alignment of the sequences, calculates ω values at each site, translates these ratios into selection scores and projects them onto the primary or tertiary sequence of the protein, allowing visual identification of blocks or patches of sites with similar ω values. Advanced options of the server include choosing the method of calculation, inputting a phylogenetic tree of the homologous sequences and choosing from amongst a number of evolutionary models implemented in the server. To demonstrate the effectiveness of Selecton, the server was run on a dataset of homologous TRIM5α primate sequences. Selecton correctly identified the species-specific restriction determinants of the protein. Thus, this analysis emphasizes the power of Selecton to accurately identify sites undergoing positive selection, and to present these results in a clear and user-friendly way.

40 in total

1. Codon-substitution models for heterogeneous selection pressure at amino acid sites.

Authors: Z Yang; R Nielsen; N Goldman; A M Pedersen
Journal: Genetics Date: 2000-05 Impact factor: 4.562

2. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution.

Authors: M Anisimova; J P Bielawski; Z Yang
Journal: Mol Biol Evol Date: 2001-08 Impact factor: 16.240

3. DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family.

Authors: Xun Gu; Kent Vander Velden
Journal: Bioinformatics Date: 2002-03 Impact factor: 6.937

4. A structural EM algorithm for phylogenetic inference.

Authors: Nir Friedman; Matan Ninio; Itsik Pe'er; Tal Pupko
Journal: J Comput Biol Date: 2002 Impact factor: 1.479

5. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.

Authors: Tal Pupko; Rachel E Bell; Itay Mayrose; Fabian Glaser; Nir Ben-Tal
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

6. A codon-based model of nucleotide substitution for protein-coding DNA sequences.

Authors: N Goldman; Z Yang
Journal: Mol Biol Evol Date: 1994-09 Impact factor: 16.240

7. Restriction of multiple divergent retroviruses by Lv1 and Ref1.

Authors: Theodora Hatziioannou; Simone Cowan; Stephen P Goff; Paul D Bieniasz; Greg J Towers
Journal: EMBO J Date: 2003-02-03 Impact factor: 11.598

8. Pervasive adaptive evolution in mammalian fertilization proteins.

Authors: Willie J Swanson; Rasmus Nielsen; Qiaofeng Yang
Journal: Mol Biol Evol Date: 2003-01 Impact factor: 16.240

9. The cytoplasmic body component TRIM5alpha restricts HIV-1 infection in Old World monkeys.

Authors: Matthew Stremlau; Christopher M Owens; Michel J Perron; Michael Kiessling; Patrick Autissier; Joseph Sodroski
Journal: Nature Date: 2004-02-26 Impact factor: 49.962

10. BTBD1 and BTBD2 colocalize to cytoplasmic bodies with the RBCC/tripartite motif protein, TRIM5delta.

Authors: Lixin Xu; Lihong Yang; Prasun K Moitra; Keiko Hashimoto; Prasad Rallabhandi; Sunil Kaul; Germana Meroni; Jane P Jensen; Allan M Weissman; Peter D'Arpa
Journal: Exp Cell Res Date: 2003-08-01 Impact factor: 3.905

155 in total

1. Molecular evolution of miraculin-like proteins in soybean Kunitz super-family.

Authors: Purushotham Selvakumar; Deepankar Gahloth; Prabhat Pratap Singh Tomar; Nidhi Sharma; Ashwani Kumar Sharma
Journal: J Mol Evol Date: 2012-01-25 Impact factor: 2.395

2. Molecular characterization of two endothelin pathways in East African cichlid fishes.

Authors: Eveline T Diepeveen; Walter Salzburger
Journal: J Mol Evol Date: 2012-01-21 Impact factor: 2.395

3. Functional evolution of the OAS1 viral sensor: Insights from old world primates.

Authors: Ian Fish; Stéphane Boissinot
Journal: Infect Genet Evol Date: 2016-07-05 Impact factor: 3.342

4. ACP5 (Uteroferrin): phylogeny of an ancient and conserved gene expressed in the endometrium of mammals.

Authors: Maria B Padua; Vincent J Lynch; Natalia V Alvarez; Mark A Garthwaite; Thaddeus G Golos; Fuller W Bazer; Satyan Kalkunte; Surendra Sharma; Gunter P Wagner; Peter J Hansen
Journal: Biol Reprod Date: 2012-04-27 Impact factor: 4.285

5. Evolutionary Analysis of the Mammalian Tuftelin Sequence Reveals Features of Functional Importance.

Authors: S Delgado; D Deutsch; J Y Sire
Journal: J Mol Evol Date: 2017-04-13 Impact factor: 2.395

Review 6. Models of coding sequence evolution.

Authors: Wayne Delport; Konrad Scheffler; Cathal Seoighe
Journal: Brief Bioinform Date: 2008-10-29 Impact factor: 11.622

7. Translational control of protein kinase Ceta by two upstream open reading frames.

Authors: Hadas Raveh-Amit; Adva Maissel; Jonathan Poller; Liraz Marom; Orna Elroy-Stein; Michal Shapira; Etta Livneh
Journal: Mol Cell Biol Date: 2009-09-21 Impact factor: 4.272

8. Identification and phylogenetic analysis of late embryogenesis abundant proteins family in tomato (Solanum lycopersicum).

Authors: Jun Cao; Xiang Li
Journal: Planta Date: 2014-12-10 Impact factor: 4.116

9. Evolutionary analysis of mammalian enamelin, the largest enamel protein, supports a crucial role for the 32-kDa peptide and reveals selective adaptation in rodents and primates.

Authors: Nawfal Al-Hashimi; Jean-Yves Sire; Sidney Delgado
Journal: J Mol Evol Date: 2009-12 Impact factor: 2.395

10. Positive selection pressure introduces secondary mutations at Gag cleavage sites in human immunodeficiency virus type 1 harboring major protease resistance mutations.

Authors: Søren Banke; Marie R Lillemark; Jan Gerstoft; Niels Obel; Louise B Jørgensen
Journal: J Virol Date: 2009-06-10 Impact factor: 5.103