Literature DB >> 20551039

Large-scale analysis of orthologs and paralogs under covarion-like and constant-but-different models of amino acid evolution.

Romain A Studer¹, Marc Robinson-Rechavi.

Abstract

Functional divergence between homologous proteins is expected to affect amino acid sequences in two main ways, which can be considered as proxies of biochemical divergence: a "covarion-like" pattern of correlated changes in evolutionary rates, and switches in conserved residues ("conserved but different"). Although these patterns have been used in case studies, a large-scale analysis is needed to estimate their frequency and distribution. We use a phylogenomic framework of animal genes to answer three questions: 1) What is the prevalence of such patterns? 2) Can we link such patterns at the amino acid level with selection inferred at the codon level? 3) Are patterns different between paralogs and orthologs? We find that covarion-like patterns are more frequently detected than "constant but different," but that only the latter are correlated with signal for positive selection. Finally, there is no obvious difference in patterns between orthologs and paralogs.

Entities: Chemical Disease Species

Mesh：

Substances：
Amino Acids

Year: 2010 PMID： 20551039 PMCID： PMC2955734 DOI： 10.1093/molbev/msq149

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Gene function changes during evolution, including changes in biochemical function that are expected to be reflected in the amino acid sequence. Most evolutionary models assume that duplication plays a major role in the evolution of such changes (Ohno 1970; Semon and Wolfe 2007; Conant and Wolfe 2008). We have previously shown that positive selection affects vertebrate protein-coding genes relatively frequently, but that there is no increase in its prevalence after duplication (Studer et al. 2008). A general trend of higher divergence after duplication is in fact not so well supported as expected and needs more investigation at all levels of divergence (Studer and Robinson-Rechavi 2009b). Positive selection is expected to be correlated with functional changes, although a direct link remains to be established (Eyre-Walker 2006). In this work, we explore the divergence between homologous proteins, this time directly in the amino acid sequences. We used two different measures, considered as proxies of biochemical divergence: the “covarion-like” pattern and the “conserved-but-different” pattern (Anisimova and Liberles 2007; Liberles 2007; Studer and Robinson-Rechavi 2009a). Covarion-like evolution means that several sites experience acceleration or deceleration in their evolutionary rate in a correlated way (i.e., in the same evolutionary period), presumably due to variation in selective pressure. Conserved but different refers to a pattern of change from one amino acid to another in a specific evolutionary period for a site that is conserved the rest of the time. It is generally assumed that both these patterns are linked to functional change and driven by positive selection. It is unclear whether this would be the same type of selective events detected by codon models because these amino acid patterns do not necessarily imply a high rate of nonsynonymous to synonymous change. We answer three questions: 1) What is the prevalence of such patterns? 2) Can we link such patterns at the amino acid level with selection inferred at the codon level? 3) Are patterns different between paralogs and orthologs?

Materials and Methods

Data

We used gene sequences and trees from the database HomolEns release 4 (Penel et al. 2009), which is based on Ensembl release 49 (March 2008; Hubbard et al. 2009) March 2008 being the date of release 49. Genes are organized in families, which include precalculated alignments and phylogenies. The main advantage of the Homolens system is the FamFetch query system and the TreePattern function (Dufayard et al. 2005). We can scan for specific topologies among the 23,155 family trees. We selected four different branches of the evolution of animals (fig. 1): 1) the whole-genome duplication at the basis of teleost fishes (3R) (549 subfamilies extracted), 2) the split between teleost fishes and tetrapods (4,024 subfamilies), 3) the whole-genome duplications at the base of the vertebrates (1R/2R) (1,014 subfamilies), and 4) the split between Protostomia (limited to insects in our data set) and Deuterostomia (vertebrates and the two chordates Ciona intestinalis and C. savignyi in our data set) (1,234 subfamilies). This represents a total of 6,821 branches to test. These events were chosen because they are important in the evolution of animals, and they contain at least four sequenced genomes on each side of the branch, which appears from pilot studies to be a minimum requirement to be able to detect shifts in a significant manner.

Tree topologies studied. Boxes represent speciation events and circles represent duplication events. The branches tested are the 3R genome duplication at the basis of teleost fishes (A), the speciation between fishes and tetrapods (B), the 1R/2R genome duplications at the origin of vertebrates (C), and the speciation between insects and chordates (D). For the families thus recovered, we removed species with 2× genome coverage (mostly coming from the Mammalian Genome Project of the the National Institute of Health). The restricted alignments were refined with MUSCLE (Edgar 2004). Computations were then done on the new alignment after removing all columns with at least one gap and extracting the well-aligned part using GBLOCKS (Castresana 2000). Phylogenetic subtrees were extracted from the global trees, and branch lengths reestimated with PhyML release 3.0 (Guindon and Gascuel 2003). For the manipulations of sequences and trees, we combined scripts in Python, BioPython (Cock et al. 2009), Jalview (Waterhouse et al. 2009), and the R library APE (Paradis et al. 2004). Our data set includes ten species of tetrapods: the frog Xenopus tropicalis (Hellsten et al. 2010), the chicken Gallus gallus (International Chicken Genome Sequencing Consortium 2004), and the nine mammals: Monodelphis domestica (Mikkelsen et al. 2007), Bos taurus (Elsik et al. 2009), Canis familiaris (Lindblad-Toh et al. 2005), Equus caballus (Wade et al. 2009), Mus musculus (Waterston et al. 2002), Rattus norvegicus (Rat Genome Sequencing Project Consortium 2004), Pongo pygmaeus abelii (unpublished data), Macaca mulatta (Gibbs et al. 2007), Pan troglodytes (The Chimpanzee Sequencing and Analysis Consortium 2005), and Homo sapiens (International Human Genome Sequencing Consortium 2001, 2004); five species of teleost fishes: Danio rerio (unpublished), Gasterosteus aculeatus (unpublished), Oryzias latipes (Kasahara et al. 2007), Tetraodon nigroviridis (Jaillon et al. 2004), and Takifugu rubripes (Aparicio et al. 2002); the two Ciona: C. intestinalis (Dehal et al. 2002) and C. savignyi (Hill et al. 2008); and four species of insects: Aedes aegypti (Nene et al. 2007), Anopheles gambiae (Holt et al. 2002), Apis melifera (Honeybee Genome Sequencing Consortium 2006), and Drosophila melanogaster (Celniker et al. 2002).

Detection of Shifts in Evolutionary Rate (Covarion-Like)

A shift in evolutionary rate can be observed when a particular amino acid is constrained in one part of the phylogenetic tree (subtree) and is relaxed in the other subtree. This has been called “heterotachy” (Lopez et al. 2002; Philippe et al. 2003), “Type I of functional divergence” (Gu 1999, 2001), or “rate-shifting sites” (Abhiman and Sonnhammer 2005). A particular case is the “concomitantly variable codons” (covarions) process (Fitch 1971; Miyamoto and Fitch 1995; Pupko and Galtier 2002). This pattern of evolution postulates that a subset of sites shifting at the same time are more likely to be structurally linked in the protein (Pupko and Galtier 2002). For simplicity, we will use the term covarions for the rest of the paper. Although it should be noted that we do not study structural linkage, we only report cases where many sites have shifted at the same time (same branch of the tree). To detect such covarions, we used Checkcov (Galtier N, personal communication), an implementation of the algorithm described by Pupko and Galtier (2002). Checkcov performs statistical tests for each residue and can be implemented in an automatic pipeline. The method performs a likelihood ratio test between a null model with only one evolutionary rate per site [rate-among-sites (RAS) model] against an alternative model with two evolutionary rates, one for each subtree. It manages repetition test inside an alignment by using a binomial distribution B(n, P). Briefly, Checkcov assumes that if a covarion process has affected an alignment of length n, we should detect significantly more that 1% of sites at P value <0.01. We used the binomial test from R (binom.test(i, n, P)) to compute the corresponding P value. We also used Procov to validate the results of Checkcov (Wang et al. 2009). Procov is a general method to detect covarions anywhere in a tree, whereas Checkcov focuses only on a single branch in the tree. To detect whether an alignment includes significant evidence of covarion-like evolution, Procov performs a likelihood ratio test between the RAS model and the general covarion model. For each site, Procov provides the log-likelihood value under the RAS model (1 degree of freedom) and the log-likelihood value under the COV model (3 degrees of freedom). To choose the cutoff to assign a site under a COV model, we used the Akaike information criterion. The difference in log likelihood should be higher than 2 (3 degrees of freedom for COV − 1 degree of freedom for RAS) to be significant. This is an approximation relative to the parametric bootstrap method used in the original paper of Procov (Wang et al. 2009); the parametric bootstrap necessitates a new computation for each data set, resulting in a slightly different cutoff in each case (1.62 for the data set of Wang et al.).

Detection of Change of Conservation Pattern (Constant but Different)

A change in conservation pattern can be seen when a residue is constrained in one subtree for a given property (e.g., a specific amino acid or hydrophobicity) and constrained in the other subtree for a different property (e.g., a different amino acid or polarity). This has been called “Type II of functional divergence” (Gu 2006), “conservation-shifting sites” (Abhiman and Sonnhammer 2005), or “constant but different” (CBD) (Gribaldo et al. 2003). We will use the term CBD for the rest of the paper. We focus only on radical changes in physicochemical properties, such as an acidic to a basic amino acid or a polar to a hydrophobic amino acid. We use Burst after Duplication with Ancestral Sequence Prediction (BADASP) (Edwards and Shields 2005) (release 1.3), which implements the Burst after Duplication (BAD) algorithm (Caffrey et al. 2000). BAD computes the observed differences between two subtrees by the comparison between ancestral conservation and recent conservation. The cutoff value and the minimum number of sites per family were chosen after simulations and after comparison with a positive selection data set (see Results).

Simulations

We performed simulations to test the accuracy and the power of the tests we use. For each subfamily from Homolens, we simulated five alignments of amino acid sequences. We specified the sequence length after removing gaps from the original alignment. We simulated the first 90% of the alignment under a nearly neutral process and the last 10% under a covarion process. The nearly neutral process was based on a RAS model with four categories and the alpha parameter estimated from the real data, using Evolver from PAML (Yang 2007), with a fixed tree topology. The covarion process was simulated by applying for each column a randomly chosen amino acid constraint in one subtree (100% conservation) and random amino acids in the other subtree (no conservation). Using Checkcov, we expect to find 0–1% of covarion sites in the first 90% of the simulated alignments and up to 100% of covarion sites in the last 10%. We also used simulations derived from the 767 nonduplication trees of our positive selection data set of vertebrates (Studer et al. 2008). This was used to test the CBD under the neutral hypothesis and to define the minimal percentage of sites expected. These simulated multiple alignments followed a RAS model using Evolver.

Statistical Analysis

Over- and underrepresentation in gene ontology (GO) terms was estimated using the TopGO library (Alexa et al. 2006) from Bioconductor (Gentleman et al. 2004). We used false discovery rate to correct for repetition test for enrichment. All other statistical analyses were performed using R (R Development Core Team 2007) and the QVALUE package for test repetition (Storey and Tibshirani 2003).

Results

Comparison with Data of Positive Selection

The data from our previous study of positive selection consists in 767 families of singleton genes of vertebrates (Studer et al. 2008). The analysis has been performed by CodeML from the PAML package (Yang 2007). The model used was the branch-site model (Zhang et al. 2005), able to predict episodic positive selection, and identify amino acids position by providing a Bayes empirical Bayes (BEB) score (Yang et al. 2005). First, we focused on sites. The BEB scores per site were retrieved for the test on the branch between tetrapods and teleost fishes. We assigned a value of 0 for sites with a BEB < 50% and used the real BEB value otherwise. We then associated to these sites the corresponding chi-square value from Checkcov (COV) and the BAD score from BADASP. We have data for a total of 330,067 sites. We found a weak positive correlation between BEB and COV (Pearson’s r = 0.07, P < 2.2 × 10−16), a higher positive correlation between BEB and BAD (Pearson’s r = 0.26, P < 2.2 × 10−16), and, as expected, a weak negative correlation between COV and BAD (Pearson’s r = −0.13, P < 2.2 × 10−16). To verify that results are not influenced by potential saturation of synonymous changes, we repeated the analysis excluding branches with dS > 1. All trends are similar (BEB–COV r = 0.04, BEB–BAD r = 0.19, and COV–BAD r = −0.16) consistent with the lack of saturation reported in Studer et al. (2008). In a second step, we made four different classes of sites (BEB < 50%, 50% ≤ BEB < 95%, 95% ≤ BEB < 99%, and BEB ≥ 99%), and we plotted the distribution of covarion and CBD scores among these classes. These distributions show that the correlation between covarions and positive selection is not biologically significant (fig. 2A) because sites have the same distribution of covarion scores whatever their posterior probability of positive selection. On the other hand, there is a clear shift in CBD scores between the different classes of positive selection (fig. 2B). Of note, these results are consistent with the detection of a “bona fide” signal by the test for positive selection.

Correlation of sites under positive selection with covarions and CBDs. (A) The histogram presents the values of the chi-square score per site for covarions. The curves are the chi-square values for different BEB intervals (posterior probability for a site to be under positive selection): black for BEB < 50%, blue for BEB < 95%, green for 99% < BEB, and red for BEB ≥ 99%. (B) The histogram represents the values of BAD scores per site for CBD. The curves are the BAD values for different BEB intervals: black for BEB < 50%, blue for BEB < 95%, green for 99% < BEB, and red for BEB ≥ 99%. The dashed curve represents BAD scores of nearly neutral simulated sequences. The dashed line at 3.5 is the 99th percentile of BAD scores for BEB < 50%. Because BAD scores are not associated to a statistical test but are associated to positive selection, we used BEB values to identify a cutoff for BAD scores. We identified the 99th percentile for the distribution of sites in the class of BEB < 50%. We found a BAD score value of 3.50, which will be defined as our cutoff to detect CBD. This is consistent with the cutoff value of 3+ recommended by the authors (BADASP manual, p. 23). It is also consistent with the distribution of BAD scores for sequences simulated under nearly neutral evolution (dashed curve in fig. 2). To fix the expected proportion of CBD sites (BAD score ≥3.5) in an alignment under the null hypothesis, we used these simulated alignments and selected the 99th percentile of the distribution (fig. 3). We obtain a limit of 3.9% (rounded to 4% for further analyses). These cutoff values (BAD ≥ 3.5; sites >4%) provide us with a stringent test of CBD for further analyses.

Percentage of CBDs per families under RAS model and real data. Histograms of proportion of CBD sites with BAD score >3.5 per subtree in data simulated under a neutral model of RAS (A) and in real data (B). The dashed line at 3.9% is the 99th percentile under the neutral model.

Covarion-Like Sites

We first performed tests on simulated data (table 1). For fish specific duplicates (3R), we recovered 75% of covarion sites. This power is not very high, but the accuracy is more than 99.6%. In other branches, we recovered 94–99% of covarion sites without losing accuracy, which is always above 99.2%.

Table 1.

Evaluation of the Accuracy and Power of the Test for Covarions.

Event Studied	Number of Familiesa	Average Number of Taxa	Average Sequences Length	Accuracy under Neutral Evolution	Power under Covarion Process
Event Studied	Number of Familiesa	Average Number of Taxa	Average Sequences Length	(=1 − Percentage Covarion) (%)	(=Percentage Covarion) (%)
Duplication 3R	2,745	9.1	390.7	99.6	75.2
Speciation tetrapods–fishes	20,120	16.2	353.8	99.4	94.5
Duplication 2R	5,070	31.3	229.5	99.4	99.2
Speciation chordates–insects	6,170	28.0	254.9	99.2	94.5

Five simulations per set of parameters derived from one family of real data.

Evaluation of the Accuracy and Power of the Test for Covarions. Five simulations per set of parameters derived from one family of real data. On real data, various percentages of families under covarion processes were found, above the 1% threshold (table 2). In the most recent event tested, the 3R duplication, only two families (0.4%) are significant, and we suspect misalignment for one case (ENSDARP00000012602; Dup3R/HBG059441-2 in our Web supdata: http://bioinfo.unil.ch/supdata/). The 3R is the event with the lowest number of sequences per family (nine in average), and this could provide less power, as seen in simulations. For the speciation between tetrapods and fishes, there are twice as many genes per family (16 on average), and we find 33 families (0.8%). This result is coherent with a preliminary study (Studer and Robinson-Rechavi 2009a), where we found a similar trend using another tool (ShiftFinder; http://sites.univ-provence.fr/evol/phylogenomics-lab/PageWeb/SHIFT-FINDER.htm) to detect shifts in evolutionary rates.

Table 2.

Results of the Detection of Covarions.

Event Studied	Number of Families	Average Number of Sequences	Average Number of Sites	Families without Significanta Signal for Covarions		Families with Significanta Signal for Covarions
Event Studied	Number of Families	Average Number of Sequences	Average Number of Sites	Number of Families (%)	Mean Branch Lengthb	Number of Families (%)	Percentage of Covarions Sites (%)	Mean Branch Length
Duplication 3R	549	9.1	391.8	547 (99.6)	0.134	2 (0.4)	3.2	0.201
Speciation tetrapods–fishes	4,024	16.2	355.2	3991 (99.2)	0.247	33 (0.8)	3.5	0.445
Duplication 2R	1,014	31.3	231.0	928 (91.5)	0.290	86 (8.5)	4.8	0.522
Speciation chordates–insects	1,234	28.0	256.2	1161 (94.1)	0.426	73 (5.9)	4.4	0.667

P value = 1% and Q value threshold at 10%.

In amino acid substitutions.

Results of the Detection of Covarions. P value = 1% and Q value threshold at 10%. In amino acid substitutions. The highest number of significant families (8.5%) is found for the genes associated to the 2R duplication at the base of vertebrates. This is also the data set with the most sequences per family, 31 on average. The oldest event tested is the speciation event between insects and chordates, where we found 5.9% of families with an average of 28 sequences per family. A different approach to the detection of covarion-like sites is implemented in Procov (Wang et al. 2009). Procov detects shifts that occurred on any branch of the tree and expectedly detects more shifts than the targeted approach of Checkcov. In our data set, Checkcov identifies 3% of significant families, whereas Procov identifies 66% of significant families. Among the 194 families identified by Checkcov, Procov also detected 97% of them. Of the 17,787 sites detected by Checkcov at the 1% threshold, 95% present some signal in Procov (ln(L_cov) − ln(L_RAS) > 0). Moreover, there is a strong correlation of Checkcov and Procov scores, when comparing sites that are significant under Checkcov (Spearman’s ρ = 0.41, P < 2.2 × 10−16). Finally, using very conservative thresholds (ten for Checkcov and two for Procov), more than 50% of the 2,961 sites from Checkcov are recovered by Procov, whereas detecting exact covarion sites is a difficult problem.

CBD Sites

We found no significant CBD sites in 3R duplicates and very few in other events (table 3): tetrapode–teleost speciation (0.5%), 2R duplication (0.4%), and chordate–insect speciation (1.4%). This is probably due to our stringent cutoff and control for false discovery rate. Interestingly, the average of CBD sites in significant families is quite high (≈10%), supporting the relevance of those shifts that we do detect.

Table 3.

Results of the Detection of CBD.

Event Studied	Number of Families	Average Number of Sequences	Average Number of Sites	Families without Significanta Signal for CBD		Families with significanta signal for CBD
Event Studied	Number of Families	Average Number of Sequences	Average Number of Sites	Number of Families (%)	Mean Branch Lengthb	Number of Families (%)	Percentage of BADASP Sites (%)	Mean Branch Length
Duplication 3R	549	9.1	391.8	549 (100)	0.134	0 (0)	NA	NA
Speciation tetrapods–fishes	4,024	16.2	355.2	4003 (99.5)	0.245	21 (0.5)	9.7	0.900
Duplication 2R	1,014	31.3	231.0	1010 (99.6)	0.305	4 (0.4)	12.7	1.592
Speciation chordates–insects	1,234	28.0	256.2	1217 (98.6)	0.430	17 (1.4)	10.3	1.149

Cutoff at 4%, based on simulation data, and Q value threshold at 10%.

In amino acid substitutions.

Results of the Detection of CBD. Cutoff at 4%, based on simulation data, and Q value threshold at 10%. In amino acid substitutions.

Global Trends

A potential confounding effect in the analysis of 2R duplicate genes is the evolutionary rate of fish genes, which evolve faster than tetrapods (Brunet et al. 2006; Steinke et al. 2006). We recomputed the 2R families after removing all fish sequences. The first observation, as expected, is a global increase of the mean length in the tested branch, from 0.257 amino acid substitutions per site to 0.439, because we measure the branch between the 2R event and the amphibian–amniote speciation, instead of the branch between the 2R event and the earlier teleost–tetrapode speciation. More interesting, there is a decrease in the number of families with significant covarion patterns (from 8.5% to 4.4%) and inversely a large increase in the number of families significant for CBD (from 0.4% to 16.8%). The fast-evolving fish genes probably increased the global heterogeneity in amino acid alignments, decreasing the signal for paralog-specific conservation of amino acids (CBD). GO enrichment shows that our data set is biased toward slow-evolving genes, as in Studer et al. (2008), for similar reasons of stringent data collection. Significant results in covarions revealed only a few categories, typical of fast-evolving proteins; no GO categories were enriched in CBD genes. With these results, we can only say that it seems that covarions and CBDs can be found among most categories of genes. Finally, we evaluated possible confounding factors by computing a linear model (analysis of variance) testing the effect of different parameters describing gene families (tables 4 and 5). The number of sequences and the branch length are correlated with the percentage of significant sites for both tests, whereas the number of sites analyzed has no impact. It should be noted that the branch length explains up to half of the variance in the CBD analyses. However, at least 72% of the variance is explained by none of these parameters for covarions and at least 44% of the variance for CBD sites. Presumably, most of this remaining variance is due to shifts in function and selection.

Table 4.

Effect of Potential Confounding Factors on the Detection of Covarions.

Variable	Duplication 3R		Speciation Tetrapods–Fishes		Duplication 2R		Speciation Vertebrates–Insects
Variable	Variance Explained (%)	P Valuea	Variance Explained (%)	P Valuea	Variance Explained (%)	P Valuea	Variance Explained (%)	P Valuea
Number of genes	2	3.8 × 10⁻⁰⁴	2	2.2 × 10⁻¹⁶	2	1.5 × 10⁻⁰⁷	3	5.5 × 10⁻¹²
Number of sites	0	1.5 × 10⁻⁰¹	0	2.1 × 10⁻⁰¹	0	8.8 × 10⁻⁰¹	0	7.2 × 10⁻⁰¹
Branch length separating subtrees	2	2.0 × 10⁻⁰⁴	12	2.2 × 10⁻¹⁶	16	2.2 × 10⁻¹⁶	6	2.2 × 10⁻¹⁶
Number of branches in subtree alpha	2	1.1 × 10⁻⁰⁴	0	8.6 × 10⁻⁰⁴	0	8.3 × 10⁻⁰¹	0	2.9 × 10⁻⁰¹
Sum of branch lengths in subtree alpha	8	2.9 × 10⁻¹³	3	2.2 × 10⁻¹⁶	1	5.6 × 10⁻⁰⁵	1	1.7 × 10⁻⁰³
Number of branches in subtree betab	NA	NA	0	6.1 × 10⁻⁰¹	0	3.7 × 10⁻⁰¹	NA	NA
Sum of branch lengths in subtree beta	10	2.2 × 10⁻¹⁶	5	2.2 × 10⁻¹⁶	1	8.5 × 10⁻⁰⁴	1	3.3 × 10⁻⁰⁴
Median_diffc	2	2.8 × 10⁻⁰⁵	4	2.2 × 10⁻¹⁶	0	1.3 × 10⁻⁰²	5	2.2 × 10⁻¹⁶
Residuals	74		72		79		84

NOTE.—NA, non-available.

Italic values indicate significant after a Bonferroni correction (α = 0.05/4 = 0.0125).

The values NA were removed from the analysis of variance because the number of branches in the subtree beta is identical to the number of branches in the subtree alpha.

Median_diff represents the difference between the medians of all branch lengths for both trees.

Table 5.

Effect of Potential Confounding Factors on the Detection of CBD Sites.

Variable	Duplication 3R		Speciation Tetrapods–Fishes		Duplication 2R		Speciation Vertebrates–Insects
Variable	Variance Explained (%)	P Valuea	Variance Explained (%)	P Valuea	Variance Explained (%)	P Valuea	Variance Explained (%)	P Valuea
Number of genes	3	2.2 × 10⁻⁰⁹	4	2.2 × 10⁻¹⁶	1	1.5 × 10⁻⁰⁶	4	2.2 × 10⁻¹⁶
Number of sites	0	5.3 × 10⁻⁰¹	0	2.1 × 10⁻⁰²	0	5.2 × 10⁻⁰¹	0	4.6 × 10⁻⁰¹
Branch length separating subtrees	51	2.2 × 10⁻¹⁶	52	2.2 × 10⁻¹⁶	39	2.2 × 10⁻¹⁶	39	2.2 × 10⁻¹⁶
Number of branches in subtree alpha	0	9.6 × 10⁻⁰¹	0	8.0 × 10⁻⁰¹	0	5.7 × 10⁻⁰¹	0	2.4 × 10⁻⁰¹
Sum of branch lengths in subtree alpha	0	2.3 × 10⁻⁰¹	0	6.6 × 10⁻⁰²	0	2.6 × 10⁻⁰²	0	5.7 × 10⁻⁰¹
Number of branchesin subtree betab	NA	NA	0	6.8 × 10⁻⁰¹	0	2.1 × 10⁻⁰¹	NA	NA
Sum of branch lengths in subtree beta	1	1.8 × 10⁻⁰⁴	0	9.0 × 10⁻⁰²	0	2.4 × 10⁻⁰¹	0	9.1 × 10⁻⁰⁴
Median_diffc	0	5.2 × 10⁻⁰¹	1	6.9 × 10⁻¹⁶	1	5.2 × 10⁻⁰⁴	2	8.1 × 10⁻¹³
Residuals	44		44		59		55

NOTE.—NA, non-available.

Italic values indicates significant after a Bonferroni correction (α = 0.05/4 = 0.0125).

The values NA were removed from the ANOVA because the number of branches in the subtree beta is identical to the number of branches in the subtree alpha.

Median_diff represents the difference between the medians of all branch lengths for both trees.

Effect of Potential Confounding Factors on the Detection of Covarions. NOTE.—NA, non-available. Italic values indicate significant after a Bonferroni correction (α = 0.05/4 = 0.0125). The values NA were removed from the analysis of variance because the number of branches in the subtree beta is identical to the number of branches in the subtree alpha. Median_diff represents the difference between the medians of all branch lengths for both trees. Effect of Potential Confounding Factors on the Detection of CBD Sites. NOTE.—NA, non-available. Italic values indicates significant after a Bonferroni correction (α = 0.05/4 = 0.0125). The values NA were removed from the ANOVA because the number of branches in the subtree beta is identical to the number of branches in the subtree alpha. Median_diff represents the difference between the medians of all branch lengths for both trees.

Discussion

Covarion-Like and CBD patterns as Proxies of Functional Divergence

We found a variable proportion of protein families to be significantly under covarion process, depending of the branch tested, from 0.8% to 8.5% (table 2), and very few families with a CBD pattern. The proportion of 2% found by Gruenheit et al. (2008) in their balanced data set between two monophyletic groups is within the same range. This suggests that covarion-like patterns may reflect relatively frequent small refinements in function or even compensatory mutations without change in function. Importantly, most cases of covarion that we detect with Checkcov are also detected by the more recent Procov method (Wang et al. 2009). The rarity of CBD is consistent with more radical functional changes. It is also possible that CBD sites are more difficult to discriminate from other slow-evolving sites, resulting in rare detection under stringent criteria. Interestingly, the proportion of sites found when a branch is significant for a family is quite high for both patterns, and highest for CBD, which is consistent with the detection of radical functional shifts in protein function in at least some cases. Of note, removing fast-evolving fish genes increased our capacity to detect CBD patterns between tetrapode paralogs. Thus, we should be careful to take into account evolutionary rate differences between species in the study of paralog divergence. We have tried to apply strict cutoffs for both covarion and CBD, but it is difficult to estimate the true level of changes in amino acid evolution that will affect function in a biologically meaningful way (Levasseur et al. 2007). Case studies present both examples of the evolution of a new function (Braasch et al. 2006) and of more subtle optimization of the original function (Christin et al. 2008). In general, minor sequence changes can affect structural properties that are directly linked to the biochemical function (Tokuriki and Tawfik 2009). At the extreme, the bacterial melamine deaminase shares 98% of identity with the atrazine chlorohydrolase (Seffernick et al. 2001). A correlation of covarions as detected in sequences with structural and functional divergence has been found in several case studies. This is the case for bacterial and eukaryotic elongation factors, which differ in function (Gaucher et al. 2001), and for which covarions were confirmed by crystal structure (Gaucher, Das, et al. 2002). It is also the case in caspases (Wang and Gu 2001). In vertebrate hemoglobin α and β, CBD sites seem to be a more reliable predictor of function divergence than covarions (Gribaldo et al. 2003). Other case studies have highlighted the importance of studying covarion or CBD patterns in correlation with protein function (Gaucher, Gu, et al. 2002; Philippe et al. 2003; Liberles 2007). A difficulty in comparing results from various studies is the use of different methods, with no common standard, and notably differences in the treatment of radical versus conservative changes (Liberles 2007). Both covarions and CBD sites can result from either gain or loss of function. A covarion pattern may indicate a site, which was kept functionally constrained in one subtree but lost this constraint in the other, or the recruitment of a site (newly constrained) for a novel function. And a CBD pattern may indicate a site conserved in the ancestor which changed by positive selection to get a new function, or a partition of the ancestral protein into different functions, especially through escape from adaptive conflict (Conant and Wolfe 2008). The covarion model has been recently extended to detect shifts in any branch in a phylogenetic tree (Penn et al. 2008). However, this method needs dozens of sequences, which is not yet applicable to the comparison of completely sequenced animal genomes. It would be interesting in the future to develop methods to analyze such patterns between three subtrees: the two lineages of interest and an outgroup, which will help to discriminate between gain and loss of function (Studer and Robinson-Rechavi 2009b).

Can We Link These Shifts at the Amino Acid Level with Positive Selection Inferred at the Codon Level?

Although the models of amino acid changes are not able to discriminate between a relaxation of purifying selection and positive selection, this is the main focus of codon models (Anisimova and Liberles 2007). But an advantage of amino acid patterns is the potential for comparisons between sequences for which synonymous substitutions are saturated. Thus, a correlation between amino acid shifts and the detection of positive selection would be interesting. We found only a weak positive correlation between scores of positive selection and of covarions. Among sites identified as under positive selection on a specific branch, there are more sites under purifying selection on background branches (type K2a of CodeML; Yang 2007) (∼6.5% in Studer et al. 2008), than sites under neutral evolution on background branches (type K2b) (∼1% in Studer et al. 2008). The type K2a of codon model with its slow evolutionary rate except on one branch is closest to a CBD pattern. The covarion pattern may be more similar to the type K2b but is not quite the same. CodeML K2b sites are expected to be variable in all background branches, whereas covarion sites are only variable in one subtree. In fact, the covarion pattern is most similar to the clade model of CodeML (Bielawski and Yang 2004), which has not been used in any large scan to our knowledge. In any case, it appears that by detecting CBD sites, we do detect a signal of positive selection similar to that inferred by the branch-site model. This opens the possibility to scan for such shifts in very ancient events, where synonymous substitutions are clearly saturated.

Is the Incidence of Covarion-Like and CBD Patterns Different between Paralogs and Orthologs?

In a study of vertebrate hemoglobin α and β, metrics of protein divergence were compared in their capacity to distinguish paralog divergence from ortholog divergence (Gribaldo et al. 2003). But on a larger data set, there is no obvious excess of either covarion-like or CBD sites on duplication branches relative to speciation branches (tables 2 and 3). The main explanatory variables for test results are branch length and the number of genes analyzed (tables 4 and 5). If anything, there seems to be a slight deficit of CBD sites after duplication, but this could be due to other factors, such as the evolutionary rate acceleration in teleost fishes. This is consistent with a previous report of similar levels of amino acid variability in orthologs and paralogs on a smaller sample (Conant et al. 2007). It has been suggested that the divergence of paralogs owes more to expression patterns than to biochemical function (Wapinski et al. 2007), although a stronger divergence of expression between paralogs than between orthologs remains to be established (Studer and Robinson-Rechavi 2009b). At the protein level, in any case, it appears that functional shifts could be frequent not only between paralogs but also between orthologs and that we should be careful when applying orthology to define the function of new genes (Lynch 2009; Studer and Robinson-Rechavi 2009b).

Conclusions

In our study, we found that the CBD pattern correlates well with sites predicted by a branch-site model. This pattern could be used as a proxy of positive selection in very ancient evolutionary events, avoiding the classical problem of saturation of synonymous sites. Positive selection is expected to be involved in strong changes that affect the protein structure and function, which could be causing the CBD pattern. Most evolutionary models predict a burst of functional change after duplication (Force et al. 1999; Conant and Wolfe 2008). We find that changes at the amino acid level, while not infrequent, affect both orthologs and paralogs similarly. Thus, it seems that the same pattern of functional divergence affects both paralogs and orthologs genes, and this might be, rather than an exception, a general trend (Studer and Robinson-Rechavi 2009b). If there are any differences in gene evolution after duplication, it is possible that they affect expression patterns rather than protein biochemical function as reflected in sequences (Wapinski et al. 2007).

74 in total

1. The crystal structure of eEF1A refines the functional predictions of an evolutionary analysis of rate changes among elongation factors.

Authors: Eric A Gaucher; Ujjwal K Das; Michael M Miyamoto; Steven A Benner
Journal: Mol Biol Evol Date: 2002-04 Impact factor: 16.240

2. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins.

Authors: Paramvir Dehal; Yutaka Satou; Robert K Campbell; Jarrod Chapman; Bernard Degnan; Anthony De Tomaso; Brad Davidson; Anna Di Gregorio; Maarten Gelpke; David M Goodstein; Naoe Harafuji; Kenneth E M Hastings; Isaac Ho; Kohji Hotta; Wayne Huang; Takeshi Kawashima; Patrick Lemaire; Diego Martinez; Ian A Meinertzhagen; Simona Necula; Masaru Nonaka; Nik Putnam; Sam Rash; Hidetoshi Saiga; Masanobu Satake; Astrid Terry; Lixy Yamada; Hong-Gang Wang; Satoko Awazu; Kaoru Azumi; Jeffrey Boore; Margherita Branno; Stephen Chin-Bow; Rosaria DeSantis; Sharon Doyle; Pilar Francino; David N Keys; Shinobu Haga; Hiroko Hayashi; Kyosuke Hino; Kaoru S Imai; Kazuo Inaba; Shungo Kano; Kenji Kobayashi; Mari Kobayashi; Byung-In Lee; Kazuhiro W Makabe; Chitra Manohar; Giorgio Matassi; Monica Medina; Yasuaki Mochizuki; Steve Mount; Tomomi Morishita; Sachiko Miura; Akie Nakayama; Satoko Nishizaka; Hisayo Nomoto; Fumiko Ohta; Kazuko Oishi; Isidore Rigoutsos; Masako Sano; Akane Sasaki; Yasunori Sasakura; Eiichi Shoguchi; Tadasu Shin-i; Antoinetta Spagnuolo; Didier Stainier; Miho M Suzuki; Olivier Tassy; Naohito Takatori; Miki Tokuoka; Kasumi Yagi; Fumiko Yoshizaki; Shuichi Wada; Cindy Zhang; P Douglas Hyatt; Frank Larimer; Chris Detter; Norman Doggett; Tijana Glavina; Trevor Hawkins; Paul Richardson; Susan Lucas; Yuji Kohara; Michael Levine; Nori Satoh; Daniel S Rokhsar
Journal: Science Date: 2002-12-13 Impact factor: 47.728

3. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

Review 4. Predicting functional divergence in protein evolution by site-specific rate shifts.

Authors: Eric A Gaucher; Xun Gu; Michael M Miyamoto; Steven A Benner
Journal: Trends Biochem Sci Date: 2002-06 Impact factor: 13.807

5. A covarion-based method for detecting molecular adaptation: application to the evolution of primate mitochondrial genomes.

Authors: Tal Pupko; Nicolas Galtier
Journal: Proc Biol Sci Date: 2002-07-07 Impact factor: 5.349

Review 6. Heterotachy and functional shift in protein evolution.

Authors: Hervé Philippe; Didier Casane; Simonetta Gribaldo; Philippe Lopez; Julien Meunier
Journal: IUBMB Life Date: 2003 Apr-May Impact factor: 3.885

7. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

8. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes.

Authors: Samuel Aparicio; Jarrod Chapman; Elia Stupka; Nik Putnam; Jer-Ming Chia; Paramvir Dehal; Alan Christoffels; Sam Rash; Shawn Hoon; Arian Smit; Maarten D Sollewijn Gelpke; Jared Roach; Tania Oh; Isaac Y Ho; Marie Wong; Chris Detter; Frans Verhoef; Paul Predki; Alice Tay; Susan Lucas; Paul Richardson; Sarah F Smith; Melody S Clark; Yvonne J K Edwards; Norman Doggett; Andrey Zharkikh; Sean V Tavtigian; Dmitry Pruss; Mary Barnstead; Cheryl Evans; Holly Baden; Justin Powell; Gustavo Glusman; Lee Rowen; Leroy Hood; Y H Tan; Greg Elgar; Trevor Hawkins; Byrappa Venkatesh; Daniel Rokhsar; Sydney Brenner
Journal: Science Date: 2002-07-25 Impact factor: 47.728

9. The genome sequence of the malaria mosquito Anopheles gambiae.

Authors: Robert A Holt; G Mani Subramanian; Aaron Halpern; Granger G Sutton; Rosane Charlab; Deborah R Nusskern; Patrick Wincker; Andrew G Clark; José M C Ribeiro; Ron Wides; Steven L Salzberg; Brendan Loftus; Mark Yandell; William H Majoros; Douglas B Rusch; Zhongwu Lai; Cheryl L Kraft; Josep F Abril; Veronique Anthouard; Peter Arensburger; Peter W Atkinson; Holly Baden; Veronique de Berardinis; Danita Baldwin; Vladimir Benes; Jim Biedler; Claudia Blass; Randall Bolanos; Didier Boscus; Mary Barnstead; Shuang Cai; Angela Center; Kabir Chaturverdi; George K Christophides; Mathew A Chrystal; Michele Clamp; Anibal Cravchik; Val Curwen; Ali Dana; Art Delcher; Ian Dew; Cheryl A Evans; Michael Flanigan; Anne Grundschober-Freimoser; Lisa Friedli; Zhiping Gu; Ping Guan; Roderic Guigo; Maureen E Hillenmeyer; Susanne L Hladun; James R Hogan; Young S Hong; Jeffrey Hoover; Olivier Jaillon; Zhaoxi Ke; Chinnappa Kodira; Elena Kokoza; Anastasios Koutsos; Ivica Letunic; Alex Levitsky; Yong Liang; Jhy-Jhu Lin; Neil F Lobo; John R Lopez; Joel A Malek; Tina C McIntosh; Stephan Meister; Jason Miller; Clark Mobarry; Emmanuel Mongin; Sean D Murphy; David A O'Brochta; Cynthia Pfannkoch; Rong Qi; Megan A Regier; Karin Remington; Hongguang Shao; Maria V Sharakhova; Cynthia D Sitter; Jyoti Shetty; Thomas J Smith; Renee Strong; Jingtao Sun; Dana Thomasova; Lucas Q Ton; Pantelis Topalis; Zhijian Tu; Maria F Unger; Brian Walenz; Aihui Wang; Jian Wang; Mei Wang; Xuelan Wang; Kerry J Woodford; Jennifer R Wortman; Martin Wu; Alison Yao; Evgeny M Zdobnov; Hongyu Zhang; Qi Zhao; Shaying Zhao; Shiaoping C Zhu; Igor Zhimulev; Mario Coluzzi; Alessandra della Torre; Charles W Roth; Christos Louis; Francis Kalush; Richard J Mural; Eugene W Myers; Mark D Adams; Hamilton O Smith; Samuel Broder; Malcolm J Gardner; Claire M Fraser; Ewan Birney; Peer Bork; Paul T Brey; J Craig Venter; Jean Weissenbach; Fotis C Kafatos; Frank H Collins; Stephen L Hoffman
Journal: Science Date: 2002-10-04 Impact factor: 47.728

10. Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence.

Authors: Susan E Celniker; David A Wheeler; Brent Kronmiller; Joseph W Carlson; Aaron Halpern; Sandeep Patel; Mark Adams; Mark Champe; Shannon P Dugan; Erwin Frise; Ann Hodgson; Reed A George; Roger A Hoskins; Todd Laverty; Donna M Muzny; Catherine R Nelson; Joanne M Pacleb; Soo Park; Barret D Pfeiffer; Stephen Richards; Erica J Sodergren; Robert Svirskas; Paul E Tabor; Kenneth Wan; Mark Stapleton; Granger G Sutton; Craig Venter; George Weinstock; Steven E Scherer; Eugene W Myers; Richard A Gibbs; Gerald M Rubin
Journal: Genome Biol Date: 2002-12-23 Impact factor: 13.583

9 in total

1. A Phylogenetic Rate Parameter Indicates Different Sequence Divergence Patterns in Orthologs and Paralogs.

Authors: Joseph B Ahrens; Ashley I Teufel; Jessica Siltberg-Liberles
Journal: J Mol Evol Date: 2020-10-29 Impact factor: 2.395

2. The evolution of novelty in conserved gene families.

Authors: Gabriel V Markov; Ralf J Sommer
Journal: Int J Evol Biol Date: 2012-06-19

3. Multidimensional scaling reveals the main evolutionary pathways of class A G-protein-coupled receptors.

Authors: Julien Pelé; Hervé Abdi; Matthieu Moreau; David Thybert; Marie Chabbert
Journal: PLoS One Date: 2011-04-22 Impact factor: 3.240

4. Understanding the functional difference between growth arrest-specific protein 6 and protein S: an evolutionary approach.

Authors: Romain A Studer; Fred R Opperdoes; Gerry A F Nicolaes; André B Mulder; René Mulder
Journal: Open Biol Date: 2014-10 Impact factor: 6.411