Literature DB >> 22042333

Significant selective constraint at 4-fold degenerate sites in the avian genome and its consequence for detection of positive selection.

Axel Künstner1, Benoit Nabholz, Hans Ellegren.   

Abstract

A major conclusion from comparative genomics is that many sequences that do not code for proteins are conserved beyond neutral expectations, indicating that they evolve under the influence of purifying selection and are likely to have functional roles. Due to the degeneracy of the genetic code, synonymous sites within protein-coding genes have previously been seen as "silent" with respect to function and thereby invisible to selection. However, there are indications that synonymous sites of vertebrate genomes are also subject to selection and this is not necessarily because of potential codon bias. We used divergence in ancestral repeats as a neutral reference to estimate the constraint on 4-fold degenerate sites of avian genes in a whole-genome approach. In the pairwise comparison of chicken and zebra finch, constraint was estimated at 24-32%. Based on three-species alignments of chicken, turkey, and zebra finch, lineage-specific estimates of constraint were 43%, 29%, and 24%, respectively. The finding of significant constraint at 4-fold degenerate sites from data on interspecific divergence was replicated in an analysis of intraspecific diversity in the chicken genome. These observations corroborate recent data from mammalian genomes and call for a reappraisal of the use of synonymous substitution rates as neutral standards in molecular evolutionary analysis, for example, in the use of the well-known d(N)/d(S) ratio and in inferences on positive selection. We show by simulations that the rate of false positives in the detection of positively selected genes and sites increases several-fold at the levels of constraint at 4-fold degenerate sites found in this study.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22042333      PMCID: PMC3242499          DOI: 10.1093/gbe/evr112

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Introduction

Before detailed studies of genetic variation at the DNA and protein levels were possible, a common view held that most mutations in the genetic material have an effect on fitness (Dobzhansky 1970). As a consequence, they were thought to either relatively quickly reach fixation by positive selection or become removed from the population by negative selection. This view was challenged in the 1960s by the observation of significant within-species polymorphism (Hopkinson et al. 1963; Spencer et al. 1964; Lewontin and Hubby 1966), indicating that some of the variation in the genome might be more or less neutral with respect to fitness. Sparked in part by such data and armed with a mathematical approach based on diffusion equations to derive theoretical arguments, this soon led Kimura to develop the neutral theory of molecular evolution (Kimura 1968), a model positing that genetic drift of neutral alleles is an important driving force in evolution. In parallel, it became increasingly clear that a large fraction of the genome appears nonfunctional and is thereby potentially shielded from selection. DNA was found to consist of much else than genes, which were found to consist of exons and introns, and cracking the genetic code revealed that some positions within exons were “silent” with respect to which amino acid is encoded. The historical perspective briefly sketched out above is of relevance for the development of our current view on genome composition and molecular evolution. To some extent, the shift in focus of the 1960s and 1970s, from natural selection being thought to have a prevailing role to acknowledging that neutral processes affect parts of the genome, is corroborated and substantiated by recent genomic data. This is particularly so, given Ohta's extension of Kimura's model to the nearly neutral theory of molecular evolution (Ohta 1973) and the many clear examples that slightly deleterious mutations will effectively behave as neutral at low effective population sizes (Wright and Andolfatto 2008; Ellegren 2009). However, in parallel, there has also been an opposite trend. One of the most important conclusions from comparative genomics is that many regions of the genome previously considered nonfunctional show evidence of sequence conservation beyond neutral expectations (reviewed by, e.g., Dermitzakis et al. 2005). This suggests that several other sequence categories than those directly coding for proteins are functional and subject to selection. The identification of the numerous ultra-conserved elements found in intergenic regions previously seen as genomic “deserts” in vertebrate genomes (Bejerano et al. 2004; Katzman et al. 2007) are an example of a transition in the perception of the genome and how it evolves. Due to the degeneracy of the genetic code, synonymous substitutions are candidates to represent neutral changes. Accordingly, 4-fold degenerate sites have traditionally been seen as essentially free of selective constraint (Eyre-Walker and Keightley 1999; Nachman and Crowell 2000), at least in mammals where effective population sizes are often low and where mutations with a small effect on fitness should be expected to behave as neutral. The 4-fold degenerate sites have therefore been used as a neutral reference both in studies of constraint at nonsynonymous sites and in noncoding sequences. However, there is evidence that at least some silent sites are constrained also in mammals (Chamary et al. 2006). Importantly, two recent genome-wide studies have reported significant levels of constraint at 4-fold degenerate sites of mammalian genes. Eőry et al. (2010) obtained estimates of 22–27% in hominids and 11–12% in murids. Pollard et al. (2010) used alignments of 36 mammalian genomes to estimate a mean constraint in the mammalian phylogeny of 25%. In both these studies, divergence at ancestral repeats (ARs) was used as a neutral reference against which divergence at synonymous sites was contrasted. To widen the knowledge of selection at synonymous sites in vertebrates, we here estimate constraint at 4-fold degenerate sites of avian genes. We use data from three available bird genomes (chicken, turkey, and zebra finch) and find high levels of constraint: 24–43%. Selection on synonymous sites thus seems to be a ubiquitous feature of vertebrate genomes.

Material and Methods

Sequence Data and Divergence Estimates

Protein-coding sequences for 1:1 orthologs of chicken and zebra finch and for 1:1:1 orthologous of chicken, turkey, and zebra finch were downloaded from BioMart (Ensembl 61) (http://biomart.org) and, in both cases, aligned using MAFFT 6.716b (Katoh et al. 2009). Pairwise chicken-zebra finch alignments of intronic sequences and three species EPO alignments of intronic sequences, including information about intron coordinates, were downloaded from Ensembl (Version 61). Due to alternative splicing, some introns are (partly) annotated as exonic sequences and intronic sites annotated as exons were removed. Repeats were masked using RepeatMasker 3.2.9 (Smit et al. 2010), and all alignments were then cleaned using Gblocks 0.91b (Castresana 2000) with a minimal block length of 30 bases and a maximum number of eight contiguous nonconserved positions. Chicken annotations (Ensembl 61) were used to identify intronic transposable elements for the chickenzebra finch comparison, and we defined AR as elements present in orthologous positions of chicken and zebra finch introns. For the three-species comparison of chicken, turkey, and zebra finch, we defined AR as elements present in orthologous positions of introns of all three species, again using chicken repeat annotations. We masked all CpG dinucleotides by excluding sites preceded by cytosine (C) or followed by guanine (G) (CpG-prone sites, see Keightley and Gaffney 2003); all divergence estimates reported herein are thus non-CpG divergences. Divergences were estimated using a general time-reversible model with a gamma distribution of variable rate among sites (REV model of baseml from the PAML package version 4.4b; Yang 2007). For two-species alignments, we estimated divergence in the different sequence categories (0-fold and 4-fold degenerate sites, ARs, and introns) for each gene separately and then obtained the genome-wide mean based on these estimates. For the three-species alignments where fewer genes were available, we concatenated all gene sequences and obtained a genome-wide divergence estimate from the concatenated data. However, in order to be able to test for a correlation between divergence and gene expression parameters in chicken, we also estimated divergence at 4-fold degenerate sites for each gene separately in the three-species alignments. To reduce the risk of incorrectly inferred orthology and to avoid saturation problems, genes with a total divergence d > 1.8 were excluded in the pairwise comparison, and genes exceeding a divergence of 0.9 in at least one branch were excluded from further analysis in the three-species comparison. Pairwise estimates of d and d for each gene between chicken and zebra finch were taken from Nam et al. (2010). Chicken polymorphisms derived from genome resequencing of pools of unrelated individuals were obtained from (Rubin et al. 2010). In the absence of available allele frequency estimates for these data, we used the density of single nucleotide polymorphisms (SNPs; number of SNPs per base pair) as a measure of polymorphism level.

Estimates of Selective Constraint

Selective constraint was estimated using an approach introduced by Kondrashov and Crow (1993) and extended by Eyre-Walker and Keightley (1999). For the pairwise alignments with divergence estimates obtained for individual genes, we used the formula:where N is the number of genes analyzed, o is the observed divergence at tested sequence category for gene i (in most cases 4-fold degenerate sites) and e is the expected divergence obtained from the divergence estimated for AR of gene i. For three-species alignments, we used the genome-wide divergence estimates obtained from concatenated sequences to directly estimate constraint as 1 − o/e. Note that we only included genes for which data on both divergence at 4-fold degenerate sites and at one or more intronic AR were available. This selection was applied to exclude the possibility of differences in rate and pattern of nucleotide substitution between regions with and without intronic AR affecting the estimates of constraint. Weighted estimates of constraint were obtained by a method similar to the approach of Halligan and Keightley (2006). For estimating the weighting factor, we divided the number of alignable non-CpG AR sites by the number of all non-CpG AR sites. Weighted constraints were estimated dividing the original constraint estimates by the weighting factor.

Simulations of Tests for Positive Selection

To evaluate the impact of constraint on tests for positive selection, we simulated sequences using a branch-site model of evolution using Evolver from the PAML package (Yang 2007). The data were generated using the tree shown in Supplementary figure 1 and with two types of selection schemes. In the first, we simulated two classes of sites, one evolving under w1 = 0.2 in background branches (black lineages in Supplementary figure 1) and w2 = 1.0 in foreground branch (gray lineage) and the other with a constant w = 0.2 in all the lineages. In the second scheme, the first class had w1 = 0.1 in background branches and w2 = 2.0 in the foreground branch, whereas the second class had a constant w = 0.1 in all the lineages. In both schemes, we allocated 20% of the sites to the first class and 80% to the second. To mimic constraint acting on synonymous sites, we ignored a fraction (0.25, 0.35, or 0.45) of the synonymous substitutions simulated. This was done by comparing the simulated sequences with the ancestral sequences—provided by Evolver—generated during simulation. Each time a synonymous substitution was identified, it was replaced by the ancestral state with a probability equal to the constraint. In the case of two substitutions per codon, we considered all the mutational paths between the two codons, except those leading to a stop codon. For simplification, we considered only one substitution randomly in case of three substitutions per codon and did not consider multiple substitutions at the same site. The process was repeated for internal lineages in order to apply the same constraint throughout the phylogenetic tree. For each simulation scheme, we simulated 200 data sets of 1,000 codons with and without constraint acting on synonymous sites and applied the branch-site likelihood test of positive selection (Zhang et al. 2005). We repeated simulations with three different codon frequency spectra obtained for chickenturkeyzebra finch orthologous alignments chosen to represent GC-poor, GC-average, and GC-rich genes.

Gene Expression and Codon Usage

Median-subtracted arcsinh expression data from 20 different chicken tissues were taken from Chan et al. (2009). Gene-wise expression breadth (τ) was estimated following Yanai et al. (2005):where N is the number of tissues, x, the expression intensity for gene x in tissue i, and xmax, the maximum expression intensity of gene x across all tissues. For analyses of gene expression level, we used xmax. As a measure for codon bias, we calculated the Codon Adaptation Index (CAI; Sharp and Li 1987) obtained for chicken protein-coding sequences using CODONW (http://codonw.sourceforge.net).

Statistical Analyses

Statistical analyses were performed using R version 2.13.1 (R Development Core Team 2008). If not stated otherwise, nonparametric bootstrapping (1,000 iterations on concatenated sequences) was used to estimate confidence intervals for divergence and level of selective constraint.

Results

Estimates of Constraint at 4-Fold Degenerate Sites in the Chicken–Zebra Finch Comparison

We identified 13,245 chickenzebra finch 1:1 orthologs and among these 3,772 genes had at least one intronic AR element. After alignment, filtering, and removal of CpG-prone sites, the data set could broadly be defined as composed of 2.97 million 0-fold degenerate sites, 0.34 million 4-fold degenerate sites, 7.94 million bp of intronic AR, and 58.19 million bp of non-AR intron sequence. Divergence in these three categories of sequence classes was lowest among 0-fold degenerate sites and highest in AR (fig. 1). Divergence at 4-fold degenerate sites and at 0-fold sites was significantly different from AR divergence (Mann–Whitney U test, P < 0.001 for both comparisons). If we use AR as a neutral reference to estimate constraint in the other sequence categories, 0-fold degenerate sites show an 86.7% (±0.6%) and 4-fold degenerate sites a 24.2% (±1.6%) constraint. Intronic divergence was similar to the AR divergence.
F

Estimated sequence divergence of ARs, introns, 4-fold degenerated sites, and 0-fold sites in the chicken–zebra finch comparison estimated gene by gene. ***Denotes significantly lower divergence in comparison to ARs (P < 0.001).

Estimated sequence divergence of ARs, introns, 4-fold degenerated sites, and 0-fold sites in the chickenzebra finch comparison estimated gene by gene. ***Denotes significantly lower divergence in comparison to ARs (P < 0.001). Compared with the noncoding regions, the high degree of sequence conservation within the coding sequence and the ability to project DNA sequences onto alignments of protein sequences greatly reduce the frequency of gaps and the need for filtering of those regions difficult to align. However, filtering of regions that are difficult to align within a presumably neutral sequence may lead to the divergence being underestimated because the rapidly evolving sites can be excluded from the analysis. One way to handle this potential problem is to weight estimated constraint by the proportion of sequence removed in the neutral reference. Using a procedure similar to Halligan and Keightley (2006) (for details, see Material and Methods), we obtained a weighted estimate of constraint of 31.5% (±2.0%) for the 4-fold degenerate sites. Previous work in several different organisms (Lercher et al. 2004; Webster et al. 2004), including birds (Axelsson et al. 2005), has demonstrated regional consistency in nucleotide substitution rates, suggested to reflect regional mutation rate variation. This can also be seen in our data with a significant correlation in divergence between the neighboring genes, both for divergence at 4-fold degenerate sites (Breusch–Godfrey serial correlation LM test, u = 26.8599, degree of freedom [df] = 1, P < 0.001) and for divergence in ARs (u = 5.9362, df = 1, P = 0.015). However, we found no evidence for the clustering of genes in the genome with similar levels of constraint at 4-fold degenerate sites (u = 0.193, df = 1, P = 0.66) nor did the constraint at 0-fold sites correlate with the genomic location (u = 0.0148, df = 1, P = 0.90). Thus, selection does not display regional variation over the long evolutionary time scale analyzed (whereas such effects are expected over short time scales, due to the background selection and selective sweeps).

Lineage-Specific Estimates of Constraint in the Chicken–Turkey–Zebra Finch Comparison

Using three-species alignments of available avian genomes (chicken, turkey, and zebra finch) allows for lineage-specific estimates of divergence and also for polarizing substitutions onto lineages by parsimony principles. In addition, the evolutionary distance between the two galliforms, chicken and turkey, is considerably shorter than that between chicken and zebra finch, which should facilitate alignment and make divergence estimates more accurate. We identified 9,531 1:1:1 orthologs among the three species, and for 1,667 of these, there was at least one intronic AR present. Table 1 reports the estimated divergences of AR and 4-fold degenerate sites in the chicken, turkey, and zebra finch lineages. Note that in the unrooted tree of the three species, the lineage from the split between chicken and turkey to zebra finch (“the zebra finch lineage”) includes the basal galliform branch, the short Galloanserae and Neoaves internal branches, the basal passeriform branch and the terminal branch leading to zebra finch (Supplementary figure 2). The trends were similar to those obtained for the chickenzebra finch comparison with AR evolving more rapidly than 4-fold degenerate sites (table 1). The estimated constraint for 4-fold degenerate sites was 43.1% (±1.6%), 28.8% (±1.8%), and 24.2% (±1.4%) in the chicken, turkey, and zebra finch lineages, respectively. Weighting the estimated constraints to take regions difficult to align in AR into account increases the estimates to 49.9% (±1.9%), 33.4% (±2.1%), and 28.0% (±1.6%), respectively. These estimates are generally higher than what we obtained in the more distant chickenzebra finch pairwise comparison.
Table 1

Lineage-Specific Divergence (Mean ± Standard Deviation) of Different Sequences Categories Estimated from Concatenated Three-Species Alignments of Chicken, Turkey, and Zebra Finch

4-Fold SitesAR
Chicken0.028 (±0.001)0.049 (±0.001)
Turkey0.038 (±0.001)0.054 (±0.001)
Zebra finch0.302 (±0.005)0.399 (±0.004)

Note.—The lineages are from an unrooted tree of the three species.

Lineage-Specific Divergence (Mean ± Standard Deviation) of Different Sequences Categories Estimated from Concatenated Three-Species Alignments of Chicken, Turkey, and Zebra Finch Note.—The lineages are from an unrooted tree of the three species.

Chicken Polymorphisms

An analysis of within-species polymorphism essentially circumvents potential problems related to the varying confidence by which different sequence categories can be aligned in distant evolutionary comparisons. It also more directly pinpoints the selective constraints of sequence evolution currently in place. Based on genomic resequencing of pools of chicken population samples, Rubin et al. (2010) gathered genome-wide data on the location of 7 million of SNPs. We calculated the mean density of SNPs (i.e., the number of segregating sites divided by the length of sequence analyzed) in different sequence categories and found that 0-fold degenerate sites had the lowest incidence (0.0039 ± 0.0001), followed by 4-fold degenerate sites (0.0058 ± 0.0001) and AR (0.0080 ± 0.0002). The density of SNPs in AR was significantly higher than in the other categories (two-sided Mann–Whitney U test, P < 0.001 in all cases). The level of polymorphism at 4-fold degenerate sites in chicken was 28% lower than in AR.

Effect of Constraint at 4-Fold Degenerate Sites on d/d and Tests for Positive Selection

Constraint at 4-fold degenerate sites has implications for the interpretation of selection from the d/d ratio and for tests of positive selection. To illustrate this, we compared estimates of d/d and d/dAR for the 3,772 genes in the chickenzebra finch comparison (dAR being divergence at ARs). On average, d/d estimates (mean 0.133 ± 0.142) were about 20% higher than d/dAR estimates (0.109 ± 0.157; two-tailed Mann–Whitney U test, P < 0.001). At the level of individual genes, dAR was higher than d for 72% of the genes. As expected from the relatively low number of available sites per gene and the associated stochastic influence on divergence estimates, the difference between dAR and d was largest for genes with high estimates of dAR (fig. 2).
F

Gene-by-gene differences between divergence estimates of AR and synonymous sites. The dashed horizontal line marks where estimates of d and dAR are equal. Values below the line are genes where divergence at AR is estimated higher than divergence at synonymous sites (and vice versa for values above the line). The red line denotes the lowess curve.

Gene-by-gene differences between divergence estimates of AR and synonymous sites. The dashed horizontal line marks where estimates of d and dAR are equal. Values below the line are genes where divergence at AR is estimated higher than divergence at synonymous sites (and vice versa for values above the line). The red line denotes the lowess curve. It is conceivable that selective constraint at 4-fold degenerate sites can affect inferences on positive selection. Specifically, if 4-fold degenerate sites are constrained, the proportion of nonsynonymous sites interpreted to evolve more rapidly than the presumed neutral reference should be elevated, potentially leading to an increased rate of false positives in the detection of positively selected sites and genes. To investigate this, we simulated sequence evolution under different levels of constraint at 4-fold degenerate sites and applied a standard maximum likelihood test of positive selection (PAML, branch-site model). The constraint levels (25%, 35%, and 45%) were chosen to reflect the range of estimates obtained in this study. For simulations without positive selection (simulations 1–3 in table 2), the proportion of genes detected as positively selected (i.e., false positives) increased from 3–9% without constraint to 14–33% with constraint. Moreover, the frequency of positively selected sites increased from 0.4–0.6% to 0.6–7.9%. Simulations letting a fraction of sites evolve under positive selection (see Material and Methods) found similar effects. For example, at a level of 35% constraint at 4-fold degenerate sites, the proportion of genes detected as positively selected increased from 47–54% to 76–90% and the frequency of positively selected sites increased from 4.2–7.4% to 8.1–8.8%.
Table 2

Simulation Results for the Proportion of Significant Likelihood Ratio Tests (LRT) for Positive Selected Genes and for the Number of Positively Selected Sites with Constraint (Denoted by “Constr.”) and without Constraint (Denoted by “No con.”)

SimulationGC-ContentPositive Selection SimulatedProportions of Significant LRT
Mean Number of Positively Evolving Sites
25% Constraint
35% Constraint
45% Constraint
25% Constraint
35% Constraint
45% Constraint
No Con.Constr.No Con.Constr.No Con.Constr.No Con.Constr.No Con.Constr.No Con.Constr.
1LowNo0.090.140.040.160.050.335.56533579.5
2AverageNo0.040.140.050.170.060.366525.54.574
3HighNo0.080.150.090.220.030.32666136.577
4LowYes0.530.710.470.860.510.9656.57159.885.853.390.1
5AverageYes0.490.730.490.760.450.94747774817486
6HighYes0.530.740.540.90.520.9639.57042.58850.596

Note.—Results were obtained from simulating 200 data sets of 1,000 codons with and without constraint acting on synonymous sites and applied the branch-site likelihood test of positive selection as implemented in PAML. LRT, Likelihood Ratio Tests.

Simulation Results for the Proportion of Significant Likelihood Ratio Tests (LRT) for Positive Selected Genes and for the Number of Positively Selected Sites with Constraint (Denoted by “Constr.”) and without Constraint (Denoted by “No con.”) Note.—Results were obtained from simulating 200 data sets of 1,000 codons with and without constraint acting on synonymous sites and applied the branch-site likelihood test of positive selection as implemented in PAML. LRT, Likelihood Ratio Tests. As expected, the rate of false positives increased with increasing constraint. In simulations without positive selection, the proportion of genes identified as positively selected was on average 2.2 times higher at 25% constraint on 4-fold degenerate sites than when such sites were unconstrained. For levels of 35% and 45% constraint, this increased to 3.5 and 7.8 times higher, respectively. For identification of positive sites, the frequency was 1.0, 4.6, and 14.7 times higher at 25%, 35%, and 45% constraint, respectively. Under the more realistic scenario of some sites evolving under positive selection, the proportion of genes identified as positively selected was 1.4, 1.7, and 1.9 times higher at 25%, 35%, and 45% constraint, respectively. The corresponding numbers for positively selected sites were 1.4, 1.5, and 1.6 times the higher frequencies.

Divergence and 4-Fold Degenerate Sites and Gene Expression

In order to search for possible causative correlates of constraint at 4-fold degenerate sites, we considered variables related to gene expression: expression level, expression breadth, and codon usage. Microarray gene expression data is available from chicken so we focused on divergence in the chicken lineage using the three-species data set. However, divergence at 4-fold degenerate sites was not correlated to any of the variables tested (gene expression level: Pearson r = 0.0008, P = 0.98; breadth of gene expression (τ): r = −0.0262, P = 0.48; CAI: r = 0.0210, P = 0.57).

Discussion

The main conclusion from this study is that 4-fold degenerate sites of avian genes evolve under significant constraint, at least when constraint is estimated using divergence in ARs as a neutral reference. This conclusion is supported both by data on intraspecific polymorphism and interspecific divergence. In the relatively distant comparison of chicken and zebra finch, which are estimated to have diverged 90 mya (van Tuinen and Hedges 2001), we obtained constraint estimates of 24–32%. The split between chicken (super order Galloanserae) and zebra finch (Neoaves) lineages represents the most basal divergence within Neognathae (Supplementary figure 2), a group that contains >99% of all extant bird species. The deep split of the lineages investigated could suggest that the estimated constraint constitutes a representative average for birds. However, it should be noted that contemporary birds are classified in some 25 orders, most of which originated around or soon after the K/T boundary (i.e., there are probably only short internal branches, in particular within Neoaves; Hackett et al. 2008; Pacheco et al. 2011). The lineages that we sampled thus only constitute a minor part of evolution among modern birds. When using three-species alignments of chicken, turkey, and zebra finch, we found constraint for 4-fold degenerate sites to be the highest in the chicken lineage (43%) followed by the turkey (29%) and the zebra finch lineages (24%). This rank order might be seen as surprising if one considers that selection against slightly deleterious mutations should be more efficient in large populations, that is, the level of constraint should correlate positively with the effective population size, N (Ohta 1973). Although passeriforms (zebra finch) are typically small and short-lived birds, galliforms (chicken and turkey) are large and long-lived, and it is clear that N of natural populations of passeriforms are typically larger than that of galliforms. As a consequence, selection should be more efficient and constraint higher in the former than in the latter. However, as mentioned above, the zebra finch lineage in the unrooted tree of chicken, turkey, and zebra finch includes several internal branches, including basal Galliformes, so the large population size of passeriforms is unlikely to be representative for the entire zebra finch lineage in the unrooted tree. Moreover, just as functional elements within noncoding DNA turn over during evolution (Smith et al. 2004; Siepel et al. 2005; Pheasant and Mattick 2007), with the consequence of the amount of shared functional sequence decreasing with increasing genetic distance (Meader et al. 2010), it is conceivable that the functional importance of individual synonymous sites also changes. We may thus expect estimates of selective constraint to be lower for more distant comparisons. These caveats suggest that the lower estimate of constraint at 4-fold degenerate sites in the zebra finch than in the chicken and turkey lineages should not be taken too far. Yet, it could be noted that Eőry et al. (2010) found constraint at 4-fold degenerate sites to be lower in murids than in hominids, despite the much large effective population sizes of the former than of the latter. Although comparative genomic studies are powerful in detecting purifying selection from sequence conservation in particular regions or at particular sites, they cannot reveal the underlying functional role of these sequences. However, several selective processes may explain why synonymous sites are constrained (Chamary et al. 2006), including selection for mRNA stability, translational efficiency, and splicing regulation (Rocha 2004; Chamary and Hurst 2005; Parmley and Hurst 2007; Drummond and Wilke 2008). Moreover, there are increasing number of examples where mutations at synonymous sites cause human disease, demonstrating the critical role of such sites (e.g., Brest et al. 2011). We failed to detect a significant correlation between the divergence at 4-fold degenerate sites and either breadth or level of gene expression. Moreover, we did not find a correlation between the divergence and codon usage. There is little evidence for codon usage bias in birds, and the codon adaptation index was not correlated with gene expression level (r = 0.0462, P = 0.21), as would have been expected under the selection for preferred codons in highly expressed genes (Hershberg and Petrov 2008). As selection for codon usage is typically weak (Duret 2002), it may very well be that the relatively low effective population sizes (N) of birds means that N for codon usage is in the neutral range. Clearly, any inference of selective constraint in a particular sequence category is only relative to a presumed neutral reference. Do ARs represent the “ideal” neutral reference? A possible confounding factor is the homogenizing effects on sequence evolution of events of nonallelic gene conversion. Such events are known to occur among transposable elements although their incidence is highest among young repeat elements with high sequence similarity (Aleshin and Zhi 2010). It may thus not be an issue for ARs present in distantly related genomes, like those of chicken and zebra finch. Another factor would be positive or negative selection at individual elements that have attained functional roles, for example, because of exonization (Schwartz et al. 2009; Shen et al. 2011). There is increasing awareness of the importance of Alu elements as gene regulators during human evolution, however, for birds there has been no such documentation. Indeed, avian genomes are low in repeat numbers (Hillier et al. 2004), and there is no prominent occurrence of short interspersed elements (SINE) (like mammalian Alu elements). A third aspect relates to the fact that transposable elements tend to be hypermethylated as a host defense mechanism against transcription and further spread of repeat elements (Yoder et al. 1997). This will clearly affect the substitution rate at CpG sites due to the strong tendency for cytosines at CpG sites to be methylated and replaced by thymine upon spontaneous deamination (e.g., Holliday and Grigg 1993). Divergence at AR-CpG sites should thus be higher than at many other CpG sites even though they evolve neutrally. However, methylation status does not appear to affect the substitution rate at non-CpG sites (Mugal and Ellegren 2011), which is the rate we studied herein. Overall, conclusions reached by recent studies of mammalian genomes suggest that AR currently represent the most appropriate neutral reference for molecular evolutionary analyses (Thomas et al. 2003; Lunter et al. 2006; Eőry et al. 2010; Pollard et al. 2010). The rationale for studying the strength of selection at nonsynonymous sites by taking the d/d ratio, rather than just d, is that scaling d by d will take variation in nonsynonymous divergence due to variation in the underlying mutation rate (supposedly manifested in d) into account (Hurst 2002). However, a consequence of constrained evolution at 4-fold degenerate sites is that the d/d metric will not be a proper measure of the rate of protein evolution. For example, the common inference of neutral evolution when d/d = 1 will not be valid. This will clearly need further investigation, not least because the degree of constraint at synonymous sites may very well vary among genes and so also the effect on d/d. Related to this, it will be important to address the correlation in constraint between d and d. These two parameters are obviously correlated due to mutation rate variation (Wolf et al. 2009); however, the question is if purifying selection adds to the correlation. Moreover, variation in the level of constraint at 4-fold degenerate sites due to variation in N or life history will have implications to attempts to evaluate the effects of the same parameters on the rate of protein evolution via d/d (cf. Wright and Andolfatto 2008; Ellegren 2009). One important consequence of constraint at synonymous sites is that it may impede on the identification of positively selected genes and sites. Based on simulations of sequence evolution and inferences of positive selection using PAML, we found a significant increase in the frequency of false positives when constraint at synonymous sites was introduced. For example, at 35% constraint and at a GC-content close to the genomic average, the frequency of falsely identified positively selected genes, as well as falsely identified positively selected sites, increased by a factor of 3–5 under a scenario of no positive selection. Under the same constraint and GC-content and with positive selection introduced, the incidence of false positives increased by a factor of 1.1–1.5. We therefore foresee the need for improved maximum likelihood protocols for detection of positive selection that take into account deviations from neutrality in the sequence category used as a neutral reference. Divergence at synonymous sites is often used as a measure of genetic distance between species. Another consequence of the observation that 4-fold degenerate sites evolve under constraint is that this divergence will not represent an unbiased distance metric, and this is particularly so if the level of constraint varies among lineages. For example, in a previous study, we estimated genome-wide mean d in the chickenzebra finch comparison at 0.42, an estimate that would be recognized as similar to that typically seen among eutherian orders. If we assume that the level of constraint at 4-fold degenerate sites is 30%, then a more unbiased (neutrality-based) distance would be 0.42/0.7 = 0.60.

Supplementary Material

Supplementary figures 1 and 2 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).
  53 in total

1.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.

Authors:  J Castresana
Journal:  Mol Biol Evol       Date:  2000-04       Impact factor: 16.240

2.  Estimate of the mutation rate per nucleotide in humans.

Authors:  M W Nachman; S L Crowell
Journal:  Genetics       Date:  2000-09       Impact factor: 4.562

Review 3.  Evolution of synonymous codon usage in metazoans.

Authors:  Laurent Duret
Journal:  Curr Opin Genet Dev       Date:  2002-12       Impact factor: 5.578

4.  Gene expression, synteny, and local similarity in human noncoding mutation rates.

Authors:  Matthew T Webster; Nick G C Smith; Martin J Lercher; Hans Ellegren
Journal:  Mol Biol Evol       Date:  2004-06-02       Impact factor: 16.240

5.  Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level.

Authors:  Jianzhi Zhang; Rasmus Nielsen; Ziheng Yang
Journal:  Mol Biol Evol       Date:  2005-08-17       Impact factor: 16.240

6.  A selection model of molecular evolution incorporating the effective population size.

Authors:  Hans Ellegren
Journal:  Evolution       Date:  2009-02       Impact factor: 3.694

7.  The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications.

Authors:  P M Sharp; W H Li
Journal:  Nucleic Acids Res       Date:  1987-02-11       Impact factor: 16.971

8.  A molecular approach to estimating the human deleterious mutation rate.

Authors:  A S Kondrashov; J F Crow
Journal:  Hum Mutat       Date:  1993       Impact factor: 4.878

Review 9.  Cytosine methylation and the ecology of intragenomic parasites.

Authors:  J A Yoder; C P Walsh; T H Bestor
Journal:  Trends Genet       Date:  1997-08       Impact factor: 11.639

10.  Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution.

Authors:  D Allan Drummond; Claus O Wilke
Journal:  Cell       Date:  2008-07-25       Impact factor: 41.582

View more
  18 in total

1.  Environmental adaptation contributes to gene polymorphism across the Arabidopsis thaliana genome.

Authors:  Cheng-Ruei Lee; Thomas Mitchell-Olds
Journal:  Mol Biol Evol       Date:  2012-07-13       Impact factor: 16.240

2.  Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs.

Authors:  Richard E Green; Edward L Braun; Joel Armstrong; Dent Earl; Ngan Nguyen; Glenn Hickey; Michael W Vandewege; John A St John; Salvador Capella-Gutiérrez; Todd A Castoe; Colin Kern; Matthew K Fujita; Juan C Opazo; Jerzy Jurka; Kenji K Kojima; Juan Caballero; Robert M Hubley; Arian F Smit; Roy N Platt; Christine A Lavoie; Meganathan P Ramakodi; John W Finger; Alexander Suh; Sally R Isberg; Lee Miles; Amanda Y Chong; Weerachai Jaratlerdsiri; Jaime Gongora; Christopher Moran; Andrés Iriarte; John McCormack; Shane C Burgess; Scott V Edwards; Eric Lyons; Christina Williams; Matthew Breen; Jason T Howard; Cathy R Gresham; Daniel G Peterson; Jürgen Schmitz; David D Pollock; David Haussler; Eric W Triplett; Guojie Zhang; Naoki Irie; Erich D Jarvis; Christopher A Brochu; Carl J Schmidt; Fiona M McCarthy; Brant C Faircloth; Federico G Hoffmann; Travis C Glenn; Toni Gabaldón; Benedict Paten; David A Ray
Journal:  Science       Date:  2014-12-11       Impact factor: 47.728

3.  Variation in base-substitution mutation in experimental and natural lineages of Caenorhabditis nematodes.

Authors:  Dee R Denver; Larry J Wilhelm; Dana K Howe; Kristin Gafner; Peter C Dolan; Charles F Baer
Journal:  Genome Biol Evol       Date:  2012-03-20       Impact factor: 3.416

4.  Complete mitochondrial DNA sequences of the threadfin cichlid (Petrochromis trewavasae) and the blunthead cichlid (Tropheus moorii) and patterns of mitochondrial genome evolution in cichlid fishes.

Authors:  Christoph Fischer; Stephan Koblmüller; Christian Gülly; Christian Schlötterer; Christian Sturmbauer; Gerhard G Thallinger
Journal:  PLoS One       Date:  2013-06-24       Impact factor: 3.240

5.  Slow but not low: genomic comparisons reveal slower evolutionary rate and higher dN/dS in conifers compared to angiosperms.

Authors:  Emmanuel Buschiazzo; Carol Ritland; Jörg Bohlmann; Kermit Ritland
Journal:  BMC Evol Biol       Date:  2012-01-20       Impact factor: 3.260

6.  Highly variable recombinational landscape modulates efficacy of natural selection in birds.

Authors:  Toni I Gossmann; Anna W Santure; Ben C Sheldon; Jon Slate; Kai Zeng
Journal:  Genome Biol Evol       Date:  2014-08       Impact factor: 3.416

7.  Kr/Kc but not dN/dS correlates positively with body mass in birds, raising implications for inferring lineage-specific selection.

Authors:  Claudia C Weber; Benoit Nabholz; Jonathan Romiguier; Hans Ellegren
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

8.  Genomic identification and characterization of the pseudoautosomal region in highly differentiated avian sex chromosomes.

Authors:  Linnéa Smeds; Takeshi Kawakami; Reto Burri; Paulina Bolivar; Arild Husby; Anna Qvarnström; Severin Uebbing; Hans Ellegren
Journal:  Nat Commun       Date:  2014-11-07       Impact factor: 14.919

9.  Strong purifying selection at synonymous sites in D. melanogaster.

Authors:  David S Lawrie; Philipp W Messer; Ruth Hershberg; Dmitri A Petrov
Journal:  PLoS Genet       Date:  2013-05-30       Impact factor: 5.917

10.  Evidence for GC-biased gene conversion as a driver of between-lineage differences in avian base composition.

Authors:  Claudia C Weber; Bastien Boussau; Jonathan Romiguier; Erich D Jarvis; Hans Ellegren
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.