Literature DB >> 21859804

Detecting past positive selection through ongoing negative selection.

Georgii A Bazykin¹, Alexey S Kondrashov.

Abstract

Detecting positive selection is a challenging task. We propose a method for detecting past positive selection through ongoing negative selection, based on comparison of the parameters of intraspecies polymorphism at functionally important and selectively neutral sites where a nucleotide substitution of the same kind occurred recently. Reduced occurrence of recently replaced ancestral alleles at functionally important sites indicates that negative selection currently acts against these alleles and, therefore, that their replacements were driven by positive selection. Application of this method to the Drosophila melanogaster lineage shows that the fraction of adaptive amino acid replacements remained approximately 0.5 for a long time. In the Homo sapiens lineage, however, this fraction drops from approximately 0.5 before the Ponginae-Homininae divergence to approximately 0 after it. The proposed method is based on essentially the same data as the McDonald-Kreitman test but is free from some of its limitations, which may open new opportunities, especially when many genotypes within a species are known.

Entities: Chemical Gene Species

Mesh：

Substances：
Codon

Year: 2011 PMID： 21859804 PMCID： PMC3184776 DOI： 10.1093/gbe/evr086

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

After the Beneficial Allele Gets Fixed, Positive Selection Turns into Negative Selection

At any given moment of time, positive selection, which favors currently uncommon derived alleles, affects only a small fraction of sites in the genome and, thus, is much rarer than negative selection, which favors common ancestral alleles (Kimura 1983). A variety of methods are used to detect positive selection, both past (McDonald and Kreitman 1991; Yang and Bielawski 2000; Smith and Eyre-Walker 2002; Bazykin et al. 2004; Eyre-Walker 2006; Huelsenbeck et al. 2006) and ongoing (Nielsen et al. 2007; Novembre and Di Rienzo 2009; Grossman et al. 2010), but neither of these methods is perfect. Here, we propose a method for detecting past positive selection through ongoing negative selection. After a positive selection-driven allele replacement is over, positive selection transforms into negative selection (fig. 1). Thus, at a site where an allele replacement occurred recently, ongoing negative selection against the ancestral allele (which is incessantly recreated by mutation) indicates, as long as the fitness landscape remains invariant, that this replacement has been driven by positive selection. Past allele replacements can be revealed by comparison of the species in which negative selection is studied to other species; here, we use maximum parsimony to infer them (fig. 1 and supplementary fig. S1, Supplementary Material online), but other methods, for example, maximum likelihood-based or Bayesian, can be used as well. In turn, ongoing negative selection at functionally important sites where allele replacements occurred during some time interval in the past can be detected using the polymorphism data. Specifically, we compare the prevalence of ancestral alleles at these sites and at supposedly selectively neutral sites where the allele replacements of the same type (e.g., for the case of single nucleotide substitutions, corresponding to the same ancestral-derived nucleotides pair) also occurred during the same time interval. This last requirement is necessary to control for the difference in the mutation rates across the genome; in particular, sites which underwent recent allele replacements are likely to have locally elevated mutation rate (Asthana et al. 2007; Bazykin et al. 2007; Hodgkinson et al. 2009).

Test for positive selection based on polymorphism at sites of ancestral divergence. (a) Change in the mode of selection as a result of an allele replacement. A fitness landscape that initially causes positive selection in favor of a rare allele “A” (left) causes negative selection against a rare allele “a” after “a”→”A” allele replacement is accomplished (right). Fitnesses are shown by vertical bars and allele frequencies are shown by pie charts. (b) Approach to measuring past positive selection. Extant species (dots) were used to infer allele replacements (“a”→”A”) that occurred at different segments (1−5; in the example shown, 3) of the ancestral lineage (see also supplementary fig. S1, Supplementary Material online). At sites of such replacements, the species for which polymorphism data is available (shown as a triangle) was used to assess the frequency of the ancestral variant (“a”). Such frequencies at nonsynonymous and synonymous sites were compared with measure β. (c–e) Results of the test for positive selection for substitutions which occurred in the lineage of the Drosophila melanogaster nuclear genome (c), Homo sapiens nuclear genome (d), or H. sapiens mitochondrial genome (e). The considered phylogeny is shown together with the times of the beginning and end of each segment of the lineage, measured in units of Ds from present. The species for which polymorphism data has been analyzed is shown as a triangle, with the number of available haploid genotypes N shown next to the species name. For each of the five considered segments, presented are the values of β together with 95% confidence intervals as horizontal bars, and the proportion of bootstrap replicates with β > 0 as pie charts.

Using Data on Current Negative Selection to Reveal Past Positive Selection

Let us compare functionally important nonsynonymous sites of protein-coding genome regions with synonymous sites, which will be assumed to be selectively neutral. In the McDonald–Kreitman (MK) test, the proportion of positive selection-driven nonsynonymous replacements is estimated, under the assumption that a nonsynonymous mutation can be strongly advantageous, strongly deleterious, or neutral, as α = 1 − DsPn/(DnPs), where Dn and Ds are the numbers of nonsynonymous and synonymous substitutions and Pn and Ps are the numbers of polymorphic nonsynonymous and synonymous sites within the same sample (McDonald and Kreitman 1991; Smith and Eyre-Walker 2002; Eyre-Walker 2006). In the test proposed here, this proportion is estimated, assuming that a nonsynonymous replacement can be either strongly advantageous or neutral, as β = 1 − pn/ps, where pn (ps) is the proportion of sites, among those nonsynonymous (synonymous) sites at which a nucleotide replacement occurred in the past, which currently carry both the derived and the ancestral nucleotides (fig. 1). Note that while α is dependent on all polymorphic sites in the analyzed set of sites, β only takes into account polymorphism at sites at which a nucleotide replacement has previously occurred in a particular segment of the considered lineage. Estimating positive selection through α and through β is based on the same fact that a neutral nonsynonymous mutation contributes as much to polymorphism as a synonymous mutation and a strongly deleterious nonsynonymous mutation contributes nothing (McDonald and Kreitman 1991; Smith and Eyre-Walker 2002; Eyre-Walker 2006).

Numerical Simulations

In order to investigate the proposed test and to compare it with the MK test, we analyzed the results of simulated molecular evolution. The data on divergence and polymorphism were generated by evolving a Wright–Fisher population along the phylogenetic tree corresponding to the actual phylogenetic tree of a clade within genus Drosophila (for details, see Materials and Methods). The genome was assumed to consist of many unlinked diallelic synonymous and nonsynonymous sites. All synonymous sites where assumed to be neutral. A nonsynonymous site evolved under one of the three modes of selection: neutrality, constant selection always favoring one of the alleles, or switching selection. In the latter case, the absolute value of the selection coefficient remained constant, but its sign switched at random moments of time, which led to episodes of positive selection favoring a previously inferior low-frequency allele. We combined nonsynonymous sites under these three selection modes in different proportions and studied the behavior of α and β. Both tests performed well in detecting the fraction of positively selected substitutions when switching selection sites were combined with neutral sites (fig. 2), with sites of very weak constant selection (fig. 2), or with sites of strong constant selection (fig. 2). α is sensitive to slightly deleterious alleles segregating within a population, so that admixture of constant selection sites with small selection coefficients leads to negative values of α; this can be remedied by excluding low-frequency polymorphisms (fig. 2). When positive selection is present, the same effect leads to underestimation of the fraction of positively selected substitutions by α (fig. 2). Both β and α with excluded low-frequency polymorphism give a reasonably good approximation of the fraction of the positively selected substitutions under all these scenarios (fig. 2).

Performance of the MK test and the proposed test on mixtures of sites with different modes of selection. Orange solid line, the actual fraction of substitutions coming from the sites under switching selection; green triangles, α; yellow triangles, α with low-frequency (<15%) polymorphisms excluded; blue squares, β. (a–c) Horizontal axis: the fraction of switching sites (s = ±10−3); the remaining sites are neutral (a) or are under very weak (b, s = 10−5) or strong (c, s = 10−3) constant selection. (d) Horizontal axis: the fraction of sites under weak constant selection (s = 10−4); the remaining sites are neutral. (e–g) The fraction of switching sites (s = ±10−3) is 10% (e), 5% (f), or 3% (g); horizontal axis: the fraction of sites under weak constant selection (s = 10−4); the remaining sites are neutral.

Positive Selection in Fruit Fly and Human Revealed by the Proposed Test

Supplementary table S1 (Supplementary Material online) presents data on nonsynonymous and synonymous allele replacements in the lineages of Drosophila melanogaster and Homo sapiens, each split into five segments, and on nonsynonymous and synonymous polymorphisms due to the presence of ancestral alleles at sites where these replacements occurred. So far, these are the only two species with extensive genome-level data on intraspecies variation available. We can see that our test implies that in the D. melanogaster lineage, the fraction of nonsynonymous replacements that were driven by positive selection remained steady at approximately 0.5, in agreement with the estimates obtained by the MK test (Eyre-Walker 2006) (fig. 1). Inferring substitutions by comparing with a single reference sequence may be erroneous when the reference sequence carries a low-frequency allele. In our approach, such errors are unlikely, because in each analysis, both the ancestral and the derived allele are observed in multiple sequences (supplementary fig. S1, Supplementary Material online). In rare instances, however, mistaking a polymorphic variant for a fixed one can lead to misidentification of the exact segment where the substitution has occurred. To assess this effect, we repeated the analysis for segment 5 of the Drosophila phylogeny using only the sites at which the nucleotide of D. sechellia coincided with the nucleotide of a sister species D. simulans. The results were similar (supplementary table S1, Supplementary Material online). In contrast to Drosophila, in the human lineage, the fraction of positively selected substitutions declined, after Ponginae–Homininae divergence around 20 Ma (Ruff 2003), from approximately 0.5 to 0 (fig. 1). This decline was probably caused by the increased body size (Ruff 2003) and the associated decrease of the effective population size (Popadin et al. 2007) and the efficiency of positive selection in the course of hominid evolution. Also, a decline of Ne may bias β downward, due to fixations of slightly deleterious alleles that were previously kept rare by more efficient selection, because after such a fixation, slightly beneficial ancestral alleles will segregate at elevated frequencies, compared with that after a selectively neutral fixation (Charlesworth and Eyre-Walker 2007). Thus, β = 0 in the hominid segments of the human lineage does not mean that there were no adaptive nonsynonymous replacements at that time. The MK test also does not reveal a large role of positive selection in the evolution of proteins in hominids (Eyre-Walker 2006). Finally, our analysis does not reveal any positive selection within human mitochondrially encoded proteins, except early in the evolution of Hominidae (fig. 1). Partially, this may be due to a small number of sites where allele replacements occurred recently. Also, because of the large number of known human mitochondrial genotypes, characterizing their variation by the fraction of polymorphic sites leads to a loss of information. However, distributions of ancestral allele frequencies at polymorphic sites of recent nonsynonymous and synonymous allele replacements are also very similar (fig. 3), arguing against a major role of positive selection in the evolution of mitochondrially encoded proteins in the human lineage during the last approximately 20 Myr.

Distributions of the number of genotypes carrying the ancestral alleles at polymorphic sites. Only sites carrying the derived allele in >50% of the genotypes, and the ancestral allele in some of the genotypes, were taken into account. (a) Data for 6,726 human mitochondrial genotypes, for replacements that occurred in any of the five segments (Mann–Whitney U test for frequency of ancestral nonsynonymous vs. synonymous allele, n = 159, P = 0.053). (b) and (c) Data for 19 human nuclear genotypes, for replacements that occurred in segments 1–3 (b; n = 1184, P = 0.355) and 5 (c; n = 1870, P = 0.108); the excess of high-frequency polymorphism in this case is due to unfinished replacements. (d) Data for 37 Drosophila melanogaster genotypes, for replacements that occurred in segments 1–4 (n = 1892, P = 1.03 × 10−8). Nonsynonymous sites, purple; synonymous sites, blue. For the complete data set, see supplementary figures S2–S4 (Supplementary Material online). Distributions of frequencies of ancestral alleles are also similar at polymorphic sites of recent nonsynonymous and synonymous allele replacements in human nuclear genes (fig. 3), which is consistent with the assumption that nonsynonymous replacements were either strongly advantageous or neutral. In contrast, in D. melanogaster there is a substantial excess of singletons at polymorphic nonsynonymous sites (fig. 3). This suggests that some of the allele replacements were driven by positive selection so weak (Watterson 1975) that the ancestral allele is present in the sample of 37 genotypes, albeit at a low frequency. Allele frequencies are not taken into account by β, which thus underestimates the role of weak positive selection. More detailed analysis which takes into account the distribution of allele frequencies, similar to that recently proposed for the MK test (Charlesworth and Eyre-Walker 2008; Eyre-Walker and Keightley 2009), is needed to address this problem. Our test can also underestimate the role of positive selection in nucleotide substitutions confined to the terminal segment of the lineage (segment 5 in fig. 1) because a substantial fraction of them have not yet been accomplished, leading to an excess of ancestral polymorphisms (compare fig. 3).

Comparison of the Proposed Test to the MK Test

The MK test and the test proposed here are both based on data on interspecies divergence and intraspecies polymorphism and apparently produce consistent estimates. Still, there is a number of substantial differences between the two tests. First, the proposed test does not compare rates of evolution at nonsynonymous versus synonymous sites; instead, it uses past synonymous replacements only to avoid a bias due to cryptic variation in mutation rates (Asthana et al. 2007; Bazykin et al. 2007; Hodgkinson et al. 2009). Second, the proposed test only deals with nonsynonymous sites that underwent a recent allele replacement, thus avoiding altogether possible complications due to mildly deleterious alleles that segregate at sites that remained under constant, negative selection for a long time and, thus, have not underwent allele replacements. In contrast, in the MK test, such alleles may lead either to underestimation (Charlesworth and Eyre-Walker 2008) or overestimation (Eyre-Walker 2002) of the role of positive selection, depending on the peculiarities of population demography. Third, the proposed approach can estimate not only the fraction of adaptive allele replacements, but also the strength of past positive selection which drove them, through the strength of ongoing negative selection (Keightley and Eyre-Walker 2007). This possibility will become especially important with the increase in the amount of genome-level data on intraspecies genetic variation, which will greatly increase the power of quantifying negative selection. More generally, all existing tests for positive selection are involved with serious problems. Positive selection at a site does not act constantly; instead, under realistic assumptions, it probably changes into negative selection immediately after an allele replacement is accomplished and stays such for a while (e.g., Kryazhimskiy and Plotkin 2008; Mustonen and Lässig 2009). The implications of this for the Dn/Ds test have been discussed (Kryazhimskiy et al. 2008). The MK test makes the same assumption implicitly. Indeed, it assumes that the fraction of sites under negative selection is independent of the action of positive selection (Smith and Eyre-Walker 2002); in reality, polymorphism is probably reduced at sites of past positive selection. The test we propose uses this exact feature of past positive selection to detect it.

Materials and Methods

Numerical Simulation

A Wright–Fisher population of constant size N = Ne = 105 was evolved under a free recombination model, with two possible alleles in each locus, and the mutation rate of μ = 10−7 between them. At a generation, each locus was characterized by the selection coefficient s and (for the switching selection mode) the waiting time for this coefficient to switch sign. At each generation, a switch occurred with the same probability; we used a waiting time of 2 × 108 generations, so that switches were rare. After an initial burn-in of 107 generations, evolution then proceeded for another 7 × 107 generations, with cladogenesis events occurring according to the prescribed phylogeny shown in figure 1. After each cladogenesis event, both derived species inherited the ancestral allele frequency and selection coefficient. A total of 1.4 × 106 loci were simulated: 105 for each of the four constant selection coefficients used and 106 for switching selection. The results in figure 2 were obtained using the simulated equivalent of segment 5 of the D. melanogaster lineage in figure 1. A site was categorized as “switched” if it had undergone a switch of the selection coefficient at this segment. Polymorphism was assessed in the terminal species (corresponding to D. melanogaster) by drawing 37 random alleles from the population. For the MK test with the low-frequency cutoff, all polymorphism with minor allele occurring in fewer than 6 of these 37 individuals was excluded. To make α and β more comparable, we used polarized MK, so that only substitutions in the simulated equivalent of D. melanogaster lineage since its divergence from D. simulans were considered; D. erecta was used to infer the ancestral state. Similarly, β was calculated at sites of substitutions in D. melanogaster lineage since its divergence from D. simulans, which were inferred using D. erecta as an outgroup.

Data

Multiple alignments of genome assemblies of six insects species to D. melanogaster (dm3) were obtained from UCSC Genome Bioinformatics Site (Kuhn et al. 2009) (http://genome.ucsc.edu). Complete genotypes of 37 inbred strains of D. melanogaster (Jordan et al. 2007) were obtained from Drosophila Population Genomics Project website (http://www.dpgp.org/). The set of FlyBase (Tweedie et al. 2009) canonical splicing variants (BDGP release 5) was used to map D. melanogaster protein-coding genes onto the alignment. Multiple alignment of each coding region was then obtained by joining the aligned segments corresponding to exons of FlyBase canonical genes in D. melanogaster. Codons masked by RepeatMasker, not aligned, or containing gaps or non-ACGT characters, as well as codons within six nucleotides of any of such codons, were excluded from the analysis. Multiple alignments of genome assemblies of six vertebrate species to H. sapiens (hg18) were obtained from UCSC Genome Bioinformatics Site (Kuhn et al. 2009) (http://genome.ucsc.edu). Data on variation of human nuclear genotypes were obtained from nine diploid human genotypes downloaded from Galaxy bioinformatics platform (Taylor et al. 2007; Schuster et al. 2010) (http://usegalaxy.org) and the reference human genome, resulting in 19 haploid genotypes. The following individual diploid genotypes were used: KB1 (454 method) (Schuster et al. 2010), ABT (SOLiD method) (Schuster et al. 2010), NA18507 (Bentley et al. 2008), NA19240 (Drmanac et al. 2010), Craig Venter (Levy et al. 2007), NA12891, NA12892, Chinese individual (Wang et al. 2008), and Korean individual (Ahn et al. 2009). The canonical splicing variants of UCSC hg18 Known Genes (Hsu et al. 2006) were used to map H. sapiens protein-coding genes onto the alignment. Multiple alignment of each coding region was then obtained by joining the aligned segments corresponding to exons of knownGene canonical genes in H. sapiens. The sequences of 6,726 complete human mitochondrial genotypes were obtained from GenBank (Benson et al. 2009) by using “Homo sapiens [orgn] and complete genome” as a query with the Limits option set to mitochondrial DNA in the Entrez retrieval system (Baxevanis 2008). The sequences of six non-H. sapiens primate species were obtained from GenBank (Benson et al. 2009). All sequences were aligned to the revised Cambridge sequence (Andrews et al. 1999) using ClustalW (Thompson et al. 1994), and coding sequences for 12 protein-coding genes (excluding ND6, which is encoded on a different strand than the rest of the genes and has biased nucleotide composition; Yang et al. 1998) were extracted from the alignments. Lengths of internal segments of insect and primate mitochondrial phylogenetic trees were taken from references Heger and Ponting (2007) and Krause et al. (2010). Lengths of internal segments of vertebrate phylogenetic tree were taken from UCSC Genome Bioinformatics Site (Kuhn et al. 2009). All lengths are in the units of the inferred per site number of synonymous substitutions (Ds).

Analysis

Codon sites with gaps or missing data in any of the six non-D. melanogaster (or non-H. sapiens) species, in any of the 37 D. melanogaster genotypes, in any of the 19 H. sapiens nuclear genotypes, or in more than 1,726 of 6,726 H. sapiens mitochondrial genotypes, were excluded from analysis. At each codon site, nucleotide sites were classified as “nonsynonymous” (“synonymous”) only when each of the four nucleotides at this site corresponded to a different (the same) amino acid in all the seven species (supplementary fig. S1, Supplementary Material online). An allele replacement was assigned to 1 of the 5 segments of the ancestral lineage of a polymorphic species using maximum parsimony (fig. 1 and supplementary fig. S1, Supplementary Material online). Nucleotide sites at which parsimony implied more than one allele replacement, or at which multiple timings of the allele replacement were equally parsimonious, were excluded from further analysis. Polymorphism was measured at sites of past allele replacement. We included in analysis only those sites where the frequency of the derived allele (“A” in fig. 1 and supplementary fig. S1, Supplementary Material online) was above 50%; such sites constituted the vast majority of all sites for allele replacements in segments 1–4. By contrast, when only the species whose variation is studied carries the derived allele, the distribution of allele frequencies contains a much higher fraction of intermediate values (segment 5 vs. segments 1–4 in fig. 1). Among sites with allele replacements that occurred within a particular segment, each of the 12 possible pairs of ancestral and derived nucleotides was considered separately. This was done to control for different rates of different mutations. For each pair of nucleotides involved in a past nonsynonymous (synonymous) allele replacement, we estimated, in the species used to study variation, the fraction of nonsynonymous (synonymous) sites with the ancestral variant present. These fractions were then averaged across the 12 possible nucleotide pairs, separately for nonsynonymous and synonymous sites, to produce the values in supplementary table S1 (Supplementary Material online) and in figure 1 and to calculate β. Among polymorphic sites with the ancestral variant present in the species used to study variation, the distributions of frequencies of the ancestral nonsynonymous versus synonymous variants were compared using the Mann–Whitney U test. The confidence intervals for β were obtained by bootstrapping the data. In each of the 1000 bootstrap replicates, we randomly selected with replacement the same number of sites from the observed set of sites with an allele replacement and calculated β for this resampled distribution.

Supplementary Material

Supplementary figures S1–S4 and table S1 are Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

41 in total

1. On the number of segregating sites in genetical models without recombination.

Authors: G A Watterson
Journal: Theor Popul Biol Date: 1975-04 Impact factor: 1.570

2. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences.

Authors: John P Huelsenbeck; Sonia Jain; Simon W D Frost; Sergei L Kosakovsky Pond
Journal: Proc Natl Acad Sci U S A Date: 2006-04-10 Impact factor: 11.205

3. Accumulation of slightly deleterious mutations in mitochondrial protein-coding genes of large versus small mammals.

Authors: Konstantin Popadin; Leonard V Polishchuk; Leila Mamirova; Dmitry Knorre; Konstantin Gunbin
Journal: Proc Natl Acad Sci U S A Date: 2007-08-06 Impact factor: 11.205

4. Adaptive protein evolution in Drosophila.

Authors: Nick G C Smith; Adam Eyre-Walker
Journal: Nature Date: 2002-02-28 Impact factor: 49.962

5. Adaptive protein evolution at the Adh locus in Drosophila.

Authors: J H McDonald; M Kreitman
Journal: Nature Date: 1991-06-20 Impact factor: 49.962

6. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies.

Authors: Peter D Keightley; Adam Eyre-Walker
Journal: Genetics Date: 2007-12 Impact factor: 4.562

Review 7. The genomic rate of adaptive evolution.

Authors: Adam Eyre-Walker
Journal: Trends Ecol Evol Date: 2006-07-03 Impact factor: 17.712

8. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

9. FlyBase: enhancing Drosophila Gene Ontology annotations.

Authors: Susan Tweedie; Michael Ashburner; Kathleen Falls; Paul Leyland; Peter McQuilton; Steven Marygold; Gillian Millburn; David Osumi-Sutherland; Andrew Schroeder; Ruth Seal; Haiyan Zhang
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971