| Literature DB >> 20333201 |
Abstract
A major focus of modern population genetics involves using polymorphism data in order to identify regions impacted by recent positive selection (so-called genomic scans). Recently, methodology has been proposed not to identify individual loci, but rather to quantify genomic recurrent hitchhiking (RHH) parameters using this same type of polymorphism data. I here examine to what extent genomic scans for adaptively important loci may be informed by recently estimated RHH parameters (and vice versa). I find that published results are largely incompatible with one another, with approximately an order of magnitude more sweeps being empirically identified than would be predicted under RHH estimates. Results demonstrate that making this connection between SHH and RHH models is crucial for a more complete and accurate characterization of adaptive evolution.Entities:
Keywords: genetic hitchhiking; genomic scans; recurrent selection; selective sweeps
Year: 2009 PMID: 20333201 PMCID: PMC2817426 DOI: 10.1093/gbe/evp031
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Details of Recurrent Sweeps under Four Selection Coefficients and Four Levels of Reduction for a 1-Mb Region
| 2 | Size of Sweep (in bp) | Fraction of Markers Swept (%) | No. | % | No. | % | No. | % | |||||||
| 5% reduction | 20% | 60% | 90% | ||||||||||||
| 1 × 10−1 | 2 × 105 | 20,000 | 3.6 × 10−5 | 1.1 | ∼0 | ∼0 | 2.7 × 10−1 | ∼0 | ∼0 | 1.0 × 10−1 | ∼1 | 0.02 | 8.3 × 10−3 | ∼12 | 0.24 |
| 1 × 10−2 | 2 × 104 | 2,000 | 3.6 × 10−4 | 1.1 × 10−1 | ∼1 | 0.002 | 2.7 × 10−2 | ∼4 | 0.007 | 1.0 × 10−2 | ∼10 | 0.02 | 8.3 × 10−4 | ∼120 | 0.24 |
| 1 × 10−3 | 2 × 103 | 200 | 3.6 × 10−3 | 1.1 × 10−2 | ∼8 | 0.002 | 2.7 × 10−3 | ∼36 | 0.007 | 1.0 × 10−3 | ∼100 | 0.02 | 8.3 × 10−5 | ∼1,200 | 0.24 |
| 1 × 10−4 | 2 × 102 | 20 | 3.6 × 10−2 | 1.1 × 10−3 | ∼80 | 0.002 | 2.7 × 10−4 | ∼360 | 0.007 | 1.0 × 10−4 | ∼1,000 | 0.02 | 8.3 × 10−6 | ∼12,000 | 0.24 |
The size of the region impacted by a given sweep, calculated as 0.01s/r base pairs (Kaplan et al. 1989), with r = 5 × 10−8 per site per generation (Charlesworth 1996; Andolfatto and Przeworski 2001).
The expected transit time of a beneficial mutation, calculated as −(log ξ/2γ), in units of 4N generations, where ξ = 1/2N, γ = 2Ns, and N = 106.
The expected time between beneficial fixations occurring within the region, calculated as =1/MΛ, in units of 4N generations, where Λ is the expected number of sweeps per recombination unit in the last 4N generations and M is the size of the region (=1 Mb).
The expected number of sweeps within the 1-Mb region that are recent enough to be detectable using polymorphism-based statistics, calculated as the average number of sweeps occurring within the last 0.1 4N generations (Przeworski 2002); importantly, only a fraction of this number may be identifiable, as power has been shown to rarely exceed 50% for commonly used summary statistics (Przeworski 2002; Jensen, Thornton, and Aquadro 2008).
The fraction of randomly placed markers across the 1 Mb under consideration that would fall within swept regions, calculated by determining the proportion of the total region impacted by a recent sweep (e.g., if eight sweeps, each effecting ∼200 bp, are expected across the 1-Mb region, then the probability for an individual marker to fall in a swept region is calculated as 1,600/1,000,000).
Estimated values for a 5% total reduction in variation due to RHH, calculated as: (Wiehe and Stephan 1993), where θ is the scaled population mutation rate (=0.01), r is the unscaled recombination rate in Morgans per base pair per generation (=5 × 10−8 per site per generation), κ is a constant (=0.075), γ = 2Ns (where s is the selection coefficient), N is the effective population size (=106), and λ is the rate of adaptive substitutions per site per generation. sλ = 9.0 × 10−15.
Estimated values for a 20% total reduction in variation (sλ = 4.1 × 10−14).
Estimated values for a 60% total reduction in variation (sλ = 2.5 × 10−13).
Estimated values for a 90% total reduction in variation (sλ = 3.0 × 10−12).
Empirical Genomic Scan Results Compared with Expectations under Estimated RHH Models for Drosophila
| Region Size | No. of markers | Fraction Swept | |||||
| 256 kb | 26 | 0.12 | 0.007 | 0.017 | ∼1 | ∼68 | ∼900 |
| 850 kb | 28 | 0.07 | 0.007 | 0.017 | ∼3 | ∼225 | ∼3,060 |
| 17 Mb | 105 | 0.12 | 0.007 | 0.017 | ∼61 | ∼4,620 | ∼61,200 |
Total length of the region spanned by the scan.
The number of scanned markers used in the study.
The fraction of scanned markers proposed by the authors to be linked to selective sweeps.
The expected fraction of markers that would fall in swept regions, for a ∼20% estimated reduction in variability (Andolfatto 2007; Macpherson et al. 2007).
The expected fraction of markers that would fall in swept regions, for a ∼50% estimated reduction in variability (Jensen, Thornton, and Andolfatto 2008; Li and Stephan 2006).
The expected number of sweeps that would fall in the sequenced regions within the last 0.1 4N generations, for parameters estimated by Macpherson et al. (2007) for D. simulans.
Only a fraction of this expected number may be identifiable, owing to the imperfect power of existing test statistics—see figure 2 (Przeworski 2002; Jensen, Thornton, and Aquadro 2008).
The expected number of sweeps that would fall in the sequenced regions within the last 0.1 4N generations, for parameters estimated by Jensen, Thornton, and Andolfatto (2008) for D. melanogaster
The expected number of sweeps that would fall in the sequenced regions within the last 0.1 4N generations, for parameters estimated by Li and Stephan (2006) for D. melanogaster.
The expected number of sweeps that would fall in the sequenced regions within the last 0.1 4N generations, for parameters estimated by Andolfatto (2007) for D. melanogaster.
From Bauer DuMont and Aquadro (2005); Jensen et al. (2007) for an X-linked region of D. melanogaster.
From Harr et al. (2002) for an X-linked region of D. melanogaster.
From Glinka et al. (2003) for an X-linked region of D. melanogaster.
FA simulated comparison of the impact of demography on the identification of selected loci in genomic scans. The demographic model is the out-of-Africa bottleneck estimated for D. melanogaster (Thornton and Andolfatto 2006). For each point, one thousand 100 unlinked-locus data sets (with each locus being of size 1 kb) were simulated in which some fraction of the loci have experienced a recent selective sweep (value given on the x axis). For example, a value of 0.05 corresponds to a model in which 5 of 100 of the loci in each simulated data set have experienced a recent selective fixation. The selection coefficient is fixed at s = 0.01, and the age of the sweep is drawn from a uniform (0, 0.1) in units of 4N generations for each selected locus. The statistic utilized is the composite likelihood ratio test of Kim and Stephan (2002). The dotted line indicates the scenario in which selected loci are perfectly identifiable. The gray line gives the performance of the statistic under common usage—in which the null model is equilibrium neutrality. As shown, there is a tremendous false-positive rate associated with this implementation of hitchhiking mapping. The black line gives the performance when the null is the true underlying demographic model. Although this greatly reduces the false-positive rate, owing to the imperfect power of the test statistic, only roughly half of selected loci are being identified.
FA comparison of RHH- and SHH-based results. As shown, RHH- and SHH-based analyses suggest dramatically different patterns, with the latter detecting a far greater number of swept loci than would be predicted under RHH estimation, thereby suggesting a greater reduction in genomic variation due to selection. The vertical dotted line indicates the point at which the common genomic scan assumptions would be met (i.e., the 5% tail of markers are swept). Assuming that recently selected loci will indeed enrich the tails of genomic distributions, this demonstrates that under RHH-based estimation the 5% tail would primarily contain false positives. Conversely, if SHH-based estimates are correct, the majority of positively selected loci would be missed using this cut-off. Points are taken from the four RHH- and three SHH-based studies presented in table 2.