Jason Bertram1,2. 1. Environmental Resilience Institute, Indiana University, Bloomington, Indiana, United States of America. 2. Department of Biology, Indiana University, Bloomington, Indiana, United States of America.
Abstract
Resolving the role of natural selection is a basic objective of evolutionary biology. It is generally difficult to detect the influence of selection because ubiquitous non-selective stochastic change in allele frequencies (genetic drift) degrades evidence of selection. As a result, selection scans typically only identify genomic regions that have undergone episodes of intense selection. Yet it seems likely such episodes are the exception; the norm is more likely to involve subtle, concurrent selective changes at a large number of loci. We develop a new theoretical approach that uncovers a previously undocumented genome-wide signature of selection in the collective divergence of allele frequencies over time. Applying our approach to temporally resolved allele frequency measurements from laboratory and wild Drosophila populations, we quantify the selective contribution to allele frequency divergence and find that selection has substantial effects on much of the genome. We further quantify the magnitude of the total selection coefficient (a measure of the combined effects of direct and linked selection) at a typical polymorphic locus, and find this to be large (of order 1%) even though most mutations are not directly under selection. We find that selective allele frequency divergence is substantially elevated at intermediate allele frequencies, which we argue is most parsimoniously explained by positive-not negative-selection. Thus, in these populations most mutations are far from evolving neutrally in the short term (tens of generations), including mutations with neutral fitness effects, and the result cannot be explained simply as an ongoing purging of deleterious mutations.
Resolving the role of natural selection is a basic objective of evolutionary biology. It is generally difficult to detect the influence of selection because ubiquitous non-selective stochastic change in allele frequencies (genetic drift) degrades evidence of selection. As a result, selection scans typically only identify genomic regions that have undergone episodes of intense selection. Yet it seems likely such episodes are the exception; the norm is more likely to involve subtle, concurrent selective changes at a large number of loci. We develop a new theoretical approach that uncovers a previously undocumented genome-wide signature of selection in the collective divergence of allele frequencies over time. Applying our approach to temporally resolved allele frequency measurements from laboratory and wild Drosophila populations, we quantify the selective contribution to allele frequency divergence and find that selection has substantial effects on much of the genome. We further quantify the magnitude of the total selection coefficient (a measure of the combined effects of direct and linked selection) at a typical polymorphic locus, and find this to be large (of order 1%) even though most mutations are not directly under selection. We find that selective allele frequency divergence is substantially elevated at intermediate allele frequencies, which we argue is most parsimoniously explained by positive-not negative-selection. Thus, in these populations most mutations are far from evolving neutrally in the short term (tens of generations), including mutations with neutral fitness effects, and the result cannot be explained simply as an ongoing purging of deleterious mutations.
One of the central problems of evolutionary biology is to delineate the role of natural selection in shaping genetic variation. Most genetic variation consists of neutral mutations which, though having no appreciable effects on fitness, are not free from the influence of selection. When selection acts on non-neutral mutations, neutral mutations that share similar genetic backgrounds can be dragged along for the ride, a process called linked selection [1]. The extent to which linked selection influences neutral variation is a major point of contention [2, 3]—one with practical implications because putatively neutral mutations are widely used to infer population demographic history [4] and as a baseline for detecting selection [2, 5]. There is also ongoing debate about the particular modes of selection responsible for shaping genetic variation. Negative selection purging the influx of deleterious mutations is probably prevalent [6, 7], but positive selection on rarer advantageous mutations is crucial for adaptive evolution and likely also has a hand in shaping neutral variation [8].Until recently, the bulk of the evidence entering the above debates rested on patterns of genetic variation measured at single snapshots in time. The interpretation of such evidence is complicated because the prospective signatures of selection are accumulated over an uncertain history during which other confounding processes (e.g. population demography) also shape genetic diversity [5, 9, 10]. Crucially, single snapshot data is unable to reveal what the process of selection is doing at any point in time i.e. selectively changing allele frequencies.A more direct approach is to analyze allele frequency data gathered from the same population at multiple points in time [10]. Evolve and resequence (E&R) experiments [11-14] and studies on wild populations [15, 16] have identified allele frequency changes associated with rapid phenotypic adaptation. However, determining the full nature and extent of selective allele frequency change has been difficult. Numerous methods exist for inferring selection coefficients from allele frequency time series [10, 17–24], but are only reliable for selection that is strong relative to the intensity of random, non-selective allele frequency change (random genetic drift) and allele frequency measurement error (e.g. due to population sampling or limited sequencing read depth).This is a major limitation that likely precludes detection of most of the influence of selection. Fitness-relevant traits are often complex (influenced by a large number of genes) and harbor ample genetic variation. Selection on such traits will thus often cause modest allele frequency shifts distributed across many loci rather than be concentrated at a small number of strongly selected loci [25-27]. Moreover, even if some genomic regions harbor strongly selected alleles, much of the associated linked selection could be undetectably weak. Thus, resolving the short-term (∼ tens of generations) influence of selection across the genome remains an important challenge [28].Here we present a new approach to analyze the genome-wide influence of selection using time-resolved allele frequency data. Our approach capitalizes on a distinctive pattern of among-locus temporal allele frequency divergence that to our knowledge has not previously been described. In contrast with single-locus approaches, this allele frequency divergence is a collective pattern incorporating alleles across the genome. We therefore lose the ability to identify particular loci under selection; in return are able to detect polygenic selective processes that are not detectable with single-locus approaches.Traditionally the allele frequency variance in a cohort of neutral alleles with initial frequency p is assumed to have the binomial form
where Δp denotes the change in allele frequency after t generations, and the variance coefficient C is frequency independent [29, Chap. 3]. The allele frequency divergence in Eq (1) is largely a consequence of random genetic drift. However, selection can also cause neutral allele frequencies to diverge. The influential effective population size literature has derived (frequency-independent) expressions for C in a wide variety of circumstances [30]. Crucially, a large body of work has attempted to subsume the effects of selection on neutral alleles into the frequency-independent value of C, including both the effects of unlinked fitness variation [31, 32], and some manifestations of linked selection [6, 33, 34]. The effective population size literature thus views (1) as a broadly applicable model of neutral allele divergence, simply requiring a tuning of C to capture the effects of selection on neutral alleles, at least to a first approximation [3, 30].Here we show, on the contrary, that linked selection causes among-locus neutral allele frequency variance to deviate from the binomial form (1), such that the variance coefficient C is frequency-dependent. We use this frequency-dependence to detect the presence of selection, analyze its influence on allele frequencies over time and estimate the typical magnitude of total selection coefficients (capturing both direct and linked selection) across the genome. Applying our approach to E&R and wild Drosophila single nucleotide polymorphism (SNP) data we find evidence of strong linked selection affecting most SNPs (although we cannot rule out migratory fluxes in the wild population). We argue that the specific form of frequency-dependence we find implies a substantial role for positive selection.
Results
Neutral evolution implies binomial allele frequency variance
The Eq (1) binomial variance classically arises in the neutral Wright-Fisher model, which assumes random sampling of gametes each generation; then C = 1 − (1 − 1/2N) where N is the (diploid) population size. In its basic form the neutral Wright-Fisher model entails a number of biological simplifications including random mating, constant N, non-overlapping generations, and the absence of fitness differences between individuals. Many of these assumptions can be relaxed without affecting the binomial form of Eq (1), at least approximately for large N and over long timescales [30, 35]. Similarly, much of the justification for Wright-Fisher as a biologically valid description of genetic drift is derived from its equivalence to a broader class of drift models in the limit of large N and slow allele frequency change (the diffusion limit [36]). Here we are interested in shorter time scales (≤ tens of generations), and want our approach to be applicable to small laboratory populations (< 103 individuals). We therefore evaluate the validity of Eq (1) more generally.An enormous variety of purely neutral genetic drift models have binomial variance [37]. This includes the Cannings model, which represents neutrality using a general exchangeability assumption that allows for arbitrary offspring number distributions [38]. Binomial variance thus accommodates fundamental deviations from Wright-Fisher such as “sweepstakes” reproduction in high-fecundity organisms [39]. However, due to the presence of fitness variation in adapting populations, neutral mutations do not evolve according to “pure drift” of the sort studied in ref. [37], even if unlinked from alleles under selection [31]. In particular, the Cannings model is not applicable because exchangeability precludes fitness variation between individuals.We show that binomial variance applies quite generally for neutral alleles unlinked from selected loci (A in S1 Text). In short, we use a generalized exchangeability argument to show that binomial variance holds in the presence of fitness variation provided that the neutral alleles under consideration are in linkage equilibrium with alleles under selection. Intuitively, linkage equilibrium ensures that the distribution of genetic backgrounds is exchangeable between alternate neutral alleles, even though individual genetic backgrounds are not exchangeable.Non-binomial variance (equivalently, frequency-dependent C) thus signifies a violation of generalized exchangeability. The obvious way for this to occur is for allele frequency change to have a nonzero bias; this could be due to linked selection, migration, mutation bias or gene drive. Additionally, deviations from binomial variance can occur if the population is structured into genetically differentiated demes (B in S1 Text). Below we check for binomial variance empirically and discuss our findings in relation to these factors leading to non-binomial variance, focusing mostly on selection for reasons that will become apparent.Note that while our exchangeability argument yields Eq (1) with finite variance for finite N, in the diffusion limit infinite variance is possible in the Cannings model [37]. None of our results depend on N → ∞ limiting behavior, so we do not discuss this possibility further.
Selection creates non-binomial allele frequency variance
We now analyze the effects of selection on allele frequency divergence, demonstrating that deviations from binomial variance will often result.The expected frequency change after one generation due to selection on an allele starting at frequency p is given by
where is the selection coefficient, is the mean fitness of the focal allele, is the mean fitness of all other alleles at the same locus, and is population mean fitness. Here s is the “total” selection coefficient that captures the net effect of selection at linked loci and the focal locus [40, 41].Selection generates among-locus divergence of allele frequencies when its strength or sign varies among alleles in a cohort. To quantify this effect, we apply the law of total variance to Δ1p where s is allowed to vary between loci:The second term in Eq (3) represents selective divergence i.e. the deterministic allele frequency divergence created by among-locus variation in s. Using Eq (2), it can be written as σ2(s|p)[p(1 − p)]2, where σ2(s|p) is the (possibly frequency-dependent) variance in total selection coefficients among loci with initial frequency p. The presence of the [p(1 − p)]2 factor will tend to cause intermediate frequency alleles to have elevated variance relative to the binomial case (Fig 1). Thus, while it is possible for the allele frequency variance created by selection to be binomial, in general it is not. Beyond the tendency for elevated variance at intermediate frequencies, the exact shape and magnitude of the deviation is determined by σ2(s|p).
Fig 1
(A) The total selection coefficient measures the overall effect of selection on an allele including any associations with other sites under selection. (B) When alleles at different sites have different total selection coefficients, selection generates allele frequency divergence. (C) Compared to the binomial variance (proportional to p(1 − p)) created by random genetic drift, the selective variance tends to be more elevated at intermediate frequencies (proportional to [p(1 − p)]2) because the magnitude of selective allele frequency change is proportional to p(1 − p).
(A) The total selection coefficient measures the overall effect of selection on an allele including any associations with other sites under selection. (B) When alleles at different sites have different total selection coefficients, selection generates allele frequency divergence. (C) Compared to the binomial variance (proportional to p(1 − p)) created by random genetic drift, the selective variance tends to be more elevated at intermediate frequencies (proportional to [p(1 − p)]2) because the magnitude of selective allele frequency change is proportional to p(1 − p).More generally, allele frequencies are measured t > 1 generations apart during which time the selective divergence accumulates. The temporal structure of selection is then important. The total allele frequency change after t generations is the sum over the intervening t generations , where δp = p − p and p is the frequency in generation i (i = 0, 1, …, t − 1 counting from the preceding measurement). From Eq (2) we have δp = sp(1 − p) where s is the total selection coefficient in generation i. Assuming that the total selection coefficients and total allele frequency change over t generations are small ( and |s| ≪ 1), the expected allele frequency change is approximately (dropping terms of order s2). The selective divergence is then given byThe two sums on the right represent respectively: the divergence contribution from fitness variation within intervening generations; and the divergence contribution from temporal consistency in fitness variation across intervening generations.Sustained selection manifests as positive among-locus temporal covariances Cov(s, s) > 0. If total selection coefficients were perfectly constant with time these positive covariances would create rapid selective divergence with the allele frequency variance in Eq (4) growing quadratically over time (because there are t(t − 1) covariance terms in Eq (4); for further details see C in S1 Text). However, for neutral alleles (the bulk of segregating variants), the temporal covariance between s and s is expected to decay exponentially with increasing time separation |j − i| due to recombination. In the two-locus case where the neutral allele is hitchhiking with one selected allele, linkage disequilibrium (and thus covariance) decays at rate ∼(1 − r)| where r is the recombination rate between the two alleles [1]. The multilocus case similarly involves exponential decay averaged over all linked sites under selection [42]. Nevertheless, even if recombination destroys linkage disequilibrium so rapidly that only concurrent generations |i − j| = 1 covary, there are still t − 1 such pairs contributing to Eq (4). Thus, among-locus temporal autocovariances Cov(s, s) can make a substantial contribution to the overall selective divergence.Alternatively, even if selection fluctuates in such a way that total selection coefficients are temporally uncorrelated Cov(s, s) = 0, the within-generation selective divergence can still create non-binomial frequency dependence. The variance resulting from this effect accumulates at a slower linear rate with time (because there are t variance terms in Eq (4); C in S1 Text)—a selective random walk [43]. Selection that changes in a more predictable manner could in principle generate no overall divergence at all—if selection reverses direction concurrently at many loci, negative covariances can be created in Eq (4) shrinking the overall divergence.In addition to the selective divergence described above, selection has another effect in Eq (3): it perturbs the drift contribution to divergence E[Var(Δp|p, s0, …, s)]. This effect occurs when a mean selective bias in the cohort displaces allele frequencies and thus perturbs the effects of drift (regardless of whether there is among-locus variation in total selection coefficients). We show that the selective perturbation to the drift variance has the form where c is a frequency-independent constant of order 1 (D in S1 Text). This result assumes that the cohort does not start close to fixation, and is also insensitive to population dynamic specifics if many generations separate measurements (t ∼ 10 in the data we analyze). For a generational measurement interval (t = 1) this result also holds in canonical models (i.e. Wright-Fisher and Moran), but in general it is possible that the exact form of the selective drift perturbation depends on population specifics. In the following analysis the exact expression for the selective drift perturbation will not be important; we only use the fact that it scales with , which implies that its effects are negligibly small in the populations of interest here (Methods).Combining variance contributions we have
where D is the frequency-independent variance coefficient in the absence of selection. The variance coefficient C(p) is thus partitioned respectively into a frequency-independent genetic drift component, a frequency-dependent selective drift perturbation, and a frequency-dependent selective divergence.
The deviation from binomial allele frequency variance described in the previous section depends crucially on the among-locus total selection coefficient variance σ2(s|p). This quantity is challenging to analyze because it is determined by the structure of linkage disequilibrium. We thus performed forward-time population genetic simulations using SLiM [44] to supplement our theoretical results (see Methods for simulation details). For simplicity, we focus on three archetypal scenarios in an unstructured, demographically stable population closed to migration: a continual influx of deleterious mutations, no non-neutral mutations (the control case), and a continual influx of unconditionally beneficial mutations. For short we call the first and last of these “negative selection” and “positive selection” respectively. In all three cases we maintain a steady influx of neutral mutations; these constitute the bulk of segregating mutations and therefore dominate the behavior of σ2(s|p). Intuitively we expect that the frequency dependence of σ2(s|p) could be quite different in the negative versus positive selection scenarios, because unconditionally deleterious mutations strong enough to cause detectable allele frequency divergence rarely reach intermediate frequencies, whereas beneficial mutations routinely do so.To check for binomial frequency variance, we use allele frequencies from two timepoints t = 10 generations apart (chosen for compatibility with the empirical data we consider below) to calculate C = Var(Δp|p)/p(1 − p) for alleles starting at intermediate 0.5 < p < 0.55 and high 0.9 < p* < 0.95 major allele frequencies. We then calculate the “excess variance” C(p) − C(p*). We also calculate total selection coefficients for all segregating mutations to investigate how the selective divergence term in Eq (5) behaves as a function of p. To make the magnitude of the latter easier to interpret, we show total selection coefficient variance on a per-generation scale where is the time-averaged total selection coefficient.According to the theory in the preceding section, selective divergence tends to create positive excess variance due to the p(1 − p) factor in the last term in Eq (5). Our positive selection simulations confirm this prediction, consistently creating positive excess variance (Fig 2A and 2C). On the other hand, there is no consistent deviation from binomial variance in the negative selection simulations: increases with major allele frequency so rapidly that the overall selective divergence term in Eq (5) is independent of frequency (Fig 2A and 2B). While these simulations are obviously simplified, the concentration of selective divergence at low/high frequencies is a general feature of the purging of new deleterious mutations. Thus, selection does generate elevated variance at intermediate frequencies as predicted theoretically, but not just any form selection: it is important that selection be “positive” in the sense of not only eliminating rare variants.
Fig 2
(A) Forward-time population genetic simulations consistently show elevated excess variance under positive selection only. Excess variance defined as C(p) − C(p*) with major allele frequencies 0.5 < p < 0.55 and 0.9 < p* < 0.95 and t = 10 generations. (B) Under strong negative selection (deleterious mutation rate U = 1/genome/generation, mutation selection coefficient s = −0.05), total selection coefficients are substantial at all frequencies but much stronger for high major allele frequencies resulting in a frequency-independent overall selective divergence like the neutral case. (C) In contrast, the selective divergence shows clear frequency dependence under positive selection, thus producing excess variance at intermediate frequencies. Population size N = 1000; 100 replicates per parameter combination. Stars indicate which panel A simulations are shown in panels B and C respectively.
(A) Forward-time population genetic simulations consistently show elevated excess variance under positive selection only. Excess variance defined as C(p) − C(p*) with major allele frequencies 0.5 < p < 0.55 and 0.9 < p* < 0.95 and t = 10 generations. (B) Under strong negative selection (deleterious mutation rate U = 1/genome/generation, mutation selection coefficient s = −0.05), total selection coefficients are substantial at all frequencies but much stronger for high major allele frequencies resulting in a frequency-independent overall selective divergence like the neutral case. (C) In contrast, the selective divergence shows clear frequency dependence under positive selection, thus producing excess variance at intermediate frequencies. Population size N = 1000; 100 replicates per parameter combination. Stars indicate which panel A simulations are shown in panels B and C respectively.
Intermediate frequency alleles have elevated variance in Drosophila
We next investigated whether binomial allele frequency variance is observed empirically. In two fruit fly (D. Simulans) E&R experiments [11, 12], we observe systematically elevated variance coefficients C at intermediate frequencies (Fig 3). We rule out measurement error as driving this pattern, because the major sources of pooled sequencing error (population sampling, read sampling, unequal individual contributions to pooled DNA) also create binomial variance rather than a systematic frequency-dependent bias (E in S1 Text; [45, 46]). We also rule out migration, since these E&R populations are closed. Moreover, as will be discussed in the next section, systematically elevated variance cannot be explained by a few large effect loci, implying that a substantial fraction of SNPs across the genome are involved in the observed pattern. Hence we also rule out mutation bias and gene drive as being the main driver of elevated variance at intermediate frequencies since these processes do not have the requisite scale. Finally, population structure tends to create a variance deficit at intermediate frequencies (B in S1 Text); thus, even if some population structure is present in these closed E&R populations, it would tend to eliminate the observed elevation of variance, not explain it. We deduce that the pattern observed in Fig 3 is due to selection, consistent with the theoretical prediction that selective divergence tends to cause elevated variance at intermediate frequencies.
Fig 3
Intermediate frequency SNPs in E&R D. Simulans populations (A [11]; B [12]) have systematically elevated variance coefficients C(p) = Var(Δp|p)/p(1 − p) relative to higher frequency SNPs after one round of evolution and resequencing (t ≈ 10 in A; t ≈ 15 in B), inconsistent with the binomial expectation for neutrally evolving alleles (1).
C(p) is calculated in 2.5% major allele frequency bins using all SNPs in the genome (circles). Vertical lines show 95% block bootstrap confidence intervals (1Mb blocks). We subtract the constant minC(p) from C(p) in each replicate to prevent differences in the overall magnitude of C(p) between replicates from obscuring p dependence within each replicate.
Intermediate frequency SNPs in E&R D. Simulans populations (A [11]; B [12]) have systematically elevated variance coefficients C(p) = Var(Δp|p)/p(1 − p) relative to higher frequency SNPs after one round of evolution and resequencing (t ≈ 10 in A; t ≈ 15 in B), inconsistent with the binomial expectation for neutrally evolving alleles (1).
C(p) is calculated in 2.5% major allele frequency bins using all SNPs in the genome (circles). Vertical lines show 95% block bootstrap confidence intervals (1Mb blocks). We subtract the constant minC(p) from C(p) in each replicate to prevent differences in the overall magnitude of C(p) between replicates from obscuring p dependence within each replicate.Similar results are found in a wild D. Melanogaster population [15] (S1 Fig), although this population is not closed and elevated variance could also be attributed to migration. The effect of migration on allele frequency divergence can be understood analogously to selection (Eq (3)) as introducing a migration divergence term Var(m(p* − p)|p) = m2 Var(p* − p|p) where m is the proportion of individuals in the focal population replaced by migrants from the source population each generation, and p* denotes source population frequencies. The migration divergence thus depends on the structure of differentiation between focal and source populations. The a priori expectation is for Var(p* − p|p) to be greatest at high p (the opposite of the observed pattern), where the largest differences p* − p are possible (analogous to the mathematical constraints on F [47]). However, since we do not know the structure of population differentiation (or even what the source population might be), we remain agnostic about the influence of migration in the ref. [15] population.Next we explored the behavior over time of the elevated variance shown in Fig 3 by following its accumulation within a frequency cohort for two studies in which allele frequencies were measured more than twice [11, 15]. Similar to our simulations, at each measured timepoint we quantified the excess variance using the difference C(p) − C(p*), where p is the initial frequency of the cohort and p* > p is a reference frequency. In practice we choose p = 0.5 to maximize the contrast with the reference frequency, while p* ∼ 0.8–0.9 is chosen to be large enough that there is a meaningful contrast with p = 0.5 but safely displaced from the p = 1 boundary where allele frequency variances are not measured reliably (see sharp increases in Fig 3 as p → 1).We find that excess variance accumulates over the course of the entire Barghi et al. [11] E&R experiment (Fig 4A shows one replicate, other replicates are similar; S2 Fig), implying a sustained, polygenic divergence in allele frequencies. This pattern is consistent with the positive Δp temporal autocovariances documented in [28]. Sustained divergence is what we expect to occur from selection in a novel but constant laboratory environment.
Fig 4
Excess allele frequency variance (a measure of deviation from neutrality defined as C(p) − C(p*)) accumulates over time in a D. Simulans E&R experiment (A; [11]), but remains relatively flat in a wild D. Melanogaster population (B; S = Spring, F = Fall, LF = Late Fall; 09 = 2009 etc.; [15]).
The excess variance is calculated for intermediate frequency alleles falling within a major allele frequency bin at p = 0.5. In (A), p* = 0.9 and bin width is 2.5%. In (B), p* = 0.8 and bin width is 5%. Vertical lines show 95% block bootstrap confidence intervals (1Mb blocks).
Excess allele frequency variance (a measure of deviation from neutrality defined as C(p) − C(p*)) accumulates over time in a D. Simulans E&R experiment (A; [11]), but remains relatively flat in a wild D. Melanogaster population (B; S = Spring, F = Fall, LF = Late Fall; 09 = 2009 etc.; [15]).
The excess variance is calculated for intermediate frequency alleles falling within a major allele frequency bin at p = 0.5. In (A), p* = 0.9 and bin width is 2.5%. In (B), p* = 0.8 and bin width is 5%. Vertical lines show 95% block bootstrap confidence intervals (1Mb blocks).By contrast, excess variance in wild D. Melanogaster populations [15] does not exhibit continual accumulation of excess variance over time, with fluctuations evident in each cohort (Fig 4B). Fluctuations imply a concurrent reversal in the direction of non-neutral allele frequency change across many loci such that non-neutral divergence is partly lost to a subsequent coordinated non-neutral convergence. Bearing in mind that migration may contribute to this pattern, the fluctuations shown in Fig 4 are compatible with temporally fluctuating selection affecting a large proportion of the genome, as proposed by ref. [15]. However, while ref. [15] attributed temporal fluctuations in selection to periodic seasonal change, we do not see a clear annual periodicity in the accumulation of variance. A similar lack of annual periodicity is found in allele frequency temporal autocovariances [28]. These results suggest a more complex selective (or migratory) regime of which seasonal fluctuations are only a part.
Linked selection strongly perturbs SNP frequencies in Drosophila
In the previous section we argued that selection is most likely responsible for elevated allele frequency divergence at intermediate frequencies in three Drosophila studies (with the possible exception of the ref. [15] study because of migration). We next used the theory developed above to estimate the typical magnitude of total selection coefficients associated with elevated divergence (we also apply our analysis to ref. [15] supposing that selection was responsible).We measure the typical intensity of selection using the among-locus standard deviation σ(s|p). This quantity determines the selective divergence in Eq (5), and has the convenient property of measuring the absolute magnitude of s regardless of sign. Intuitively, σ(s|p) measures the intensity of a collective “polygenic” adaptive response shared across many loci. If a fraction f of loci have s = 0, then where σnn(s|p) is the standard deviation in s among non-neutral loci. Thus, a substantial fraction of the alleles in a cohort must have nonzero s (f appreciably smaller than 1) for there to be a discernible σ(s|p) signal.We estimate σ(s|p) from measured allele frequency divergence using Eq (5). Since we only have measurements separated by t generations, we actually estimate where is the time-averaged selection coefficient . To estimate from Eq (5), we need to eliminate the non-selective divergence contributions of genetic drift D and measurement error (which was not included in Eq (5)). In Methods we show that these latter contributions are cancelled out in the excess variance C(p) − C(p*), avoiding the complication of independently estimating them. However, some selective divergence is also cancelled out in the difference C(p) − C(p*), so that this approach only obtains a lower boundIn all three Drosophila studies, we find the above lower bound to be of order 10−4 (Fig 5), implying that total selection coefficients with magnitudes of order are commonplace in the populations considered here.
Fig 5
Total selection coefficients show substantial among-locus variance in Drosophila.
(A-C) Lower bound estimates of calculated from (6) (circles; vertical lines show 95% block bootstrap confidence intervals) are of order 10−4, which implies typical s values of ∼1%. Following the original studies [11, 12, 15], we assume t = 10 (A); t = 15 (B) and t = 10 (C; for both summer and winter).
Total selection coefficients show substantial among-locus variance in Drosophila.
(A-C) Lower bound estimates of calculated from (6) (circles; vertical lines show 95% block bootstrap confidence intervals) are of order 10−4, which implies typical s values of ∼1%. Following the original studies [11, 12, 15], we assume t = 10 (A); t = 15 (B) and t = 10 (C; for both summer and winter).
Discussion
Several lines of evidence support the view that selection strongly influences genetic variation in Drosophila [8, 12, 28, 48]. Our results independently show that even over a short time interval (tens of generations), most intermediate frequency SNPs are influenced by selection—total selection coefficients (which include linked selection) of |s|∼1% are the norm among intermediate frequency SNPs, despite most of these SNPs having no effect on fitness. Since our method relies on contrasting behavior at different frequencies, the effect of selection on extreme frequency alleles is used as a reference and is therefore not directly inferred. We expect the effects of selection to be even greater at extreme frequencies where most deleterious mutations are segregating and recent neutral mutations are most tightly linked to selected backgrounds.The power of our approach stems from aggregating allele frequency behavior over many loci, thereby leveraging the sheer number of variants measured with whole-genome sequencing to discern a selective signal. Heuristically, the sampling error in the lower bound estimate (6) is proportional to where L is the number of independent loci used to estimate C(p). With enough sequenced variants (L ∼ 105), selection coefficients of order |s|∼1% should be detectable over a single generation even when allele frequency noise is of comparable magnitude (i.e. read depth and population size ∼102; see Methods). Intuitively, variants across the genome experience a detectable non-neutral shift as a collective even though the underlying allele frequency changes may be indistinguishable from drift at individual loci.Our approach is a departure from the widespread use of frequency-independent C for neutral mutations [30]. The variance coefficient C can be expressed in terms of the “variance effective population size” N as C = 1 − (1 − 1/2N). Thus, selection makes N frequency-dependent for neutral mutations over short timescales (i.e. before an appreciable fraction of the alleles in a cohort fix). The origin of this non-binomial allele frequency variance is variation in the selective background of alleles at different loci.Selection does not need to be consistent over time to have this effect: stochastically fluctuating selection with no temporal consistency can also generate non-binomial allele frequency variance. However, temporally consistent selection generates divergence more rapidly, and temporal covariances can be responsible for most of the selective divergence (Results). Moreover, allele frequency changes Δp are correlated over time in the systems analyzed here [28]. Thus, it seems likely that temporally consistent selection is at least partly responsible for the patterns documented here.Note, however, that in contrast to ref. [28], the temporal covariances relevant to allele frequency divergence in Eq (4) are between total selection coefficients, not Δp. For among-locus temporal covariances in total selection coefficients to be non-zero it is necessary for those coefficients to vary among loci, whereas Δp covariances quantify any temporal consistency in allele frequency change [42]. Thus, Δp temporal covariances can theoretically be present without any selective divergence, and vice versa. In practice, the temporal autocovariances in Δp must be calculated across three measurement steps e.g. Cov(p − p0, p2 − p). These cross-measurement covariances do not contribute to the divergence observed at t generations, and are only a subset of the covariances contributing to the divergence observed at 2t generations (Eq (4)). Therefore, the patterns of variance accumulation documented here are related but not equivalent to the patterns documented in ref. [28]. Temporal autocovariances in Δp predominantly capture the extent to which the genome-wide influence of selection has a temporally enduring pattern across measurements. Allele frequency divergence captures the cumulative genome-wide influence of both temporally stable and fluctuating selection between two measurements. The relative contribution from temporal covariances in total selection coefficients depends on the intensity of selective fluctuations as well as the persistence time of linkage disequilibrium (Results), and would require generational allele frequency measurements to quantify.We found that the frequency structure of allele frequency divergence is informative about the underlying structure of direct selection (Fig 2). Elevated divergence of intermediate frequency alleles is difficult to explain if only negative selection on unconditionally deleterious mutations is occurring. Although selection against an influx of deleterious mutations can generate transient sweep-like behavior for neutral mutations that originate on genetic backgrounds with above-average fitness, this scenario still entails overwhelmingly more influence on allele frequency dynamics at low/high frequencies compared to intermediate frequencies [49]. More broadly, it may be possible to make more detailed inferences about the structure of direct selection by moving beyond allele frequency variances and analyzing the entire distribution of allele frequency change Δp.Quantifying the bounds on how much selection is possible, and how much selection actually occurs in natural popoulations, is a long running controversy [50, 51]. The strong total selection coefficients (|s|∼1%) we find must predominantly reflect linked selection on neutral SNPs. This implies a substantial risk of overestimating the amount of direct selection when, as is commonly done, selection coefficients are inferred at individual loci and then attributed to direct selection. This “excess significance” is a well known difficulty in E&R experiments [12, 52], and similar challenges have arisen in wild populations [15]. Our results indicate that improving the sensitivity of single-locus selection coefficient inferences, or better controlling for multiple comparisons, will likely not resolve this issue. Our total selection coefficient estimates are also substantially larger than direct selection coefficients of individual alleles estimated from diversity patterns in Drosophila [8]. This is consistent with a linkage-centered view of neutral mutation evolution in which the selective background of most neutral mutations contains multiple alleles under selection such that allele frequency behavior is governed by the fitness variation within local “linkage blocks” [53] or larger haplotypes [11].
Methods
Simulations
We used SLiM [44] to simulate a closed population with N = 103 individuals, a 100Mb diploid genome, a recombination rate of 10−8/base pair/generation, and a neutral mutation rate of 10−8/base pair/generation. Non-neutral mutations were introduced at rate U/chromosome/generation, where in each simulation non-neutral mutations were assumed to have the same fixed selection coefficient. Four background selection regimes (U = 1, 0.1 × s = −0.05, −0.01), one neutral regime (U = 0), and four positive selection regimes (U = 0.1, 0.01 × s = 0.01, 0.02) were evaluated (Fig 2). In each regime, 100 replicates were simulated with complete genotypes recorded at generations 104 and 104 + 10, mimicking the t = 10 generation interval in the empirical studies after a burn in period of 10N = 104 generations. Total selection coefficients in Fig 2B and 2C computed using Eq (2) from genotype data at generation 104.
Data processing
SNP frequency data were obtained from the open access resources published in [15] (wild D. Melanogaster, 1 replicate, ∼5 × 105 SNPs, 7 timepoints), [11] (D. Simulans E&R, 10 replicates, ∼5 × 106 SNPs, 7 timepoints) and [12] (D. Simulans E&R, 3 replicates, ∼3 × 105 SNPs, 2 timepoints). We performed no additional SNP filtering. For the [15] data, only SNPs tagged as “used” were included.
Block bootstrap confidence intervals
We use bootstrapping to estimate the variability of the quantities plotted in Figs 3–5. These quantities are calculated as an average over loci, where nearby loci are unlikely to be statistically independent due to linkage. To account for the non-independence of individual loci when bootstrap sampling, 95% confidence intervals are calculated using a block bootstrap procedure [28]. Each chromosome is partitioned into 1 megabase windows (∼120 total windows). Bootstrap sampling is then applied to these windows. The plotted vertical lines span the 2.5% and 97.5% block bootstrap percentiles.
Estimation of the selection coefficient variance
To derive Eq (6) we show that the reference value C(p*) satisfies the inequalityThe first line above is Eq (5) evaluated at the reference frequency p* with an additional measurement error term M included. M is frequency-independent because measurement error is binomial (E in S1 Text; [45, 46]). Eq (7) implies that the reference value C(p*) is an upper bound on the drift and measurement components of C(p) for all p. Taking the difference C(p) − C(p*), we then have
eliminating D and M.To derive Eq (7) we first drop the selective drift perturbation because it is negligibly small compared to the selective divergence in the populations considered here: C (and therefore D) is of order 10−2, E[s|p] is at most of order 10−2, and t ∼ 10; hence . By comparison, is of order 10−2. Second, we have p*(1 − p*)σ2(st|p*) > 0; subtracting this term gives the inequality.
Estimation limits
Our analysis relies on detecting differences in C(p) between cohorts with different values of p. The ability to detect such differences is determined by the sampling error in C(p) arising due to the calculation of Var(Δp|p) from a finite number of loci. To estimate this sampling error, we assume that Δp is approximately normally distributed, in which case the sample variance in Var(Δp|p) is 2Var(Δp|p)2/(L − 1) ≈ 2Var(Δp|p)2/L where L ≫ 1 is the number of independent loci used to estimate Var(Δp|p). The standard error in C(p) = Var(Δp|p)/p(1 − p) is thus given by . This defines the scale of statistically detectable differences in C(p) − C(p*), which in turn determines the statistically detectable lower bound estimate on σ2(s|p) (6). For example, to detect σ2(s|p)∼10−4 at p = 0.5 (i.e. a typical selection coefficient of σ(s|p)∼1%) after one generation of evolution with C1 ∼ 10−2 (i.e. a population sample of ∼100 individuals, an average read depth of ∼100 and fairly strong genetic drift D1 ∼ 10−2), we need at least L ∼ 105 independent SNPs.
Supplemental text.
This file contains supplemental text sections A-E.(PDF)Click here for additional data file.
Frequency dependence of C.
Same as Fig 3 but for the Bergland et al. data. Each curve represents a different seasonal iterate e.g. summer 2009 to fall 2009.(TIF)Click here for additional data file.
All Barghi et al. replicates.
Same as Fig 4A but including all 10 replicates from Barghi et al.(TIF)Click here for additional data file.7 Sep 2021Dear Dr Bertram,Thank you very much for submitting your Research Article entitled 'Allele frequency divergence reveals ubiquitous influence of positive selection in Drosophila' to PLOS Genetics.The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocolsPlease be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.[LINK]We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.Yours sincerely,Alex BuerkleAssociate EditorPLOS GeneticsBret PayseurSection Editor: EvolutionPLOS GeneticsThis manuscript has been reviewed by two referees, both of whom find the theoretical, methodological, and empirical results to be interesting. The reviews offer a number of suggestions for improvement of the manuscript, including encouragement to move material from the Supplement/Appendix into the main paper and to develop one or more visualizations to communicate the core result. They also raise some questions that are important for clarity and evaluation of the manuscript.Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Bertram develops a theoretical partitioning of allele frequency change to distinguish genetic drift from natural selection from time series data, which here is SNP allele frequencies scored over time within an evolving population. The signal is that selection elevates the magnitude of change at intermediate allele frequencies relative to drift. He applies the approach to allele frequency measurements from laboratory and wild Drosophila populations and finds that selection has genome-wide effects. This paper is worthy of publication because it illustrates an important way that linked selection / genetic draft cannot be subsumed into an “effective population size effect.” The applications to Drosophila demonstrate the genomic extent of this process. I have two major suggestions and several minor comments.I. The paper has the structure of a short-format paper (e.g. Nature or PNAS). Plos Genetics allows a more rigorous treatment of a topic within the main body of the paper. For this reason, I think Appendix S3 should be moved into the main paper. The equations of S3 provide the reader with a far more explicit understanding of the parts of eq 4. It is important to see the sources of variability (and covariation) in selection coefficients, both among loci and through time, as determinants of outcomes. It also sets up an important contrast for the application to Drosophila data. Temporally fluctuating selection (of a specific flavor) is implicated for the Bergland et al field studies but not for the laboratory Evolve-and-Resequence studies. These differences are discussed (e.g. lines 207 onward, lines 307-325) but could be explained more clearly by reference to previous in-text equations.II. The semantic classifications of selection are inconsistent and confusing. For example, Lines 12-14 set up selection as positive or negative. Directional selection is always positive on one allele and negative on the other. Here, the idea seems to involve the initial allele frequency of the favored allele. If so, that should be defined explicitly. At other points in the manuscript, the author contrasts positive selection to “purifying selection.” The latter term is typically associated with mutation-selection balance. Purifying selection is where selection acts to eliminate variation. Its usually considered as the alternative to balancing selection (where selection acts to maintain polymorphism), but that may be the meaning here.Perhaps for terminology reasons, I also struggled to follow the section of Lines 253-272. The point seems to be that “selection acting continuously to purge deleterious alleles” does not explain the results. Are the hypothesized mutations unconditionally deleterious? If so, they would never have made it into the studies in the first place. Mutations rare at time zero and then stay rare would be filtered during variant calling and thus never make it into the analyzed datasets. If the mutations were at intermediate frequency at the initiation of the study (time = 0), they would not likely have been unconditionally deleterious. One allele may have become deleterious in the lab environment in which the population was maintained if it was different (or even just more constant) than the ancestral environment. However, I am not sure how this scenario is distinct from “positive selection.”Other comments:Lines 27-29: “Numerous methods exist for inferring selection coefficients from allele frequency time series [10, 17–24], but are only reliable for selection that is strong relative to the intensity of random, non-selective allele frequency change (random genetic drift).”For samples from natural populations (millions of individuals), the noise from finite sampling (limited numbers of individuals taken plus limited sequencing coverage) is likely to be more important problem than genetic drift in terms of the power of outlier tests.Line 370: Does “eliminate” mean ignore here? I cannot tell if this is a mathematical operation or the term has simply been dropped.Reviewer #2: This MS develops a framework for testing temporal fluctuations in allele frequency due to selection. This framework uses the the variance in allele frequencies, conditional on starting frequency and shows using theory, simulation, and real data that strong temporally variable selection occurs over short time periods. Overall the MS is well written and I think that the results are relevant to a broad audience.I think that the MS could be restructured a bit to make it more approachable to the non-theoretician. I still struggle to understand why the excess variance is frequency dependent. Can the author put together a conceptual figure that will help motivate an intuitive understanding of the results? I think that this would be crucial for this paper to be accessible to the PloS G readership.I also think that the paper could be re-ordered. First theory, then simulations, then data. Having the simulations at the end feels like an afterthought. However for people who might struggle with the maths, the simulation results very striking and might help the reader trust the importance of the results better.Minor commentsLine 143: "that is not important to our results". Do you mean "that is important to our results"? If you mean what you wrote, why should I care to read something that is not important?Line 185-186: "Similar results are found in a wild D. melanogaster population (not shown in Fig 1)." It is a little funny to read this sentence and then not see the results. Why didn't you include this dataset in Fig 1?Fig 4. What the 1e-2, 1e-4, and stars refer to?**********Have all data underlying the figures and results presented in the manuscript been provided?Large-scale datasets should be made available via a public repository as described in the PLOS Genetics
data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.Reviewer #1: NoneReviewer #2: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: No21 Sep 2021Submitted filename: response_to_reviews.pdfClick here for additional data file.22 Sep 2021Dear Dr Bertram,We are pleased to inform you that your manuscript entitled "Allele frequency divergence reveals ubiquitous influence of positive selection in Drosophila" has been editorially accepted for publication in PLOS Genetics. Congratulations!Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!Yours sincerely,Alex BuerkleAssociate EditorPLOS GeneticsBret PayseurSection Editor: EvolutionPLOS Geneticswww.plosgenetics.orgTwitter: @PLOSGenetics----------------------------------------------------Comments from the reviewers (if applicable):I appreciate the author's clear and direct responses to the suggestions for the improvement of the manuscript. I have reviewed these and believe they fully address all of the points that arose in the reviews.----------------------------------------------------Data DepositionIf you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-00936R1More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.----------------------------------------------------Press QueriesIf you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.27 Sep 2021PGENETICS-D-21-00936R1Allele frequency divergence reveals ubiquitous influence of positive selection in DrosophilaDear Dr Bertram,We are pleased to inform you that your manuscript entitled "Allele frequency divergence reveals ubiquitous influence of positive selection in Drosophila" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!With kind regards,Anita EstesPLOS GeneticsOn behalf of:The PLOS Genetics TeamCarlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdomplosgenetics@plos.org | +44 (0) 1223-442823plosgenetics.org | Twitter: @PLOSGenetics
Authors: Neda Barghi; Raymond Tobler; Viola Nolte; Ana Marija Jakšić; François Mallard; Kathrin Anna Otte; Marlies Dolezal; Thomas Taus; Robert Kofler; Christian Schlötterer Journal: PLoS Biol Date: 2019-02-04 Impact factor: 8.029