Literature DB >> 30968124

Robust Estimation of Recent Effective Population Size from Number of Independent Origins in Soft Sweeps.

Abstract

Estimating recent effective population size is of great importance in characterizing and predicting the evolution of natural populations. Methods based on nucleotide diversity may underestimate current day effective population sizes due to historical bottlenecks, whereas methods that reconstruct demographic history typically only detect long-term variations. However, soft selective sweeps, which leave a fingerprint of mutational history by recurrent mutations on independent haplotype backgrounds, holds promise of an estimate more representative of recent population history. Here, we present a simple and robust method of estimation based only on knowledge of the number of independent recurrent origins and the current frequency of the beneficial allele in a population sample, independent of the strength of selection and age of the mutation. Using a forward-time theoretical framework, we show the mean number of origins is a function of θ=2Nμ and current allele frequency, through a simple equation, and the distribution is approximately Poisson. This estimate is robust to whether mutants preexisted before selection arose and is equally accurate for diploid populations with incomplete dominance. For fast (e.g., seasonal) demographic changes compared with time scale for fixation of the mutant allele, and for moderate peak-to-trough ratios, we show our constant population size estimate can be used to bound the maximum and minimum population size. Applied to the Vgsc gene of Anopheles gambiae, we estimate an effective population size of roughly 6×107, and including seasonal demographic oscillations, a minimum effective population size >3×107, and a maximum <6×109, suggesting a mean ∼109.

Entities: Disease Gene Species

Keywords: zzm321990 Anopheleszzm321990 ; demographic oscillations; effective population size; recurrent mutation; soft sweeps

Mesh：

Year: 2019 PMID： 30968124 PMCID： PMC6736332 DOI： 10.1093/molbev/msz081

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Studying the differences in sequences between individuals in a population has the potential to give new insight into evolutionary processes: the evolutionary forces of selection, mutation, migration, and drift can leave a signature in the pattern and frequency of polymorphisms in time and space, which population genetic theory can be used to infer (Bollback et al. 2008; Gutenkunst et al. 2009; Liu and Fu 2015; Zanini et al. 2015; Khatri 2016; Petkova et al. 2016; Feder et al. 2017). A key parameter to estimate for any evolving population is the effective population size (Fisher 1930; Wright 1931), as it determines the underlying nature of the evolutionary dynamics and the relative importance of genetic drift versus selection for evolving traits. In particular, having an accurate estimate of recent effective population size has impact on our ability to predict the outcomes of evolution, as the current population size controls the mutational input through the parameter and the fate of rare variants in a population via the population scaled strength of selection 2Ns (Kimura 1962). However, there is not a single well-defined measure of effective population size and different estimates will depend on the particular evolutionary pressures on the trait or genomic region under consideration, as well as on previous population histories (Charlesworth 2009). A common method to estimate effective population size is from the nucleotide diversity π of neutral regions of a genome, where for , we expect (Charlesworth 2009). This relation represents a balance between mutations introducing variation at rate μ and drift removing variation at rate . However, nucleotide diversity will tend to be dominated by population bottlenecks, and so be insensitive to recent population expansions (Karasov et al. 2010), and there is a need for methods to estimate effective population sizes which are more representative of current day census size. Methods based on linkage disequilibrium tend to be limited to small population sizes (Waples and Do 2010). On the other hand, although there are a number of methods that attempt to directly infer demographic history (Pybus et al. 2000; Gutenkunst et al. 2009; Browning and Browning, 2015; Liu and Fu 2015), these methods are either complex and computationally intensive or only able to detect long-term changes in population size. There are currently no methods that simply and robustly allow estimation of very recent effective population sizes. A recent popular paradigm to study variation in populations is “soft sweeps,” where for sufficiently large population sizes (), multiple copies of the same mutation, distinguished by their haplotype background, coexist in the population. This provides a direct genetic fingerprint on the rate at which mutations enter a population θ, which without their distinguishing haplotype backgrounds would be hidden. Precise information about θ is effectively hidden when mutations arise infrequently per generation (), since in this weak successive mutations regime, a single dominant haplotype fixes in a population before other haplotypes have a chance to establish; these are termed “hard sweeps,” as each subsequent sweep erases any previous information, giving a weak bound that . In a series of seminal articles by Pennings and Hermisson (Hermisson and Pennings 2005; Pennings and Hermisson 2006a, 2006b), much of the basic theory of soft sweeps was developed within a coalescence framework. In particular, the mean number and the distribution of independent origins in a neutral population sample were found to be given by Ewens’ sampling framework (Ewens 2010). Recently, using this approach, Anderson et al. (2017) estimated the effective population size of the malaria parasite; such estimates of N from soft sweeps should be representative of the effective size over the time period of the sweep (Karasov et al. 2010) and more representative of current day census size. However, estimating the maximum likelihood effective population size requires using Ewens’ formula (Ewens 2010) for the probability of observing a certain number of distinct alleles in a sample of only neutral alleles; although exact it is not very practical for large sample sizes, as it requires evaluating the Stirling number of the first kind, a combinatorial factor that has not been implemented in most programming languages. In addition, when the mutant allele has not yet gone to fixation, we need to account for the fact that samples will contain both wild type and mutant alleles; this requires the extra complication of having to convolve Ewens’ formula with a binomial distribution for the probability of observing a given number of mutants in a sample given the frequency of the mutant. Finally, Pennings et al. (2014) estimated an effective population size of HIV from soft sweeps of larger than estimates from nucleotide diversity; however, they used a specific theory for the case of a single amino acid change given by two different nucleotide mutations, so that the two codons give a maximum of two detectable independent origins. In this article, we present a simple semideterministic forward-time approach, based on a nonhomogeneous Poisson establishment rate of independent mutants, which thereafter grow deterministically (Messer and Neher 2012). We show that this gives very accurate estimates of the number of independent origins as a function of the time since selection sets in. In the haploid case, we show explicitly the likelihood function is independent of the selection coefficient and only dependent on the frequency of the mutant allele and so does not require estimation of the selection coefficient or the age of the allele. This approach has the advantage of being simple to implement, as the likelihood function is a nonhomogeneous Poisson process, and is particularly appealing as the results can be understood in intuitive terms in a forward-time framework. Further, we show the method is robust to whether or not the mutation was preexisting in the population and is equally accurate for diploid populations with incomplete dominance (). Finally, we apply our method to recent data from the Vgsc locus from the Anopheles gambiae 1000 genomes (1000Ag) project () to find an estimate of effective population size almost 2 orders of magnitude greater than is estimated by analyzing nucleotide diversity. Moreover, to account for the marked seasonal population dynamics of this species, we show that it is possible to calculate a bound for the maximum and minimum effective population sizes, based on an estimate of effective population size using the constant population size method.

Theory

We calculate the likelihood of the number of origins with two assumptions: 1) we assume a nonhomogeneous (time-dependent) Poisson process such that mutant alleles establish with rate , where x(t) is the frequency of all mutant alleles in the population; 2) after establishment of the kth mutant allele, its frequency increases deterministically. The mean number of origins at time T is then determined by calculating the average number of establishment events in a time window 0 to t, where t is the latest possible time of establishment, such that it can grow deterministically to a critical frequency to be sampled from the population at some time T.

Deterministic Growth

We assume that the overall mutant population grows according to the following differential equation: where the first term is the change in frequency due to frequency independent selection (assuming ) and second is the change in frequency due to mutations arising from the wild-type population at mutation rate μ. This has the following closed-form solution: which in its form is where where and x0 is the initial frequency of the total mutant population. As in this deterministic framework, the mutant allele only asymptotically reaches fixation as , we identify as the characteristic or typical time to fixation, which is the inflexion point of the function and roughly the point at which the mutant has reached a frequency of for ; the actual time to fixation with discrete populations and drift will always be of the same order of magnitude as . Here, we assume that the initial frequency of the mutant is zero and so using the identity (), We see that the typical time to fixation has a logarithmic dependence on the mutation rate and can increase without bound for small mutation rates since we must wait for mutations to arise before selection can act to increase its frequency. Note that our approach here is in contrast to (Karasov et al. 2010; Messer and Neher 2012; Wilson et al. 2014) who typically assume an expression for the mutant frequency which ignores initial conditions and de novo mutation, which as we see can cause a large effect on the time to fixation; in our case, this is important as we require the mutant to have zero initial frequency, when the selection pressure arises.

Stochastic Establishment and Likelihood of Number of Origins

We assume mutant alleles arise by de novo mutation at a time-varying (nonhomogeneous) rate proportional to the number of wild-type individuals . De novo mutants must reach a critical frequency at which point more mutant individuals are added by selection compared with the change in number due to drift (Desai and Fisher 2007). The probability that a de novo mutant, starting at frequency , grows by drift to size , is just the inverse of the size of this neutral subpopulation, . The rate of establishment of mutants is then We make the assumption that establishments occur randomly and independently and so the underlying probability distribution for the number of establishments up to time , the time of establishment of last mutant to possibly be sampled at a latter time T, is given by a nonhomogeneous Poisson process: where is the number of independent origins at time T, and where the mean is given by the integral of the rate α up to time : The time of the last establishment is straightforward to calculate as shown next.

Calculating tK

The time for the last possible establishment, t of the Kth mutant, in order to be sampled with high probability at time T, is calculated by using a deterministic approximation for the change in frequency of the Kth mutant. In an experiment, and in simulation, individuals of a population are sampled with a sample size n; in simulation this is done using multinomial sampling with the allele frequencies determined from simulation. Here, for simplicity, we assume that when a mutant allele frequency is above then the mutant will be found in a sample of size n. With a deterministic time-course of the Kth mutant, there is a one-to-one correspondence between its frequency at time T, and the time of establishment t, given that its frequency must be . To calculate , we use the fact that in the deterministic limit the ratio of the frequency of any mutant allele is fixed with respect to the overall mutant population, that is, ; this is true whenever the growth function of each mutant is of the same form , which can be proved by showing . In this case, once a mutant arises in the population, we assume no more mutations can create the mutant from wild type and that there are no back mutations, so the growth of each mutant follows: while the growth of the total number of mutants is given by equation (1); however, once the overall mutant population has established the effect of mutations will be weak compared with selection, as long as , and so to a good approximation, the total mutant population also follows the same form as equation (9). It is then simple to show that the frequency of the Kth mutant is just a scaling of the frequency of the total mutant population x(t): where we have used the fact that at the establishment time t we know that the frequency of the mutant must be , and that . We then solve , for t to give where we have again used the identity () to arrive at this expression.

Simple Expression for Mean Number of Origins

The mean number of origins is calculated by inserting equation (11) into equation (8) and then after some algebra we find: which we see only has dependence on the selection coefficient s through the frequency of the total mutant frequency x(T) at time T. This is consistent with the results in Pennings and Hermisson (2006a), where in the coalescence framework they find the probability of a soft sweep in a sample size of 2, at fixation, is independent of the frequency sample path of the mutant allele and weakly bounded by selection through the fixation time. This result suggests that larger sample sizes n increase the number of independent origins we should expect to observe. In practice, we can replace x(T) by the frequency of the mutant in the sample with little error, since it has a weak logarithmic dependence in equation (12), so that is the number of mutants in the sample. Making the standard replacement , we arrive at a pleasingly simple expression for the mean number of origins in the sample: which is only a function of θ and the number of mutants n. As shown in the Supplementary Material online, the theory can be extended to the diploid case, where we find an expression for the mean number of origins as a function of the dominance coefficient h (assuming incomplete dominance ) and the selection coefficient s, as well as N and μ. In this case it is not clear whether the mean number of origins, and hence the Poisson distribution, is independent of the selection parameters s and h, as the resulting expression is complex. However, as we will see, the haploid expression, with in equation (13), is as accurate in the estimation of the effective population size as using the diploid expression, which suggests the dependence on s and h are weak. In addition, as shown by Pennings and Hermisson (2006a), the probability of a soft sweep has a weak dependence in diploid populations, which would also suggest a weak dependence on s for the number of origins.

Simulations

Methods

We simulate the population genetics of multiple recurrent mutations at a single locus using an infinite-alleles Wright–Fisher process. Simulations start assuming a fixed wild type, so that the mutant frequency ; each subsequent mutation that arises is given its own “allelic” identity to represent it arising on a different haplotype background, and once it enters the population the same allele cannot be produced by mutation from the wild type or any other allele. As is commonly assumed for an infinite-alleles process, we assume in addition there are no back mutations to the wild type. Each mutant allele has the same selective advantage s relative to the wild type. For population sizes up to , we use multinomial sampling of alleles with fixed population size N to calculate the stochastic change in frequency between generations due to selection and drift. This is replaced by the equivalent multivariate Gaussian distribution with covariance matrix for population sizes larger than 106. Correspondence between the two methods was checked for simulations at smaller population sizes (not shown). In both cases, mutations are treated separately and introduced with a nonhomogeneous Poisson process, where the mean number of new mutant alleles in generation t + 1 is given by , where x(t) is the frequency of all mutants in generation t; each of these new mutant alleles arise in the population with frequency (or in the diploid case). At various time points T, we sample the vector of frequencies of all independent mutants , using multinomial sampling with K + 1 categories (including the wild type, which has frequency ), and sample size n. This produces a sample vector , where is the number of the kth mutant sampled. The number of origins is then the number of different mutants that are nonzero in the sample.

Results

In figure 1 is plotted the time series of the frequency of each recurrent mutation from the Wright–Fisher simulations for and s = 0.05 and two different mutation rates, corresponding to (A) and (B). We see that at the larger mutation rate there are correspondingly many more mutants in the population, and that the rate of production of mutants is proportional to the frequency of the wild type, signified by the lack of new mutants once the total mutant population has fixed. The red curve is a plot of equation (3), the deterministic solution for the total mutant population over time, and we see that it matches well the time-course found in the simulations, particularly for , where stochastic effects of the de novo generation of mutants becomes negligible. The frequency of each of the recurrent mutants follows the same scaling as the total frequency of all mutants, as assumed in the Theory section, and once the mutant population fixes, each of the recurrent mutants plateaus and stops changing in frequency (up to small relative fluctuations), which is as predicted by equation (9). In other words, in the deterministic limit there is a “crowding-out” effect, characteristic of logistic growth, where the growth of a mutant is limited by all other mutants in the population.

. 1.

Time series of the frequency of each independent origin of the same recurrent mutant (range of different colors). (A) , and s = 0.05, (B) same as (A), but with . Solid black line is the sum of all mutant frequencies (), dashed black line the frequency of the wild type (), and the solid red line is the deterministic time-course given by equation (3). In each plot, the highlighted mutant in the thick magenta solid line shows an example of a mutant establishing at the frequency , at time t, and then reaching the critical sampling frequency at a time T. If T is the time of sampling, then this would be the last possible mutant that could contribute to a sample, and the time between 0 and t would be the window over which mutants can be generated that could contribute to a sample at time T. The distribution of the number of origins at time T is just the distribution of the number of establishments in this time window; this is the basis of the semideterministic theoretical calculation of the number of origins described above. Figure 2 shows the results for the mean number of origins calculated from simulation (squares), compared the semideterministic theory presented in this article (thick lines) and Pennings and Hermisson’s calculation (thin lines) based on Ewens’ sampling theory (Pennings and Hermisson 2006a; Ewens 2010). We see in general that the time-course of reflects the time-course of the frequency of the total mutant population, with a sigmoidal variation, where for the largest selection coefficients we see a plateau reached in <500 generations. Both the semideterministic theory and Ewens’ theory predict that the plateau of is independent of the selection coefficient, since is roughly given by time window over which mutants can be generated, which approximately scales as , multiplied by the rate of establishment of mutants, which scales like , cancelling the s dependence. This is seen more clearly in figure 3 which is the number of origins plotted for over a longer timescale for various values of and s; we see that the semideterministic theory and the simulations show the plateau is indeed independent of the selection coefficient and only dependent on . We see that the simulations agree with this prediction for the larger population sizes, but for , the number of origins decreases for long times; this is due to drift removing very low frequency variants at the smaller population size, whereas at the larger population sizes drift acts more slowly, such that the change is insignificant on the timescale of the simulation. Finally, we see that the time-course of the mean number of origins before the plateau is different for each population size, where for the smaller selection coefficients the mean number of origins arises more slowly for larger population sizes. This is related to the deterministic time-course of the mutant frequency which, given the initial condition that the mutant frequency is zero, has a strong dependence on the mutation rate as shown by equation (5). The simulations are performed for fixed , and so a larger population size means a smaller mutation rate and so increases more slowly.

. 2.

. 3.

Average number of origins for population size of on linear-log scale, for and showing that plateau number of origins is independent of s. The filled symbols show the simulation results and standard error bars for the parameter combinations shown in the legend. The solid thick lines are the predictions for the same parameter combination of the semideterministic theory described in this article (Methods).

Average number of origins for population sizes of , and . The filled symbols show the simulation results and standard error bars for the parameter combinations shown in the legend; for and , the simulations used multinomial sampling of the Wright–Fisher drift process with 50 and 10 replicates, respectively, for each parameter combination, whereas for , the multinomial sampling is replaced by the multivariate Gaussian distribution approximation of the drift process (see the Methods section above), where 100 replicates are used in this plot. The solid thick lines are the predictions for the same parameter combination of the semideterministic theory described in this article (Methods), whereas the thin lines represent the prediction of Pennings and Hermisson (2006a), based on Ewens’ sampling theory (Ewens 2010). Average number of origins for population size of on linear-log scale, for and showing that plateau number of origins is independent of s. The filled symbols show the simulation results and standard error bars for the parameter combinations shown in the legend. The solid thick lines are the predictions for the same parameter combination of the semideterministic theory described in this article (Methods). We also examine the distribution of the number of origins in figure 4 from Wright–Fisher simulations (1,000 replicates) at a population size , selection coefficient s = 0.05, and mutation rates . The theory presented in this article describes the distribution very well for all times up to and including fixation. On the other hand, Ewens’ sampling framework predicts in a sample of n neutral alleles that the distribution of the number of distinct mutant alleles η is where is the unsigned Stirling number of the first kind, which is a combinatorial factor which arises in the expansion of the rising factorial . However, if the mutant allele has not fixed then the probability distribution of η mutants alleles is the convolution of equation (14) with a binomial distribution that in a sample of size n we see n mutant alleles given a frequency x(t) of the mutant population. This convolution has no known closed-form solution and for large sample sizes is computationally intensive. In figure 4, the dotted lines are a plot of Ewens’ theory equation (14) without this convolution and n replaced in equation (14) by (calculated in Mathematica [Wolfram Research, Inc. 2018]) and as expected it does poorly when the mutant has not yet fixed and is quite accurate at later times when the mutant is near or at fixation. When the mutant allele is at fixation, the semideterministic likelihood of this article and that from Ewens’ formula are closely matched (fig. 7).

. 4.

Distribution of the number of origins for simulations with various mutation rates for and s = 0.05 (open circles) compared with theory in this article equation (12) and (7) (solid lines) and Ewens’ sampling formula (dotted lines), both with n = 1,000. For the mutation rates , the corresponding typical fixation time (eq. 5) is generations.

. 7.

Likelihood (normalized) of the number of origins as function of effective population size given an observed number η = 10 and samples size n = 1,530 chromosomes, corresponding to that found for the Ag1000 project () for the Vgsc resistance locus. As shown in the legend, the semideterministic theory in this article, assuming a current day frequency of x = 0.78 (as observed) is compared with assuming x = 1 and the Ewens’ sampling theory equation (14), which only has applicability for x = 1. The 95% confidence intervals (gray dotted lines) and maximum likelihood effective population size (red dotted line) are shown for the semideterministic likelihood function with x = 0.78.

Parameter Estimation

Haploid

As described above, the semideterministic theory calculates the likelihood function for the number of observed independent origins, and we find it is only a function of , the frequency of the mutant population at the time of sampling x(T) and the sample size . Typically, the mutation rate will have been independently determined, and so we can determine a maximum likelihood estimate of N given knowledge of the n and x(T), which can be estimated from the sample. In figure 5 is the -error of this estimation process using 100 replicate Wright–Fisher simulations, with sample size n = 1,000, where the true N is known. We see that for mutant frequencies x > 0.1, the error of our estimate is always less than a factor of , which means the effective population size is accurately determined to much less than an order of magnitude. Moreover, the accuracy increases for increasing , where it is < for .

. 5.

-error in estimating the true effective population size, for (A) haploid populations with , (B) diploid populations with , for various selection coefficients, mutation rates, and dominance coefficients (diploid only) from Wright–Fisher simulations (100 replicates for each parameter combination). (A) We use equations (12) and (7) to determine the maximum likelihood estimate. (B) For the diploid population, we use the same Poisson likelihood function, but with mean given by equations (13) and (14) in the Supplementary Information, where we assume perfect knowledge of T (squares) and also compare to the case where we have a systematic error in our knowledge of T, where the true time is instead T (circles), and we see the estimates are unchanged. In addition, for the diploid population we use the haploid likelihood function (eqs. 13 and 7) with to estimate N (plus signs) and find again excellent agreement.

Diploid

We can also accurately estimate the effective population size from diploid simulations. As described in the Supplementary Material online, we extend the semideterministic theory to the diploid case with incomplete dominance () by using the exact implicit solution t(x) for how the frequency x of the mutant allele changes over time to calculate time of establishment of the last mutant to be sampled at some later time T. This is then used to calculate the likelihood function , where we assume a known mutation rate. We are still left with having to jointly estimate N, s, and h in the diploid case. However, we expect that the dependence on h and s will be weak (Pennings and Hermisson 2006a), although it is not straightforward to show this explicitly, as in the haploid case, where there is no dependence on s, even before fixation. To show this, we use the implicit relation (eq. 2, Supplementary Material online) to numerically estimate that gives t(x) = T, where we assume perfect knowledge of the dominance coefficient h. We see in figure 5 that the estimate of the effective population size from diploid simulations has a similar accuracy as the haploid simulations and is robust to knowledge of the exact time selection sets in T; the error is taken up in the estimate of s (not shown). We also use the haploid semideterministic theory to estimate the effective population size, using in equation (13) to account for double the number of chromosomes, shown by plus signs in figure 5 again we see that the estimate of N is identical using the haploid method for a given set of parameters, s, h, and μ. Both the robustness of estimates to the exact knowledge of T and that the haploid theory gives identical estimates indicates that the direct dependence on s and h is very weak or nonexistent, at least for weak absolute selection (Pennings and Hermisson 2006a).

Haploid with Preexisting Mutations

Finally, we examine the effect that preexisting mutations have on our estimate of the effective population size. We run simulations such that for times the mutant allele has a negative selection coefficient , where generations and , s = 0.05 and . The mean number of origins is plotted in figure 6, for the various values of s as well as for the case of no preexisting mutations (black hexagram symbols); we see that as the mutant allele becomes increasingly neutral before positive selection sets in, the number of origins is larger, except for long times where the plateau of is approximately independent of s. This suggests the overall effect of preexisting mutations is to cause a time advance on the number of origins. This again would suggest that the estimate of effective population size should be robust to preexisting mutations, which we see to be the case in figure 6, where the error in estimating N using equation (13) for the mean of the Poisson likelihood function is roughly independent of s and very similar to assuming no preexisting mutations (black hexagrams).

. 6.

Mean number of origins for haploid simulations with preexisting mutations (A), where the black hexagram symbols represent simulations without preexisting simulations, and (B) -error in maximum likelihood estimate of the true effective population size from Wright–Fisher simulations with various values of the deleterious selection coefficient s (100 replicates for each parameter combination).

Application to Data from Ag1000 Project

Recently published data from the Ag1000 project have extensive population level sampling of the genomes of mosquitoes across sub-Saharan Africa (). The gene for the voltage-gated sodium channel (Vgsc) is known to have at least two single nucleotide mutations in the same codon that confer resistance to insecticides, () and (), and phylogenetic analysis of this gene reveal ten haplotype clusters (fig. 4 in ]) with a current mutant frequency of determined directly from the data. If we assume either mutation is required for resistance, this gives a mutation rate of , assuming a base-pair mutation rate of , which is based on a recent accurate estimate from Drosophila (Keightley et al. 2014), as the mutation rate has not been directly measured for A. gambiae. Applying the haploid algorithm to this data, using in equation (13) (accounting for the factor of 2 between chromosomes and individuals), and n = 1,193 (given a sample size of n = 1,530 chromosomes from 765 mosquitoes), gives an estimate of , which corresponds to an effective population size , where the values in brackets are the 95% confidence intervals (2 units from max likelihood), as shown in the plot of the likelihood function in figure 7. This estimate is almost 2 orders of magnitude greater than that of from a nucleotide diversity . In the same article, the authors use the more sophisticated “stairway” plot (Liu and Fu 2015) and (Gutenkunst et al. 2009) method to estimate population history and find most recent effective population sizes of order , which is roughly six times less than our estimate. Likelihood (normalized) of the number of origins as function of effective population size given an observed number η = 10 and samples size n = 1,530 chromosomes, corresponding to that found for the Ag1000 project () for the Vgsc resistance locus. As shown in the legend, the semideterministic theory in this article, assuming a current day frequency of x = 0.78 (as observed) is compared with assuming x = 1 and the Ewens’ sampling theory equation (14), which only has applicability for x = 1. The 95% confidence intervals (gray dotted lines) and maximum likelihood effective population size (red dotted line) are shown for the semideterministic likelihood function with x = 0.78. Note that we can also apply the method to each resistance mutant separately and , which have frequencies of and , and five independent origins each, which assuming a single base-pair mutation rate of for each of these, gives the following estimates of effective population size , and , respectively, where the values in brackets, are again the 95% confidence intervals. We see the estimates based on each single nucleotide polymorphism are consistent with the estimate above based on both single nucleotide polymorphisms, but, as expected, with larger confidence intervals. However, it is known that in many sub-Saharan regions mosquitoes undergo seasonal demographic changes, where the population size changes between wet and dry seasons by up to a peak-to-trough factor of (Minakawa et al. 2002; Mabaso et al. 2007; Bomblies et al. 2009; Walker et al. 2013), where Nmax and Nmin are the maximum and minimum of the population size. To check the impact of demographic changes on our population size estimates, we ran simulations for a mutant with s = 0.05, with an oscillating population size , with a period of generations, which is ∼1 year and much shorter than the expected time to fixation of the mutant of ∼300 generations (eq. 5). The simulations were performed with various peak-to-trough ratios and with two constraints: 1) that the geometric mean and 2) that the harmonic mean . Simulations with constrained arithmetic mean were also performed but are not shown. Overall, we see in figure 8 that constraining the harmonic mean of the maximum and minimum population size for a given gives fewer origins than simulations with a constant population size, and more origins than simulations that constrain the geometric mean, the exception being for where the number is slightly larger, but roughly equal, to the constant population size case. This means we can broadly say that using the constant population size theory to estimate will give a relatively tight lower bound on the true harmonic mean (with weaker lower bounds on the geometric mean, and arithmetic mean as discussed below). Given the simple relation between the harmonic mean and maximum and minimum population sizes we can derive expressions for a lower bound on Nmin and Nmax given an estimate of and : This is true for any value of . On the other hand, from figure 8, we can see for that the number of origins due to the harmonic constraint is approximately one half the origins assuming a constant effective population size, so as equation (13) is almost linear in θ, with only a weak logarithmic nonlinearity, the true harmonic mean can be estimated as (simulations with harmonic mean constrained to confirm this—not shown). The field data (Minakawa et al. 2002; Mabaso et al. 2007; Bomblies et al. 2009; Walker et al. 2013) suggest , which means . We can then upper bound the maximum and minimum effective population sizes as and , which is only true specifically for .

. 8.

The mean number of origins from Wright–Fisher simulations (1,000 replicates) for oscillating population size with period generations, selection coefficient s = 0.05, , and with the geometric mean (green) and harmonic mean (purple) of Nmax and Nmin constrained to , for different peak-to-trough ratios. Black squares represent constant population size simulations. Altogether, this gives the following bounds on Nmax and Nmin: and , using the estimate above of . Using these bounds, we can then put a bound on the arithmetic mean , as . Note that this result needs a little care in interpretation, since as seen in figure 8 for the harmonic constraint gives slightly greater independent origins, however, it is with near equality and within the errors of these estimates. Simulations that constrain the arithmetic mean of the maximum and minimum population sizes show that the number of origins monotonically decreases with increasing , but are significantly less than even the constrained geometric mean case (not shown). This means our estimate will be less than the arithmetic mean for all , but as with the geometric mean, the equivalent to equation (15) would provide a much weaker lower bound on Nmax and Nmin.

Discussion

Estimating the recent effective population size is of paramount importance to understanding and predicting the evolutionary dynamics of natural populations. As has been previously suggested (Karasov et al. 2010), methods that estimate effective population size based on nucleotide diversity are likely to give estimates which are much smaller than the current day census size, as such metrics are dominated by historical population bottlenecks. Although methods based on linkage disequilibrium can detect recent effective population sizes, they tend to be limited to small populations (Waples and Do 2010). In addition, methods that estimate demographic histories tend to be computationally complicated and with limited range of applicability, such as only detecting long-term variations (Pybus et al. 2000; Gutenkunst et al. 2009; Liu and Fu 2015) or limited to small population sizes (Browning and Browning 2015). However, a genomic region undergoing current selection should leave a signature which represents an effective population size more representative of the census size during the sweep (Karasov et al. 2010). When the mutational input into a population is large , we expect a signature of a selective sweep will be a large diversity of haplotype backgrounds, due to multiple and recurrent independent instances of the same mutation that is under positive selection; such a sweep has been termed a soft sweep as multiple rather than a single haplotype dominate the sweep (Hermisson and Pennings 2005). Although Pennings and Hermisson seminal work (Pennings and Hermisson 2006a, 2006b) laid out much of our understanding of soft sweeps within a coalescence framework, many quantities like the likelihood of the number of origins, particularly when the mutant population has not yet fixed, are not straightforward to calculate numerically. In this article, we have presented a simple semideterministic haploid forward-time theory of the number of independent origins of a recurrent mutation. We show that the distribution of the number of origins is very closely approximated by a Poisson distribution with a mean number of origins that has an exact and simple closed-form solution for the haploid case, which is independent of the selection coefficient and the age of the allele, and only depends on , the sample size and the current day mutant frequency. We show it works robustly for diploid populations with incomplete dominance, and whether or not mutations are preexisting in the population before the selection pressure arose. Our forward-time semideterministic theory also provides an intuitive insight into the dynamics of soft sweeps, where it is clear there is a demarcation between the stochastic and deterministic stages for each haplotype contributing to a soft sweep. New origins are generated by recurrent mutation, and these must establish by growing to a frequency where deterministic selection outweighs drift; thereafter growth is approximately deterministic of each independent mutant. The deterministic part of the theory shows that at sufficiently large population sizes the growth of each recurrent mutant is just a scaling of the overall mutant population and grows logistically, where other mutants “crowd-out” the growth of a particular mutant; once the wild type is extinct new mutants cannot arise, and growth of each recurrent mutant is zero, so this structure is effectively frozen, which is confirmed by simulation up to small fluctuations due to drift. Including drift in this picture means that this frozen structure is only temporary as drift will take of order N generations to act. This is seen in the simulations at even a moderate population size of , where drift can act on the small frequency variants causing a decrease in independent origins for long times; however, for very large populations there is a stable plateau as predicted by the theory. This suggests that Ewens’ sampling theory and the calculation in this article will not be valid for small populations after fixation of the mutant, since the supply of mutants has been switched off; therefore the semideterministic approach in this article will be limited to times at or before fixation for small population sizes. The framework of this semideterministic theory also makes clear why selection should have little effect on the plateau number of origins, as the rate of establishment is proportional to the s, whereas the time window over which new origins can be generated is proportional to the lifetime of the wild type, which scales as , giving a number of origins that is independent of s. In addition, our result for the mean number of origins shows further that it is only dependent on the selection coefficient through the frequency of the mutant population, and in particular on the ratio of the number of mutants in the sample to the number of new mutants that enter every generation (). Surprisingly, as found by Pennings and Hermisson (2006a), the number of origins does not depend on the exact sample path (history of the population frequency) of the mutant; here we see further that the number of origins only depends on the frequency of the mutant at a given time. Finally, we estimated the effective population size of A. gambiae and Anopheles coluzzii to be using data from the 1000Ag project (), which is roughly 2 orders of magnitude larger than estimated using the same underlying data from nucleotide diversity and much closer to what is likely to be the census population size in recent history. This supports simple calculations of Karasov et al. (2010), which suggested values of effective population size derived from nucleotide diversity are too small to explain adaptation of resistance alleles or the occurrence of multiple resistance haplotypes for the Ace gene in Drosophila melanogaster. Here, we have provided a very simple and robust method to quantify this effect. The demographic history of Anopheles has also been estimated from the 1000Ag project data () using the “stairway” plot (Liu and Fu 2015) and (Gutenkunst et al. 2009) methods, giving a recent population size of roughly , greater than the nucleotide diversity estimate, but smaller than our estimate. A possible reason for this discrepancy is that these methods tend to detect long-term demographic changes, so that the difference could represent recent population growth in the past 100 years. However, there are reasons to be uncertain about these estimates; the estimates in are based on applying each of these methods to data from each geographic region, whereas the estimate here is based on data from all geographic regions in the Ag1000 data. In the completely panmictic case, the estimate in each region should agree with the estimate based on pooling the data, but as discussed below if there is spatial structure then the relation between the two estimates would not be straightforward. There is also good reason to suggest there may have been a reduction in effective population size due to action of insecticides (Athrey et al. 2012; O’Loughlin et al. 2016). More generally, a recent population expansion would further lead to our method underestimating the current day population size. In an expanding population, there would be a maximum in the number of wild-type individuals that produce independent origins at around the time , since at very early times the overall population size is small and at longer times than the wild type is near extinction. Therefore, if selection is particularly strong and occurs far in the past compared with the current age of the allele, this would be a very large underestimate, as the number of observed origins would be controlled by a time when the census was very small. For the Vgsc gene of A. gambiae given that the current day mutant frequency , and generations (assuming that insecticides were introduced about 80 years ago), we can use equation (3) to numerically find the best fit selection coefficient as with the constraint that . Then, using equation (5), we calculate generations; this is the recent past, which suggests our estimate should not be too great an underestimate. On the other hand, if there has been a recent decline in population numbers then this would have an opposite effect, where our method would overestimate the effective population size due the overall number of origins being dominated for times , when the population was larger in the past. Again, with our estimate of being in the recent past suggests the error will be small. Additionally, as discussed in Pennings et al. (2014), if there are preexisting mutations then the population size estimate would be influenced by the size before insecticides were introduced. It is also known that the Anopheles populations undergo seasonal demographic fluctuations with peak-to-trough population sizes of order 10–100 (Minakawa et al. 2002; Mabaso et al. 2007; Bomblies et al. 2009; Walker et al. 2013). To investigate the effect of such fluctuations on our population size estimates, we performed simulations of oscillating population sizes over time for peak-to-trough factors <1,000. These results showed that a constant population size estimate will tend to underestimate the harmonic mean of the maximum and minimum of the population size for large peak-to-trough ratios . In addition, the simulations show for , the average number of origins is approximately one half. Together, this allowed quantification of bounds on the maximum and minimum population size giving and , assuming a peak-to-trough ratio of , as suggested by the field data (Minakawa et al. 2002; Mabaso et al. 2007; Bomblies et al. 2009; Walker et al. 2013). This then suggests a mean (arithmetic) population size bounded as . One might ask if seasonal oscillating demographics could alone explain the discrepancy between the N estimated from nucleotide diversity and our larger estimate here. Our results in figure 8 suggest that for increasing peak-to-trough ratio, we would expect to underestimate the harmonic, geometric and arithmetic mean (arithmetic mean not shown), and so given our constant population size estimate using equation (13) of , we in fact would expect the discrepancy with respect to the nucleotide diversity estimate to be even larger and this cannot in itself explain the discrepancy. However, this is comparing to the estimate of N from π assuming a constant/fixed population size. For an oscillating population, the nucleotide diversity will be controlled by the harmonic average of the effective population size over a cycle (Wright 1938; Charlesworth 2009), which can be shown to be given by the geometric mean of the maximum and minimum of the sinusoidal demographic variation (i.e., ). This means for different values of peak-to-trough factors with the same geometric mean , the nucleotide diversity should be unchanged, on the other hand the simulations in figure 8 show that we should observe fewer origins for increasing ; this is inconsistent with observations, as fewer origins corresponds to an underestimate of the geometric mean, which is the effective population size estimated by nucleotide diversity. In other words, oscillating demographics with an unchanging mean would lead to the nucleotide diversity estimate of N to be greater than the value estimated from number of origins assuming a nonoscillating and fixed population size. We observe the opposite, which suggests there is another mechanism by which nucleotide diversity has been suppressed, such as historical and sustained bottlenecks. The results of these oscillating demographic simulations are in contrast to those of Wilson et al. (2014), which showed that the probability of a soft sweep in a sample size of 2 only depends on the cycle-averaged harmonic mean, when demographic oscillations are fast. As mentioned above the cycle-averaged harmonic mean is just the geometric mean of the maximum and minimum population sizes; however, our results show different peak-to-trough ratios give significantly different numbers of independent origins for the same geometric mean. This suggests the probability of a soft sweep in a sample size of 2 is a weak measure of the diversity of haplotypes compared with the number of independent origins. Our estimation also makes the assumption that the populations are well mixed or panmictic and constant over time, which clearly requires testing regarding the Ag1000 data, which consists of the sequences of individuals collected over the wide spatial region of sub-Saharan Africa. As discussed by Ralph and Coop (2010), we would expect our results to be accurate in the limit of strong long-range or nonlocal dispersal, which mimics the panmictic approximation; on the other hand, if local migration is strong, spatial structure of the populations would tend to give a larger number of origins compared with the panmictic case, which would suggest our method would overestimate the effective population size needed to explain an observed number of origins. In other words, it is possible that spatial structure could account partially or wholly for the large number of origins observed in natural populations of A. gambiae and Anopheles coluzzii. Further theory and simulations will be needed to test this hypothesis.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

31 in total

1. An integrated framework for the inference of viral population history from reconstructed genealogies.

Authors: O G Pybus; A Rambaut; P H Harvey
Journal: Genetics Date: 2000-07 Impact factor: 4.562

2. On the probability of fixation of mutant genes in a population.

Authors: M KIMURA
Journal: Genetics Date: 1962-06 Impact factor: 4.562

3. Parallel adaptation: one or many waves of advance of an advantageous allele?

Authors: Peter Ralph; Graham Coop
Journal: Genetics Date: 2010-07-26 Impact factor: 4.562

4. Environmental predictors of the seasonality of malaria transmission in Africa: the challenge.

Authors: Musawenkosi L H Mabaso; Marlies Craig; Amanda Ross; Thomas Smith
Journal: Am J Trop Med Hyg Date: 2007-01 Impact factor: 2.345

5. Soft sweeps II--molecular population genetics of adaptation from recurrent mutation or migration.

Authors: Pleuni S Pennings; Joachim Hermisson
Journal: Mol Biol Evol Date: 2006-03-06 Impact factor: 16.240

6. Temporal and micro-spatial heterogeneity in the distribution of Anopheles vectors of malaria along the Kenyan coast.

Authors: Martin Walker; Peter Winskill; María-Gloria Basáñez; Joseph M Mwangangi; Charles Mbogo; John C Beier; Janet T Midega
Journal: Parasit Vectors Date: 2013-10-28 Impact factor: 3.876

7. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data.

Authors: Ryan N Gutenkunst; Ryan D Hernandez; Scott H Williamson; Carlos D Bustamante
Journal: PLoS Genet Date: 2009-10-23 Impact factor: 5.917

8. Loss and recovery of genetic diversity in adapting populations of HIV.

Authors: Pleuni S Pennings; Sergey Kryazhimskiy; John Wakeley
Journal: PLoS Genet Date: 2014-01-23 Impact factor: 5.917

9. Population genomics of intrapatient HIV-1 evolution.

Authors: Fabio Zanini; Johanna Brodin; Lina Thebo; Christa Lanz; Göran Bratt; Jan Albert; Richard A Neher
Journal: Elife Date: 2015-12-11 Impact factor: 8.140

10. Quantifying evolutionary dynamics from variant-frequency time series.

Authors: Bhavin S Khatri
Journal: Sci Rep Date: 2016-09-12 Impact factor: 4.379

4 in total

1. The genome of pest Rhynchophorus ferrugineus reveals gene families important at the plant-beetle interface.

Authors: Khaled Michel Hazzouri; Naganeeswaran Sudalaimuthuasari; Biduth Kundu; David Nelson; Mohammad Ali Al-Deeb; Alain Le Mansour; Johnston J Spencer; Claude Desplan; Khaled M A Amiri
Journal: Commun Biol Date: 2020-06-24

2. The clarifying role of time series data in the population genetics of HIV.

Authors: Alison F Feder; Pleuni S Pennings; Dmitri A Petrov
Journal: PLoS Genet Date: 2021-01-14 Impact factor: 5.917

3. A theory of resistance to multiplexed gene drive demonstrates the significant role of weakly deleterious natural genetic variation.

Authors: Bhavin S Khatri; Austin Burt
Journal: Proc Natl Acad Sci U S A Date: 2022-08-01 Impact factor: 12.779

4. A large effective population size for established within-host influenza virus infection.

Authors: Casper K Lumby; Lei Zhao; Judith Breuer; Christopher Jr Illingworth
Journal: Elife Date: 2020-08-10 Impact factor: 8.140

4 in total