Feng Gao1, Alon Keinan1. 1. Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853 ak735@cornell.edu fg237@cornell.edu.
Abstract
The site frequency spectrum (SFS) and other genetic summary statistics are at the heart of many population genetic studies. Previous studies have shown that human populations have undergone a recent epoch of fast growth in effective population size. These studies assumed that growth is exponential, and the ensuing models leave an excess amount of extremely rare variants. This suggests that human populations might have experienced a recent growth with speed faster than exponential. Recent studies have introduced a generalized growth model where the growth speed can be faster or slower than exponential. However, only simulation approaches were available for obtaining summary statistics under such generalized models. In this study, we provide expressions to accurately and efficiently evaluate the SFS and other summary statistics under generalized models, which we further implement in a publicly available software. Investigating the power to infer deviation of growth from being exponential, we observed that adequate sample sizes facilitate accurate inference; e.g., a sample of 3000 individuals with the amount of data expected from exome sequencing allows observing and accurately estimating growth with speed deviating by ≥10% from that of exponential. Applying our inference framework to data from the NHLBI Exome Sequencing Project, we found that a model with a generalized growth epoch fits the observed SFS significantly better than the equivalent model with exponential growth (P-value [Formula: see text]). The estimated growth speed significantly deviates from exponential (P-value [Formula: see text]), with the best-fit estimate being of growth speed 12% faster than exponential.
The site frequency spectrum (SFS) and other genetic summary statistics are at the heart of many population genetic studies. Previous studies have shown that human populations have undergone a recent epoch of fast growth in effective population size. These studies assumed that growth is exponential, and the ensuing models leave an excess amount of extremely rare variants. This suggests that human populations might have experienced a recent growth with speed faster than exponential. Recent studies have introduced a generalized growth model where the growth speed can be faster or slower than exponential. However, only simulation approaches were available for obtaining summary statistics under such generalized models. In this study, we provide expressions to accurately and efficiently evaluate the SFS and other summary statistics under generalized models, which we further implement in a publicly available software. Investigating the power to infer deviation of growth from being exponential, we observed that adequate sample sizes facilitate accurate inference; e.g., a sample of 3000 individuals with the amount of data expected from exome sequencing allows observing and accurately estimating growth with speed deviating by ≥10% from that of exponential. Applying our inference framework to data from the NHLBI Exome Sequencing Project, we found that a model with a generalized growth epoch fits the observed SFS significantly better than the equivalent model with exponential growth (P-value [Formula: see text]). The estimated growth speed significantly deviates from exponential (P-value [Formula: see text]), with the best-fit estimate being of growth speed 12% faster than exponential.
SUMMARY statistics of genetic variation play a vital role in population genetic studies, especially inference of demographic history. In particular, the site frequency spectrum (SFS) is a vital summary statistic of genetic data and is widely utilized by many demographic inference methods applied to humans and other organisms (Marth ; Gutenkunst ; Excoffier ; Bhaskar ; Liu and Fu 2015). Some other demographic inference methods are based on the sequential Markov coalescent and utilize the most recent common ancestor () and linkage disequilibrium patterns (Li and Durbin 2011; Harris and Nielsen 2013; MacLeod ; Sheehan ; Schiffels and Durbin 2014). As another example, several studies used the average pairwise difference between chromosomes (Hammer ; Gottipati ; Arbiza ) and the SFS (Keinan ) to study the relative effective population sizes between the human X chromosome and the autosomes. The wide application of such genetic summary statistics stresses the need for their fast and accurate computation under any model of demographic history, instead of their estimations via simulations or approximations (e.g., Hudson 2002; Gutenkunst ).Several recent demographic inference studies showed evidence that human populations have undergone a recent epoch of fast growth in effective population size (Gutenkunst ; Coventry ; Gravel ; Nelson ; Tennessen ; Gazave ). However, the above studies assumed that the growth is exponential. The observation of a huge amount of extremely rare, previously unknown variants in several sequencing studies with large sample sizes (Nelson ; Tennessen ; Fu ) and the recent explosive growth in census population size suggests that the human population might have experienced a recent super-expononential growth, i.e., growth with speed faster than exponential (Coventry ; Keinan and Clark 2012; Reppell , 2014). Hence, recent studies presented a new generalized growth model that extends the previous exponential growth model by allowing the growth speed to be exponential or faster/slower than exponential (Reppell , 2014). Modeling the recent growth by this richer family of models holds the promise of a better fit to human genetic data and can also be applicable to other organisms that experienced growth. However, only simulation approaches are currently available for evaluating such a generalized growth demographic model (Reppell ), which makes inference of demographic history computational intractable.In this study, we first provide a set of explicit expressions for the computation of five summary statistics under a model of any number of epochs of generalized growth or decline: (1) the time to the most recent common ancestor (); (2) the total number of segregating sites (S); (3) the SFS; (4) the average pairwise difference between chromosomes per site (π); and (5) the burden of private mutations (α), a summary statistic that has been recently introduced as sensitive to recent growth (Keinan and Clark 2012; Gao and Keinan 2014). We also introduce a new software package, Efficient computation of Generalized models’ Genetic summary Statistics (EGGS), which implements these expressions and facilitates fast and accurate generation of these summary statistics. We show that the numerically computed summary statistics match well with simulation results and facilitate computation that is orders of magnitude faster than simulations. By performing demographic inference on the SFS generated from simulated sequences, we then explore how many samples are needed for recovering parameters of a recent generalized growth epoch. Finally, we apply the software to investigate the nature of the recent growth in humans by inferring demographic models using the SFS of synonymous variants of 4300 European individuals from the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (Tennessen ; Fu ).
Materials and Methods
Generalized demographic models
A demographic model describes the changes of effective population size N against time T. We consider time, measured in generations, as starting from 0 at present and increasing backward in time. Furthermore, we consider the families of demographic models that are constituted by any number of epochs of generalized growth or decline, along the lines of Bhaskar and Song (2014). More formally, there exists a minimal positive integer L such that the demographic history of a population can be split into a model with epochs that are split by L ordered different time points (), with the kth epoch starting from and lasting through (thus the last epoch starts at time and continues into indefinite past, ). Such a history is considered as a generalized model if the population size in each epoch can be described by the following differential equation regarding time T (Reppell , 2014),where Each epoch can hence capture a variety of changing patterns in effective population size. Specifically, if , this epoch is of constant population size. When , controls the growth or decline speed of this epoch: (1) if , the epoch is of exponential growth () or decline () with rate (2) if , the epoch is of faster-than-exponential (super-exponential) growth () or decline (); (3) if , the epoch is of slower-than-exponential (sub-exponential) growth () or decline (). Linear growth or decline is also a special case of generalized models when An illustration of a generalized model with five epochs is provided in Figure 1, with more detailed explanation and illustrations in Supporting Information, File S1 and Figure S1.
Figure 1
Illustration of an example of a generalized demographic model as introduced in the first section of Materials and Methods. This model consists of five epochs (starting from the present on the right): (1) faster-than-exponential () growth (forward in time) from to between and (2) linear decline (a special case of generalized decline when ) from to between and (3) exponential growth (a special case of generalized growth when ) from to between and (4) slower-than-exponential () decline from to between and and (5) constant population size (a special case of generalized growth when ) at starting from , which lasts indefinitely backward in time (). The ending population size of the previous epoch is not necessarily the beginning population size of the next epoch (e.g., , ), corresponding to an instantaneous population size change at that time.
Illustration of an example of a generalized demographic model as introduced in the first section of Materials and Methods. This model consists of five epochs (starting from the present on the right): (1) faster-than-exponential () growth (forward in time) from to between and (2) linear decline (a special case of generalized decline when ) from to between and (3) exponential growth (a special case of generalized growth when ) from to between and (4) slower-than-exponential () decline from to between and and (5) constant population size (a special case of generalized growth when ) at starting from , which lasts indefinitely backward in time (). The ending population size of the previous epoch is not necessarily the beginning population size of the next epoch (e.g., , ), corresponding to an instantaneous population size change at that time.The solution to Equation 1 is (Reppell , 2014), where is the initial population size of the kth epoch. Each epoch k is defined by four parameters: the starting population size , the ending population size , the duration of the epoch , and the growth speed parameter . The growth rate parameter is an immediate function of these parameters, , and hence does not need to be provided as an independent variable in defining the changes in effective population size during an epoch. Note that , the starting population size of the epoch, is not necessarily the same as , the ending population size of the kth epoch. Specifically, if , there is an instantaneous change in population size at time .
Explicit expressions for summary statistics of demographic models under arbitrary population size functions
In this section, we briefly summarize the main results from previous studies that are used to evaluate the expected value of the summary statistics. Under Kingman’s standard coalescent (Kingman 1982a,b), given a demographic model , the expected time to the most recent common ancestor can be calculated by (Polanski and Kimmel 2003), where the superscript p is the number of chromosomes (i.e., twice the sample size for diploids), is the expected time to the first coalescent event when there are j chromosomes at present, and are constants (Tavare 1984; Takahata and Nei 1985; Polanski ) provided in File S1. Without loss of generality, we consider the case of diploid individuals, where there are chromosomes at any generation T, and use the notation . Then is expressed by the equationwhere .The expected full normalized SFS can be computed by the following set of equations (Polanski ),where is the length of branches in the genealogy that have i descendants () and is the total length of all branches in the coalescent tree. The quantities and are constants (Polanski ), which we provide in File S1.Naturally, the expected number of segregating sites is given bywhere is the mutation rate per site per generation and L is the length of the locus under consideration. The average pairwise difference between chromosomes per site can be calculated byThe expected burden of private mutations α at a diploid sample size of , defined as the proportion of heterozygous sites in a new diploid individual that are homozygous in the previous individuals, can be computed by (Gao and Keinan 2014), where is Kronecker delta function.The detailed description of the five summary statistics mentioned above is included in File S1.
Evaluation of the expected time to the first coalescent event under generalized models
The core of evaluating the summary statistics lies in finding feasible and numerically stable functions for calculating , the expected time to the first coalescent event when there are j chromosomes at present. Previous studies give explicit expressions of for a demographic model constructed by exponential and constant-size epochs (Polanski ; Bhaskar ). In this study, we give a comprehensive set of formulas for under generalized models introduced above. Define then , where is the total number of epochs. The quantity can be computed by the following set of equations:The expressions of function are given in File S1. The function , where is the confluent hypergeometric function of the second kind (Gradshteĭn ). The function , where is the confluent hypergeometric function of the first kind (Gradshteĭn ). The exponential growth or decline then becomes a special case of when ,where is the exponential integral (Gradshteĭn ), which has been shown by previous studies (Polanski ; Bhaskar ). We could not find feasible and numerically stable closed-form formulas for when the population size decreases forward in time in a manner that is not linear or exponential (i.e., and ). In these scenarios, we used Gauss–Legendre quadrature (Kahaner ) for efficient numerical evaluation of relevant functions (see File S1 for detailed description).If or ,If or ,If ,
Software implementation
The above expressions are implemented in a software package, EGGS. The source code and compiled programs for Linux and Mac OS platforms are publicly available from our Web site (http://keinanlab.cb.bscb.cornell.edu). Source code was written in C++, with no external libraries needed for compilation. Additional information of implementation is included in File S1 and in the manual that accompanies the software online.
Demographic models assumed in this study
The demographic models used in this study are based on the inferred European history presented by Gazave (Figure 2, in black), which contains two bottlenecks (Keinan ) and a recent exponential growth epoch. Specifically, the Gazave model inferred that the European population had a constant effective population size of 10,000 (diploid) individuals before 4720 generations ago and went through the ancient bottleneck between 4720 and 4620 generations ago with a population size of 189. The population size then recovered to 10,000 diploids until 720 generations ago, at which time the recent bottleneck started with a size of 549. At 620 generations ago, the population size recovered to 5633 individuals. The recent growth epoch started 140.8 generations ago and led to a population size of 654,000 at present. The parameters of the original recent growth epoch were varied to incorporate generalized growth effects.
Figure 2
Comparison of four summary statistics estimated by FTEC simulation and computed by EGGS. (A) Demonstration of the demographic models considered for evaluating the accuracy of our calculations as implemented in EGGS (first section of Results). This two-bottleneck model has the same population size and time throughout history as in the inferred European history in Gazave , with the exception that we varied the growth speed parameter of the recent growth epoch to be (sub-exponential, blue), (exponential as in Gazave , black), and (super-exponential, red). The y-axis shows effective population size of diploid individuals on log scale. (B–E) The comparison of the first 15 entries of the SFS (B), the total number of segregating sites (S) across all 200,000 loci (each 1000 bp long) (C), the expected pairwise difference between chromosomes per base pair (D), and the burden of private mutations (α) as the percentage of heterozygous variants in one individual that are monomorphic in the rest of the sample of 999 individuals (E) computed numerically in EGGS (dark-colored bars) and simulated by FTEC (light-colored bars) for the demographic models shown in A: blue, black, red, with a sample size of 1000 individuals (2000 chromosomes). The y-axis in B is on a log scale.
Comparison of four summary statistics estimated by FTEC simulation and computed by EGGS. (A) Demonstration of the demographic models considered for evaluating the accuracy of our calculations as implemented in EGGS (first section of Results). This two-bottleneck model has the same population size and time throughout history as in the inferred European history in Gazave , with the exception that we varied the growth speed parameter of the recent growth epoch to be (sub-exponential, blue), (exponential as in Gazave , black), and (super-exponential, red). The y-axis shows effective population size of diploid individuals on log scale. (B–E) The comparison of the first 15 entries of the SFS (B), the total number of segregating sites (S) across all 200,000 loci (each 1000 bp long) (C), the expected pairwise difference between chromosomes per base pair (D), and the burden of private mutations (α) as the percentage of heterozygous variants in one individual that are monomorphic in the rest of the sample of 999 individuals (E) computed numerically in EGGS (dark-colored bars) and simulated by FTEC (light-colored bars) for the demographic models shown in A: blue, black, red, with a sample size of 1000 individuals (2000 chromosomes). The y-axis in B is on a log scale.In addition to using the model mentioned above, we also applied an alternative model of ancient European history for inference. The model was first presented in Gravel and later used in Tennessen . This model inferred that the European population had an ancient effective population size of 7300 diploid individuals until 6167 generations ago, when the population size expanded to 14,474 individuals. The first bottleneck took place 2125 generations ago, with the population size reducing to 1861 individuals. This first bottleneck lasted until 958 generations ago, at which time a second bottleneck took place with a decreased population size of 1032. We assumed 24 years per generation (Scally and Durbin 2012) to translate the year-based time presented in the original model. For compatibility with the Gazave model, we considered that the population size had an instantaneous recovery after the second bottleneck lasted for 100 generations, instead of gradual recovery (Gazave ). Figure S8 shows the schematic representation of the Gravel model.
Demographic inference framework based on the site frequency spectrum
Demographic inference in this study was based on the observed allele frequency counts from the simulated or real data set. To determine the fitness of a model to the observed data, we calculated the composite log likelihood bywhere C is a vector of the observed folded allele frequency counts and is the computed folded SFS under demographic model . More detailed description can be found in File S1.To search for the maximum-likelihood point over the parameter space, we applied the ECM (Expectation/Conditional Maximization) method (Meng and Rubin 1993), which was previously used in the demographic inference study by Excoffier . One hundred ECM cycles were performed for each run of inference. We obtained 95% confidence intervals of parameter estimates via block bootstrapping of the data 200 times. Specifically, if the original data contained l loci, we randomly chose l loci from the original data with replacement in each bootstrap (see File S1 for details).
Processing of NHLBI Exome Sequencing Project data for demographic history inference
The NHLBI Exome Sequencing Project (ESP) data (Tennessen ; Fu ) contain deep sequencing of 4300 individuals of European ancestry. An important feature of these data is the high level of sequencing coverage, which allows the capture of very rare variants accurately. These variants constitute the part of the SFS that is most enriched for information on recent population growth (Keinan and Clark 2012; Tennessen ; Gao and Keinan 2014). To reduce the effect of selection as much as possible while keeping a sufficient amount of data, we chose to use the SFS calculated from synonymous single-nucleotide variants (SNVs) only, as previously performed by Tennessen . To further improve the quality of the data, we filtered SNVs with average read depth ≤20 or with successful genotype counts <7740 (90%) and subsampled the remaining 233,134 SNVs to 7740 alleles, which is equivalent to 3870 diploid individuals (File S1).
Data availability
The NHLBI Exome Sequencing Project (ESP) data used in this study is publicly available at http://evs.gs.washington.edu/EVS/.
Results
Comparison with simulated results by FTEC
To validate that the expressions provided in Materials and Methods can correctly compute the summary statistics under generalized growth models, we compared the summary statistics calculated by our software EGGS to those simulated by the software FTEC (a coalescent simulator for modeling faster than exponential growth by Reppell ) under the demographic models shown in Figure 2A. This model is the inferred European history in Gazave , except that we varied the growth speed parameter b (Equation 1), which corresponds to 1 in the original model (exponential growth), to also be 0.5 (corresponding to sub-exponential growth) and 1.5 (corresponding to super-exponential growth). The sample size is fixed at 1000 diploid individuals (2000 chromosomes). For FTEC simulation, we used a mutation rate of per base pair per generation (e.g., Kong ) and simulated 200,000 independent loci, each of 1000 bp.The comparison of the SFS, S (across all 200,000 loci), π, and α numerically computed by EGGS to those simulated by FTEC is shown in Figure 2, B–E. For each demographic model illustrated in Figure 2A, the values for all summary statistics from the numerical computation by EGGS are practically identical to those from the simulation results by FTEC. However, our software EGGS exhibits a huge speed improvement over FTEC. For each model considered in Figure 2A, EGGS takes <1 sec to generate the results, while it takes ∼5 hr for FTEC to simulate the sequences, due to the large number of independent loci required for accurate estimation (performed in the Ubuntu system with an Intel Xeon CPU at 2.67 GHz). For instance, when 2000 independent loci are simulated, which still takes ∼3 min, the summary statistics deviate considerably from the accurate results (Figure S2 and Table S1). Furthermore, our software works well over a wide range of values of the growth parameter b, even when (corresponding to linear growth or decline) or (Figure S3), conditions that are not handled by FTEC. We note, however, that as a simulation program FTEC provides the full sequences as output and can have a wider range of applications than facilitated by the SFS and other summary statistics that EGGS calculates.
Evaluating inference of generalized growth based on the site frequency spectrum
We next set out to test the accuracy (as a function of sample size) of inferring parameters in models with generalized growth from the SFS. Bhaskar and Song (2014) showed that in theory, an underlying generalized growth demographic model can be uniquely identified by the ideal, perfect expected SFS with a very small sample size generated from that model (34 haploid sequences for the models shown in Figure 2A). However, the SFS is estimated in practice from a limited amount of data from each individual (even in the case of whole-genome sequencing) and, as a result, the estimated SFS will fluctuate around the expected values, which limits its accuracy for inference (Terhorst and Song 2015). We aim to test such inference in practice and determine the power of generalized growth detection and the sample size needed for accurately recovering the growth speed parameter as well as other parameters of the demographic model. For it to be comparable with many practical applications, we considered sequence length that is about equivalent to that obtained from whole-exome sequencing (File S1).We performed inference on the SFS calculated from simulated sequences generated by FTEC. We simulated a demographic model with the same initial epochs as the model illustrated in Figure 2A. Starting 620 generations ago, the simulated model includes a constant population size of 10,000 until 200 generations ago, when the population starts a generalized growth epoch until the present. The generalized growth epoch starts with a population size of 10,000 that grows to an extant effective population size of 1 million individuals, with the growth speed parameter b taking each of the following values: 0.4, 0.7, 0.9, 1.0, 1.1, 1.3, and 1.6. We chose these values to represent a range of super-exponential and sub-exponential growth, with emphasis on values around the exponential rate () to test the detection power of generalized growth when the growth speed deviates slightly from exponential. We varied the sample size (number of diploid individuals sampled at present) to be 1000, 2000, 3000, 5000, and 10,000 (File S1). The first 15 entries of the site frequency spectra for these simulated scenarios are shown in Figure S4. From each set of simulations, we then inferred four parameters of the recent growth epoch, which can uniquely determine the epoch: (1) the growth speed parameter b; (2) the initial population size before growth, (3) the ending population size after growth, and (4) the onset time of growth T, which is equivalent to the growth duration since the simulated epoch ends at present.As sample size increases, the accuracy of the point estimates generally improves and the confidence interval narrows (Figure 3). Specifically, when the SFS of only 1000 diploids is used for inference, the inference performs poorly for all parameters, exhibiting large confidence intervals (Figure 3). However, the confidence interval always includes the true simulated value. A sample size of 2000 already exhibits acceptable performance except when the growth speed becomes large ( and 1.6). Larger sample sizes of 5000 and 10,000 are sufficient for inferring all parameters with very tight confidence intervals. For such sample sizes, the inference even significantly distinguishes between growth speeds ( and ) that are close to exponential () from that of an exponential, thereby concluding that a sub-exponential (0.9) or super-exponential (1.1) growth has taken place. These observations suggest that a sample size of at least 3000 diploid individuals might be needed for inferring the parameters associated with the simulated recent generalized growth epoch, which is motivated by previous models of European demographic history. It remains to be explored how accurate the estimates are, and how their accuracy improves with sample size, across a more diverse set of models.
Figure 3
Inference results on simulated data with a recent generalized growth epoch. The model parameters are as follows: Growth starts 200 generations before the present from an effective population size of 10,000 and ends with an effective population size of 1 million at present. The growth speed parameter b takes the following values in different simulations: 0.4, 0.7, 0.9, 1.0, 1.1, 1.3, and 1.6. Inference of these four parameters is based on the SFS estimated from a sample of individuals of one of five sizes (black, 1000; red, 2000; blue, 3000; brown, 5000; and green, 10,000). The point estimates with 95% confidence interval for these parameters are grouped by the growth speed parameter b (x-axis). The thick, dashed lines show the true values of the simulated model. The results are shown in the following order: (A) the inferred growth speed parameter, (B) the inferred population size before growth, (C) the inferred population size after growth, and (D) the inferred growth start time. The y-axis in C is on a log scale.
Inference results on simulated data with a recent generalized growth epoch. The model parameters are as follows: Growth starts 200 generations before the present from an effective population size of 10,000 and ends with an effective population size of 1 million at present. The growth speed parameter b takes the following values in different simulations: 0.4, 0.7, 0.9, 1.0, 1.1, 1.3, and 1.6. Inference of these four parameters is based on the SFS estimated from a sample of individuals of one of five sizes (black, 1000; red, 2000; blue, 3000; brown, 5000; and green, 10,000). The point estimates with 95% confidence interval for these parameters are grouped by the growth speed parameter b (x-axis). The thick, dashed lines show the true values of the simulated model. The results are shown in the following order: (A) the inferred growth speed parameter, (B) the inferred population size before growth, (C) the inferred population size after growth, and (D) the inferred growth start time. The y-axis in C is on a log scale.
European demographic history inference
We next performed demographic inference on NHLBI ESP data (Tennessen ; Fu ). We applied our inference framework to these data while considering and comparing two models. Both models assume the ancient epochs before 620 generations ago to be the same as those in the Gazave model illustrated in Figure 2A. We inferred the parameters only for the most recent epoch, which is of generalized growth in one model while limited to exponential growth in the other. The parameters for inference are as follows: for both models, (1) population size before growth (); (2) population size after growth (); and (3) growth onset time (T), which is equivalent to the duration of growth; and only for the generalized growth model (4) the growth speed parameter (b), which is fixed at for the exponential growth model. The point estimates and 95% confidence intervals are shown in Table 1 and the best-fit demographic models are illustrated in Figure 4, A and B (see also Figure S5, Figure S6 and Figure S7).
Table 1
Demographic inference results using ESP data for a model with a recent epoch of exponential growth and a model with a recent epoch of generalized growth
Ancient history
Growth model
Nf(104)
Ni(106)
T
b
Gazave model
Exponential
1.31 (1.26–1.36)
1.04 (1.00–1.07)
198 (195–202)
NA
Generalized
1.24 (1.18–1.30)
1.26 (1.16–1.37)
213 (206–220)
1.12 (1.07–1.15)
Gravel model
Exponential
0.89 (0.86–0.93)
0.85 (0.82–0.88)
186 (182–190)
NA
Generalized
0.78 (0.74–0.83)
1.33 (1.22–1.46)
218 (211–228)
1.22 (1.18–1.26)
Shown are point estimates and 95% confident intervals (in parentheses) for the following parameters of the inferred recent growth epoch when the ancient history was assumed to be the same as that in the Gazave model and the Gravel model: population size before growth (); population size after growth (); time growth started in generations (T); and the growth speed parameter (b), which is fixed at in the exponential growth case.
Figure 4
Demographic inference results based on ESP data. (A) Illustration of the effective population size (y-axis, on a log scale) over time for the best-fit models inferred based on ESP data, assuming the ancient history is the same as that in Gazave . Two models are shown: one restricted to recent growth being exponential (black) and one with a generalized recent growth epoch (red). Before 620 generations ago, the model was not inferred and all parameters were set to be the same as those shown in Figure 2A. Solid lines show the effective population size over time for each of the inferred models, with dashed lines indicating estimated parameter values on the x-axis or the y-axis. Only the most recent 1000 generations are shown to emphasize the difference between the two models. (B) A zoom-in to the most recent 240 generations of the inferred models in A to emphasize the acceleration pattern of the generalized growth model, with the y-axis on a linear scale. (C-D) Similar to A-B, except that the best-fit models presented are based on the assumption that the ancient history before 858 generations ago is fixed to that in Gravel (see Figure S8).
Shown are point estimates and 95% confident intervals (in parentheses) for the following parameters of the inferred recent growth epoch when the ancient history was assumed to be the same as that in the Gazave model and the Gravel model: population size before growth (); population size after growth (); time growth started in generations (T); and the growth speed parameter (b), which is fixed at in the exponential growth case.Demographic inference results based on ESP data. (A) Illustration of the effective population size (y-axis, on a log scale) over time for the best-fit models inferred based on ESP data, assuming the ancient history is the same as that in Gazave . Two models are shown: one restricted to recent growth being exponential (black) and one with a generalized recent growth epoch (red). Before 620 generations ago, the model was not inferred and all parameters were set to be the same as those shown in Figure 2A. Solid lines show the effective population size over time for each of the inferred models, with dashed lines indicating estimated parameter values on the x-axis or the y-axis. Only the most recent 1000 generations are shown to emphasize the difference between the two models. (B) A zoom-in to the most recent 240 generations of the inferred models in A to emphasize the acceleration pattern of the generalized growth model, with the y-axis on a linear scale. (C-D) Similar to A-B, except that the best-fit models presented are based on the assumption that the ancient history before 858 generations ago is fixed to that in Gravel (see Figure S8).Although the Gazave model assumed a different ancient history before the recent growth epoch from that assumed in Tennessen , using ESP data and assuming exponential growth, the inferred growth epoch is generally consistent with that obtained in the latter study (Figure 4, A and B, and Table 1). Our study infers that recent growth started 198 (95% C.I.: 195–202) generations ago with an effective population size of ∼13,100 (12,600–13,600) and continued at a rate of 2.2% (2.15–2.26%) per generation (Table 1), while Tennessen estimated that recent growth had an initial population size of ∼9500 individuals, a duration of 204 generations, and a growth rate of 2.0% per generation.The inferred generalized growth model fits the data significantly better than that with exponential growth (P-value by likelihood-ratio test with 1 d.f.). It estimates that growth started 213 (206–220) generations ago from an effective population size of 12,400 (11,800–13,000), both values consistent with those estimated in the exponential growth model. The extant effective population size following growth is estimated to be 1.26 (1.16–1.37) million. The inferred growth speed parameter (1.07–1.15) is significantly larger than the exponential speed of (P-value , using a one-tailed z-test), which is the main difference between the two models. implies a growth rate acceleration pattern (File S1) that is super-exponential at 12% faster than exponential through the epoch (Figure 4): the super-exponential growth is relatively slow around the onset time, and it keeps accelerating as time approaches the present.To test the sensitivity of the model to the assumption of ancient European history, we considered an alternate model of ancient history. We fixed the history before 858 generation ago to be that inferred by Gravel for Europeans (Materials and Methods). We repeated inference of the same parameters, using the same ESP data. As above, the inferred parameters for exponential growth are similar to those obtained in Tennessen that were based on the model of Gravel (Table 1). However, the SFS from this model fits the data worse than that from the exponential model based on the ancient history of the Gazave model (P-value from goodness-of-fit test between the exponential Gravel model and ESP data; P-value = 0.97 for the corresponding exponential Gazave model; see File S1 and Table S3). By applying a generalized growth epoch to the Gravel model, the inferred parameters are generally in line with those from the generalized model based on Gazave , although some differences exist (Table 1), indicating that the assumption of ancient history can affect the inference of recent growth to some extent. More importantly, the generalized Gravel model fits the data almost equally well as the generalized Gazave model, which is significantly better than the exponential model (P-value by likelihood-ratio test; also see Table S3). As with the generalized Gazave model, the inferred growth speed parameter from the generalized Gravel model, (1.18–1.26), is also significantly larger than the exponential speed (P-value , using a one-tailed z-test; Figure 4, C and D).Motivated by these results, we considered a third model with two recent exponential growth epochs, which still assumes the ancient epochs before 620 generations ago to be the same as those in the Gazave model illustrated in Figure 2A. Five parameters were inferred (Table S2), with the first phase of growth estimated to start 219 (95–334) generations ago with a population size of 12,200 (11,700–13,200). This phase of growth lasts until 135 (25–157) generations ago and leads to a population size of 47,100 (30,200–540,900). The population size after the recent phase of growth is 1.12 (1.07–2.09) million. This model provides a significantly better fit than the model with a single exponential growth (P-value by likelihood-ratio test with 2 d.f.), but is a worse model than the generalized growth model (based on the Bayesian information criterion, ). However, this model exhibits some of the same accelerating patterns as in the generalized growth model, ascertained by the growth rate of the most recent exponential epoch being 2.4% (2.3–5.2%), larger than that of the first exponential epoch, 1.6% (1.3–2.1%). This acceleration pattern shown in both the generalized model and the model with two exponential epochs is consistent with evidence of growth in European census population size that has greatly accelerated in the modern era (Keinan and Clark 2012).
Discussion
In this study, we provide mathematical derivation and a software that can efficiently compute the expected values of five genetic data summary statistics given a generalized demographic model by evaluating the derived explicit expressions. These summary statistics include the time to the most recent common ancestor (), the total number of segregating sites (S), the SFS, the average pairwise difference between chromosomes per site (π), and the burden of private mutations (α). The fast and accurate generation of these summary statistics under generalized models can provide a useful tool in the studies of human demographic inference. For instance, in addition to inference based on the SFS as in the present study, a recent study by Chen presented an inference framework based on the total number of segregating sites. The results in this study can be easily incorporated into that framework. Furthermore, the source code of the software is freely available to allow extensions to compute other summary statistics of interest (for example, the joint SFS of samples from multiple populations under generalized models, by extending the work of Wakeley and Hey 1997 and Chen 2012). Such extensions can facilitate a variety of population genetic studies in humans and other organisms beyond the inference of demographic history.It is also possible that other families of growth models may fit the pattern of human population size history. For instance, Eldon considered the algebraic-growth model in the form of . In reality, however, not all demographic models have numerically stable closed-form expressions for the expected time to the first coalescent event (). In these cases, fast and accurate numerical integration methods, such as the Gauss–Legendre quadrature used in this work, can be applied to evaluate . This technique holds the promise of efficiently generating the expected value of population genetic summary statistics under arbitrary population size functions.Bhaskar pointed out that as sample size increases, the assumptions of standard Kingman’s coalescent are violated as multi-merger and simultaneous-merger events can become nonnegligible. Such events can distort the genealogies and potentially cause the values of summary statistics to be different from those under Kingman’s coalescent (Bhaskar ). To explore such discrepancies, we compared the SFS from Kingman’s coalescent and the discrete-time Wright–Fisher (DTWF) model (Bhaskar ) under the inferred demographic history in the generalized Gazave model with a sample size of 3870 diploids (File S1). We observed that the SFS from the DTWF model and Kingman’s coalescent are very similar (File S1 and Figure S9), which means that multi-merger and simultaneous-merger events should not have a significant effect on the inference carried out in this study. However, it remains valuable to systematically study the effect of multi-merger and simultaneous-merger events in the context of generalized growth, especially as sample size increases.By applying inference of generalized growth based on the SFS generated from the synonymous variants of 4300 individuals of the NHLBI ESP data set (Tennessen ; Fu ), we found that the generalized growth model shows a better fit to the observed data than the exponential growth model that has been used by almost all previous demographic modeling studies (P-value ). We also found that the European population experienced a recent growth in population size with speed modestly faster than exponential (, P-value for difference from ). This result is consistent with previous speculations that the human population might have undergone a recent accelerated growth epoch based on the observation of very rare, previously unknown variants in several sequencing studies with large sample sizes (Nelson ; Tennessen ; Fu ). It is also in line with the super-exponential growth in census population size during that time (Keinan and Clark 2012). In future studies, it will be valuable to incorporate gradient-based optimization techniques for the fast inference of demographic models containing generalized growth epochs, e.g., by extending the work of Bhaskar . Such an improvement will enable simultaneous inference of recent growth and more ancient epochs.To minimize the impact of natural selection on our demographic inference, we considered only synonymous SNVs for demographic modeling, as in the original study of Tennessen . However, it is still a potential limitation that the data are affected by negative and background selection. Hence, it remains valuable to validate the result of super-exponential growth by conducting inference on SFS calculated from more neutral genomic regions (Gazave ) or by modeling the effect of selection. One promising possibility is extracting genomic regions that are less subject to selection from whole-genome sequences in the UK10K project (The UK10K Consortium ). More generally, with the increasing availability of high-quality whole-genome sequencing data with large sample sizes for humans and other species, more refined and realistic demographic histories can be estimated with generalized models.
Authors: Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey Journal: Science Date: 2012-05-17 Impact factor: 47.728
Authors: Wenqing Fu; Timothy D O'Connor; Goo Jun; Hyun Min Kang; Goncalo Abecasis; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; David Altshuler; Jay Shendure; Deborah A Nickerson; Michael J Bamshad; Joshua M Akey Journal: Nature Date: 2012-11-28 Impact factor: 49.962
Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo Journal: Nature Date: 2015-09-14 Impact factor: 49.962
Authors: Daniel Taliun; Daniel N Harris; Michael D Kessler; Jedidiah Carlson; Zachary A Szpiech; Raul Torres; Sarah A Gagliano Taliun; André Corvelo; Stephanie M Gogarten; Hyun Min Kang; Achilleas N Pitsillides; Jonathon LeFaive; Seung-Been Lee; Xiaowen Tian; Brian L Browning; Sayantan Das; Anne-Katrin Emde; Wayne E Clarke; Douglas P Loesch; Amol C Shetty; Thomas W Blackwell; Albert V Smith; Quenna Wong; Xiaoming Liu; Matthew P Conomos; Dean M Bobo; François Aguet; Christine Albert; Alvaro Alonso; Kristin G Ardlie; Dan E Arking; Stella Aslibekyan; Paul L Auer; John Barnard; R Graham Barr; Lucas Barwick; Lewis C Becker; Rebecca L Beer; Emelia J Benjamin; Lawrence F Bielak; John Blangero; Michael Boehnke; Donald W Bowden; Jennifer A Brody; Esteban G Burchard; Brian E Cade; James F Casella; Brandon Chalazan; Daniel I Chasman; Yii-Der Ida Chen; Michael H Cho; Seung Hoan Choi; Mina K Chung; Clary B Clish; Adolfo Correa; Joanne E Curran; Brian Custer; Dawood Darbar; Michelle Daya; Mariza de Andrade; Dawn L DeMeo; Susan K Dutcher; Patrick T Ellinor; Leslie S Emery; Celeste Eng; Diane Fatkin; Tasha Fingerlin; Lukas Forer; Myriam Fornage; Nora Franceschini; Christian Fuchsberger; Stephanie M Fullerton; Soren Germer; Mark T Gladwin; Daniel J Gottlieb; Xiuqing Guo; Michael E Hall; Jiang He; Nancy L Heard-Costa; Susan R Heckbert; Marguerite R Irvin; Jill M Johnsen; Andrew D Johnson; Robert Kaplan; Sharon L R Kardia; Tanika Kelly; Shannon Kelly; Eimear E Kenny; Douglas P Kiel; Robert Klemmer; Barbara A Konkle; Charles Kooperberg; Anna Köttgen; Leslie A Lange; Jessica Lasky-Su; Daniel Levy; Xihong Lin; Keng-Han Lin; Chunyu Liu; Ruth J F Loos; Lori Garman; Robert Gerszten; Steven A Lubitz; Kathryn L Lunetta; Angel C Y Mak; Ani Manichaikul; Alisa K Manning; Rasika A Mathias; David D McManus; Stephen T McGarvey; James B Meigs; Deborah A Meyers; Julie L Mikulla; Mollie A Minear; Braxton D Mitchell; Sanghamitra Mohanty; May E Montasser; Courtney Montgomery; Alanna C Morrison; Joanne M Murabito; Andrea Natale; Pradeep Natarajan; Sarah C Nelson; Kari E North; Jeffrey R O'Connell; Nicholette D Palmer; Nathan Pankratz; Gina M Peloso; Patricia A Peyser; Jacob Pleiness; Wendy S Post; Bruce M Psaty; D C Rao; Susan Redline; Alexander P Reiner; Dan Roden; Jerome I Rotter; Ingo Ruczinski; Chloé Sarnowski; Sebastian Schoenherr; David A Schwartz; Jeong-Sun Seo; Sudha Seshadri; Vivien A Sheehan; Wayne H Sheu; M Benjamin Shoemaker; Nicholas L Smith; Jennifer A Smith; Nona Sotoodehnia; Adrienne M Stilp; Weihong Tang; Kent D Taylor; Marilyn Telen; Timothy A Thornton; Russell P Tracy; David J Van Den Berg; Ramachandran S Vasan; Karine A Viaud-Martinez; Scott Vrieze; Daniel E Weeks; Bruce S Weir; Scott T Weiss; Lu-Chen Weng; Cristen J Willer; Yingze Zhang; Xutong Zhao; Donna K Arnett; Allison E Ashley-Koch; Kathleen C Barnes; Eric Boerwinkle; Stacey Gabriel; Richard Gibbs; Kenneth M Rice; Stephen S Rich; Edwin K Silverman; Pankaj Qasba; Weiniu Gan; George J Papanicolaou; Deborah A Nickerson; Sharon R Browning; Michael C Zody; Sebastian Zöllner; James G Wilson; L Adrienne Cupples; Cathy C Laurie; Cashell E Jaquish; Ryan D Hernandez; Timothy D O'Connor; Gonçalo R Abecasis Journal: Nature Date: 2021-02-10 Impact factor: 69.504