Literature DB >> 33769498

Estimating divergence times from DNA sequences.

Per Sjödin¹, James McKenna¹, Mattias Jakobsson^1,2.

Abstract

The patterns of genetic variation within and among individuals and populations can be used to make inferences about the evolutionary forces that generated those patterns. Numerous population genetic approaches have been developed in order to infer evolutionary history. Here, we present the "Two-Two (TT)" and the "Two-Two-outgroup (TTo)" methods; two closely related approaches for estimating divergence time based in coalescent theory. They rely on sequence data from two haploid genomes (or a single diploid individual) from each of two populations. Under a simple population-divergence model, we derive the probabilities of the possible sample configurations. These probabilities form a set of equations that can be solved to obtain estimates of the model parameters, including population split times, directly from the sequence data. This transparent and computationally efficient approach to infer population divergence time makes it possible to estimate time scaled in generations (assuming a mutation rate), and not as a compound parameter of genetic drift. Using simulations under a range of demographic scenarios, we show that the method is relatively robust to migration and that the TTo method can alleviate biases that can appear from drastic ancestral population size changes. We illustrate the utility of the approaches with some examples, including estimating split times for pairs of human populations as well as providing further evidence for the complex relationship among Neandertals and Denisovans and their ancestors.

Entities: Chemical Disease Gene Species

Keywords: divergence time; effective population size; human evolution; population divergence

Year: 2021 PMID： 33769498 PMCID： PMC8049563 DOI： 10.1093/genetics/iyab008

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

Background

Many population genetic inference approaches compare levels of genetic variation within and across genomes, individuals and/or populations in order to uncover their evolutionary history. A multitude of demographic inference methods have been developed in order to capitalize on the wealth of information that comes with the availability of full genomes from multiple individuals (see Schraiber and Akey 2015, for a review). The sheer scale and complexity of whole-genome data sets poses its own challenge for making inference of population demographic parameters. A common approach for inference has been to compare the observed data, often summarized in some statistic, to simulated data that can be generated under a range of population-genetic models. Building on this idea and combined with a rejection algorithm, Approximate Bayesian computation (ABC; Tavaré ; Beaumont ; Cornuet ; Pudlo ) has proven to be one useful tool for both model choice and parameter estimation. However, the problem of choosing which models to test is not trivial for most inference approaches, including ABC, as the set of models to choose from is very large. In parallel, there have been recent developments in methods that use haplotype information. A challenge for these approaches is how to model the dependence of genealogies along a sequence. One solution has been to approximate the full ancestral recombination graph, a method used in the pairwise sequentially Markovian coalescent (PSMC) (Li and Durbin 2011) and similar approaches (Schiffels and Durbin 2014; Terhorst ; Kelleher ; Speidel ; Wang ). Another strategy has been to rely on relatively short genetic fragments located sufficiently far away from each other to be able to assume linkage equilibrium between loci, combined with absolute linkage (absence of recombination) within each locus (e.g. Gronau ). Both these approaches typically lead to set-ups that cannot be solved analytically and often rely on computationally heavy, advanced statistical methods in order to estimate parameters (but see Gattepaille ; Lohse ). A related strategy is to assume independence among sites, using a composite likelihood framework (Gutenkunst ; Excoffier ). From this assumption, the observed variables (i.e. frequency spectra) do not depend on the full distribution of genealogical branch lengths, they are functions only of the expected branch lengths (Griffiths and Tavaré 1998; Chen 2012). This observation greatly simplifies the probability computations. To the extent that closed-form solutions can be obtained, the assumption of independence between sites also leads to inference tools that are easier to integrate with other methods, and can provide useful insights into underlying processes (Beichman ; Terhorst ). Conversely, a disadvantage of assuming independence between sites is that only information concerning the expected values can be obtained, rather than the full distributions of stochastic variables. For small samples, an alternative is to derive closed-form expressions for the probability of observing particular configurations of variants in simple divergence models, including the isolation-with-migration model (Wakeley and Hey 1997; Wilkinson-Herbots 2008; Chen 2012). Lohse , 2016) showed that more generally, the probability of observing a particular variant configuration can be obtained from a generating function of genealogical branches. Assuming independence among sequence-blocks, Lohse , 2016) outlined an approach for computing the likelihood under various demographic models and sampling schemes. Regardless of whether independence between sites is assumed or not, all of these methods can be useful for inferring the time of divergence between two populations. Examples of simple and direct methods used to estimate population divergence times include: Gutenkunst ,Wakeley (2009),Green ,Schlebusch , and Theunert and Slatkin (2018). These methods build on the principle of genetic drift accumulating as a function of effective population size and number of generations. Following a population backwards in time, and using that the accumulated drift at generation t is: where N(i) is the (effective) number of individuals at generation i, the divergence time is then the number of generations required to generate the estimated drift. Such estimates are typically not dependent on knowing the mutation rate but some assumptions regarding N(i) is required, either by assuming a fixed effective population size or depending on an estimated function of N(i). Alternatively, one can base the divergence time estimate on an assumed mutation rate (e.g.Wakeley and Hey 1997; Chen 2012; Pickrell ). By assuming independence among sites, in a two-population divergence model (without migration), the probability of observed sample configurations (summarized as the full site frequency spectrum (SFS), including invariable sites) can be derived analytically. Using a likelihood framework, we can then estimate parameters of interest in the divergence model. Here, we present two simple approaches based on picking two gene copies from each of two populations: the “Two-Two” (TT) method, which was briefly introduced in Schlebusch and the “Two-Two-outgroup” (TTo) method. These are sufficiently simple to allow for analytical solutions giving closed formulas for the estimates of the model parameters based on the counts of different sample configurations. Specifically, assuming a mutation rate and generation time, we can estimate population divergence time separately from genetic drift since the model is parametrized with both a drift parameter and a time parameter.

Observed data

For the purpose of investigating the demographic relationship between two populations denoted population 1 and population 2, assume that two gene copies have been sampled from each population. For bi-allelic sites, assume that the ancestral (denoted “0”) and the derived variant (denoted “1”) is known. The number of derived alleles in a sample from population 1 combined with the number of derived alleles in as sample from population 2 is referred to as the joint frequency spectra (e.g.Chen 2012). In our set-up, the sample size from both populations is two so that the number of derived is either 0, 1, or 2, and there are 9 possible sample configurations, which are presented in Table 1. The observed number of sites with sample configuration will be denoted by and the total number of investigated sites by m.

Table 1

Notation for the number of sites with 0, 1, or 2 derived variants in the sample from population 1 and the sample from population 2

	0 in population2	1 in population2	2 in population2
0 in population 1	O0,0	O0,1	O0,2
1 in population1	O1,0	O1,1	O1,2
2 in population 1	O2,0	O2,1	O2,2

Notation for the number of sites with 0, 1, or 2 derived variants in the sample from population 1 and the sample from population 2

Theory

We study a general population divergence model where the population-branch leading to population 1 and the population-branch leading to population 2 merge (backwards in time) to become the ancestral population. The model makes no assumptions regarding population size and/or population structure changes in the daughter populations. The model assumes no migration between the two daughter populations and that these merge into a panmictic ancestral population. We use the following notation, with time measured in number of generations: t 1, time to split for population 1; t 2, time to split for population 2; a 1, probability of two lineages in population 1 not coalescing before t1; a 2, probability of two lineages in population 2 not coalescing before t2, n 1, expected time to coalescent in population 1 given coalescingbefore t1; and n 2, expected time to coalescent in population 2 given coalescing before t2. In addition to the drift parameters α1 and α2, the parameters ν1 and ν2 are needed because two branches with the same time-length and the same drift can have different distributions of coalescence times. To illustrate, a linearly growing population that starts with size N and ends with size 2N will have the same drift as a shrinking population that starts with size 2N and ends with size N but they will not have the same distribution of coalescent times within that interval. These parameters also cover cases when the daughter populations are not panmictic. A similar parametrization can be found, for instance, in Rogers and Bohlender (2015). The composite likelihood assumption of independence between sites implies that the probability of a mutation on a specific branch in a genealogy is the expected length of that branch (given a demographic model) multiplied by the mutation rate. We denote the mutation rate per site and generation by μ, assume independence between sites, and an infinite sites model. We define the following events for the two sampled lineages for each population: With lineages surviving to enter the ancestral population (depending on whether coalescence events have occurred in the daughter populations), we define A to be the number of derived variants in a sample of size k drawn at the split time in the ancestral population and write . To illustrate how the probabilities of the sample configurations are derived, we can take an example conditional on no coalescent event in population 1 and a coalescent event in population 2 (the event ). There are then three lineages entering the ancestral population. These lineages constitute a sample of size 3 from the ancestral population. Sample configuration will then be observed with probability (the ancestral variant has to be assigned to the lineage entering population 2; an event with probability 2/3) plus the probability that a mutation occurs on either lineage entering population 1 during the time interval t1. The probability that a mutation hits a branch of length t1 is , and the probability that this happens and that the derived variant already exists in the ancestral population can be ignored as it requires two mutational events at the same site. Thus, conditional on . The same reasoning can be applied to derive the conditional probabilities for all seven (polymorphic) sample configurations and these are shown in Table 2.

Table 2

Conditional probabilities

	H1∧H2	H1∧¬H2	¬H1∧H2	¬H1∧¬H2
O1,0	2μν1	2μν1	2a313+2μt1	a412+2μt1
O0,1	2μν2	2a313+2μt2	2μν2	a412+2μt2
O2,0	a212+μ(t1−ν1)	a313+μ(t1−ν1)	a323	a426
O0,2	a212+μ(t2−ν2)	a323	a313+μ(t2−ν2)	a426
O1,1	0	0	0	2a423
O2,1	0	2a323	0	a432
O1,2	0	0	2a323	a432

Conditional probabilities Since a subsample of size n randomly drawn from a larger sample of size n + k has the same distribution as a sample of size n drawn directly from the population, we can reduce the number of parameters by replacing all a with i < 4 using a-terms with i = 4 as follows: These equations together with Table 2 allow us to derive the probabilities for the different sample configurations. For instance: where . Using the same strategy for the derivation of the other six probabilities, we obtain the probabilities for all seven sample configurations in Table 2. Writing for brevity, these are: Furthermore, if we assume a (indefinitely) panmictic ancestral population (Figure 1A), we define: to be the number of generations a coalescent process that starts with four lineages at the (most recent) base of the ancestral population spends with i lineages, so that the time to the most recent common ancestor (Tmrca) is . Then (see Appendix):

Figure 1

Different assumptions for population divergence models (A) panmictic ancestral population, and (B) constant ancestral population.

Different assumptions for population divergence models (A) panmictic ancestral population, and (B) constant ancestral population. Writing , and replacing the b with their respective expression in terms of τ, the probabilities for the different sample configurations can be expressed as: These eight equations point to two challenges: (1) it is not possible to completely separate τ4 from divergence times due to its co-occurrence with and , ii) disregarding τ4, it is still an underdetermined set of equations with eight parameters but only seven equations/degrees of freedom (). It can be tempting to reduce the number of parameters by setting t1 = t2, but because [from equations (1) above] specifying t1 = t2 would add additional dependence between the equations. Although this will decrease the number of parameters, it also decreases the number of independent equations. Furthermore, allowing for separate divergence times along the two branches is a valuable asset; not only does it allow the framework to be applicable for temporally structured samples, but separate estimates for each branch can be useful more generally. In fact, it turns out that the divergence time estimate based on the population branch represented by a modern-day individual alleviates the potential issue of residual ancient DNA-specific properties (DNA degradation, sequencing errors, and mapping errors) that could impact divergence time estimates (see below). In contrast, for contemporaneous samples, divergence time estimates should be the same along the two branches (assuming neutrality and the same mutation rate and generation time along the two branches). The challenges noted above can be dealt with either by assuming a constant ancestral population size (the“TT”-method) or by using an outgroup to increase the number of equations (the“TTo”-method).

Assuming a constant ancestral population size (“TT”)

Assuming a constant ancestral population size N reduces the number of parameters in the model (Figure 1B), so that and (with ) and . Then the probabilities in equations (2) simplify as: with We set and solve for the parameters (and note that this guarantees that they are also maximum likelihood estimates (MLEs; Doob 1934; Wald 1949) to obtain: These equations are referred to as the ’TT’-method in Schlebusch where, in order to get the divergence time in years we used G = 30 and in where G is the length of a generation. Note also that if sequencing errors or DNA degradation mainly result in additional singletons, then errors in the sample from population 1 only affects and thus only and ( occurs exclusively in the equations for estimating T1 and V1).

Adding an outgroup (“TTo”)

The equations (1) are useful for data (including SNP-genotype data) where the derived variant at each site has been ascertained in a population that branched off prior to the investigated population split. Such data will ensure that derived variants in the studied sample will be older than the split so that there are no new mutations occurring in the branches. In such a case, , and can all be set to 0 in the equations (1) above, resulting in a new set of equations (see ) that can be solved for the α’s to get: where * indicates that these are the corresponding parameters and sample configuration counts conditional on ascertainment in an outgroup. With this ascertainment procedure it is important that the population used to ascertain the SNPs represents a true outgroup to our studied populations and that the populations satisfy an assumption of bifurcating topology (or “tree-ness”). To validate such an assumption we can set up tests of tree-ness, since if , then the test statistics: should be 0 (see Appendix, where it is also shown that Y2 is closely related to the D-statistic, Green ). The estimates and of α1 and α2 together with the equations in (2) can furthermore be used to obtain estimates of: with Note the two alternative estimates (one using and one using ) for τ3 and τ2 and we take the average of these in estimates below. Based on the obtained estimates of τ2 and τ3, we can attempt to approximate τ4 as a combination of τ2and . In a constant population, and or For this reason we propose to approximate as where x is the estimated ratio of . This leads to to get We refer to this approach to estimate divergence time as:TTo” (as in“TT outgroup”).

Picking two gene copies from population 1 and one gene copy from population 2

The method so far described can be seen as an expansion of the simpler case of picking two gene copies from one population, and only one gene copy from the other population. This simpler set-up can be useful, for instance, when dealing with low-coverage genome data (e.g. ancient DNA sequence data). With this simpler approach, divergence time estimation needs an outgroup (only assuming a constant population size is not sufficient to solve the equations in this case). This 2 plus 1 approach does, however, provide reliable estimates of branch specific genetic drift (under often reasonable demographic assumptions, see Appendix and Wakeley 2009; Schlebusch ; Skoglund ).

Simulations and comparison to GPhoCS

The model underlying the TT method assumes a panmictic ancestral population of constant size prior to the split, and no gene-flow between populations after the split. Although common to many coalescent-based approaches, such assumptions are rarely realistic for natural populations, and it is increasingly evident that mis-specification of an overly simplistic model may lead to substantially biased parameter estimates (Gronau ; Mazet ; Orozco 2016). Here, we investigate the robustness of the TT-method parameter estimation against violation of the basic model assumptions [equations in (3)]. We compare its performance under these conditions against an alternative method for parameter inference, GPhoCS (Gronau ). The analytical TT method and the Bayesian inference method GPhoCS are located to some degree at opposite ends of the statistical inference spectrum; instead of relying on independent single bi-allelic sites, GPhoCS assumes complete linkage between individual sites at a genetic locus (typically 10 kb), but independence between these loci. It should be noted that GPhoCS is capable of estimating parameters under more complicated demographic models than the simple split model we study here. In particular, GPhoCS allows users to specify migration rates and define migration bands between populations, such that it does not share the TT method assumption of no gene-flow occurring between populations after the population split. For this reason, the effect of migration on parameter estimation was investigated only for the TT method. The software MS (Hudson 2002) was used to generate polymorphic datasets using a standard coalescent algorithm under a variety of demographic scenarios. The effects of changes in ancestral population size (Figure 2A) and migration between branches since the split (Figure 2B) were investigated. In each model, the ancestral population size, N, was fixed at 34,000, corresponding to 17,000 diploid individuals. This value is in line with recent estimates of African ancestral effective population size ∼1 million years ago (Li and Durbin 2011; Schiffels and Durbin 2014; Schlebusch ). MS scales time by , and simulations were constructed with true split times of 10,000 and 1500 generations. Assuming a generation time of 30 years this equates to split times of 300,000 and 45,000 years, respectively. These were chosen to keep simulations relevant to the findings of previous work where the deepest split among human groups was estimated at >260,000 years (Schlebusch ), together with more recent divergence events.

Figure 2

The two general demographic models used to simulate data for testing robustness of TT method, with (A) changes in ancestral population size, and (B) variation in proportion and timing of migration between branches since a population split. MS generates samples assuming , where N1 is the diploid population size of population 1, and μ is the neutral mutation rate. Mutation rates vary across the human genome and estimates vary depending on the method used (Scally and Durbin 2012). Li and Durbin (2011) calculated a human neutral mutation rate of per generation (assuming 25 years per generation), whilst recent consensus suggests a lower rate of per base pair per generation is more accurate (Moorjani ). The latter is the mutation rate used across all simulations. Results were filtered such that only those simulations resulting in all sample configurations represented by sites were used in subsequent analyses. The Bayesian inference method GPhoCS is based on likelihood estimation and in order to allow adequate convergence of parameter estimates, a burn-in period of 100,000 iterations was used when applied to MS simulated data.

The effect of varying ancestral population size

In simulating the demographic scenario shown in Figure 2A, populations 1 and 2 are constant backwards in time, but not (necessarily) equal in size; each population size is independently drawn from a uniform distribution between 170 and 1,700,000 diploid individuals. A total of 1000 of such demographies were generated. Populations 1 and 2 merge at 10,000 generations to form a single ancestral population of initial size individuals for generations. The ancestral population then changes to , [drawn uniformly from (1.7 × 102, 1.7 × 106)] for τ generations, before returning to N. We investigate the impact of that change in ancestral population size ( for τ generations) on TT estimates of population divergence time () and ancestral population size, , for generations and generations (Supplementary Figure S1). Supplementary Figure S1 shows that increasing ancestral population size for τ generations can have the effect of inflating divergence time estimates for the TT method. This behavior is expected; if we imagine an expansion of the ancestral population to infinite size for τ generations, no coalescence events would occur during that time, and divergence time estimates would be upwardly biased by τ generations. This bias however seems to be relatively minor compared with that arising from severe bottlenecks. For instance, when the true t = 10,000, a bottleneck in the ancestral population of 1500 individuals lasting for 100 generations results in generations. However, the same severity of bottleneck lasting for 500 generations will result in a greater underestimate of generations. Similarly, N is underestimated when severe bottlenecks occur, though these estimates seem to be more robust than estimates of divergence time (Supplementary Figure S2). When studying more recent splits, (1500 generations), we observe that severe bottlenecks have the potential to result in nonsensical negative split time estimates (Supplementary Figure S3). Figure 3 shows a comparison between TT method and GPhoCS estimates of population divergence time (t) and ancestral population size (N) in cases where the duration of the bottleneck (τ) is fixed at 1000 generations and true split time is 10,000 generations. Results suggest that both methods react similarly to violations of the assumption of a change in ancestral population size; each being particularly susceptible to bias when severe bottlenecks have occurred. GPhoCS performs somewhat better than the TT method, with severe bottlenecks resulting in less of an underestimate of population divergence time.

Figure 3

A comparison of the effect of ancestral population size changes on TT method and GPhoCS parameter estimates. The time between population divergence and change in ancestral population size () is (A, B) 0, (C, D) 2000, and (E, F) 10,000 generations. In all cases, the duration of change in ancestral population size (τ) is 1000 generations and true split time is 10,000 generations. An interesting effect appears when a bottleneck of sufficient severity occurs, whereby both methods’ estimates begin to rebound towards the true split time of 10,000 generations. Again this behavior is expected as all lineages will coalesce in a bottleneck of sufficient severity prior to a population divergence event. In this case the bottleneck itself will act as the constant ancestral population size, and as long as it occurs in close proximity to the split, divergence time estimates are not affected much. For the same reason, when a severe bottleneck occurs a long time prior to the split, both methods produce a (slight) overestimate of the true divergence time (Figure 3E).

The effect of migration between branches

In simulations based on the demographic scenario shown in Figure 2B, a pulse of admixture occurs δ generations ago, with proportion of one daughter population made up of migrants from the other daughter population. Thus we examine the effect of increasing proportion of migration occurring at various times between present and the split time. All populations are kept fixed and constant at 17,000 diploid individuals (). Figures 4 and 5 show the effect of increasing proportion of migrants on TT divergence time estimates when true split time is 10,000 generations. Divergence times are reliably estimated when the proportion of migrants is below 0.1, and as expected, even at higher proportions the bias decreases the nearer the admixture event is to split time. Note that under this set-up, the TT method returns generations even when the proportion of migrants (γ) is 1 and admixture time (δ) is 0. We would expect in this case a , but it seems that this scenario (where all populations are equal in size) is equivalent to a violation of the assumption of a constant ancestral population. As described previously, this has the effect of biasing upwards, and shows that in cases of high proportion of recent admixture, differences between the size of the daughter populations and the ancestral population also has the potential to result in biased . Estimates of ancestral population size on the other hand, are only very slightly affected (Supplementary Figures S4 and S5). Very similar results are observed when a more recent true split time of 1500 generations is studied (Supplementary Figures S6 and S7).

Figure 4

The effect of varying admixture time (δ) on TT split time estimates (), when proportion of admixture (γ) is (A) 0, (B) 0.01, (C) 0.05, and (D) 0.1, and true split time is 10,000 generations.

Figure 5

The effect of varying admixture time (δ) on TT split time estimates (), when proportion of admixture (γ) is (A) 0.25, (B) 0.5, (C) 0.75, and (D) 1, and true split time is 10,000 generations.

The effect of varying admixture time (δ) on TT split time estimates (), when proportion of admixture (γ) is (A) 0, (B) 0.01, (C) 0.05, and (D) 0.1, and true split time is 10,000 generations. The effect of varying admixture time (δ) on TT split time estimates (), when proportion of admixture (γ) is (A) 0.25, (B) 0.5, (C) 0.75, and (D) 1, and true split time is 10,000 generations. Although simulations have shown the TT method to be relatively robust to violations of its assumptions in general, it is evident that extensive, recent gene flow between daughter populations or strong, prolonged bottlenecks in the ancestral population has the potential to introduce bias. If however we obtain external estimates of α1 and α2 through the outgroup ascertainment procedure outlined above (TTo), we can obtain estimates of divergence time that are much less dependent on assumptions concerning the ancestral population. Figure 6 shows a comparison of TT and TTo method results in scenarios of increasing duration of ancestral population size change. The true values of α1 and α2 have been used in equations in (7) to obtain estimates of B1, B2, τ2 and τ3 that in turn have been used to approximate τ4 and divergence times following equation (8). These results show that by using external estimates of drift, there is the potential to considerably reduce bias in divergence time estimates when severe bottlenecks have occurred in the ancestral population. Furthermore, Figure 6 also compares estimates of N, from that of the TT method to one based on the TTo estimate of τ4 (), which is found to be much more sensitive to ancestral bottlenecks.

Figure 6

A comparison of TT method estimates of divergence time and ancestral population size with (TTo), and without (TT) using external estimates of drift. The duration of alternative ancestral population size (τ) is (A, B) 100, (C, D) 500, and (E, F) 1000 generations. In all cases, the change in ancestral population size occurs immediately prior to the split (=0) and true split time is 10,000 generations.

Application to data

The TT method requires good quality sequence data, typically high-coverage genome sequence data since diploid genotype calls are utilized, including invariable sites and singletons (in the sample of four chromosomes). Alternatively, instead of one high-coverage diploid genome, several low-coverage genomes from the same population data can be combined to produce (sufficiently many) sites with two gene copies. It is also important to be careful when filtering the genome for reliable regions as this can cause an artificial bias of the mutation rate. The formulas are sufficiently simple to allow for asymptotic confidence intervals based on MLE theory, and one can imagine thinning the genome data to make sites independent of each other (to overcome potential dependence via linkage). However, we chose a more conservative approach of estimating confidence; the weighted block jackknife procedure (Busing ), which should be more robust to large-scale “outlier” regions driving the signal. We conducted pairwise comparisons among the 11 HGDP individuals and the Denisovan genome and the Altai Neandertal genome from (Meyer ) and (Prüfer ) and estimate population divergence times (see Appendix for a description of data cleaning and calling of ancestral states). The 11 individuals from the Human Genome Diversity Project (HGDP) include 5 individuals from Africa: one Khoe-San (“San”), one rainforest hunter-gatherer (“Mbuti”), two West-African (“Mandenka” and “Yoruba”) and one East-African (“Dinka”). The other individuals were two Europeans (“French” and “Sardinian”), two East-Asians (“Han” and “Dai”), one individual from Oceania (“Papuan”) and one individual representing a South-American indigenous population (“Karitiana”). In addition, we used the high-coverage ancient southern African hunter-gatherer genome (“Balito Bay A”; Schlebusch ) as an outgroup for some divergence estimates (see below).

Split model parameter estimates

We refer to split time estimates in years in the TT-method as and under the TTo method as . These are obtained by setting G = 30 and in equation (4) and applying this to the estimates of T1 and T2 in equations (3) and (8), respectively. Comparisons are grouped according to the population split they represent. For instance, the comparison between French and San is referred to as the “Khoe-San split.”

Divergence estimates according to the TT-method

Assuming a constant ancestral population size, N, it is possible to estimate N, α1, α2, ν1, ν2 as well as t1, t2 without relying on ascertainment procedures. Estimates of α, N and ν are shown in Supplementary Figures S8–10, respectively. From Supplementary Figure S10 it is apparent that ν is often poorly estimated and the uncertainty of the estimate appears to be closely linked to the amount of branch-specific genetic drift (Supplementary Figure S11). A closer look at any of the formulas for reveals that the impact of ν on the probabilities disappears as αapproaches 1 (no drift). Estimates of the ancestral population size remain remarkably constant at around , regardless of choice of individuals (Supplementary Figure S8). Values of are shown in Figure 7. To summarize, estimates of the different split times are (in descending order):

Figure 7

Split time estimates assuming a constant ancestral population and a mutation rate of and a generation time of 30 years. Corresponding to the branch specific divergence time, there are two estimates each comparison.

the split between Neanderthal and Denisovans kya; the split between archaic humans and modern humans kya; the deepest split among modern human population (between Khoe-San and other human populations) kya (see Schlebusch ) for the consequence of using the ancient southern African Balito Bay A genome); the split between Mbuti and other modern humans (excluding Khoe-San populations) kya the split between West- and East-Africans kya; the split between East-Africans and non-African kya; and splits between non-African < 0 ya. Split time estimates assuming a constant ancestral population and a mutation rate of and a generation time of 30 years. Corresponding to the branch specific divergence time, there are two estimates each comparison. Here, the range for the split between archaic and modern humans takes into account the fact that the archaic genomes are older than 40 ky. There are two obvious odd sets of estimates among these: the negative times for non-Africans, and the deep time between Denisovans and Neandertals contrasted to the younger time between Denisovans/Neandertals and modern humans (note that we assume a constant ancestral population size here). We discuss each of these split time estimates below, but first we revisit the utility of ascertaining variants in an outgroup.

Divergence estimates according to the TTo method

By comparing two individuals using only those sites where the derived variant was present in an outgroup, it is possible to: (1) test whether the outgroup represents a true outgroup, and (2) obtain estimates of α1 and α2 that do not rely on assumptions concerning the ancestral population. We utilized the Mbuti, Balito Bay A, or Neandertal/Denisovan as outgroups. The estimates of α conditional on the derived variant being present in an outgroup are shown in Supplementary Figure S15. These three options were variably suitable as outgroups depending on the comparison being made. For instance, when comparing an individual from outside Africa to an African individual, Neandertal/Denisovan would not be true outgroups given the archaic admixture shared among non-African individuals (Green ). This was also visible in the tests based on equations (5) and (6) above (as well as the D-test, see Supplementary Figure S12). A likely consequence of the documented additional Denisovan ancestry in Papuan (Meyer ) is that no comparison involving Papuan passed the outgroup tests. Perhaps more surprising, any comparison involving Mbuti failed the tests when Balito Bay A was used as the outgroup. Moreover, both Mbuti and Balito Bay A were expected to be true outgroups for the comparison of Neandertal vs Denisovan, but the test; however, pointed to them not being true outgroups. Comparisons between estimates of θ (assuming a constant ancestral population size) to estimates of τ2 and using different outgroups for ascertainment are shown in Supplementary Figures S16–18. Since there is presently no suitable outgroup for comparisons between a modern human and one of the two archaic humans—this would require a genome from an archaic human that split off before the Neandertal/Denisovan branch—it was not possible to estimate τ2 and for such comparisons. When reliable outgroup ascertained estimates of α1 and α2 can be obtained, we estimate τ2, τ3, B1 and B2 using equations (7) that are used in equation (8) to obtain an estimate of . This in turn gives that are shown in Supplementary Figures S16–21. For the majority of comparisons, such an approach does not yield different estimates compared with assuming a constant ancestral population size. The major exceptions to this are those comparisons involving non-Africans that show positive and realistic divergence time estimates using the ascertainment scheme (Figure 8).

Figure 8

Different estimates of split times using outgroup ascertainment assuming a mutation rate of and a generation time of 30 years. Comparisons with Papuans not included as no such comparison passed the outgroup-tests. Three estimates are shown: estimates where outgroup ascertainment is performed in Mbuti (+), in Balito Bay A (○) and in Neandertal/Denisovan (×). Transparent gray represents SD and for comparisons that failed the outgroup tests.

Divergence times outside Africa

The divergence time estimates for non-African populations under a constant model () are nonsensical, (negative values). This is likely a consequence of the severe out-of-Africa bottleneck that leads to being much smaller than , which then violates the assumption of a constant N ( in a constant population with N chromosomes). Estimates based on the three outgroup ascertainment schemes () give more reasonable values as shown in Figure 8. Here, the split times estimates are: 50–75 kya between Europeans and Asians/Americans; kya between Sardinians and French; 25–30 kya between Dai and Karitiana; 25–30 kya between Dai and Han; and < 25 kya between Han and Karitiana. These estimates are generally consistent with the prevailing view of the demographic history outside Africa. For instance, the deep split between Sardinians and French may reflect previous findings that while Sardinians trace their ancestry mostly to the early Neolithic farmers, the French are more admixed with European hunter–gatherers and components of the Yamnaya expansion (Skoglund ; Allentoft ; Günther ; Haak ). To interpret the split times between Han, Dai, and Karitiana, it should be noted that Karitiana is best modeled as a combination of three source populations (an ancient Siberian Eurasian source, a north East Asian source and an Australasian source), where the north East Asian contribution is substantially greater than the other two sources combined (Raghavan ; Skoglund ). The fact that the Karitiana show a more recent divergence with Han than with Dai likely reflects north East Asians contributing substantially to Native Americans, and that the Dai has a south East Asian component (closer to an Australasians; Raghavan ; Skoglund ). This admixture pattern results in shallower divergence between Karitiana and Han, and deeper divergence between Karitiana and Dai and between Han and Dai (see e.g.Figure 2 in Raghavan ).

Western vs Eastern Africa and timing of the out-of-Africa event

Assuming a constant ancestral population size, we estimate the split between non-Africans and East-Africans (Dinka) to between 66 and 82 kya. The split between Mandenka and Yoruba is estimated to 100 kya while the split between Western Africans and Eastern Africans (Dinka and non-Africans) is estimated to between 96 and 117 kya. The estimates based on the three outgroup ascertainment schemes () are generally older. Although the demographic history of Western and Eastern Africa appears to be particularly complex (Pickrell ; Gurdasani ; Triska ; Busby ; Hollfelder ), and the SD estimates suggest one should not over-interpret the values, there are a few interesting tendencies among these estimates (Figure 9). First, estimated split times are consistently lower among Yoruba, Mandenka, and Dinka than between any of these populations and a non-African population; likely an effect of gene flow among the three African populations (Gurdasani ; Busby et al. 2016; Schlebusch and Jakobsson 2018). Second, estimates between Yoruba and Dinka (or non-Africans) are deeper than split time estimates between Mandenka and Dinka (or non-Africans). This is consistent with some observations suggesting that Mandenka have a greater east African/European ancestry component compared with Yoruba (Gurdasani ; Patin ; Schlebusch and Jakobsson 2018). Although Mandenka is more distant geographically from Dinka than Yoruba, there is evidence that historical trading routes along the Sahel belt may have resulted in more gene-flow between East Africans and Mandenka (than with Yoruba; Triska ; Černý ). Third, there is a tendency for split estimates between East Asians (Dai or Han) and West Africans (Yoruba, Mandenka, or Dinka) to be deeper than split estimates between Europeans (French or Sardinian) and West Africans. This observation, combined with gene-flow between east and west Africa, is consistent with previous suggestions of migration into East Africa from a European or Middle Eastern source (Llorente ).

Figure 9

Different estimates of split times using outgroup ascertainment assuming a mutation rate of and a generation time of 30 years. Three estimates are shown: estimates where outgroup ascertainment is performed in Mbuti (+), in Balito Bay A (○) and in Neandertal/Denisovan (×). Transparent gray represents SD and for comparisons that failed the outgroup tests.

Deepest splits among modern human populations

The split between Khoe-San and other modern human populations are estimated to around 250 kya using the TT-method (Schlebusch ). It was further demonstrated that all modern-day Khoe-San groups and individuals, including the HGDP San individuals investigated here, were affected by Eurasian/east African admixture, which in turn impacts estimates of the deepest divergence of modern humans (different methods are differently sensitive to admixture; Schlebusch ). This observation became evident in comparisons with an ancient southern African individual (the Balito Bay A boy who lived some 2000 years ago), closely related to modern-day Khoe-San individuals, but without the Eurasian/east African admixture that post-dated the life-time of the Balito Bay A boy. Population divergence time estimates based on the ancient Balito Bay A boy predates the estimates based on modern-day Khoe-San individuals, and give an estimate that is unaffected by the migration and admixture in the last 2000 years (Schlebusch ). Above we showed the effect of assuming a constant ancestral population size, and how violations of this assumption by a bottleneck in the ancestral population can bias divergence time estimates. However, we find no evidence for such a bottleneck in the common ancestral population to all modern humans, and hence. In general, we find very little difference between the TT and the TTo estimates (Supplementary Figure S19). In fact, there is a tendency for to be lower than for the Mbuti-split, providing additional support that the Khoe-San split is the deepest split among modern human populations (Schlebusch ).

Archaic split times

The split between modern humans and both Neandertal and Denisovan is estimated to between 510 amd 707 kya, which is in line with previous such estimates (e.g.Prüfer ). In fact, restricting the analysis to split times on the non-ancient branch (to alleviate issues with fossil dating and potential excess ancient DNA damage) and only to comparisons between Africans and the two archaic humans, gives a range of estimates from 609 to 681 kya (Figure 10). Unfortunately, there is (presently) no suitable outgroup for comparisons between modern humans and archaic humans in order to utilize the outgroup ascertainment approach.

Figure 10

Split time estimates between the five African individuals and the two archaic humans assuming a constant ancestral population. Only estimates based on the non-ancient branch are shown. A mutation rate of and a generation time of 30 years is assumed. The estimated split between Neandertal and Denisovan is around 970 kya; more than 250 ky older than the split between modern humans and archaic humans. This is likely an artifact of violating the model assumptions; the existence of a more complex demography is indicated by our finding that neither Mbuti nor Balito Bay A were found to be true outgroups to the Neandertal–Denisovan comparison according to our outgroup test and the D-test (Supplementary Figures S14 and S13). Some studies hypothesize that the demographic relationship between Neandertals and Denisovans was governed by meta-population dynamics (Rogers ). Others suggest complicating demographic factors such as admixture between Denisovans and Homo erectus (Prüfer ) or admixture from the modern human branch into the Altai Neandertal (Kuhlwilm ).

Conclusion

We present a simple approach to estimate parameters under a comparatively general split model. In particular, no assumptions are needed concerning the population size processes/changes in the daughter populations (i.e. more recent than the split). The underlying model does not include gene-flow between daughter population; however, we can show that moderate violation of this assumption has little impact on the population divergence-time estimates. Assuming a constant ancestral population size, this approach provides an unbiased estimate of divergence time. However, when the ancestral population is not constant, and particularly in the case of severe bottlenecks, divergence time estimates can be biased. Indeed, simulations comparing the TT-method to GPhoCS—an alternative, fundamentally different approach to demographic inference, has shown that the two methods are sensitive to violations of the same assumptions. The reason for this can be intuitively understood in terms of the tMRCA in the ancestral population; most of this time is spent with two lineages and the duration of this is utilized by both methods to estimate the size of the ancestral population. Since, by assumption, the ancestral size is constant, if the time to the first coalescent event in the ancestral population is shorter than expected, (for instance, due to a bottleneck shortly before the divergence), then both methods underestimate the true population divergence time. When such severe bottlenecks have occurred, we have shown that it is possible to reduce much of this bias through the outgroup ascertainment procedure implemented in the TTo method. Applying the TT-method to a sample of 11 genomes from the HGDP panel together with the Neandertal and Denisovan genomes, we provide further information on the details of the various splits within the sample and corroborate many previously estimated population divergence times. Finally, accumulating evidence suggests that human evolution is highly reticulated, and perhaps not well approximated by the sort of bifurcating tree-models studied here (Schlebusch and Jakobsson 2018; Henn ; Scerri ; Stringer 2016). Nonetheless, the framework presented here is still a useful tool: different population genetic methods vary in their assumptions and sensitivities to model violations, thus it is important when investigating the complex demographies underlying the evolution of humans to have access to a variety of different methods. The TT-method is relatively robust to model violations and provides a simple and transparent analytic framework that can be compared with, and potentially even integrated with, other, more computationally demanding methods (Beichman ; Terhorst ; Wang ).

Data availability

Genome sequence data from the following publications was extracted and reprocessed in order to avoid mapping and filtering biases: Prüfer and Schlebusch . Scripts used in simulations and plotting of results, together with open source code for running the TT method is freely available in Python at github.com/jammc313/TT-method. Supplementary material is available at figshare: https://doi.org/10.25386/genetics.13415774.

Table A1

Number of derived in the two samples

	0 in population 2	1 in population 2
0 in population 1	O0,0	O0,1
1 in population 1	O1,0	O1,1
2 in population 1	O2,0	O2,1

Table A2

Conditional probabilities

	H	¬H
O1,0	2μν	23a31+2μt1
O0,1	12a21+μt2	13a31+μt2
O2,0	12a21+μ(t1−ν)	13a32
O1,1	0	23a32

55 in total

1. Approximate Bayesian computation in population genetics.

Authors: Mark A Beaumont; Wenyang Zhang; David J Balding
Journal: Genetics Date: 2002-12 Impact factor: 4.562

2. The joint allele frequency spectrum of multiple populations: a coalescent theory approach.

Authors: Hua Chen
Journal: Theor Popul Biol Date: 2011-12-02 Impact factor: 1.570

3. On the importance of being structured: instantaneous coalescence rates and human evolution--lessons for ancestral population size inference?

Authors: O Mazet; W Rodríguez; S Grusea; S Boitard; L Chikhi
Journal: Heredity (Edinb) Date: 2015-12-09 Impact factor: 3.821

4. Inferring coalescence times from DNA sequence data.

Authors: S Tavaré; D J Balding; R C Griffiths; P Donnelly
Journal: Genetics Date: 1997-02 Impact factor: 4.562

5. Ancient genomes link early farmers from Atapuerca in Spain to modern-day Basques.

Authors: Torsten Günther; Cristina Valdiosera; Helena Malmström; Irene Ureña; Ricardo Rodriguez-Varela; Óddny Osk Sverrisdóttir; Evangelia A Daskalaki; Pontus Skoglund; Thijessen Naidoo; Emma M Svensson; José María Bermúdez de Castro; Eudald Carbonell; Michael Dunn; Jan Storå; Eneko Iriarte; Juan Luis Arsuaga; José-Miguel Carretero; Anders Götherström; Mattias Jakobsson
Journal: Proc Natl Acad Sci U S A Date: 2015-09-08 Impact factor: 11.205

6. Robust and scalable inference of population history from hundreds of unphased whole genomes.

Authors: Jonathan Terhorst; John A Kamm; Yun S Song
Journal: Nat Genet Date: 2016-12-26 Impact factor: 38.330

Review 7. Revising the human mutation rate: implications for understanding human evolution.

Authors: Aylwyn Scally; Richard Durbin
Journal: Nat Rev Genet Date: 2012-09-11 Impact factor: 53.242

8. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history.

Authors: Carina M Schlebusch; Pontus Skoglund; Per Sjödin; Lucie M Gattepaille; Dena Hernandez; Flora Jay; Sen Li; Michael De Jongh; Andrew Singleton; Michael G B Blum; Himla Soodyall; Mattias Jakobsson
Journal: Science Date: 2012-09-20 Impact factor: 47.728

9. Bayesian inference of ancient human demography from individual genome sequences.

Authors: Ilan Gronau; Melissa J Hubisz; Brad Gulko; Charles G Danko; Adam Siepel
Journal: Nat Genet Date: 2011-09-18 Impact factor: 38.330

10. A method for genome-wide genealogy estimation for thousands of samples.

Authors: Leo Speidel; Marie Forest; Sinan Shi; Simon R Myers
Journal: Nat Genet Date: 2019-09-02 Impact factor: 38.330

3 in total

1. Warthog Genomes Resolve an Evolutionary Conundrum and Reveal Introgression of Disease Resistance Genes.

Authors: Genís Garcia-Erill; Christian H F Jørgensen; Vincent B Muwanika; Xi Wang; Malthe S Rasmussen; Yvonne A de Jong; Philippe Gaubert; Ayodeji Olayemi; Jordi Salmona; Thomas M Butynski; Laura D Bertola; Hans R Siegismund; Anders Albrechtsen; Rasmus Heller
Journal: Mol Biol Evol Date: 2022-07-02 Impact factor: 8.800

2. Multiple migrations to the Philippines during the last 50,000 years.

Authors: Maximilian Larena; Federico Sanchez-Quinto; Per Sjödin; James McKenna; Carlo Ebeo; Rebecca Reyes; Ophelia Casel; Jin-Yuan Huang; Kim Pullupul Hagada; Dennis Guilay; Jennelyn Reyes; Fatima Pir Allian; Virgilio Mori; Lahaina Sue Azarcon; Alma Manera; Celito Terando; Lucio Jamero; Gauden Sireg; Renefe Manginsay-Tremedal; Maria Shiela Labos; Richard Dian Vilar; Acram Latiph; Rodelio Linsahay Saway; Erwin Marte; Pablito Magbanua; Amor Morales; Ismael Java; Rudy Reveche; Becky Barrios; Erlinda Burton; Jesus Christopher Salon; Ma Junaliah Tuazon Kels; Adrian Albano; Rose Beatrix Cruz-Angeles; Edison Molanida; Lena Granehäll; Mário Vicente; Hanna Edlund; Jun-Hun Loo; Jean Trejaut; Simon Y W Ho; Lawrence Reid; Helena Malmström; Carina Schlebusch; Kurt Lambeck; Phillip Endicott; Mattias Jakobsson
Journal: Proc Natl Acad Sci U S A Date: 2021-03-30 Impact factor: 11.205

3. Philippine Ayta possess the highest level of Denisovan ancestry in the world.

Authors: Maximilian Larena; James McKenna; Federico Sanchez-Quinto; Carolina Bernhardsson; Carlo Ebeo; Rebecca Reyes; Ophelia Casel; Jin-Yuan Huang; Kim Pullupul Hagada; Dennis Guilay; Jennelyn Reyes; Fatima Pir Allian; Virgilio Mori; Lahaina Sue Azarcon; Alma Manera; Celito Terando; Lucio Jamero; Gauden Sireg; Renefe Manginsay-Tremedal; Maria Shiela Labos; Richard Dian Vilar; Acram Latiph; Rodelio Linsahay Saway; Erwin Marte; Pablito Magbanua; Amor Morales; Ismael Java; Rudy Reveche; Becky Barrios; Erlinda Burton; Jesus Christopher Salon; Ma Junaliah Tuazon Kels; Adrian Albano; Rose Beatrix Cruz-Angeles; Edison Molanida; Lena Granehäll; Mário Vicente; Hanna Edlund; Jun-Hun Loo; Jean Trejaut; Simon Y W Ho; Lawrence Reid; Kurt Lambeck; Helena Malmström; Carina Schlebusch; Phillip Endicott; Mattias Jakobsson
Journal: Curr Biol Date: 2021-08-12 Impact factor: 10.834

3 in total