Literature DB >> 35294015

Properties of 2-locus genealogies and linkage disequilibrium in temporally structured samples.

Arjun Biddanda¹, Matthias Steinrücken^1,2, John Novembre^1,2.

Abstract

Archeogenetics has been revolutionary, revealing insights into demographic history and recent positive selection. However, most studies to date have ignored the nonrandom association of genetic variants at different loci (i.e. linkage disequilibrium). This may be in part because basic properties of linkage disequilibrium in samples from different times are still not well understood. Here, we derive several results for summary statistics of haplotypic variation under a model with time-stratified sampling: (1) The correlation between the number of pairwise differences observed between time-staggered samples (πΔt) in models with and without strict population continuity; (2) The product of the linkage disequilibrium coefficient, D, between ancient and modern samples, which is a measure of haplotypic similarity between modern and ancient samples; and (3) The expected switch rate in the Li and Stephens haplotype copying model. The latter has implications for genotype imputation and phasing in ancient samples with modern reference panels. Overall, these results provide a characterization of how haplotype patterns are affected by sample age, recombination rates, and population sizes. We expect these results will help guide the interpretation and analysis of haplotype data from ancient and modern samples.

Entities: Chemical

Keywords: ancient DNA; linkage disequilibrium; population genetics

Mesh：

Year: 2022 PMID： 35294015 PMCID： PMC9245597 DOI： 10.1093/genetics/iyac038

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.402

Introduction

Multilocus properties of genetic variation have been useful for studying evolutionary processes and maximizing the information extracted from population genetic data. Patterns of multilocus variation are shaped by mutation and recombination events, generating novel combinations of alleles on chromosomes (i.e. haplotypes). The nonrandom association of alleles between 2 (or more) loci is known as linkage disequilibrium (LD; Lewontin and Kojima 1960; Hill and Robertson 1968; Slatkin 2008). Common measures of LD include the covariance and correlation in allelic state at 2 loci on the same haplotype within a sample (D and r2, respectively; Hill and Robertson 1968; Slatkin 2008). The decay of LD as a function of the distance between genetic variants plays an important role in dating evolutionary events (e.g. Moorjani ), determining the accuracy of complex trait prediction (e.g. Vilhjálmsson ), and moderating the power to map trait-associated loci (e.g. Wray 2005; Spencer ). One approach for modeling variation at multiple loci has been through the use of coalescent theory (Kingman 1982; Hudson 1985). The coalescent process at multiple loci can involve both recombination (splitting events) and coalescence (joining events) of ancestral lineages, which means that there can be a different number of lineages ancestral to a sample at each locus at a given point in time (Hudson 1985; Simonsen and Churchill 1997; Durrett 2008). Based on a 2-locus coalescent model, Hudson (2001) developed a composite likelihood approach to estimate fine-scale recombination rates in early sequencing datasets. This initial approach paved the way for subsequent methods to estimate fine-scale recombination rates in humans, accommodating increasing model complexity (McVean ; Auton and McVean 2007; Kamm ; Spence and Song 2019). Also using a 2-locus coalescent model, McVean (2002) was able to express metrics of LD in terms of properties of coalescent times. As the impact of changing demographic history on coalescent times is relatively straightforward, this advance enabled a more intuitive understanding of the impact of demographic history and sampling design on expected patterns of LD in data (McVean 2002; Wakeley and Lessard 2003). A second major modeling framework for LD has been via “haplotype copying” models, such as the Li and Stephens’ model (Li and Stephens 2003; Song 2016). Haplotype copying models provide a computationally efficient approximation for the likelihood of observed haplotype data generated with recombination (Fearnhead and Donnelly 2001). As a result, they have become a backbone of many analyses of population-genomic data, such as genotype imputation (e.g. Howie ), haplotype phasing (e.g. Loh ), and local ancestry inference (Price ; Lawson ). In an increasing number of settings, samples are not all from the same time point. This is exemplified by the growing study of archeogenetics, also known as ancient DNA (aDNA) studies (reviewed in Slatkin and Racimo 2016; Llamas ; Skoglund and Mathieson 2018). Archeogenetic studies of humans have been able to reliably obtain genetic data from samples up to 45,000 years before present, although the majority of samples are from the past ∼15,000 years (Skoglund and Mathieson 2018). For single-locus data, genealogical models have been developed to quantify the impact of ancient samples on population genetic statistics, such as the expected site-frequency spectrum, the number of variants private to an ancient sample, and F (Rodrigo and Felsenstein 1999; Forsberg ; Ortega-Del Vecchyo and Slatkin 2018). In contrast, the impact of time-separation on patterns of LD has not been fully explored. Here, we characterize patterns of haplotype variation in temporally stratified samples using a genealogical perspective. Analogous approaches for time-stratified samples in a coalescent framework have generally not been developed for the case of 2 or more recombining loci. One exception is the approach of Dialdestoro that uses importance sampling over the space of latent ancestral recombination graphs when calculating the likelihood of observed sequence data for haplotypes at multiple time-points. Our work here contrasts to that of Dialdestoro in that we obtain analytic solutions for 2-locus scenarios and for the haplotype copying model. The work presented here is complementary to previous work by Terhorst who modeled how allele frequencies change for multiple loci using a Gaussian approximation to the Wright–Fisher model, though here we approach the problem from a coalescent perspective. We primarily consider statistics based on 2 haplotypes as a starting point for representing the impact of time-stratified sampling across multiple loci. However, we also explore the statistic , whose properties can be understood as an expectation over 4 haplotypic states. We focus on these simplified scenarios as they are analytically tractable, while still providing insight on expected patterns in data (Hudson 1985; McVean 2002). We first show how time-stratified sampling affects the joint properties of genealogies at 2 loci, demonstrating that the time gap between a pair of samples has an impact on the rate of decay in the correlation of genealogical statistics and corresponding patterns of variation with recombination distance. We also analyze the behavior of fitting the haplotype copying model with samples of different ages, in particular when the test haplotype is from a time-point in the past compared with a modern haplotype panel. Overall, our results show the effect of time-stratified sampling on expected patterns of haplotypic variation, and their implications for the further development of population genetic methods.

Methods

Coalescent simulations and calculation of pairwise-differences

We used (Kelleher ) to perform all coalescent simulations used throughout the article. For simulations of 2 loci, we used a customized recombination map to reflect 2 nonrecombining loci of a given size separated by a specified absolute recombination rate. For the simulations of haplotypes, we use the default simulation method and a uniform recombination map (default per-basepair per-generation). To calculate a pairwise-coalescent effective N to compare our constant-population-size theory for 2 loci with simulations under varying demographic history, we took a Monte-Carlo approach using 104 coalescent simulations to compute the mean marginal pairwise coalescent time from simulations and compute as .

Monte-Carlo simulation of correlation in pairwise differences

To verify our comparisons of the theoretical prediction of with data, we simulated 2 loci as described above with a mutation rate (approximately equivalent to a 1-kb window with human scale parameters) for 100 log-spaced points from . When estimating , we conducted 100,000 independent simulations and estimated the Pearson correlation using the pearsonr function in the scipy package (Virtanen ). The standard error of the correlation was calculated using the asymptotic formula: . For estimating the correlation in pairwise differences, we simulated 20 replicates of 20 Mb haplotypes and calculated a Monte-Carlo estimator of the mean correlation in segregating sites at different recombination distances. The estimation proceeds as follows: (1) we split the chromosome into nonoverlapping windows of length L basepairs (default: 1 kb); (2) for each of 5,000 Monte-Carlo samples we choose a window S and define a paired window a recombination distance r from it (randomly choosing the direction to search); (3) compute the empirical Pearson correlation coefficient of the number of pairwise differences across the 5,000 paired windows. Standard errors were computed using the asymptotic formula above, using the 20 replicate chromosomes. For estimation with the real whole-genome sequencing data, we use 30 log-spaced bins over the range , where r is in Morgans to calculate Monte-Carlo estimates of the correlation in pairwise differences. Unless otherwise specified in the text, error bars reflect 2 standard errors from the mean. When translating from years to generations for comparison of models to our theoretical predictions, we use a generation time of 30 years per generation from Fenner (2005).

Monte-Carlo estimation of joint LD

To estimate the product of LD across timepoints (Equation 7), we used Monte-Carlo simulations of 500 modern and ancient haplotypes in a model of constant population size of . We conducted 10 replicate simulations of 1 megabase haplotypes with the mutation rate and recombination rate set to per basepair per generation. We applied a filter of the minor allele frequency pooled across timepoints at when calculating the joint LD coefficient. We additionally bin by genetic distance using the automatic histogram binning in scipy (Virtanen ). For very low values of ρ, there are too few mutations co-occurring at such short distances in our simulations so we set a lower-bound of ρ = 1 when plotting Fig. 5.

Fig. 5.

Joint product of LD between samples separated by t generations across different population-scaled recombination rates ρ (see Methods). Dots represent results from simulation and solid lines are theoretical predictions from Equation (7).

Analysis of ancient whole-genome sequencing data

For our analysis of whole-genome aDNA data, we compared single nucleotide variants observed in the LBK and Ust-Ishim samples (Lazaridis ; Fu ). Variants were called using and were subsequently filtered using the same criterion as in de Barros Damgaard . To account for not having resolved haplotypes in the ancient samples, we scale the observed differences by the probability that they would be observed in a haplotype randomly sampled from the diploid genome (e.g. 0.5 if heterozygote in ancient sample, 1 if opposing homozygote in the ancient sample). For modern samples, we used haplotypes from the 1000 Genomes Project Phase 3 Dataset (Auton ). We computed the correlation in pairwise differences in nonoverlapping 1-kb windows and applied a mappability mask to account for varying coverage in the modern sample by normalizing (Auton ). Standard errors were estimated using a nonparametric bootstrap across 22 autosomes. To compare 2 empirical curves of , we apply a 2-sided Binomial sign test to test the proportion of recombination distance bins for which 1 ancient sample has a higher correlation and test against the null hypothesis that the proportion is 0.5.

Parameter estimation in the haplotype copying model

We implemented a version of the haplotype copying model proposed by Lawson that accounts for the genetic map distances between subsequent single-nucleotide polymorphisms. The Hidden Markov Model (HMM) is defined as follows. The transition probabilities between hidden states, X, where X represents the haplotype in the panel that the test haplotype copies off of at site l: where g is the genetic distance between markers l−1 and l (in Morgan), K is the size of the haplotype reference panel, and λ is the “jump rate” or rate at which the model transitions between the haplotype copying states. The emission probabilities can be similarly characterized, using a parameter ϵ that represents the probability of a copying error: where h is the allelic state of the query haplotype at site l. We use 2D numerical optimization from (Virtanen ) to jointly estimate the maximum-likelihood estimates and . Unless specifically stated, we use the joint parameter estimates in our results for both simulated and empirical data. For profile maximum-likelihood estimates of , we use Brent optimization within the range [] with a fixed . We estimate standard errors for and using a finite-difference approximation to the second derivative of the joint log-likelihood surface. All simulations under the haplotype copying model were conducted using chromosomes of 40 megabases, and recombination and mutation rates of per basepair per generation. Every modern panel consisted of K = 100 haplotypes (unless otherwise specified). We also ascertained to variants with a minor allele frequency in the modern panel.

Analysis of male X-chromosomes in 1,240K human aDNA dataset

The human aDNA data that we used for our analysis of the haplotype copying model (see Online Resources) are typed at a set of 1,233,013 sites across the genome and downloaded from the David Reich Laboratory’s website. Genotypes are drawn using psuedohaploid sampling based on the available reads at these sites. We filtered the data based on the following criteria for our analysis while restricting to the X chromosome: (1) Must be a male sample; (2) Samples must not have a significant amount of modern DNA contamination (e.g. “PASS” contamination checks); and (3) Samples must have ≥8,000 nonmissing variants across the X chromosome. Following this filter, the median autosomal coverage for the remaining samples is , and an average of 1.29 sites per 25 kb on the X-chromosome. Following these filters, we have a total set of 798 samples for which we estimated the maximum-likelihood jump rate under the haplotype copying model. To minimize confounding via spatial variables, we chose a centroid location (48°N latitude, 6°E longitude) and only retained samples within 1,500 km of this centroid. Following this filtering step, there are 344 samples that are used for the main figures (Fig. 7).

Fig. 7.

Comparison of estimated haplotype copying jump rates between real data and simulations. a) Estimate of the jump rate in ancient male X-chromosomes within 1,500 km of central Europe. b) Maximum likelihood estimates of the haplotype copying jump rate using simulated X-chromosomes under the model of Tennessen . c) Estimated jump rates using simulated data under the model of Browning and Browning (2015).

We performed estimation of the haplotype copying jump rate across all of the 798 originally filtered samples using 3 different haplotype reference panels (49 CEU haplotypes [“CEU”]; 240 EUR haplotypes [“EUR”]; 1,233 haplotypes [“FULLKG”]) for the X-chromsome from the 1000 Genomes Phase 3 dataset (Auton ). In all cases, we used the sex-averaged recombination map for the X-chromosome from Kong . For linear modeling of the jump-rate as a function of the sample age, we used the OLS function of statsmodels package (Seabold and Perktold 2010). When comparing the real data against simulations under the demographic models inferred by Tennessen and Browning and Browning (2015), we use n = 49 modern day CEU haplotypes and sampled haplotypes at ages corresponding to the real data using a generation time of 30 years per generation (Fenner 2005). We additionally scaled each demographic model by 3/4 to reflect the reduced effective size of the X-chromosome.

Results

Two-locus genealogical properties

To model 2 haplotypes at 2 loci with time-stratified sampling, we adapted a previously developed continuous time Markov process for modeling ancestral lineages at 2 loci (Hudson 1983, 1990; Simonsen and Churchill 1997). The states in the model are triplets [e.g. (2, 0, 0)] that depict the number of lineages ancestral to both loci, locus 1, or locus 2, respectively. Coalescence and recombination events eventually lead to an absorbing state where both haplotypes have coalesced at both loci [the state (1, 0, 0), Fig. A1]. Analytical results for joint moments in the coalescent times in this model have been previously obtained for the case where samples are taken at the present (Hudson 1983; Simonsen and Churchill 1997; Durrett 2008, Chapter 3).

Fig. A1.

Markov chain model for the ancestral process at 2 loci from Simonsen and Churchill (1997). In all settings for 2 modern haplotypes, we assume that we start from the state in the middle (state “0”) in all applications, which means that all sampled haplotypes are coupled. The parameter η represents the coalescent rate and the parameter ρ represents the recombination rate (measured in coalescent units). Figure adapted from Hobolth and Jensen (2014).

a) Schematic of genealogies at 2 loci separated by a population-scaled recombination distance ρ (). The parameter t represents the sampling time of the haplotype (measured in coalescent units, i.e. scaled by ). The random variables T and T are the additional time to coalescence at locus A & B, after t. b) The probability of the modern haplotype being “uncoupled” at the time of ancient sampling as a function of t and ρ. In this setting, “uncoupled” means that the ancestral lineages at locus A and B are not on the same haplotype, enhancing the probability of different T and T occurring at each locus. Here, to analyze the case of time-stratified sampling, we assume that one of the haplotypes has been sampled at time t in the past (in coalescent units) and the other at the present. With this time gap in sampling, there are 2 natural phases in the ancestral process: (1) the time between the present and when the ancient haplotype is sampled (), and thus only the lineage of the modern haplotype can evolve at each locus, and (2) the time when the lineages of both haplotypes (modern and ancient) are evolving through the full state space of the ancestral process (). For this 2-phase ancestral process, we derived expressions for the covariance between the T at 2 loci (A and B), as well as the total branch lengths (L, L) separated by a population-scaled recombination distance, , where r is the per-generation probability of recombination. The derivation proceeds by recognizing that a key aspect of the 2-phase process is the effect of recombination during the first phase, when only the modern lineage is evolving backwards in time (, see Appendix A). During this phase the process has only 2 states, “uncoupled” and “coupled.” By “uncoupled,” we mean that the ancestral lineages are evolving independently at each locus, whereas “coupled” means that they are evolving as a joint ancestral lineage. The starting state for the second phase of the ancestral process (when ) is either that the modern haplotype’s ancestral lineages are coupled at both loci or uncoupled from one another. We obtain the time-dependent probability of being in the uncoupled state by exponentiating the 2 × 2 rate matrix Q for the reduced state-space of the ancestral process during , where . By doing so and taking different limits, we find: Figure 1b shows for either large time-separation (t) or large population-scaled-recombination rates (ρ), it becomes more likely that the modern haplotype is in the uncoupled state by the time the process encounters the ancient haplotype. Since the remaining dynamics are the same as the 2-locus ancestral process with 2 contemporaneously sampled haplotypes, we thereafter leverage known results for the 2-locus ancestral process (Simonsen and Churchill 1997; McVean 2002; Durrett 2008, Chapter 3). In the next 2 sections, we take this modeling approach to derive the expectations of observable quantities from time-staggered haplotype data.

Fig. 1.

Correlation in pairwise differences

The number of pairwise differences between 2 haplotypes at each of 2 loci is an observable summary of genetic variation at linked loci in time-sampled sequence data. To investigate the properties of the joint distribution on pairwise differences at 2 loci (locus A and B), we continue to assume a model with recombination occurring at a rate ρ between them and no recombination occurring within each. For each locus, as is typical in coalescent models, we assume an infinite-sites model with mutations arising on each lineage as a Poisson process with rate , where , μ is the per-basepair per-generation mutation rate, L is the size of the locus (in basepairs), and N is the effective population size. Following the approach described in the preceding section, we derive the correlation of pairwise differences for the case with time-stratified sampling (see Appendix A). In particular, we use the fact that the correlation in the number of pairwise differences at locus A and B can be expressed in terms of the correlation in the total branch length between the loci (Wakeley and Lessard 2003; Hobolth ). We find the correlation in pairwise differences between 2 loci to be: where is the correlation in total branch length at locus A and locus B. In Appendix A (building on previous results from Hudson 1983; Simonsen and Churchill 1997; Durrett 2008, Chapter 3), we derive its exact form and several limiting values to be: As the equations show, the correlation in pairwise differences is affected by the age of the ancient sample t in 2 ways. The first effect is due to the factor in Equation (4) that decreases as t increases and is not dependent on ρ, which can be seen in Fig. 2 by the decrease for t = 10,000 against t = 0 for very small ρ. We note that the difference between t = 10,000 and t = 0 in Fig. 2 is more pronounced than between 1,000 and 0, because t in Equation (4) is on the coalescent scale. The second effect occurs in how t affects (Fig. 2). For values of , the correlation decays linearly with t and with for ρ. The decay decreases more rapidly as when and as t gets large (the third case in Equation 5). This is because of the additional time (t) that the recombination process has to break apart the shared genealogical history at each locus.

Fig. 2.

Theoretical (solid lines) and simulated correlation between pairwise differences in a constant-size demography () at different sample ages (in generations). Comparison of theoretical prediction of with data from 2-locus coalescent simulations with (see Methods). Solid blue and orange lines are the theoretical predictions for from Equation (4).

The impact of nonequilibrium demographic history on the correlation in pairwise differences

To explore the effects of varying population size through time, we simulated haplotype data under models of constant size, instantaneous growth, and trajectories inferred from previous studies of human populations that include both bottlenecks and growth (Tennessen ; Browning and Browning 2015; Fig. 3). Motivated by how most human aDNA data are from approximately the last 15,000 years, we investigated the correlations on a timescale of 500 generations.

Fig. 3.

The impact of varying demographic history on the correlation in pairwise differences at 2 loci. For all simulations, the recombination rate between the loci was set to per generation (∼10 kb, assuming 1 cM per 1 Mb). Simulated scenarios include: a) constant population size, b) inferred models of population growth, and c) models of instantaneous population growth. Each timepoint had 50,000 replicate simulations. In models with constant population size, larger population sizes lead to smaller inter-locus correlations (lower LD). In all our simulations , so on the time-scale of 500 generations, the correlation in branch length decreases linearly as expected with sampling age (Equation 4, Supplementary Fig. 2A). Across all population sizes, we observe significantly negative relationships between sample age (on the coalescent scale) and the correlation in branch length akin to what we predict in Equation (4) (for linear regression of , we find for ). The negative effect of t on the correlation in total branch length in turn decreases the correlation in pairwise differences (Fig. 3a). When simulating under the population size trajectories from Tennessen or from the Browning and Browning (2015), “UK10K IBDNe model” in reference to the original dataset, the correlations are smaller than the UK10K IBDNe model, which includes a larger population size in the last few generations but an overall N (estimated using Watterson’s estimator, see Methods) that is smaller than the Tennessen model (; Fig. 3b). In a linear model, the correlation in pairwise differences decreases with age under the UK10K IBDNe model [, 95% CI = (−0.51, −0.31)] and not in the Tennessen model [, 95% CI: (−0.03, 0.12)]. For the case of step-wise population growth (Fig. 3c), we make 3 observations. First, the decrease in the correlation in pairwise differences is no longer approximately linear with time but decays nonlinearly, with the rate of decay decreasing with sample age. Second, the correlation in pairwise differences is highest at short time-scales for the most recent growth event, and at long-timescales for the most ancient growth event. This can be interpreted again as a result of the very low N in this setting such that the factor scaling the correlation in pairwise differences (Equation 4) dominates the behavior after generations (when the correlation in branch length is similar across all settings). Third, the correlation in the branch length is substantially higher () when compared with the previously inferred demographies (Supplementary Fig. 2). The step-wise growth scenario is interesting in that due to the large, recent increase in population size, we expect roughly star-like genealogies with coalescent times concentrated around the start of the growth event (Slatkin 1996; Rosenberg and Hirsh 2003). In this scenario, we find the correlation between loci in the branch lengths is increased greatly (Supplementary Figs. 2C and 8) which contributes to elevating the . At the same time, as θ is decreased relative to other scenarios (due to lower N), we do not see as drastic an increase in the correlation between pairwise differences as in the branch length (Equation 4). Intuitively, as N, the correlation in total branch length between loci increases as the coalescent rate increases if the recombination rate is held fixed; lowering N also decreases θ, which increases the correlation in pairwise differences between loci. Finally, we also investigated the correlation in pairwise differences in a 2-population model of divergence without gene flow. We assume the modern and ancient haplotype are each sampled from different populations. In this scenario, both the ancient and modern haplotypes can become uncoupled prior to any possibility of inter-haplotype coalescence lowering the expected correlation in pairwise diversity (Appendix A). In this model, we find the correlation in number of pairwise differences decreases as a function of the sum of the divergence time and the sampling time (; Supplementary Fig. 1).

Correlation of pairwise differences in time-staggered whole-genome sequencing data

Next, we explored the correlation of pairwise differences in modern and ancient human whole-genome sequencing data with 2 high-coverage samples from 2 different ages. We restricted to analyzing high-quality whole-genome sequencing data to avoid ascertainment biases and to more accurately estimate pairwise differences (see Methods; Fig. 4).

Fig. 4.

Comparison of the correlation in pairwise differences between LBK, Ust-Ishim, and a modern CEU control individual. Points represent the estimate of the pairwise correlation between randomly chosen pairs of loci (see Methods). When computing the theoretical curves, we used and a mutation rate per basepair-per-generation. The first sample we chose is an ∼7,000-year-old sample from modern-day Germany associated with the Linear Ban Keramic (LBK) culture and labeled variously in previous studies as the Stuttgart LBK sample or simply the LBK sample (Lazaridis ). The second sample is ∼45,000 years old and from Western Siberia, labeled Ust-Ishim (Fu ). These samples have an order of magnitude difference in the sampling time-scale (thousands vs tens-of-thousands years). To investigate the correspondence of our theory with empirical data, we compared the correlation in pairwise differences across our 2 empirical samples to the theoretical predictions from Equation (4). We find that for recombination rates < Morgans, the scale and rate of decay of the empirical curves are consistent with the theoretical predictions (Fig. 4). However, there is a larger deviation between the empirical results and theoretical predictions at longer recombination distances (), where in observed data there is an excess of correlation in pairwise differences (Fig. 4). The extended decay of that we see in real data is not present in data simulated under the model of (Tennessen ; Supplementary Fig. 4A) or under a constant-sized demography (Supplementary Fig. 4B), suggesting that the extended decay is not attributable to demographic history alone and warrants further study.

LD with time-stratified sampling

To directly relate the joint genealogical properties described above to patterns of LD, we investigated the normalized expected product of LD (D) between the ancient and modern samples: where is the frequency of the derived allele at the first locus at time t and is a classic measure of LD in the sample of individuals from time t (Lewontin and Kojima 1960). Using the genealogical identity coefficients from McVean (2002), we derive the ratio of the expectations of the product of LD between time-points. Motivated by arguments put forth by McVean (2002) and Ragsdale and Gravel (2019) that express statistics of LD by taking the ratio of expectations (i.e. ), we take the ratio of expectations of in Equation (6) to derive a time-stratified analog of . Similar to , we stress that our statistic is not directly equivalent to —is an approximation that can become poor for loci at low-frequencies McVean (2002). In Appendix B, we derive an expression for the joint product of LD across both timepoints (): when t = 0, Equation (7) reduces to the expression for , as shown in McVean (2002). Both simulations and our theoretical predictions show that larger time-separation between samples qualitatively decreases the joint product of LD (Fig. 5). Joint product of LD between samples separated by t generations across different population-scaled recombination rates ρ (see Methods). Dots represent results from simulation and solid lines are theoretical predictions from Equation (7).

The impact of time-stratified sampling in haplotype copying models

We next consider the scenario where one would be interested in modeling an ancient haplotype as a mosaic of modern haplotypes, as might arise when trying to phase or impute aDNA genotypes using a reference panel of modern haplotypes and the popular Li and Stephens haplotype copying model (Li and Stephens 2003; Song 2016). We specifically use a modified model where the recombination map positions are known a priori (see Methods;Lawson ). We focus on the maximum-likelihood estimate of the haplotype copying jump rate () for a given test haplotype as it copies off the reference panel. We view partly as a summary statistic reflecting the length scale of copying tracts and as an indicator of the expected accuracy of imputation (Stephens and Scheet 2005; Jewett ). The time-separation between the ancient haplotype and modern sample provides an opportunity for recombination events to occur among the modern reference haplotypes before the ancient lineage is able to coalesce with any individuals from the modern panel (Equation 3; Fig. 1). Thus, we expect higher jump rates as the sample age t increases. We also expect coalescence within the modern panel will contribute to higher jump rates with increasing t by effectively reducing the panel size moving farther back in time. Using the first time coalescence between the ancient target and a member of the modern panel, we observe a saturation effect when increasing the modern panel size (Appendix C and Supplementary Fig. 5). The time until the first coalescent event involving the ancient sample is equal to the length of the external branch in the local genealogy that leads to the ancient sample, and affects the rate of recombination events that can induce switch events in the copying model. The time to the first coalescent involving the ancient sample and the modern panel decreases as a function of the reference panel size, K. However, as the age of the sample increases, the number of lineages extant to the reference sample becomes smaller, making the time to first coalescent event more similar across modern reference panel sizes. Using simulations with populations of constant size, we find that the realized copying jump rate indeed increases with age, and does so monotonically as a function of the age of the test haplotype under a model of constant population size (Fig. 6a). The simple monotonic relationship can break down in nonequilibrium demographic models. For instance, in demographic models including recent population growth for European populations, we find that there is an initial decrease in from the present to ∼150 generations ago before a more rapid increase moving back into the past (Fig. 6b;Tennessen ; Browning and Browning 2015). A similar result is observed more dramatically in simulations of instantaneous growth, with a common feature being a decreasing relationship between and sample age up to the time of onset of instantaneous growth, reflective of the effect of a strong conditioning on the coalescent time (Fig. 6c andSupplementary Fig. 8).

Fig. 6.

Estimation of haplotype copying jump-rate against sample age for different models of population demographic history (top row). a) constant population size, b) previously inferred models of recent population growth, and c) models of instantaneous population growth. The inferred parameters should be interpreted in terms of the average jumps per Morgan.

Haplotype copying jump-rates in human aDNA data

To compare our simulation experiments on the dependence of the jump-rate with sampling time to empirical data, we applied our jump rate estimation to a collection of 1,159 ancient human samples (see Methods). To avoid potential errors introduced by statistical phasing, we analyzed only haploid carriers of the X chromosome by taking samples labeled as male in both the ancient data and the modern reference panel (1000 Genomes Project data; Auton ). Thus, the analysis used 47,094 bi-allelic SNPs observed on the X chromosome. To avoid the potential effects of population structure confounding the impact of time-stratified sampling and to maximize the sample size, we focus primarily on Europe as it is the region with the highest density of aDNA samples, and we used n = 49 CEU male X chromosomes to define the modern reference panel (see Supplementary Fig. 6 for experiments with alternate panels). Based on copying jump rates estimated across 344 ancient male X-chromosome samples from across Europe (see Methods for a description of the dataset), we find that the estimated jump rate decreases as a function of sample age (Fig. 7a). Accounting for spatial variables (Latitude, Longitude, and Latitude × Longitude) in a linear model (see Methods and Supplementary Fig. 6), we find the effect of sample age on the estimated copying jump rate is negative (; 95% CI = (−0.63, −0.46)]. Filtering for the highest 25% coverage individuals did not change the result (Supplementary Fig. 9). The inferred haplotype copying error rate (ϵ) also decreases with age, suggesting the observed decrease in λ is not an artifact of the inference procedure (Supplementary Fig. 10). Comparison of estimated haplotype copying jump rates between real data and simulations. a) Estimate of the jump rate in ancient male X-chromosomes within 1,500 km of central Europe. b) Maximum likelihood estimates of the haplotype copying jump rate using simulated X-chromosomes under the model of Tennessen . c) Estimated jump rates using simulated data under the model of Browning and Browning (2015). This decrease is contrary to our idealized simulations with constant population size (Fig. 6a) and in agreement with the simulations involving some aspect of recent growth (Fig. 6, b and c). To make the comparison more exact, we replicate simulations of Tennessen and Browning and Browning (2015) with the exact temporal sampling structure of the real 344 samples and using a sex-averaged recombination map for the X chromosome (Kong ). With these simulations, we are able to replicate an initial decrease in the jump-rate as a function of sampling time (Fig. 7, b and c). However, the simulations do not capture the duration of the decrease in jump-rate with sample age, which we find to be generations in the real data.

Discussion

In this article, we have developed theory to understand the effects of serial sampling on patterns of haplotype variation in the context of 2 models, the 2-locus coalescent model and the haplotype copying model. Both of these models are used to describe patterns of LD in population genetic data, and share several features with one another. Both models capture the relationship between recombination distance and the breakdown of LD, but the 2-locus genealogies consider patterns only at 2-loci whereas the haplotype copying model considers a multilocus perspective. It should also be noted that the 2-locus genealogical model explicitly considers the time of coalescent and recombination events, whereas the haplotype copying model, in the form used here, does not consider the timing of particular events. However, in spite of their differences, they both have wide relevance in that they provide theoretical results for the expected patterns of linked variation, underlying standard approaches to analyze modern haplotype data. In the 2-locus coalecenscent, we find that with larger time-separation between samples, the correlation in branch length at 2 loci decreases by an amount proportional to the probability of uncoupling of a sampled modern haplotype over t units of time (Equation 4). In constant-size populations and small values of , the decrease is linear in time. As t increases the decay of correlation in branch lengths to occur with order ) vs ). Intuitively, the additional marginal branch length on which a recombination event can occur ( vs 2 in expectation) is disrupting between-locus correlation. Demographic history also shapes the correlation in branch length between loci, with increasing as N decreases due to a decrease in the variance in coalescent times (Supplementary Fig. 2). For larger values of t there is an additional decrease in the correlation of pairwise differences between loci, , that arises from the impact of mutations (the denominator of Equation 4). For small values of t ( coalescent units) the correlation of branch length essentially determines the behavior of the correlation in observable number of differences between 2 loci. The expected joint LD coefficient between data sampled at different times decreases across all recombination scales in the simulations and the theoretical derivations (Fig. 5). However, it is important to note that our simulations here represent an idealized scenario with a large number of ancient haplotypes (n = 500) and no genotyping error. Therefore, it will be of further interest to determine if statistics such as the joint LD coefficient may be informative for demographic inference, while accounting for potential error modes from realistic data sources. Our analysis of the haplotype copying rate revealed interesting impacts of demographic history. In constant-size models, the inferred copying rate increased with the sample age as one might expect due to recombination events; however, in cases of strong recent population growth (Tennessen ; Reppell ; Browning and Browning 2015) the inferred copying rate decreases initially with age and then increases. To understand this, consider how the haplotype copying jump-rate, , is inversely related to the expected branch-length shared between an ancient haplotype and a member of the modern panel, because recombination events that occur on these branches can initiate copying-switch events (Li and Stephens 2003; Paul ; Steinrücken ). In cases with rapid population growth, there are initially limited numbers of coalescent events, followed by a high rate when the population is small, looking backwards in time. Samples that are sampled sequentially closer to the onset of growth have shorter branch length on which potential switch events occur, producing the initial negative relationship. For samples that are sampled more ancestrally than the onset of population growth, we find that the jump rate increases as the coalescent time are no longer affected by the onset of growth (Fig. 6c). Our empirical analysis of aDNA data from western Eurasia supported a negative relationship between the haplotype copying rate and sample age. In contrast with the demographic models simulated, the empirical data show an extended decrease in the jump rate, reaching over ∼400 generations. Similar discrepancies arise when comparing the correlation in pairwise differences in empirical data (Fig. 4). We consider 2 potential explanations for the discrepancy between simulations and observations: unmodeled aspects of population demographic history not captured by existing models used for simulation or aDNA data artifacts. Throughout our experiments for both the haplotype copying rate and correlation in pairwise differences, we found that demographic models capturing more detail of recent Eurasian history did not adequately predict either statistic. However, there may still be potential unmodeled aspects of relevance to our statistics here. For example, the duration of the decrease in the estimated copying rate could be due to smaller local population sizes in the more distant past than is reflected in the models. This is particularly relevant given the time-scale of ∼400 generations (∼12,000 years) as this extends into the Mesolithic and Paleolithic eras during which populations were likely small in overall size and deeply structured (Premo and Hublin 2009; Haak ; Skoglund and Mathieson 2018). If ancestral population structure existed in this period, it may have biased inferred effective population size upwards in models that were fit under the assumption of a single panmictic population (Li and Durbin 2011, Supplementary Section 1.6). We also recognize that due to population turnover, the proportion of ancestry directly ancestral to the modern reference panel may fluctuate as a function of time due to population turnover, leading to temporal patterns in the jump rate. Regarding the aDNA data, in our empirical analysis, we do not find any significant effects of coverage on the qualitative result that the jump-rate decreases as a function of time (Supplementary Fig. 9B). If error rates increase with sample age it would seem to run counter to the observed result, causing elevated jump rate estimates as one goes further back in time; however, this is not what we observe in our joint estimation (Supplementary Fig. 10). Some complex form of reference bias increasing with age and interacting with the haplotype copying model may be plausible. Overall, the result suggests there may be interesting insights to be gained by more detailed empirical analyses of haplotypic patterns in aDNA. Many methods have been developed in the context of haplotype copying models, from imputation and phasing (e.g. Howie 2009), estimation of recombination rates (e.g. Li and Stephens 2003), to fine-scale ancestry estimation (e.g. Lawson ). Our theoretical results leave important considerations for each of these application domains with serially sampled data. For imputation and phasing, the increase in the copying jump rate as a function of time under constant population sizes implies that LD will be lower in relation to the first coalescent time with a member of the modern panel, and will lower the copying accuracy at longer genetic distances (Appendix C; Jewett ). For samples that are sufficiently old, there is a diminishing benefit for generating larger modern reference panels (Appendix C), which primarily results in improvements in imputation and phasing for modern samples due to recent relatedness (Jewett ; McCarthy ). Our exploration of the impact of population demography (particularly population growth) and our empirical analysis of the male X chromosome paints a more optimistic picture for the analysis of human aDNA using the haplotype copying model. We find that there is a substantial attenuation of the increase in the haplotype copying jump-rate () under scenarios of recent growth, and even potential decreases in the case of instant population growth (Fig. 6). Together with our empirical result of the jump rate decreasing as a function of time across male X chromosomes in ancient European samples (Fig. 7), the results support the idea that we may be able to impute common variants relatively accurately in human populations that have undergone recent rapid growth. Indeed, the empirical accuracy of imputation is relatively high for samples within the past ∼6,000 years (Gamba ; Martiniano ). In addition to the “reference-based” phasing we have explored in this work, methods that iteratively sample haplotypes from the input genotypes have advantages for phasing aDNA when modern reference panels lack the haplotype and allelic diversity present in ancient samples (e.g. Rubinacci ). We leave this comparison of phasing and imputation accuracy from exclusively reference-based models with the addition of iterative haplotype sampling for future work, though we expect some of the insights gained here will help this exploration. As caveats, our theoretical results here do not account for some important features of aDNA data. Specifically, we have not attempted to model genotyping error and low-coverage data, both common in the analysis of aDNA (e.g. Dabney ). Our results on pair-wise loci could be extended to directly model the effects of errors at one or both loci. Methods using haplotype copying HMMs with emission probabilities directly modeling low-coverage sequencing data (e.g. Rubinacci ) are more applicable to account for this sparsity in aDNA analysis. Another caveat is that due to the wide temporal range and the absolute number of samples available (Olalde and Posth 2020), our empirical analyses focused on samples from western Eurasia. As aDNA technology improves and sampling becomes less centered on western Eurasia, it will be interesting to reanalyze the relationship between the jump-rate and sample age across multiple regions with varied demographic histories. With the abundance of aDNA data being generated across a wide array of organisms, statistical and theoretical advances will need to similarly account for this new dimension in the data. Here, we have highlighted the impact of time-stratified sampling for 2 related models, the 2-locus coalescent with recombination and the haplotype copying model. We expect that our theoretical treatment of these models will serve to inform advances in statistical population genetic methods that account for serially sampled data to maximize their utility for inference.

Data availability

All results in this article can be reproduced directly from repositories specified in the Online Resources. Supplemental material is available at GENETICS online. Click here for additional data file.

61 in total

1. Two-locus sampling distributions and their application.

Authors: R R Hudson
Journal: Genetics Date: 2001-12 Impact factor: 4.562

2. Estimating recombination rates from population genetic data.

Authors: P Fearnhead; P Donnelly
Journal: Genetics Date: 2001-11 Impact factor: 4.562

3. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation.

Authors: Matthew Stephens; Paul Scheet
Journal: Am J Hum Genet Date: 2005-01-31 Impact factor: 11.025

4. Recombination rate estimation in the presence of hotspots.

Authors: Adam Auton; Gil McVean
Journal: Genome Res Date: 2007-07-10 Impact factor: 9.043

5. The impact of accelerating faster than exponential population growth on genetic variation.

Authors: Mark Reppell; Michael Boehnke; Sebastian Zöllner
Journal: Genetics Date: 2013-12-30 Impact factor: 4.562

6. Efficient phasing and imputation of low-coverage sequencing data using large reference panels.

Authors: Simone Rubinacci; Diogo M Ribeiro; Robin J Hofmeister; Olivier Delaneau
Journal: Nat Genet Date: 2021-01-07 Impact factor: 38.330

7. Two-Locus Likelihoods Under Variable Population Size and Fine-Scale Recombination Rate Estimation.

Authors: John A Kamm; Jeffrey P Spence; Jeffrey Chan; Yun S Song
Journal: Genetics Date: 2016-05-10 Impact factor: 4.562

8. Genome sequence of a 45,000-year-old modern human from western Siberia.

Authors: Qiaomei Fu; Heng Li; Priya Moorjani; Flora Jay; Sergey M Slepchenko; Aleksei A Bondarev; Philip L F Johnson; Ayinuer Aximu-Petri; Kay Prüfer; Cesare de Filippo; Matthias Meyer; Nicolas Zwyns; Domingo C Salazar-García; Yaroslav V Kuzmin; Susan G Keates; Pavel A Kosintsev; Dmitry I Razhev; Michael P Richards; Nikolai V Peristov; Michael Lachmann; Katerina Douka; Thomas F G Higham; Montgomery Slatkin; Jean-Jacques Hublin; David Reich; Janet Kelso; T Bence Viola; Svante Pääbo
Journal: Nature Date: 2014-10-23 Impact factor: 49.962

9. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.

Authors: Bjarni J Vilhjálmsson; Jian Yang; Hilary K Finucane; Alexander Gusev; Sara Lindström; Stephan Ripke; Giulio Genovese; Po-Ru Loh; Gaurav Bhatia; Ron Do; Tristan Hayeck; Hong-Hee Won; Sekar Kathiresan; Michele Pato; Carlos Pato; Rulla Tamimi; Eli Stahl; Noah Zaitlen; Bogdan Pasaniuc; Gillian Belbin; Eimear E Kenny; Mikkel H Schierup; Philip De Jager; Nikolaos A Patsopoulos; Steve McCarroll; Mark Daly; Shaun Purcell; Daniel Chasman; Benjamin Neale; Michael Goddard; Peter M Visscher; Peter Kraft; Nick Patterson; Alkes L Price
Journal: Am J Hum Genet Date: 2015-10-01 Impact factor: 11.025

10. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations.

Authors: Jeffrey P Spence; Yun S Song
Journal: Sci Adv Date: 2019-10-23 Impact factor: 14.136