Arjun Biddanda1, Matthias Steinrücken1,2, John Novembre1,2. 1. Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA. 2. Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA.
Abstract
Archeogenetics has been revolutionary, revealing insights into demographic history and recent positive selection. However, most studies to date have ignored the nonrandom association of genetic variants at different loci (i.e. linkage disequilibrium). This may be in part because basic properties of linkage disequilibrium in samples from different times are still not well understood. Here, we derive several results for summary statistics of haplotypic variation under a model with time-stratified sampling: (1) The correlation between the number of pairwise differences observed between time-staggered samples (πΔt) in models with and without strict population continuity; (2) The product of the linkage disequilibrium coefficient, D, between ancient and modern samples, which is a measure of haplotypic similarity between modern and ancient samples; and (3) The expected switch rate in the Li and Stephens haplotype copying model. The latter has implications for genotype imputation and phasing in ancient samples with modern reference panels. Overall, these results provide a characterization of how haplotype patterns are affected by sample age, recombination rates, and population sizes. We expect these results will help guide the interpretation and analysis of haplotype data from ancient and modern samples.
Archeogenetics has been revolutionary, revealing insights into demographic history and recent positive selection. However, most studies to date have ignored the nonrandom association of genetic variants at different loci (i.e. linkage disequilibrium). This may be in part because basic properties of linkage disequilibrium in samples from different times are still not well understood. Here, we derive several results for summary statistics of haplotypic variation under a model with time-stratified sampling: (1) The correlation between the number of pairwise differences observed between time-staggered samples (πΔt) in models with and without strict population continuity; (2) The product of the linkage disequilibrium coefficient, D, between ancient and modern samples, which is a measure of haplotypic similarity between modern and ancient samples; and (3) The expected switch rate in the Li and Stephens haplotype copying model. The latter has implications for genotype imputation and phasing in ancient samples with modern reference panels. Overall, these results provide a characterization of how haplotype patterns are affected by sample age, recombination rates, and population sizes. We expect these results will help guide the interpretation and analysis of haplotype data from ancient and modern samples.
Multilocus properties of genetic variation have been useful for studying evolutionary
processes and maximizing the information extracted from population genetic data. Patterns of
multilocus variation are shaped by mutation and recombination events, generating novel
combinations of alleles on chromosomes (i.e. haplotypes). The nonrandom association of
alleles between 2 (or more) loci is known as linkage disequilibrium (LD; Lewontin and Kojima 1960; Hill and Robertson 1968; Slatkin 2008). Common measures of LD include the covariance and correlation in
allelic state at 2 loci on the same haplotype within a sample (D and
r2, respectively; Hill
and Robertson 1968; Slatkin 2008).
The decay of LD as a function of the distance between genetic variants plays an important
role in dating evolutionary events (e.g. Moorjani
), determining the accuracy of complex trait
prediction (e.g. Vilhjálmsson ), and moderating the power to map trait-associated loci (e.g.
Wray 2005; Spencer ).One approach for modeling variation at multiple loci has been through the use of coalescent
theory (Kingman 1982; Hudson 1985). The coalescent process at multiple loci can involve
both recombination (splitting events) and coalescence (joining events) of ancestral
lineages, which means that there can be a different number of lineages ancestral to a sample
at each locus at a given point in time (Hudson
1985; Simonsen and Churchill 1997;
Durrett 2008). Based on a 2-locus coalescent
model, Hudson (2001) developed a composite
likelihood approach to estimate fine-scale recombination rates in early sequencing datasets.
This initial approach paved the way for subsequent methods to estimate fine-scale
recombination rates in humans, accommodating increasing model complexity (McVean ; Auton and McVean 2007; Kamm ; Spence and Song 2019). Also using a 2-locus coalescent model,
McVean (2002) was able to express metrics
of LD in terms of properties of coalescent times. As the impact of changing demographic
history on coalescent times is relatively straightforward, this advance enabled a more
intuitive understanding of the impact of demographic history and sampling design on expected
patterns of LD in data (McVean 2002; Wakeley and Lessard 2003).A second major modeling framework for LD has been via “haplotype copying” models, such as
the Li and Stephens’ model (Li and Stephens
2003; Song 2016). Haplotype copying
models provide a computationally efficient approximation for the likelihood of observed
haplotype data generated with recombination (Fearnhead and Donnelly 2001). As a result, they have become a backbone of many
analyses of population-genomic data, such as genotype imputation (e.g. Howie ), haplotype phasing (e.g.
Loh ), and
local ancestry inference (Price ; Lawson ).In an increasing number of settings, samples are not all from the same time point. This is
exemplified by the growing study of archeogenetics, also known as ancient DNA (aDNA) studies
(reviewed in Slatkin and Racimo 2016; Llamas ; Skoglund and Mathieson 2018). Archeogenetic
studies of humans have been able to reliably obtain genetic data from samples up to
45,000 years before present, although the majority of samples are from the past
∼15,000 years (Skoglund and Mathieson
2018).For single-locus data, genealogical models have been developed to quantify the impact of
ancient samples on population genetic statistics, such as the expected site-frequency
spectrum, the number of variants private to an ancient sample, and
F (Rodrigo and
Felsenstein 1999; Forsberg ; Ortega-Del Vecchyo and
Slatkin 2018). In contrast, the impact of time-separation on patterns of LD has not
been fully explored.Here, we characterize patterns of haplotype variation in temporally stratified samples
using a genealogical perspective. Analogous approaches for time-stratified samples in a
coalescent framework have generally not been developed for the case of 2 or more recombining
loci. One exception is the approach of Dialdestoro
that uses importance sampling over the space of
latent ancestral recombination graphs when calculating the likelihood of observed sequence
data for haplotypes at multiple time-points. Our work here contrasts to that of Dialdestoro in that
we obtain analytic solutions for 2-locus scenarios and for the haplotype copying model. The
work presented here is complementary to previous work by Terhorst who modeled how allele
frequencies change for multiple loci using a Gaussian approximation to the Wright–Fisher
model, though here we approach the problem from a coalescent perspective.We primarily consider statistics based on 2 haplotypes as a starting point for representing
the impact of time-stratified sampling across multiple loci. However, we also explore the
statistic , whose properties can be understood as an expectation over 4
haplotypic states. We focus on these simplified scenarios as they are analytically
tractable, while still providing insight on expected patterns in data (Hudson 1985; McVean
2002). We first show how time-stratified sampling affects the joint properties of
genealogies at 2 loci, demonstrating that the time gap between a pair of samples has an
impact on the rate of decay in the correlation of genealogical statistics and corresponding
patterns of variation with recombination distance. We also analyze the behavior of fitting
the haplotype copying model with samples of different ages, in particular when the test
haplotype is from a time-point in the past compared with a modern haplotype panel. Overall,
our results show the effect of time-stratified sampling on expected patterns of haplotypic
variation, and their implications for the further development of population genetic
methods.
Methods
Coalescent simulations and calculation of pairwise-differences
We used (Kelleher ) to perform all
coalescent simulations used throughout the article. For simulations of 2 loci, we used a
customized recombination map to reflect 2 nonrecombining loci of a given size separated by
a specified absolute recombination rate. For the simulations of haplotypes, we use the
default simulation method and a uniform recombination map (default per-basepair per-generation). To calculate a
pairwise-coalescent effective N to compare our
constant-population-size theory for 2 loci with simulations under varying demographic
history, we took a Monte-Carlo approach using 104 coalescent simulations to
compute the mean marginal pairwise coalescent time from simulations and compute as .
Monte-Carlo simulation of correlation in pairwise differences
To verify our comparisons of the theoretical prediction of with data, we simulated 2 loci as described above with a
mutation rate (approximately equivalent to a 1-kb window with human
scale parameters) for 100 log-spaced points from . When estimating , we conducted 100,000 independent simulations and
estimated the Pearson correlation using the pearsonr function in the scipy package
(Virtanen ). The standard error of the correlation was calculated using the asymptotic
formula: .For estimating the correlation in pairwise differences, we simulated 20 replicates of
20 Mb haplotypes and calculated a Monte-Carlo estimator of the mean correlation in
segregating sites at different recombination distances. The estimation proceeds as
follows: (1) we split the chromosome into nonoverlapping windows of length
L basepairs (default: 1 kb); (2) for each of 5,000 Monte-Carlo
samples we choose a window S and define a paired window a
recombination distance r from it (randomly choosing the direction to
search); (3) compute the empirical Pearson correlation coefficient of the number of
pairwise differences across the 5,000 paired windows. Standard errors were
computed using the asymptotic formula above, using the 20 replicate chromosomes. For
estimation with the real whole-genome sequencing data, we use 30 log-spaced bins over
the range , where r is in Morgans to calculate
Monte-Carlo estimates of the correlation in pairwise differences. Unless otherwise
specified in the text, error bars reflect 2 standard errors from the mean. When
translating from years to generations for comparison of models to our theoretical
predictions, we use a generation time of 30 years per generation from Fenner (2005).
Monte-Carlo estimation of joint LD
To estimate the product of LD across timepoints (Equation 7), we used Monte-Carlo simulations of 500 modern and ancient
haplotypes in a model of constant population size of . We conducted 10 replicate simulations of 1 megabase
haplotypes with the mutation rate and recombination rate set to per basepair per generation. We applied a filter of the
minor allele frequency pooled across timepoints at when calculating the joint LD coefficient. We
additionally bin by genetic distance using the automatic histogram binning in
scipy (Virtanen ). For very low values of ρ, there are too
few mutations co-occurring at such short distances in our simulations so we set a
lower-bound of ρ = 1 when plotting Fig. 5.
Fig. 5.
Joint product of LD between samples separated by t
generations across different population-scaled recombination rates ρ
(see Methods). Dots represent results from simulation and solid lines
are theoretical predictions from Equation
(7).
Analysis of ancient whole-genome sequencing data
For our analysis of whole-genome aDNA data, we compared single nucleotide variants
observed in the LBK and Ust-Ishim samples (Lazaridis ; Fu ). Variants were
called using and were
subsequently filtered using the same criterion as in de Barros Damgaard .To account for not having resolved haplotypes in the ancient samples, we scale the
observed differences by the probability that they would be observed in a haplotype
randomly sampled from the diploid genome (e.g. 0.5 if heterozygote in ancient sample, 1 if
opposing homozygote in the ancient sample). For modern samples, we used haplotypes from
the 1000 Genomes Project Phase 3 Dataset (Auton
).We computed the correlation in pairwise differences in nonoverlapping 1-kb windows and
applied a mappability mask to account for varying coverage in the modern sample by
normalizing (Auton ). Standard errors were estimated using a nonparametric bootstrap across 22
autosomes. To compare 2 empirical curves of , we apply a 2-sided Binomial sign test to test the
proportion of recombination distance bins for which 1 ancient sample has a higher
correlation and test against the null hypothesis that the proportion is 0.5.
Parameter estimation in the haplotype copying model
We implemented a version of the haplotype copying model proposed by Lawson that accounts for the
genetic map distances between subsequent single-nucleotide polymorphisms. The Hidden
Markov Model (HMM) is defined as follows. The transition probabilities between hidden
states, X, where X represents
the haplotype in the panel that the test haplotype copies off of at site
l: where g is the genetic
distance between markers l−1 and l (in
Morgan), K is the size of the haplotype reference panel, and
λ is the “jump rate” or rate at which the model transitions between the
haplotype copying states.The emission probabilities can be similarly characterized, using a parameter
ϵ that represents the probability of a copying error: where h is the allelic state of
the query haplotype at site l.We use 2D numerical optimization from
(Virtanen ) to jointly estimate the
maximum-likelihood estimates and . Unless specifically stated, we use the joint parameter
estimates in our results for both simulated and empirical data. For profile
maximum-likelihood estimates of , we use Brent optimization within the range
[] with a fixed . We estimate standard errors for and using a finite-difference approximation to the second
derivative of the joint log-likelihood surface.All simulations under the haplotype copying model were conducted using chromosomes of 40
megabases, and recombination and mutation rates of per basepair per generation. Every modern panel consisted
of K = 100 haplotypes (unless otherwise specified). We
also ascertained to variants with a minor allele frequency in the modern panel.
Analysis of male X-chromosomes in 1,240K human aDNA dataset
The human aDNA data that we used for our analysis of the haplotype copying model (see
Online Resources) are typed at a set of 1,233,013 sites across the
genome and downloaded from the David Reich Laboratory’s website. Genotypes are drawn using
psuedohaploid sampling based on the available reads at these sites. We filtered the data
based on the following criteria for our analysis while restricting to the X chromosome:
(1) Must be a male sample; (2) Samples must not have a significant amount of modern DNA
contamination (e.g. “PASS” contamination checks); and (3) Samples must have ≥8,000
nonmissing variants across the X chromosome. Following this filter, the median autosomal
coverage for the remaining samples is , and an average of 1.29 sites per 25 kb on the
X-chromosome.Following these filters, we have a total set of 798 samples for which we estimated the
maximum-likelihood jump rate under the haplotype copying model. To minimize confounding
via spatial variables, we chose a centroid location (48°N latitude, 6°E longitude) and
only retained samples within 1,500 km of this centroid. Following this filtering step,
there are 344 samples that are used for the main figures (Fig. 7).
Fig. 7.
Comparison of estimated haplotype copying jump rates between real data and
simulations. a) Estimate of the jump rate in ancient male X-chromosomes within 1,500
km of central Europe. b) Maximum likelihood estimates of the haplotype copying jump
rate using simulated X-chromosomes under the model of Tennessen . c) Estimated
jump rates using simulated data under the model of Browning and Browning (2015).
We performed estimation of the haplotype copying jump rate across all of the 798
originally filtered samples using 3 different haplotype reference panels (49 CEU
haplotypes [“CEU”]; 240 EUR haplotypes [“EUR”]; 1,233 haplotypes [“FULLKG”]) for the
X-chromsome from the 1000 Genomes Phase 3 dataset (Auton ). In all cases, we used the sex-averaged
recombination map for the X-chromosome from Kong
. For linear modeling of the jump-rate as a
function of the sample age, we used the OLS function of
statsmodels package (Seabold
and Perktold 2010). When comparing the real data against simulations under the
demographic models inferred by Tennessen
and Browning and Browning (2015), we use n = 49
modern day CEU haplotypes and sampled haplotypes at ages corresponding to the real data
using a generation time of 30 years per generation (Fenner 2005). We additionally scaled each demographic model by
3/4 to reflect the reduced effective size of the X-chromosome.
Results
Two-locus genealogical properties
To model 2 haplotypes at 2 loci with time-stratified sampling, we adapted a previously
developed continuous time Markov process for modeling ancestral lineages at 2 loci (Hudson 1983, 1990; Simonsen and
Churchill 1997). The states in the model are triplets [e.g. (2, 0, 0)] that
depict the number of lineages ancestral to both loci, locus 1, or locus 2, respectively.
Coalescence and recombination events eventually lead to an absorbing state where
both haplotypes have coalesced at both loci [the state
(1, 0, 0), Fig. A1]. Analytical results for
joint moments in the coalescent times in this model have been previously obtained for the
case where samples are taken at the present (Hudson 1983; Simonsen and Churchill
1997; Durrett 2008, Chapter 3).
Fig. A1.
Markov chain model for the ancestral process at 2 loci from Simonsen and Churchill (1997). In all settings for 2
modern haplotypes, we assume that we start from the state in the middle (state “0”)
in all applications, which means that all sampled haplotypes are coupled. The
parameter η represents the coalescent rate and the parameter
ρ represents the recombination rate (measured in coalescent
units). Figure adapted from Hobolth and
Jensen (2014).
a) Schematic of genealogies at 2 loci separated by a population-scaled recombination
distance ρ (). The parameter t
represents the sampling time of the haplotype (measured in coalescent units, i.e.
scaled by ). The random variables T
and T are the additional time to coalescence at locus
A & B, after t.
b) The probability of the modern haplotype being “uncoupled” at the time of ancient
sampling as a function of t and ρ. In
this setting, “uncoupled” means that the ancestral lineages at locus
A and B are not on the same haplotype, enhancing
the probability of different T and
T occurring at each locus.Here, to analyze the case of time-stratified sampling, we assume that one of the
haplotypes has been sampled at time t in the past (in
coalescent units) and the other at the present. With this time gap in sampling, there are
2 natural phases in the ancestral process: (1) the time between the present and when the
ancient haplotype is sampled (), and thus only the lineage of the modern haplotype can
evolve at each locus, and (2) the time when the lineages of both haplotypes (modern and
ancient) are evolving through the full state space of the ancestral process
().For this 2-phase ancestral process, we derived expressions for the covariance between the
T at 2 loci (A and
B), as well as the total branch lengths
(L, L) separated by a
population-scaled recombination distance, , where r is the per-generation probability
of recombination.The derivation proceeds by recognizing that a key aspect of the 2-phase process is the
effect of recombination during the first phase, when only the modern lineage is evolving
backwards in time (, see Appendix A). During this phase the process has only 2
states, “uncoupled” and “coupled.” By “uncoupled,” we mean that the ancestral lineages are
evolving independently at each locus, whereas “coupled” means that they are evolving as a
joint ancestral lineage. The starting state for the second phase of the ancestral process
(when ) is either that the modern haplotype’s ancestral lineages
are coupled at both loci or uncoupled from one another. We obtain the time-dependent
probability of being in the uncoupled state by exponentiating the 2 × 2 rate matrix
Q for the reduced state-space of the ancestral process during
, where . By doing so and taking different limits, we find:Figure 1b shows for either large
time-separation (t) or large population-scaled-recombination
rates (ρ), it becomes more likely that the modern haplotype is in the
uncoupled state by the time the process encounters the ancient haplotype. Since the
remaining dynamics are the same as the 2-locus ancestral process with 2 contemporaneously
sampled haplotypes, we thereafter leverage known results for the 2-locus ancestral process
(Simonsen and Churchill 1997; McVean 2002; Durrett 2008, Chapter 3). In the next 2 sections, we take this
modeling approach to derive the expectations of observable quantities from time-staggered
haplotype data.
Fig. 1.
a) Schematic of genealogies at 2 loci separated by a population-scaled recombination
distance ρ (). The parameter t
represents the sampling time of the haplotype (measured in coalescent units, i.e.
scaled by ). The random variables T
and T are the additional time to coalescence at locus
A & B, after t.
b) The probability of the modern haplotype being “uncoupled” at the time of ancient
sampling as a function of t and ρ. In
this setting, “uncoupled” means that the ancestral lineages at locus
A and B are not on the same haplotype, enhancing
the probability of different T and
T occurring at each locus.
Correlation in pairwise differences
The number of pairwise differences between 2 haplotypes at each of 2 loci is an
observable summary of genetic variation at linked loci in time-sampled sequence data. To
investigate the properties of the joint distribution on pairwise differences at 2 loci
(locus A and B), we continue to assume a model with
recombination occurring at a rate ρ between them and no recombination
occurring within each. For each locus, as is typical in coalescent models, we assume an
infinite-sites model with mutations arising on each lineage as a Poisson process with rate
, where , μ is the per-basepair per-generation
mutation rate, L is the size of the locus (in basepairs), and
N is the effective population size.Following the approach described in the preceding section, we derive the correlation of
pairwise differences for the case with time-stratified sampling (see Appendix A). In
particular, we use the fact that the correlation in the number of pairwise differences at
locus A and B can be expressed in terms of the
correlation in the total branch length between the loci (Wakeley and Lessard 2003; Hobolth ). We find the correlation
in pairwise differences between 2 loci to be: where is the correlation in total branch length at locus
A and locus B. In Appendix A (building on previous
results from Hudson 1983; Simonsen and Churchill 1997; Durrett 2008, Chapter 3), we derive its exact
form and several limiting values to be:As the equations show, the correlation in pairwise differences is affected by the age of
the ancient sample t in 2 ways. The first effect is due to
the factor in Equation (4) that decreases as
t increases and is not dependent on ρ,
which can be seen in Fig. 2 by the decrease
for t = 10,000 against t = 0 for
very small ρ. We note that the difference between
t = 10,000 and t = 0 in Fig. 2 is more pronounced than between 1,000 and
0, because t in Equation (4) is on the coalescent scale. The second effect occurs in how
t affects (Fig. 2). For
values of , the correlation decays linearly with
t and with for ρ. The decay decreases more rapidly as
when and as t gets large (the third
case in Equation 5). This is because of the
additional time (t) that the recombination process has to
break apart the shared genealogical history at each locus.
Fig. 2.
Theoretical (solid lines) and simulated correlation between pairwise differences in a
constant-size demography () at different sample ages (in generations). Comparison
of theoretical prediction of with data from 2-locus coalescent simulations with
(see Methods). Solid blue and orange
lines are the theoretical predictions for from Equation
(4).
Theoretical (solid lines) and simulated correlation between pairwise differences in a
constant-size demography () at different sample ages (in generations). Comparison
of theoretical prediction of with data from 2-locus coalescent simulations with
(see Methods). Solid blue and orange
lines are the theoretical predictions for from Equation
(4).
The impact of nonequilibrium demographic history on the correlation in pairwise
differences
To explore the effects of varying population size through time, we simulated haplotype
data under models of constant size, instantaneous growth, and trajectories inferred from
previous studies of human populations that include both bottlenecks and growth (Tennessen ; Browning and Browning 2015; Fig. 3). Motivated by how most human aDNA data
are from approximately the last 15,000 years, we investigated the correlations on a
timescale of 500 generations.
Fig. 3.
The impact of varying demographic history on the correlation in pairwise
differences at 2 loci. For all simulations, the recombination rate between the loci
was set to per generation (∼10 kb, assuming 1 cM per 1 Mb).
Simulated scenarios include: a) constant population size, b) inferred models of
population growth, and c) models of instantaneous population growth. Each timepoint
had 50,000 replicate simulations.
The impact of varying demographic history on the correlation in pairwise
differences at 2 loci. For all simulations, the recombination rate between the loci
was set to per generation (∼10 kb, assuming 1 cM per 1 Mb).
Simulated scenarios include: a) constant population size, b) inferred models of
population growth, and c) models of instantaneous population growth. Each timepoint
had 50,000 replicate simulations.In models with constant population size, larger population sizes lead to smaller
inter-locus correlations (lower LD). In all our simulations , so on the time-scale of 500 generations, the correlation
in branch length decreases linearly as expected with sampling age (Equation 4, Supplementary Fig. 2A). Across all
population sizes, we observe significantly negative relationships between sample age (on
the coalescent scale) and the correlation in branch length akin to what we predict in
Equation (4) (for linear regression of
, we find for ). The negative effect of t
on the correlation in total branch length in turn decreases the correlation in pairwise
differences (Fig. 3a).When simulating under the population size trajectories from Tennessen or from the Browning and Browning (2015), “UK10K IBDNe
model” in reference to the original dataset, the correlations are smaller than the UK10K
IBDNe model, which includes a larger population size in the last few generations but an
overall N (estimated using Watterson’s estimator, see
Methods) that is smaller than the Tennessen model (; Fig. 3b).
In a linear model, the correlation in pairwise differences decreases with age under the
UK10K IBDNe model [, 95% CI = (−0.51, −0.31)] and not in the Tennessen
model [, 95% CI: (−0.03, 0.12)].For the case of step-wise population growth (Fig. 3c), we make 3 observations. First, the decrease in the correlation in
pairwise differences is no longer approximately linear with time but decays nonlinearly,
with the rate of decay decreasing with sample age. Second, the correlation in pairwise
differences is highest at short time-scales for the most recent growth event, and at
long-timescales for the most ancient growth event. This can be interpreted again as a
result of the very low N in this setting such that the
factor scaling the correlation in pairwise differences (Equation 4) dominates the behavior after generations (when the correlation in branch length is
similar across all settings). Third, the correlation in the branch length is
substantially higher () when compared with the previously inferred demographies
(Supplementary Fig. 2).The step-wise growth scenario is interesting in that due to the large, recent increase
in population size, we expect roughly star-like genealogies with coalescent times
concentrated around the start of the growth event (Slatkin 1996; Rosenberg and Hirsh 2003). In this scenario, we find the correlation between
loci in the branch lengths is increased greatly (Supplementary Figs. 2C and 8) which contributes to elevating the
. At the same time, as θ is decreased
relative to other scenarios (due to lower N), we do not see
as drastic an increase in the correlation between pairwise differences as in the branch
length (Equation 4). Intuitively, as
N, the correlation in total branch length
between loci increases as the coalescent rate increases if the
recombination rate is held fixed; lowering N also decreases
θ, which increases the correlation in pairwise differences between
loci.Finally, we also investigated the correlation in pairwise differences in a 2-population
model of divergence without gene flow. We assume the modern and ancient haplotype are
each sampled from different populations. In this scenario, both the
ancient and modern haplotypes can become uncoupled prior to any possibility of
inter-haplotype coalescence lowering the expected correlation in pairwise diversity
(Appendix A). In this model, we find the correlation in number of pairwise differences
decreases as a function of the sum of the divergence time and the sampling time
(; Supplementary Fig. 1).
Correlation of pairwise differences in time-staggered whole-genome sequencing
data
Next, we explored the correlation of pairwise differences in modern and ancient human
whole-genome sequencing data with 2 high-coverage samples from 2 different ages. We
restricted to analyzing high-quality whole-genome sequencing data to avoid ascertainment
biases and to more accurately estimate pairwise differences (see Methods;
Fig. 4).
Fig. 4.
Comparison of the correlation in pairwise differences between LBK, Ust-Ishim, and a
modern CEU control individual. Points represent the estimate of the pairwise
correlation between randomly chosen pairs of loci (see Methods). When
computing the theoretical curves, we used and a mutation rate per basepair-per-generation.
Comparison of the correlation in pairwise differences between LBK, Ust-Ishim, and a
modern CEU control individual. Points represent the estimate of the pairwise
correlation between randomly chosen pairs of loci (see Methods). When
computing the theoretical curves, we used and a mutation rate per basepair-per-generation.The first sample we chose is an ∼7,000-year-old sample from modern-day Germany associated
with the Linear Ban Keramic (LBK) culture and labeled variously in previous studies as the
Stuttgart LBK sample or simply the LBK sample (Lazaridis ). The
second sample is ∼45,000 years old and from Western Siberia, labeled
Ust-Ishim (Fu ). These samples have an order of magnitude difference in the
sampling time-scale (thousands vs tens-of-thousands years).To investigate the correspondence of our theory with empirical data, we compared the
correlation in pairwise differences across our 2 empirical samples to the theoretical
predictions from Equation (4). We find that
for recombination rates < Morgans, the scale and rate of decay of the empirical
curves are consistent with the theoretical predictions (Fig. 4). However, there is a larger deviation between the
empirical results and theoretical predictions at longer recombination distances
(), where in observed data there is an excess of correlation
in pairwise differences (Fig. 4). The
extended decay of that we see in real data is not present in data simulated
under the model of (Tennessen ; Supplementary Fig. 4A) or under a constant-sized demography (Supplementary Fig. 4B), suggesting
that the extended decay is not attributable to demographic history alone and warrants
further study.
LD with time-stratified sampling
To directly relate the joint genealogical properties described above to patterns of LD,
we investigated the normalized expected product of LD (D) between the
ancient and modern samples: where is the frequency of the derived allele at the first locus
at time t and is a classic measure of LD in the sample of individuals
from time t (Lewontin and Kojima
1960). Using the genealogical identity coefficients from McVean (2002), we derive the ratio of the expectations of the
product of LD between time-points. Motivated by arguments put forth by McVean (2002) and Ragsdale and Gravel (2019) that express statistics of LD by
taking the ratio of expectations (i.e. ), we take the ratio of expectations of in Equation
(6) to derive a time-stratified analog of . Similar to , we stress that our statistic is not directly equivalent to —is an approximation that can become poor for loci at
low-frequencies McVean (2002). In Appendix
B, we derive an expression for the joint product of LD across both timepoints
():when t = 0, Equation (7) reduces to the expression for , as shown in McVean
(2002). Both simulations and our theoretical predictions show that larger
time-separation between samples qualitatively decreases the joint product of LD (Fig. 5).Joint product of LD between samples separated by t
generations across different population-scaled recombination rates ρ
(see Methods). Dots represent results from simulation and solid lines
are theoretical predictions from Equation
(7).
The impact of time-stratified sampling in haplotype copying models
We next consider the scenario where one would be interested in modeling an ancient
haplotype as a mosaic of modern haplotypes, as might arise when trying to phase or impute
aDNA genotypes using a reference panel of modern haplotypes and the popular Li and
Stephens haplotype copying model (Li and Stephens
2003; Song 2016). We specifically
use a modified model where the recombination map positions are known a
priori (see Methods;Lawson ). We focus on the maximum-likelihood
estimate of the haplotype copying jump rate () for a given test haplotype as it copies off the reference
panel. We view partly as a summary statistic reflecting the length scale
of copying tracts and as an indicator of the expected accuracy of imputation (Stephens and Scheet 2005; Jewett ).The time-separation between the ancient haplotype and modern sample provides an
opportunity for recombination events to occur among the modern reference haplotypes before
the ancient lineage is able to coalesce with any individuals from the modern panel (Equation 3; Fig. 1). Thus, we expect higher jump rates as the sample age
t increases. We also expect coalescence
within the modern panel will contribute to higher jump rates with
increasing t by effectively reducing the panel size moving
farther back in time.Using the first time coalescence between the ancient target and a member of the modern
panel, we observe a saturation effect when increasing the modern panel size (Appendix C and Supplementary Fig.
5). The time until the first coalescent event involving the ancient sample is equal
to the length of the external branch in the local genealogy that leads to the ancient
sample, and affects the rate of recombination events that can induce switch events in the
copying model. The time to the first coalescent involving the ancient sample and the
modern panel decreases as a function of the reference panel size, K.
However, as the age of the sample increases, the number of lineages extant to the
reference sample becomes smaller, making the time to first coalescent event more similar
across modern reference panel sizes.Using simulations with populations of constant size, we find that the realized copying
jump rate indeed increases with age, and does so monotonically as a function of the age of
the test haplotype under a model of constant population size (Fig. 6a). The simple monotonic relationship can break down in
nonequilibrium demographic models. For instance, in demographic models including recent
population growth for European populations, we find that there is an initial decrease in
from the present to ∼150 generations ago before a more
rapid increase moving back into the past (Fig. 6b;Tennessen ; Browning and Browning
2015). A similar result is observed more dramatically in simulations of
instantaneous growth, with a common feature being a decreasing relationship between
and sample age up to the time of onset of instantaneous
growth, reflective of the effect of a strong conditioning on the coalescent time (Fig. 6c andSupplementary Fig. 8).
Fig. 6.
Estimation of haplotype copying jump-rate against sample age for different models of
population demographic history (top row). a) constant population size, b) previously
inferred models of recent population growth, and c) models of instantaneous population
growth. The inferred parameters should be interpreted in terms of the average jumps
per Morgan.
Estimation of haplotype copying jump-rate against sample age for different models of
population demographic history (top row). a) constant population size, b) previously
inferred models of recent population growth, and c) models of instantaneous population
growth. The inferred parameters should be interpreted in terms of the average jumps
per Morgan.
Haplotype copying jump-rates in human aDNA data
To compare our simulation experiments on the dependence of the jump-rate with sampling
time to empirical data, we applied our jump rate estimation to a collection of 1,159
ancient human samples (see Methods). To avoid potential errors
introduced by statistical phasing, we analyzed only haploid carriers of the X chromosome
by taking samples labeled as male in both the ancient data and the modern reference
panel (1000 Genomes Project data; Auton
). Thus, the analysis used 47,094 bi-allelic SNPs
observed on the X chromosome. To avoid the potential effects of population structure
confounding the impact of time-stratified sampling and to maximize the sample size, we
focus primarily on Europe as it is the region with the highest density of aDNA samples,
and we used n = 49 CEU male X chromosomes to define
the modern reference panel (see Supplementary Fig. 6 for experiments with alternate panels).Based on copying jump rates estimated across 344 ancient male X-chromosome samples from
across Europe (see Methods for a description of the dataset), we find
that the estimated jump rate decreases as a function of sample age
(Fig. 7a). Accounting for spatial
variables (Latitude, Longitude, and Latitude × Longitude) in a linear model (see
Methods and Supplementary Fig. 6), we find the effect of sample age on the estimated
copying jump rate is negative (; 95% CI = (−0.63, −0.46)]. Filtering for the highest 25%
coverage individuals did not change the result (Supplementary Fig. 9). The inferred haplotype copying error rate
(ϵ) also decreases with age, suggesting the observed decrease in
λ is not an artifact of the inference procedure (Supplementary Fig. 10).Comparison of estimated haplotype copying jump rates between real data and
simulations. a) Estimate of the jump rate in ancient male X-chromosomes within 1,500
km of central Europe. b) Maximum likelihood estimates of the haplotype copying jump
rate using simulated X-chromosomes under the model of Tennessen . c) Estimated
jump rates using simulated data under the model of Browning and Browning (2015).This decrease is contrary to our idealized simulations with constant population size
(Fig. 6a) and in agreement with the
simulations involving some aspect of recent growth (Fig. 6, b and c). To make the comparison more exact, we
replicate simulations of Tennessen and Browning and
Browning (2015) with the exact temporal sampling structure of the real 344
samples and using a sex-averaged recombination map for the X chromosome (Kong ). With
these simulations, we are able to replicate an initial decrease in the jump-rate as a
function of sampling time (Fig. 7, b and
c). However, the simulations do not capture the duration of the decrease in
jump-rate with sample age, which we find to be generations in the real data.
Discussion
In this article, we have developed theory to understand the effects of serial sampling on
patterns of haplotype variation in the context of 2 models, the 2-locus coalescent model and
the haplotype copying model. Both of these models are used to describe patterns of LD in
population genetic data, and share several features with one another. Both models capture
the relationship between recombination distance and the breakdown of LD, but the 2-locus
genealogies consider patterns only at 2-loci whereas the haplotype copying model considers a
multilocus perspective. It should also be noted that the 2-locus genealogical model
explicitly considers the time of coalescent and recombination events, whereas the haplotype
copying model, in the form used here, does not consider the timing of particular events.
However, in spite of their differences, they both have wide relevance in that they provide
theoretical results for the expected patterns of linked variation, underlying standard
approaches to analyze modern haplotype data.In the 2-locus coalecenscent, we find that with larger time-separation between samples, the
correlation in branch length at 2 loci decreases by an amount proportional to the
probability of uncoupling of a sampled modern haplotype over t
units of time (Equation 4). In constant-size
populations and small values of , the decrease is linear in time. As
t increases the decay of correlation in branch lengths to
occur with order ) vs ). Intuitively, the additional marginal branch length on which
a recombination event can occur ( vs 2 in expectation) is disrupting between-locus correlation.
Demographic history also shapes the correlation in branch length between loci, with
increasing as N decreases due to
a decrease in the variance in coalescent times (Supplementary Fig. 2). For larger values of
t there is an additional decrease in the correlation of
pairwise differences between loci, , that arises from the impact of mutations (the denominator of
Equation 4). For small values of
t ( coalescent units) the correlation of branch length
essentially determines the behavior of the correlation in observable number of differences
between 2 loci.The expected joint LD coefficient between data sampled at different times decreases across
all recombination scales in the simulations and the theoretical derivations (Fig. 5). However, it is important to note that our
simulations here represent an idealized scenario with a large number of ancient haplotypes
(n = 500) and no genotyping error. Therefore, it will be
of further interest to determine if statistics such as the joint LD coefficient may be
informative for demographic inference, while accounting for potential error modes from
realistic data sources.Our analysis of the haplotype copying rate revealed interesting impacts of demographic history. In
constant-size models, the inferred copying rate increased with the sample age as one might
expect due to recombination events; however, in cases of strong recent population growth
(Tennessen ;
Reppell ; Browning and Browning 2015) the inferred copying
rate decreases initially with age and then increases. To understand this, consider how the
haplotype copying jump-rate, , is inversely related to the expected branch-length shared
between an ancient haplotype and a member of the modern panel, because recombination events
that occur on these branches can initiate copying-switch events (Li and Stephens 2003; Paul ; Steinrücken ). In cases with rapid population growth,
there are initially limited numbers of coalescent events, followed by a high rate when the
population is small, looking backwards in time. Samples that are sampled sequentially closer
to the onset of growth have shorter branch length on which potential switch events occur,
producing the initial negative relationship. For samples that are sampled more
ancestrally than the onset of population growth, we find that the jump rate
increases as the coalescent time are no longer affected by the onset of growth (Fig. 6c).Our empirical analysis of aDNA data from western Eurasia supported a negative relationship
between the haplotype copying rate and sample age. In contrast with the demographic models
simulated, the empirical data show an extended decrease in the jump rate, reaching over ∼400
generations. Similar discrepancies arise when comparing the correlation in pairwise
differences in empirical data (Fig. 4). We
consider 2 potential explanations for the discrepancy between simulations and observations:
unmodeled aspects of population demographic history not captured by existing models used for
simulation or aDNA data artifacts. Throughout our experiments for both the haplotype copying
rate and correlation in pairwise differences, we found that demographic models capturing
more detail of recent Eurasian history did not adequately predict either statistic. However,
there may still be potential unmodeled aspects of relevance to our statistics here. For
example, the duration of the decrease in the estimated copying rate could be due to smaller
local population sizes in the more distant past than is reflected in the models. This is
particularly relevant given the time-scale of ∼400 generations (∼12,000 years) as this
extends into the Mesolithic and Paleolithic eras during which populations were likely small
in overall size and deeply structured (Premo and
Hublin 2009; Haak ; Skoglund and Mathieson 2018).
If ancestral population structure existed in this period, it may have biased inferred
effective population size upwards in models that were fit under the assumption of a single
panmictic population (Li and Durbin 2011,
Supplementary Section 1.6). We
also recognize that due to population turnover, the proportion of ancestry directly
ancestral to the modern reference panel may fluctuate as a function of time due to
population turnover, leading to temporal patterns in the jump rate. Regarding the aDNA data,
in our empirical analysis, we do not find any significant effects of coverage on the
qualitative result that the jump-rate decreases as a function of time (Supplementary Fig. 9B). If error rates
increase with sample age it would seem to run counter to the observed result, causing
elevated jump rate estimates as one goes further back in time; however, this is not what we
observe in our joint estimation (Supplementary Fig. 10). Some complex form of reference bias increasing with age
and interacting with the haplotype copying model may be plausible. Overall, the result
suggests there may be interesting insights to be gained by more detailed empirical analyses
of haplotypic patterns in aDNA.Many methods have been developed in the context of haplotype copying models, from
imputation and phasing (e.g. Howie 2009),
estimation of recombination rates (e.g. Li and
Stephens 2003), to fine-scale ancestry estimation (e.g. Lawson ). Our theoretical results
leave important considerations for each of these application domains with serially sampled
data. For imputation and phasing, the increase in the copying jump rate as a function of
time under constant population sizes implies that LD will be lower in relation to the first
coalescent time with a member of the modern panel, and will lower the copying accuracy at
longer genetic distances (Appendix C; Jewett
). For samples that are sufficiently old, there is a
diminishing benefit for generating larger modern reference panels (Appendix C), which
primarily results in improvements in imputation and phasing for modern samples due to recent
relatedness (Jewett ; McCarthy ).Our exploration of the impact of population demography (particularly population growth) and
our empirical analysis of the male X chromosome paints a more optimistic picture for the
analysis of human aDNA using the haplotype copying model. We find that there is a
substantial attenuation of the increase in the haplotype copying jump-rate () under scenarios of recent growth, and even potential
decreases in the case of instant population growth (Fig. 6). Together with our empirical result of the jump rate decreasing as a
function of time across male X chromosomes in ancient European samples (Fig. 7), the results support the idea that we may be able to
impute common variants relatively accurately in human populations that have undergone recent
rapid growth. Indeed, the empirical accuracy of imputation is relatively high for samples
within the past ∼6,000 years (Gamba ; Martiniano ). In addition to the “reference-based” phasing we have explored
in this work, methods that iteratively sample haplotypes from the input genotypes have
advantages for phasing aDNA when modern reference panels lack the haplotype and allelic
diversity present in ancient samples (e.g. Rubinacci
). We leave this comparison of phasing and imputation
accuracy from exclusively reference-based models with the addition of iterative haplotype
sampling for future work, though we expect some of the insights gained here will help this
exploration.As caveats, our theoretical results here do not account for some important features of aDNA
data. Specifically, we have not attempted to model genotyping error and low-coverage data,
both common in the analysis of aDNA (e.g. Dabney
). Our results on pair-wise loci could be extended to
directly model the effects of errors at one or both loci. Methods using haplotype copying
HMMs with emission probabilities directly modeling low-coverage sequencing data (e.g. Rubinacci ) are more
applicable to account for this sparsity in aDNA analysis. Another caveat is that due to the
wide temporal range and the absolute number of samples available (Olalde and Posth 2020), our empirical analyses focused on samples
from western Eurasia. As aDNA technology improves and sampling becomes less centered on
western Eurasia, it will be interesting to reanalyze the relationship between the jump-rate
and sample age across multiple regions with varied demographic histories.With the abundance of aDNA data being generated across a wide array of organisms,
statistical and theoretical advances will need to similarly account for this new dimension
in the data. Here, we have highlighted the impact of time-stratified sampling for 2 related
models, the 2-locus coalescent with recombination and the haplotype copying model. We expect
that our theoretical treatment of these models will serve to inform advances in statistical
population genetic methods that account for serially sampled data to maximize their utility
for inference.
Data availability
All results in this article can be reproduced directly from repositories specified in the
Online Resources.Supplemental material is available
at GENETICS online.Click here for additional data file.
Authors: Qiaomei Fu; Heng Li; Priya Moorjani; Flora Jay; Sergey M Slepchenko; Aleksei A Bondarev; Philip L F Johnson; Ayinuer Aximu-Petri; Kay Prüfer; Cesare de Filippo; Matthias Meyer; Nicolas Zwyns; Domingo C Salazar-García; Yaroslav V Kuzmin; Susan G Keates; Pavel A Kosintsev; Dmitry I Razhev; Michael P Richards; Nikolai V Peristov; Michael Lachmann; Katerina Douka; Thomas F G Higham; Montgomery Slatkin; Jean-Jacques Hublin; David Reich; Janet Kelso; T Bence Viola; Svante Pääbo Journal: Nature Date: 2014-10-23 Impact factor: 49.962
Authors: Bjarni J Vilhjálmsson; Jian Yang; Hilary K Finucane; Alexander Gusev; Sara Lindström; Stephan Ripke; Giulio Genovese; Po-Ru Loh; Gaurav Bhatia; Ron Do; Tristan Hayeck; Hong-Hee Won; Sekar Kathiresan; Michele Pato; Carlos Pato; Rulla Tamimi; Eli Stahl; Noah Zaitlen; Bogdan Pasaniuc; Gillian Belbin; Eimear E Kenny; Mikkel H Schierup; Philip De Jager; Nikolaos A Patsopoulos; Steve McCarroll; Mark Daly; Shaun Purcell; Daniel Chasman; Benjamin Neale; Michael Goddard; Peter M Visscher; Peter Kraft; Nick Patterson; Alkes L Price Journal: Am J Hum Genet Date: 2015-10-01 Impact factor: 11.025