Simulation studies have demonstrated that a variety of patterns in worldwide genetic variation are compatible with the trends predicted by a serial founder model, in which populations expand outward from an initial source via a process in which new populations contain only subsets of the genetic diversity present in their parental populations. Here, we provide analytical results for key quantities under the serial founder model, deriving distributions of coalescence times for pairs of lineages sampled either from the same population or from different populations. We use these distributions to obtain expectations for coalescence times and for homozygosity and heterozygosity values. A predicted approximate linear decline in expected heterozygosity with increasing distance from the source population reproduces a pattern that has been observed both in human genetic data and in simulations. Our formulas predict that populations close to the source location have lower between-population gene identity than populations far from the source, also mirroring results obtained from data and simulations. We show that different models that produce similar declining patterns in heterozygosity generate quite distinct patterns in coalescence-time distributions and gene identity measures, thereby providing a basis for distinguishing these models. We interpret the theoretical results in relation to their implications for human population genetics.
Simulation studies have demonstrated that a variety of patterns in worldwide genetic variation are compatible with the trends predicted by a serial founder model, in which populations expand outward from an initial source via a process in which new populations contain only subsets of the genetic diversity present in their parental populations. Here, we provide analytical results for key quantities under the serial founder model, deriving distributions of coalescence times for pairs of lineages sampled either from the same population or from different populations. We use these distributions to obtain expectations for coalescence times and for homozygosity and heterozygosity values. A predicted approximate linear decline in expected heterozygosity with increasing distance from the source population reproduces a pattern that has been observed both in human genetic data and in simulations. Our formulas predict that populations close to the source location have lower between-population gene identity than populations far from the source, also mirroring results obtained from data and simulations. We show that different models that produce similar declining patterns in heterozygosity generate quite distinct patterns in coalescence-time distributions and gene identity measures, thereby providing a basis for distinguishing these models. We interpret the theoretical results in relation to their implications for human population genetics.
EQUILIBRIUM population structure models, which assume that the rules specifying the evolution of alleles within and among populations do not change with time, have achieved much success in describing genetic variation. Although equilibrium models are convenient for obtaining analytical results that can be used to test hypotheses and predict patterns of genetic variation, nonequilibrium models often provide more realistic representations of patterns that occur in real populations. Nonequilibrium models assume that the rules specifying the evolution of alleles change as a function of time. In nonequilibrium models, however, with some exceptions (e.g., Takahata ; Wakeley 1996a,b,c; Jesus ; Efromovich and Kubatko 2008), analytical formulas have been relatively scarce because model complexity can make them difficult to obtain.Recently, a nonequilibrium structured population model, the “serial founder model,” has been proposed for describing the colonization of the world by modern humans (Ramachandran ). The colonization process in this model starts with a single source population. The source population sends a subset of its individuals to migrate outward and found a new population. This newly founded population has a small size at its founding and subsequently expands to a larger size. After the expansion, it then sends out migrants to form the next population. The founding process is iterated until K populations have been founded. The appeal of this model is that using both forward (Ramachandran ; Deshpande ) and backward (coalescent) simulations (DeGiorgio ; Hunley ), it has been successful in describing observed patterns of human genetic variation, such as the decline in expected heterozygosity observed with increasing geographic distance from a putative African source location.In addition to the initial serial founder model of Ramachandran , a variety of models that contain the geographic expansions and bottlenecks characteristic of the serial founder model have recently been studied (Austerlitz ; Le Corre and Kremer 1998; Edmonds ; Ray ; Klopfstein ; Liu ; Excoffier and Ray 2008; Hallatschek and Nelson 2008; DeGiorgio ; Deshpande ; Hunley ). Among formulations with a one-dimensional geographic structure, some models (e.g., Austerlitz ; Deshpande ) allow migration after the initial founding of populations and assume that once a population is founded, it logistically grows to its carrying capacity. When carrying capacity is reached, or shortly thereafter, migrants exit the population to found the next population. Other models (e.g., DeGiorgio ) do not permit migration after populations are founded and assume that population growth is instantaneous. In these models, after a population is founded, it experiences a small size for some length of time before instantaneously expanding to a larger size. For the former class of models, Austerlitz presented recursions to generate the distribution of coalescence times for pairs of lineages sampled either from the same population or from different populations. These equations can then be used to calculate geographic patterns in summary statistics such as gene diversity and FST. For the latter class, DeGiorgio and Hunley approached similar problems using simulations. The relative simplicity of the population growth and migration assumptions in this latter group of models, however, potentially permits explicit formulas, rather than recursions or simulations, to be investigated.Here, generalizing the coalescent-based version of the serial founder model as formulated by DeGiorgio , we provide an analytical distribution of the coalescence time for a pair of lineages at a randomly selected locus, along with corresponding expected coalescence times, expected homozygosity values, and FST values. In this nonequilibrium model, we show that the decrease in expected heterozygosity and the corresponding increase in homozygosity with increasing distance from the source population can be predicted analytically. We then provide analytical results for the expected identity for two alleles drawn randomly from a given pair of populations, and we find that the qualitative patterns produced by the formulas closely match those observed from human genetic data and the simulations of Hunley . Furthermore, we discuss how our results can be used to obtain analytical formulas for summary statistics for an archaic serial founder model, for the nested-regions model of Hunley , and for the instantaneous divergence model of DeGiorgio . Our new analytical formulas on within-population gene diversity, between-population gene identity, and pairwise FST motivate an analysis of empirical trends in these summary statistics in worldwide human genetic data. Because a serial founder process is largely consistent with worldwide patterns of human genetic variation, the analytical results presented here are useful both for generating and for testing hypotheses about human origins.
Serial Founder Model
In this section, we begin by formally defining the serial founder model. This model was used in a simulation of DeGiorgio , and here, we provide a more complete generalization. We obtain the probability density of coalescence times for two lineages sampled under the model. Utilizing this density, we obtain mth moments of coalescence times, mth moments of homozygosities, and FST values between pairs of populations.
Model
We formulate the serial founder model in a coalescent setting. A diagram of the model appears in Figure 1A. Our generic formulation contains a sequence of bottlenecks in which bottleneck sizes, population sizes, bottleneck lengths, and the times for the population founding events are allowed to vary. The model considers K extant populations, denoted E1, E2, …, E. For i < j, the founding of extant population E took place at least as far back in time as that of extant population E. The model has 2K ancestral populations, denoted A0, A1, …, A2–1. For i < j, the founding of ancestral population A took place at least as far back in time as that of ancestral population A. N denotes the size of ancestral population A, i = 0, 1, …, 2K – 1. Note that for i = 1, 2, …, K, the size of extant population E is equal to N2(–), which also is the size of ancestral population A2(–). Time is measured in generations, and the present has time τ0 = 0.
Figure 1
Serial founder model. (A) Serial founder model with K extant and 2K ancestral populations. At time τ2–1, ancestral population A2–1 expands to a larger size to form ancestral population A2(–1). Next, at time τ2(–1), ancestral population A2(–1) splits to form extant population E1 and newly founded ancestral population A2(–1)–1. At time τ2(–1)–1, population A2(–1)–1 expands to a larger size to form ancestral population A2(–2). In general, at time τ2(–), ancestral population A2(–) splits into extant population E and newly founded ancestral population A2(–)–1. At time τ2(–)–1, ancestral population A2(–)–1 expands to a larger size to form ancestral population A2[–(+1)]. (B) Scenario in which lineages are sampled from populations E and E, i ≤ j (i < j is shown here). Regions in which coalescence can occur are shaded.
Serial founder model. (A) Serial founder model with K extant and 2K ancestral populations. At time τ2–1, ancestral population A2–1 expands to a larger size to form ancestral population A2(–1). Next, at time τ2(–1), ancestral population A2(–1) splits to form extant population E1 and newly founded ancestral population A2(–1)–1. At time τ2(–1)–1, population A2(–1)–1 expands to a larger size to form ancestral population A2(–2). In general, at time τ2(–), ancestral population A2(–) splits into extant population E and newly founded ancestral population A2(–)–1. At time τ2(–)–1, ancestral population A2(–)–1 expands to a larger size to form ancestral population A2[–(+1)]. (B) Scenario in which lineages are sampled from populations E and E, i ≤ j (i < j is shown here). Regions in which coalescence can occur are shaded.Forward in time, ancestral population A2–1 expands to a larger size at time τ2–1 to create ancestral population A2(–1), the population directly ancestral to the source population E1. At time τ2(–1), ancestral population A2(–1) splits into extant population E1 and ancestral population A2(–1)–1, a newly founded population during the time in which it experiences a small size prior to expansion. At time τ2(–1)–1, ancestral population A2(–1)–1 expands to a larger size to form ancestral population A2(–2). At time τ2(–2), ancestral population A2(–2) splits to form extant population E2 and ancestral population A2(–2)–1, the next founded population during its bottleneck phase. This process is iterated until extant population K has been founded. In general, at time τ2(–), i = 1, 2, …, K – 1, ancestral population A2(–) splits into extant population E and a newly founded ancestral population A2(–)–1. At time τ2(–)–1, i = 0, 1, …, K – 1, ancestral population A2(–)–1 expands to a larger size to form ancestral population A2[–(+1)]. Note that by construction, extant population E and ancestral population A0 are the same population.We note that several past studies (e.g., Austerlitz ; Ramachandran ; Liu ; Deshpande ) utilized formulations of the serial founder model that involved logistic growth of newly founded populations, migration between neighboring populations after their initial founding, or both of these model features. In contrast, for the purpose of obtaining analytical results, our model has a mathematically simpler formulation that involves an instantaneous expansion of a newly founded population to a larger size and that does not permit migration between neighbors after founding events.
Coalescence times
In this section, we derive the probability density of coalescence times for a pair of lineages sampled under the serial founder model. We begin by deriving the probability density function f(t) for the coalescence time of a pair of lineages, one randomly sampled from extant population E and the other from extant population E (where j is not necessarily distinct from i). This function is defined piecewise over the space of possible coalescence times t ∈ [0, ∞). Using our formula for f(t), we derive mth moments of coalescence times, from which we obtain mean pairwise coalescence times. We use the result from coalescent theory that coalescence times are exponentially distributed with a rate that is inversely proportional to the population size (Kingman 1982; Hudson 1983; Tajima 1983). Also, we use the result that the number of mutations along a genealogical branch is Poisson distributed, and because we restrict our attention to neutral loci, we separate the mutation process from the genealogical process (Tavaré 1984; Hudson 1990).Let T be a random variable that denotes the coalescence time for a pair of lineages, one from extant population E and the other from extant population E, with i ≤ j. If i < j, then the two lineages cannot coalesce until they are in the same ancestral population (i.e., more ancient than τ2(–)). Suppose the two lineages are in the same population during time interval [τ, τ+1), where h ≥ 2(K – i). The probability density for coalescence at time t ∈ [τ, τ+1) is the product of the probability that the lineages do not coalesce in the more recent time intervals,and , the probability density for coalescence at time t conditional on failure to coalesce by time τ.If i = j, then the two lineages can also coalesce in the interval [τ0, τ2(–)). Suppose the two lineages exist in the same population during time interval [τ0, τ2(–)). The probability density for coalescence at time t ∈ [τ0, τ2(–)) in extant population E is . The probability that the lineages do not coalesce in time interval [τ0, τ2(–)) is (we write τ0 for notational consistency, but recall τ0 = 0).For i ≤ j and h ∈ {2(K – i), 2(K – i) + 1, …, 2K – 1}, denote the probability that a coalescence has not occurred by time τ for two lineages, one from E and one from E, bywhere δ is the Kronecker delta. We then arrive at the density function for the time to coalescence of a pair of lineages sampled from extant populations E and E, i ≤ j,where τ2 = ∞. This density for the pairwise coalescence time consists of a collection of shifted exponential distributions, each defined on a different interval.Equipped with the density in Equation 1, we next derive mth moments for the distribution of coalescence times. We are interested primarily in the mean, but the derivation for arbitrary m is no more difficult than that for m = 1.Using the result (Gradshteyn and Ryzhik 2007, p. 106) thatwe obtainSetting m = 1, the expected coalescence time isUsing the density in Equation 1, we can investigate how the initial divergence time and the severity of bottlenecks influence the distribution of coalescence times. Figure 2B displays density plots for coalescence times in the serial founder model in Figure 2A. Analytical density functions closely match the histograms generated in 107 coalescent simulations using MS (Hudson 2002), following the simulation method of DeGiorgio . Figure 2B shows that multiple modes appear in the distributions of pairwise coalescence times, as a result of the increased coalescence rate during bottlenecks. Coalescence-time distributions for pairs of lineages from different populations are shifted by the divergence time of the two populations, so that coalescence times for pairs of lineages from distinct populations tend to exceed those of pairs from the same population.
Figure 2
Distributions of coalescence times in the serial founder model. (A) Serial founder model with four extant populations. Thick population sizes represent 10000 diploid individuals and thin population sizes represent 1000 diploid individuals. The times of founding events and population expansions are τ0 = 0, τ = τ–1 + 5000 for h = 1, 2, …, 6, and τ6 = τ7 = 30000 generations. (B) Probability density of coalescence times. Each subplot is the probability density of coalescence times for a pair of lineages sampled from the pair of populations listed in the row and column. (C) Probability density of coalescence times for a pair of lineages sampled from population 4, with identical parameter values to part A except that the bottlenecks (thin populations) have 5000 diploid individuals instead of 1000 diploid individuals. The figure can be compared with the plot for two lineages from population 4 in part B. Histograms are based on 107 coalescent simulations using MS (Hudson 2002), and the red lines represent the analytical densities obtained from Equation 1.
Distributions of coalescence times in the serial founder model. (A) Serial founder model with four extant populations. Thick population sizes represent 10000 diploid individuals and thin population sizes represent 1000 diploid individuals. The times of founding events and population expansions are τ0 = 0, τ = τ–1 + 5000 for h = 1, 2, …, 6, and τ6 = τ7 = 30000 generations. (B) Probability density of coalescence times. Each subplot is the probability density of coalescence times for a pair of lineages sampled from the pair of populations listed in the row and column. (C) Probability density of coalescence times for a pair of lineages sampled from population 4, with identical parameter values to part A except that the bottlenecks (thin populations) have 5000 diploid individuals instead of 1000 diploid individuals. The figure can be compared with the plot for two lineages from population 4 in part B. Histograms are based on 107 coalescent simulations using MS (Hudson 2002), and the red lines represent the analytical densities obtained from Equation 1.We can consider the effect of bottleneck size by examining the coalescence-time distribution for a pair of lineages in two scenarios that are identical except that one has a smaller bottleneck size. In Figure 2B, considering a pair of lineages from population 4, with bottleneck size 1000 individuals, most of the coalescence-time distribution accumulates early because of the strong bottleneck during the time interval [τ1, τ2) = [5000, 10000). Much of the remainder of the distribution accumulates during the next strong bottleneck, in the interval [τ3, τ4) = [15000, 20000).Increasing the bottleneck size in Figure 2A, from 1000 to 5000, the coalescence rate within bottlenecks decreases. Because of this decrease, lineages are more likely to persist farther into the past without coalescing. Thus, Figure 2C shows that decreasing the severity of the bottleneck by increasing the bottleneck population size reduces the probability that the lineages coalesce during the most recent bottleneck. A fourth mode of the coalescence-time distribution then becomes visible during the bottleneck in the interval [τ5, τ6) = [25000, 30000).
Pairwise homozygosity and heterozygosity
Two commonly used summary statistics are expected homozygosity (gene identity) and expected heterozygosity (gene diversity). Let J be a random variable that denotes the homozygosity for a pair of lineages, one randomly sampled from extant population E and the other from extant population E (where j is not necessarily distinct from i). Further, let H = 1 – J be a random variable that denotes the heterozygosity for a pair of lineages, one randomly sampled from E and the other from E. We define homozygosity as the probability that two alleles sampled at a locus are identical by descent (the definition of locus used here is flexible and can range from a single site to a haplotype). Assuming an infinite alleles mutation model and a time interval of length T generations, if mutations are Poisson distributed, then homozygosity, or the probability that no mutation occurs on an interval of length T, is e–2μ, where μ is the per-generation mutation rate (Wakeley 2009, p. 107). We can therefore find mth moments of homozygosity asBy the binomial theorem, the mth moment of heterozygosity is . Setting m = 1 in Equation 5, we obtain the expected homozygosity and heterozygosity for two lineages, one sampled from population E and the other from E,Using the model in Figure 2A, Figure 3 plots the expected heterozygosity of two lineages sampled from population 4 as a function of both bottleneck population size and bottleneck length. When the bottleneck has length zero, bottlenecks do not increase genetic drift and hence the expected heterozygosity reaches its maximum. Increasing the bottleneck length causes a monotonic decrease in expected heterozygosity. Decreasing the population size of the bottlenecks further decreases the heterozygosity. The smallest expected heterozygosity shown is reached with the combination of the smallest bottleneck population size (100 diploid individuals) and the largest bottleneck length (5000 generations).
Figure 3
Expected heterozygosity for a pair of lineages sampled from population 4 of Figure 2A (Equation 7), as a function of population size for bottlenecks and bottleneck length measured in generations. A per-generation mutation rate of μ = 2.5 × 10−5 is assumed.
Expected heterozygosity for a pair of lineages sampled from population 4 of Figure 2A (Equation 7), as a function of population size for bottlenecks and bottleneck length measured in generations. A per-generation mutation rate of μ = 2.5 × 10−5 is assumed.
Pairwise FST
Our computation of expected coalescence times in Equation 4 provides a basis for obtaining the commonly used measure of genetic differentiation, pairwise FST between populations. Using the results of Slatkin (1991) on FST at small mutation rates, we can write , where is the mean coalescence time of two lineages randomly drawn from the same population and is the mean coalescence time of two lineages randomly drawn from any two populations (same or different). By using the expected coalescence times in our serial founder model (Equation 4), we can define these times for pairwise comparisons of populations E and E (i < j) as , (the mean pairwise coalescence time for two lineages from different populations), and . Therefore,where the quantities are defined in Equation 4.
Patterns Observed in Human Population Data
In this section we describe a worldwide human population-genetic data set and patterns in summary statistics calculated from the data set. The summary statistics we investigate are within-population gene diversity, between-population gene identity, and pairwise FST. Analytical formulas for these summary statistics under the serial founder model are obtained in Equations 6–8. We compare patterns in these summary statistics observed in data to those predicted by specific models of human evolutionary history. Through these comparisons, we discuss which models of human history are compatible with patterns of genetic variation observed in present-day human populations. Note that only one of the three summary statistics that we study (gene diversity) was discussed by DeGiorgio .We analyzed data from the Human Genome Diversity Panel (HGDP) (Cann ; Cavalli-Sforza 2005), using 783 autosomal microsatellite loci in 1048 individuals sampled from 53 worldwide populations (Ramachandran ; Rosenberg ). For a given population, gene diversity was calculated using DeGiorgio and Rosenberg’s (2009) Equation 10, averaged across loci; the values were taken from Figure 7C of DeGiorgio and Rosenberg (2009). For distinct populations A and B, between-population gene identity was calculated as where and are the sample frequencies of the ith distinct allele at locus ℓ in populations A and B, respectively, and Iℓ is the number of distinct alleles in the pair of populations at locus ℓ (Nei 1987). Pairwise FST was calculated using Weir’s (1996) Equation 5.3.
Figure 7
Patterns of genetic variation in a nested regions model. The values of the model parameters are the same as those in Figure 6, with the exception that the bottleneck lasts 16 generations instead of 2 generations during the founding of modern populations 15, 29, 43, 57, 71, and 85. (A) Gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. (B) Between-population gene identity for pairs of populations. (C) Pairwise FST for pairs of populations.
Figure 4 displays patterns observed for the three summary statistics in the HGDP data set. Figure 4A shows an approximate linear decline of gene diversity with increasing geographic distance from a putative East African location of modern human origins. Figure 4B shows a heat map of gene identity between all pairs of populations, illustrating that pairs closer to Africa generally have lower between-population gene identity than pairs farther from Africa. Figure 4C displays a heat map of pairwise FST between populations. FST is lower for pairs of populations that are close geographically than for pairs of populations that are geographically distant. Additionally, FST values between populations in the Americas are generally larger than FST values between pairs of non-American populations. In Figure 4, a slight jump in the values of summary statistics is visible at the boundaries of geographic regions. That is, separate values of gene diversity computed within populations from the same geographic region, and gene identity and FST values for pairs of populations from the same region, tend to be more similar to each other than to corresponding values involving populations from different regions.
Figure 4
Patterns of within- and between-population summary statistics observed in human population-genetic data. Plots are based on 783 microsatellite loci from 53 worldwide populations in the HGDP data set (Ramachandran ; Rosenberg ). (A) Gene diversity as a function of distance from East Africa (redrawn from Degiorgio ). Each point represents a particular population. (B) Between-population gene identity. Columns and rows each represent populations, and an entry in the matrix represents the gene identity for the population pair represented by the row and column. (C) Pairwise FST calculated from the same populations as in B.
Patterns of within- and between-population summary statistics observed in human population-genetic data. Plots are based on 783 microsatellite loci from 53 worldwide populations in the HGDP data set (Ramachandran ; Rosenberg ). (A) Gene diversity as a function of distance from East Africa (redrawn from Degiorgio ). Each point represents a particular population. (B) Between-population gene identity. Columns and rows each represent populations, and an entry in the matrix represents the gene identity for the population pair represented by the row and column. (C) Pairwise FST calculated from the same populations as in B.We can now compare the three patterns in summary statistics observed from the HGDP data set with patterns predicted by models of human evolutionary history. We consider several special cases of our general serial founder model that are chosen on the basis of previous investigations of human evolution. These cases include a modern serial founder model (Ramachandran ; DeGiorgio ; Deshpande ), a nested regions model in which bottlenecks between continental regions are more severe than those within continental regions (Hunley ), an instantaneous divergence model in which all populations diverged at the same time in the past (DeGiorgio ), and an archaic serial founder model in which the founding process started distantly in the past (DeGiorgio ). Using Equations 6–8, we now examine the patterns in gene diversity, between-population gene identity, and pairwise FST generated by these four special cases of the general serial founder model. We consider the extent to which each model can reproduce the patterns observed in worldwide human genetic data in the three statistics.
Modern Serial Founder Model
Motivation and model
A modern serial founder model (Figure 5A) is a special case of our general formulation (Figure 1). To obtain the DeGiorgio serial founder model with K populations, suppose that the bottleneck length is Lb generations and that the time between the end of a bottleneck and the founding of a new population is L generations. In other words, suppose τ2+1 – τ2 = L for h = 0, 1, …, K – 2 and τ2 – τ2–1 = Lb for h = 1, 2, …, K – 1. Let τ0 = 0. Modern population 1 founds modern population 2 at time τ2(–1) = τ2–1 = τD. Each bottleneck has size Nb diploid individuals, and all other populations have size N. For the exact serial founder model studied by DeGiorgio , we set K = 100, Lb = 2, L = 19, τD = 2079, N = 10000, and Nb = 250. These values were chosen to represent reasonable values for human populations: τD was chosen to lie within an estimated interval of time for the out-of-Africa migration (e.g., Relethford 2008), N was chosen as a commonly used value to represent the present-day effective size of human populations (e.g., Takahata ), Nb was chosen to represent a size typical for small isolated hunter–gatherer populations (Cavalli-Sforza 2004), Lb was chosen to represent a process in which individuals migrate in the first generation and finalize the settlement of a population in the second generation, and L was chosen such that founding events were distributed uniformly over τD = 2079 generations. Utilizing this parameterization and a per-generation mutation rate of μ = 2.5 × 10−5, we examine whether the modern serial founder model can reproduce observed patterns of human genetic variation.
Figure 5
Models to which the general serial founder model reduces. (A) Modern serial founder model. (B) Nested regions model with R regions. (C) Instantaneous divergence model. (D) Archaic serial founder model. The models in A and C are exactly the models discussed by DeGiorgio .
Models to which the general serial founder model reduces. (A) Modern serial founder model. (B) Nested regions model with R regions. (C) Instantaneous divergence model. (D) Archaic serial founder model. The models in A and C are exactly the models discussed by DeGiorgio .
Patterns generated by the model
Figure 6 displays patterns of genetic variation generated by the modern serial founder model. As was observed previously in simulations (Ramachandran ; DeGiorgio ; Deshpande ), the modern serial founder model reproduces the approximate linear decline in gene diversity with distance from the source population (Figure 6A). Figure 6B displays a heat map of pairwise gene identity values between pairs of modern populations. The heat map shows that populations close to the source population have smaller between-population gene identities than populations far from the source, as is observed in human population data (Figure 4B). Figure 6C displays a heat map of FST values between pairs of modern populations, demonstrating that pairs of populations that are geographically distant tend to have larger FST than pairs of populations that are geographically close. The model largely recovers the pattern observed in human data (Figure 4C); however, it also predicts small FST between pairs that are far from the source population, a pattern that is not observed for human populations distant from Africa.
Figure 6
Patterns of genetic variation in a modern serial founder model. The values of the model parameters are indicated in the section Modern Serial Founder Model. (A) Gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. (B) Between-population gene identity for pairs of populations. (C) Pairwise FST for pairs of populations.
Patterns of genetic variation in a modern serial founder model. The values of the model parameters are indicated in the section Modern Serial Founder Model. (A) Gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. (B) Between-population gene identity for pairs of populations. (C) Pairwise FST for pairs of populations.The pattern of decrease in gene diversity with increasing distance from a source population is due to the decrease in pairwise coalescence time within populations caused by a cumulative increase in genetic drift with increasing distance from the source. Pairs of lineages from distinct populations distant from the source have the potential to coalesce more recently than do pairs of lineages close to the source, thereby explaining the increased gene identity for pairs of populations distant from the source. However, FST between populations that are geographically distant from the source is smaller than FST between populations that are close to the source, as the effect of reduced between-population coalescence times in decreasing FST for populations distant from the source outweighs the effect of their reduced within-population coalescence times in increasing FST.Our results show that the modern serial founder model largely recovers the patterns observed from human genetic data (Figure 4). Two exceptions are that it does not predict either a peculiar pattern of small gene identities observed between Oceanian and non-Oceanian populations (Figure 4B) or the large FST values observed in the Americas (Figure 4C).
Nested Regions Model
One aspect of the trends in genetic diversity that was not captured by our parameterization of the modern serial founder model above is the larger difference in diversity observed between populations from different continental regions than between populations from the same continental region (Figure 4A). This observation motivates the nested regions model (Figure 5B) simulated by Hunley , in which the set of populations is distributed across several “regions” separated by barriers to migration. Examples of such regions include different continents, areas separated by mountain ranges, or islands within an archipelago. Because crossing between regions is more difficult than migration within a region, significant genetic drift might occur during the expansion into a new region. The nested regions model incorporates this increase in genetic drift during the geographic expansion through increased bottleneck severity between regions relative to bottleneck severity within regions.We incorporate severe bottlenecks into the modern serial founder model (Figure 5A) by increasing the bottleneck lengths to Lb = 16 generations instead of Lb = 2 during the founding of modern populations 15, 29, 43, 57, 71, and 85. Hence, the length of time between the end of any of these bottlenecks and the founding of the next population is L = 5 generations instead of L = 19, so that the time between founding events is still Lb + L = 21 generations. These severe bottlenecks subdivide the set of K = 100 modern populations into R = 7 regions.Figure 7 depicts patterns of genetic variation generated by the nested regions model. As was observed in simulations of Hunley , the nested regions model reproduces the approximate linear decline in gene diversity with distance from the source population, with small discontinuities in genetic diversity at region boundaries (Figure 7A). Similarly, as was observed in the simulations of Hunley , the nested regions model reproduces the patterns of between-population gene identity observed from human data, with pairs of populations far from the source displaying larger gene identity than pairs close to the source (Figure 7B). Also, in the nested regions model, pairs of populations that are geographically distant tend to have larger FST than pairs of populations that are geographically close (Figure 7C). The nested regions model predicts regional boundaries in the gene identity and FST heat maps (Figure 7, B and C) that partly reproduce the block structure in the human population data (Figure 4, B and C). However, as in the modern serial founder model, the nested regions model predicts small FST between pairs that are far from the source population, a pattern that is not observed for populations in the Americas (contrast Figure 4C and Figure 7C).Patterns of genetic variation in a nested regions model. The values of the model parameters are the same as those in Figure 6, with the exception that the bottleneck lasts 16 generations instead of 2 generations during the founding of modern populations 15, 29, 43, 57, 71, and 85. (A) Gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. (B) Between-population gene identity for pairs of populations. (C) Pairwise FST for pairs of populations.As was seen with the modern serial founder model above, the nested regions model recovers most of the patterns observed in human population-genetic data (Figure 4). Because of the increased bottleneck severity between regions, unlike the modern serial founder model, the nested regions model also reproduces the larger differences in values of the three summary statistics observed between regions compared to values observed within regions (Figure 4).
Instantaneous Divergence Model
DeGiorgio found that another model, the instantaneous divergence model, was capable of generating patterns that were compatible with observed patterns of within-population gene diversity, linkage disequilibrium, and the ancestral allele frequency spectrum. Because we investigated only within-population summary statistics, however, it was not examined whether the gene identity and FST patterns observed in Figure 4, B and C, could also be generated by the instantaneous divergence model.The instantaneous divergence model (Figure 5C) is a model in which all populations diverge at the same time in the past and populations that are farther from the source population have a smaller population size than those that are closer to the source. The motivation for this model is that populations that have traveled a greater distance from a source population will likely have lost alleles through genetic drift. The instantaneous divergence model allows for this increased drift for populations that are located far from the source population by assigning such populations a smaller size. An increase in genetic drift causes a decrease in gene diversity due to the random loss of alleles, as also occurs in bottlenecks. DeGiorgio found that when the size of population i in the instantaneous divergence model was set so that the elapsed coalescent time was the same as in modern population i in the modern serial founder model, the approximate linear trend in gene diversity with distance from the source population was virtually indistinguishable from that of the modern serial founder model.Suppose a modern serial founder model is parameterized as in Figure 6A. We obtain the instantaneous divergence model of DeGiorgio by setting the divergence time of all K populations to τD, the ancestral diploid population size to N, and the diploid size of population i tofor i = 1, 2, …, K, where τD, N, Nb, L, and Lb are the parameters in the modern serial founder model in the section Modern Serial Founder Model (DeGiorgio ). The value of N is chosen so that τD/N is the total duration in coalescent units of population i. To obtain the exact instantaneous divergence model described by DeGiorgio , we set τD = 2079, N = 10000, Nb = 250, L = 19, and Lb = 2. These values are the same values used for the modern serial founder model in Figure 6A. Using Equation 9 for the size of population i allows population i to experience the same level of genetic drift as modern population i in the modern serial founder model.Figure 8 depicts patterns of genetic variation generated by the instantaneous divergence model. As was observed in the simulations of DeGiorgio , this model reproduces the approximate linear decline in gene diversity with increasing distance from the source population (Figure 8A). In contrast, between-population gene identity and pairwise FST yield patterns that are quite different from those observed in human data (contrast Figure 8, B and C, with Figure 4, B and C). All off-diagonal entries of Figure 8B have identical small gene identities. Also, pairs of populations that are close to the source population have smaller FST than pairs that are far from the source (Figure 8C).
Figure 8
Patterns of genetic variation in the instantaneous divergence model. The values of the model parameters are indicated in the section Instantaneous Divergence Model. (A) Gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. (B) Between-population gene identity for pairs of populations. (C) Pairwise FST for pairs of populations.
Patterns of genetic variation in the instantaneous divergence model. The values of the model parameters are indicated in the section Instantaneous Divergence Model. (A) Gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. (B) Between-population gene identity for pairs of populations. (C) Pairwise FST for pairs of populations.The approximate linear decline in gene diversity produced by the instantaneous divergence model (Figure 8A) is caused by the loss of alleles and consequent decrease in heterozygosity due to increased genetic drift within populations that are far from the source population (DeGiorgio ). However, the fact that all off-diagonal entries of Figure 8B are identical indicates that no correlation exists with geography for between-population gene identity under the instantaneous divergence model. This lack of correlation causes the pattern of pairwise FST values to be driven completely by the sizes of population pairs. Hence, population pairs far from the source location, which have smaller population sizes, and therefore smaller within-population coalescence times, have higher FST values.Because the approximate linear decline in gene diversity (Figure 8A) generated by the instantaneous divergence model matches the pattern observed from human genetic data (Figure 4A), we can conclude that the pattern of within-population gene diversity observed from human data reflects the cumulative increase in genetic drift with increasing distance from Africa (DeGiorgio ). However, the patterns of between-population summary statistics generated by the instantaneous divergence model (Figure 8, B and C) do not match the patterns observed from data (Figure 4, B and C). Thus, a model that incorporates only a cumulative increase in genetic drift with increasing distance from a source is not sufficient to predict observed patterns of between-population genetic diversity.
Archaic Serial Founder Model
The serial founder model was motivated as a model to explain how modern humans expanded out of Africa and colonized the world. Our general serial founder model, however, does not place restrictions on the time of the first founding event. Therefore, our general model reduces to an archaic serial founder model (Figure 5D) when the time to the first founding event occurs distantly in the past. The archaic serial founder model, although it has an identical mathematical form to the modern serial founder model, is conceptually different in the sense that it is motivated by hypotheses regarding expansions of ancient hominids out of Africa, whereas the modern serial founder model is motivated by hypotheses of recent expansion of anatomically modern humans out of Africa. The effect of increasing the time of the first founding event can be investigated in the serial founder model while holding all other parameters in the model constant.In this section, we discuss how the patterns for within-population gene diversity, between-population gene identity, and pairwise FST change as the serial founding process is pushed farther into the past. To obtain an archaic serial founder model, we assume that except for divergence time τD, all parameters are the same as in the modern serial founder model considered in Figure 6. We consider divergence times of τD = 5000, 7500, 10000, 16000, and 40000 generations ago. Divergence times τD = 16000 and τD = 40000 are of particular interest because, assuming a generation time of 25 years, they approximate estimates of the divergence of modern humans with Neanderthal (400 KYA; Green ; Noonan ) and Homo erectus (1 MYA; Takahata 1993) populations, respectively.For τD = 5000, relative to the modern serial founder model in which τD = 2079, a decrease occurs in the magnitude of the slope of the decline of gene diversity with increasing distance from the source population (Figure 9A). The increased gene identity and decreased FST between populations that are far from the source population relative to between populations that are close to the source, although still observable, are less distinct with the increased divergence time. Further increasing the divergence time to τD = 7500 (Figure 9B) and τD = 10000 (Figure 9C) leads to a progressive decrease in the differences among populations in values of the three summary statistics. For a serial founder model with a divergence time of τD = 16000, at a putative time of the Neanderthal divergence, differences in values among populations for each of the three summary statistics are small (Figure 9D). For the H. erectus serial founder model with τD = 40000, differences in values among populations for each of the three summary statistics are nearly negligible, displaying almost no trend (Figure 9E).
Figure 9
Patterns of genetic variation in an archaic serial founder model, as a function of varying divergence time τD. The values of the model parameters for A–E are the same as in Figure 6A, with the exception that the divergence time τD, measured in generations, varies across the plots. The first column is gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. The second column is between-population gene identity for pairs of populations. The third column is pairwise FST for pairs of populations. (A) τD = 5000. (B) τD = 7500. (C) τD = 10000. (D) τD = 16000. (E) τD = 40000.
Patterns of genetic variation in an archaic serial founder model, as a function of varying divergence time τD. The values of the model parameters for A–E are the same as in Figure 6A, with the exception that the divergence time τD, measured in generations, varies across the plots. The first column is gene diversity of the populations as a function of distance from the source population, where distance is measured in units of populations. The second column is between-population gene identity for pairs of populations. The third column is pairwise FST for pairs of populations. (A) τD = 5000. (B) τD = 7500. (C) τD = 10000. (D) τD = 16000. (E) τD = 40000.As τD increases, the differences among populations in values of gene diversity, between-population gene identity, and FST decrease. These smaller differences result from the smaller degree of influence that ancient bottlenecks have on genetic diversity in comparison with recent bottlenecks of identical severity. This lack of influence of ancient bottlenecks on present-day gene diversity is reflected most strongly in the small difference in gene diversity between population 1 and population 100 in the H. erectus serial founder model (Figure 9E). Furthermore, with greater τD, the difference between the divergence time for two populations sampled close to the source and for two populations sampled far from the source is small relative to τD. This small difference in divergence times causes between-population summary statistics such as gene identity and FST to have little correlation with geography (i.e., most off-diagonal entries have similar values) at large divergence times (Figure 9E).These results imply that the patterns in gene diversity, gene identity, and FST observed from empirical data cannot be predicted solely by an archaic serial founder process using our parameterization; specifically, the observed patterns are not consistent with a serial founder process that occurs too far back in the past. Pushing back the time of the first founding event while holding all other parameters constant decreases the ability of the serial founder model to generate the patterns observed in Figure 4.
Discussion
In this article, we have derived pairwise coalescence-time distributions for a serial founder model. Under the model, we have provided analytical formulas for expected coalescence times, expected homozygosity, and pairwise FST. In addition, we have analytically described the trend of decreasing gene diversity with increasing distance from the source population, and the patterns observed in between-population gene identity and pairwise FST. Using coalescence-time densities in various special cases, we have found that the modern serial founder model and the nested regions model are consistent with geographic patterns of within- and between-population genetic diversity observed in human data. Our work demonstrates the utility of using theoretical computations on between-population summary statistics in conjunction with similar computations on within-population statistics to predict geographic patterns in genetic data.One pattern that was not predicted by any of our models was the large FST observed in the Americas. Whereas the modern serial founder and the nested regions models predict small FST between populations far from the source, FST values in the Americas are large. It is possible that the models provide a poor fit to the pattern of evolution in the Americas after the initial founding of the Native American population, as they also are inconsistent with the large differences in gene diversity among populations in the Americas. During the initial migration into the Americas, small individual populations may have experienced highly variable levels of genetic drift as they spread over a large unoccupied region (e.g., Wang ; Goebel ; Meltzer 2009). Such a migration process could have given rise to highly variable levels of genetic diversity across the region, as well as a somewhat irregular pattern in FST. If we were to modify our model to incorporate this variability along with stronger bottlenecks or smaller population sizes within the Americas relative to those in non-American populations, then we might be able to produce patterns that agree with the observed data. Indeed, Hunley found that model parameters can be chosen to enable patterns of within- and between-population genetic diversity to closely match those empirically observed in the Americas.Another pattern that was not predicted by any of our models is the small between-population gene identity observed between pairs of populations, one from Oceania and the other not from Oceania (Figure 4B). This pattern could potentially be explained either by an ancient divergence of the Oceanian populations from the non-Oceanian populations through a separate migration out of Africa to Oceania (e.g., Derricourt 2005; Bulbeck 2007; Field ; Szpiech ; Kayser 2010) or by admixture of the populations in Oceania with an archaic human population (e.g., Reich ). A separate founding process could have generated low levels of within-population gene diversity for the Oceanian populations while simultaneously producing the low levels of between-population gene identity between Oceanian and non-Oceanian populations. Alternatively, because the increase in between-population coalescence times that would be caused by ancient admixture would result in a decrease in between-population gene identity, such admixture could potentially explain the disagreement of the data with our model predictions. Separate migrations or ancient admixture could potentially be incorporated into a more general version of our model to investigate the plausibility of these scenarios.By increasing the time of the first founding event, we have determined that the archaic serial founder model is not able to reproduce patterns of gene diversity, between-population gene identity, and pairwise FST observed in human genetic data. However, limited archaic admixture coupled with a modern serial founder model might not be incompatible with the patterns we have examined. Signatures of archaic admixture might exist in modern human population-genetic data (e.g., Garrigan and Hammer 2006; Plagnol and Wall 2006; Green ; Reich ) and as discussed above, such admixture could potentially explain anomalous observations in Oceania. However, this admixture, if it indeed occurred, must have been insufficient to generate a large signature in most of the patterns that we have studied.Although the patterns of gene diversity produced by the serial founder and the instantaneous divergence models are virtually indistinguishable (DeGiorgio ), we have shown that these models can be differentiated using between-population gene identity and pairwise FST. Ultimately, this potential for differentiation traces to distinctive distributions of pairwise coalescence times. In the instantaneous divergence model, each population has a constant size up until time τD and consequently, the coalescent process simply follows an exponential distribution until time τD and then another exponential distribution with a different rate after time τD. In contrast, in the serial founder model, the rate of coalescence inside a bottleneck is elevated compared to outside the bottleneck. This increased rate of coalescence causes lineages to merge within a narrow time interval. Because the serial founder model incorporates multiple bottlenecks, the distribution of coalescence times is multimodal.Recently, many studies have found that two-dimensional spatial maps generated from principal components analysis (PCA) applied to human genetic data closely match maps of geographic sampling locations of populations (e.g., Lao ; Novembre ; Price ; Bryc ; Wang ; Xing ). McVean (2010) demonstrated a close link between pairwise coalescence times and PCA, in which sampled lineages can be projected onto principal components through expected coalescence times for pairs of lineages. The coalescence-time distributions provided in this article can potentially be used to interpret PCA maps, so that PCA maps themselves might be used as summary statistics for testing evolutionary models.Estimated coalescence-time distributions might also be utilized more formally for maximum-likelihood estimation of parameters such as bottleneck lengths, bottleneck sizes, and divergence times (e.g., Thomson ; Takahata ; Tang ; Rannala and Yang 2003; Tishkoff and Verrelli 2003; Garrigan and Hammer 2006; Fagundes ; Blum and Jakobsson 2011). Further, these distributions might also be useful for hypothesis testing; because many of the models in this article are nested, likelihood-ratio tests can be performed. For extending our work to perform maximum-likelihood inference, it will be desirable to extend the computations to permit the sampling of multiple lineages in each population. Such an extension could potentially build upon the work of Marth , who derived the coalescence-time distribution for a sample of n lineages in a single population with multiple bottlenecks.An additional feature of structured population models that would be desirable to incorporate is migration between populations after their initial founding. In the archaic serial founder model, some level of migration between neighboring populations might enable the model to make predictions that more closely match observations from human genetic data. For the modern serial founder model, simulations have shown that small to moderate levels of migration have relatively little impact on observed patterns of genetic diversity (DeGiorgio ). In any case, inclusion of migration would enable us to examine considerably more complex versions of the models that we have investigated.Finally, one important quantity that we did not explore is linkage disequilibrium (LD). In simulations, we previously studied whether the spatial distribution of LD observed in worldwide human populations is consistent with a serial founder model (DeGiorgio ). We found that the serial founder model can indeed predict the observed spatial distribution of LD. Moreover, we found that LD patterns can be useful in distinguishing the patterns predicted by different evolutionary models. Therefore, incorporation of LD into our theoretical models would provide a distinct type of statistic that would further enhance model identifiability. For example, because excess long-range LD is a signature of ancient admixture (e.g., Plagnol and Wall 2006), incorporation of LD statistics would be useful for assessing whether models that include archaic admixture provide a better fit to observed human genetic variation than models that do not consider admixture. Because LD is such a valuable quantity, it would be informative to examine patterns of LD produced by the various models by incorporating recombination into the theory.
Authors: Sohini Ramachandran; Omkar Deshpande; Charles C Roseman; Noah A Rosenberg; Marcus W Feldman; L Luca Cavalli-Sforza Journal: Proc Natl Acad Sci U S A Date: 2005-10-21 Impact factor: 11.205
Authors: Noah A Rosenberg; Saurabh Mahajan; Sohini Ramachandran; Chengfeng Zhao; Jonathan K Pritchard; Marcus W Feldman Journal: PLoS Genet Date: 2005-12-09 Impact factor: 5.917
Authors: Nicole Creanza; Merritt Ruhlen; Trevor J Pemberton; Noah A Rosenberg; Marcus W Feldman; Sohini Ramachandran Journal: Proc Natl Acad Sci U S A Date: 2015-01-20 Impact factor: 11.205