Literature DB >> 21753753

Inference of human population history from individual whole-genome sequences.

Abstract

The history of human population size is important for understanding human evolution. Various studies have found evidence for a founder event (bottleneck) in East Asian and European populations, associated with the human dispersal out-of-Africa event around 60 thousand years (kyr) ago. However, these studies have had to assume simplified demographic models with few parameters, and they do not provide a precise date for the start and stop times of the bottleneck. Here, with fewer assumptions on population size changes, we present a more detailed history of human population sizes between approximately ten thousand and a million years ago, using the pairwise sequentially Markovian coalescent model applied to the complete diploid genome sequences of a Chinese male (YH), a Korean male (SJK), three European individuals (J. C. Venter, NA12891 and NA12878 (ref. 9)) and two Yoruba males (NA18507 (ref. 10) and NA19239). We infer that European and Chinese populations had very similar population-size histories before 10-20 kyr ago. Both populations experienced a severe bottleneck 10-60 kyr ago, whereas African populations experienced a milder bottleneck from which they recovered earlier. All three populations have an elevated effective population size between 60 and 250 kyr ago, possibly due to population substructure. We also infer that the differentiation of genetically modern humans may have started as early as 100-120 kyr ago, but considerable genetic exchanges may still have occurred until 20-40 kyr ago.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 21753753 PMCID： PMC3154645 DOI： 10.1038/nature10231

Source DB: PubMed Journal: Nature ISSN： 0028-0836 Impact factor: 49.962

The distribution of the time since the most recent common ancestor (TMRCA) between two alleles in an individual provides information about the history of population size change over time. Existing methods for reconstructing the detailed TMRCA distribution have analyzed large samples of individuals at non-recombining loci like mitochondrial DNA[13]. However, the statistical resolution of inferences from any one locus is poor, and power fades rapidly moving back in time as there are few independent lineages probing deep time depths (in humans, no information at all is available from mitochondrial DNA beyond about 200kya when all humans share a common maternal ancestor[11]). In contrast, a diploid genome sequence harbors hundreds of thousands of independent loci, each with its own TMRCA between the two alleles an individual carries. In principle it should be possible to reconstruct the TMRCA distribution across the autosomes and chromosome X by studying how the local density of heterozygous sites changes across the genome, reflecting segments of constant TMRCA separated by historical recombination events. To explore whether we could take advantage of this idea to learn about the detailed TMRCA distribution from a diploid whole genome sequence, we proposed the PSMC model, which is a specialization of the sequentially Markovian coalescent model[14] to the case of two chromosomes (Figure 1a). The free parameters of this model include the scaled mutation and recombination rates, and piecewise constant ancestral population sizes (Methods). We scaled results to real time assuming 25 years per generation and a neutral mutation rate of 2.5×10−8 per generation[15]. The consequences of uncertainty in the two scaling parameters will be discussed later in the text.

Figure 1

Illustration of the PSMC model and its application to simulated data

(a) The PSMC infers the local time to the most recent common ancestor (TMRCA) based on the local density of heterozygotes, using a Hidden Markov Model, where the observation is a diploid sequence, the hidden states are discretized TMRCA and the transitions represent ancestral recombination events. (b) We used the ms software to simulate the TMRCA relating the two alleles of an individual across a 200kb region (the thick red line), and inferred the local TMRCA at each locus using the PSMC (the heat map). The inference usually includes the correct time, with the greatest errors at transition points.

To validate our model, we simulated a hundred 30Mbp sequences with a sharp out-of-Africa bottleneck followed by a population expansion, and inferred population size history with PSMC (Figure 2a). PSMC is able to recover the parameters used in the simulation and the variance of the estimate is small between 20kya–3Mya. More recently than 20kya or more anciently than 3Mya few recombination events are left in the present sequence, which reduces the power of PSMC, and therefore the estimated effective population size (N) in these time intervals is not as accurate and has large variance. To test the robustness of the model, we introduced variable mutation rates and recombination hotspots in the simulation (Supplementary Text). The inference is still close to the true history (Figure 2b). A uniform rate of SNP ascertainment errors does not change our qualitative results, either (Figure S2). On the other hand, the simulations also reveal a limitation of PSMC in recovering sudden changes in effective population size. For example the instantaneous reduction from 12,000 to 1,200 at 100kya in the simulation was spread over the preceding 50,000 years in the PSMC reconstruction.

Figure 2

PSMC estimate on simulated data

(a) PSMC estimate on data simulated by msHOT. The blue curve is the population size history used in simulation; the red curve is the PSMC estimate on the originally simulated sequence; the 100 thin green curves are the PSMC estimates on 100 sequences randomly resampled from the original sequence. (b) PSMC estimate on data with variable mutation rate or with hotspots.

We applied the PSMC model to real data from recently published genome sequences (Table 1). Figure 3a reveals that all populations are very similar in estimated N history between 150 and 1500kya. YRI differentiates from non-African populations around 100–120kya (at 110kya, N=15313±559 and N=12829±485). This evidence of early population differentiation is potentially consistent with the archaeological evidence of anatomically modern humans found in the Near East around 100kya[12]. European and East Asian populations are nearly identical in estimated N beyond 11kya. From a peak of 13,500 at 150kya, the N dropped by a factor of 10 to 1,200 between 40–20kya, before a sharp increase whose precise magnitude we do not have power to measure. We also observe a less marked bottleneck in YRI from a peak of 16,100 around 100–150kya to 5,700 at 50kya, recovering earlier[16] than the out-of-Africa populations with increases back to 8,700 by 20kya, coinciding with the Last Glacial Maximum. All populations show increased N between 60–200kya, about the time of origin of anatomically modern humans[17]. An alternative to an increase in actual population size during this time would be that there was population structure involving separation and admixture[11,16] (Figure S5).

Table 1

Properties of the input sequences.

Label	Description	Coverage	# Called bases(bp)	# Heterozygotes(bp)	Heterozygosity(×0.001)
YRI1.A¹⁰	NA18507 autosomes	40X	2.14 G	2.17 M	1.013
YRI2.A⁹	NA19239 autosomes	29X	2.11 G	2.21 M	1.051
EUR1.A⁸	Venter autosomes	9X	2.13 G	1.23 M	0.578
EUR2.A⁹	NA12891 autosomes	38X	2.11 G	1.67 M	0.791
KOR.A⁷	SJK autosomes	20X	2.13 G	1.47 M	0.690
CHN.A⁶	YH autosomes	30X	2.19 G	1.52 M	0.694
YRI3.X⁹	NA19240 X chromosome	38X	106 M	71.6 k	0.673
EUR3.X⁹	NA12878 X chromosome	35X	110 M	48.0 k	0.436
KOR-CHN.X	SJK-YH combined X chr.	-	102 M	39.7 k	0.390
YRI1-EUR1.X	NA18507-Venter combined X chr.	-	83 M	55.6 k	0.670
YRI1-KOR.X	NA18507-KOR combined X chr.	-	100 M	66.9 k	0.669
YRI1-CHN.X	NA18507-YH combined X chr.	-	106 M	69.5 k	0.657

Coverage equals the average number of reads covering HapMap3 loci. A base is said to be called if it passes all filters described (Methods). The relatively lower coverage for EUR1.A leads to higher sampling bias at heterozygotes, which leads to underestimated heterozygosity but can be corrected by adjusting the neutral mutation rate in scaling (Supplementary Section S1.2).

Figure 3

PSMC estimate on real data

(a) The population sizes inferred from autosomes of six individuals. 5%, 10% and 29% of heterozygotes are assumed to be missing in CHN.A, KOR.A and EUR1.A, respectively. (b) The population sizes inferred from male-combined X chromosomes and the simulated African-Asian combined sequences from the best-fit model by Schaffner et al. Sizes inferred from X chromosome data are scaled by 4/3; the neutral mutation rate on X, which is used in time scaling, is estimated with the ratio of male-to-female mutation rate α equal to 2 (Methods).

We also see in all populations an increase in estimated N beyond 1Mya, with a sharp increase beyond 3Mya. Although it is tempting to read into this the transition from the previously estimated larger N at the time of the split from chimpanzee[18], our method may also be subject to artifacts in this region due to regions of balancing selection or clustered false heterozygotes related to segmental duplications (Figure S3). Analyzing a European female X chromosome (EUR3.X) yields a history similar to that from autosomes scaled by 0.75, as expected for the X chromosome (Figure 3b). We do not observe a more severe bottleneck on the X chromosome[19]. To investigate the relationship between African and non-African populations, we combined X chromosomes from YRI and a non-African to construct a pseudo-diploid genome. From Figure 3b, we can see that although African and non-African populations might have started to differentiate as early as 100–120kya, they largely remain as one population until approximately 60–80kya, the time point when the YRI1-EUR1.X curve clearly leaves EUR3.X. This supports the recent analysis of the relationship between the Neandertal genome and that of modern humans[20], which concluded that West Africans and non-Africans descended from a homogeneous ancestral population in the last 100,000 years with subsequent minor admixture out of Africa from Neandertals, rather than an alternative explanation of ancient (>300,000 year old) sub-structure separating West African and non-African populations. From Figure 3b, we also notice surprisingly that there is a low N between African and non-African populations until approximately 20kya, suggesting substantial genetic exchanges between these populations long after the initial separation. Complete separation would correspond to very large or effectively infinite N, as seen below 20kya. To explore whether the inferred recent gene flow is a modeling artifact, we simulated complete divergence at 60kya according to the Schaffner et al. model[21], and saw increased rather than reduced N in the period 20–60kya (brown line in Figure 3b). To explore further, we extracted from YRI1-KOR.X segments that PSMC indicates coalesce more recently than 50kya. These comprise 220 segments covering 31.2Mbp (>20% X chromosome). We observe 1,363 base-pair differences in 20.7Mbp of callable sequence in these segments, corresponding to an average divergence time of 37.4kya. In contrast, if we apply the same process to the simulated data from the Schaffner et al. model, the apparently recently diverged segments cover only 0.4% of the simulated chromosome. The human-macaque divergence in the 220 segments was only 4% lower than the chromosome average, so regional variability in mutation rates cannot explain these results. In summary the existence of long segments of low divergence between YRI1 and KOR suggests substantial genetic exchange between West African and non-African populations up until 20–40kya, and is not consistent with a simple separation approximately 60kya. The evidence for continued gene flow between Africans and non-Africans prior to the separation of Europeans and East Asians (Supplementary Section S4.2) is more recent than the archaeologically documented time of the out-of-Africa dispersal since there are fossils in both Europe and Australasia that date to >40 kya[22]. An important caveat to this result is uncertainty of the per-year mutation rate 1.0×10−9 (=2.5×10−8/25). While this mutation rate agrees well with the rates estimated between primates averaged over millions of years (Supplementary Section S3.1), generation intervals as high as 29 years per generation over the last few thousands of years[23] and present mutation rates lower than 2.5×10−8 per generation[9] are possible in principle, and these could make our recent date estimates somewhat older, although it is difficult to imagine a date of final gene flow of as old as 60kya as being consistent with the data due to these inaccuracies. Our analyses can also not exclude the possibility that the divergence time inferred from X chromosomes may not be representative due to sex-biased demographic processes[19], highlighting the importance of repeating this analysis on autosomal data once haploid whole genome sequences become available[24]. Intriguingly, a recent study using an orthogonal type of data (analysis of allele frequencies) also inferred that gene flow between Africans and non-Africans continued until strikingly recently, in the case of that study, until 17–26kya[25]. An important goal for future work is to determine whether these recent dates reflect real history, and if so to obtain more detail about the timing and scale of the events involved. In this paper we have introduced a novel method to infer the history of effective population size from genome wide diploid sequence data. It is relatively straightforward to apply, with less potential ascertainment bias in comparison to existing methods that use selective genotyping data or the resequencing data from a few loci. Furthermore, our method is computationally tractable and typically uses much more primary sequence data than the existing methods, which allows us to estimate population size at each time going back in history, rather than assume a parametric structure of times, divergences and size changes. The results described above concerning the timing and depth of the out-of-Africa bottleneck are broadly consistent with previous studies though our results are more detailed (Supplementary Section S4.2). The hypothesis that there was significant ongoing genetic exchange throughout the bottleneck is surprising in light of current views about human migrations; however, it is not inconsistent with the archaeological literature, and should motivate further research. There is the potential to extend this type of SMC-HMM approach to data from multiple individuals, which would access more recent times, but this will require inference over a substantially more complex hidden state space of trees on the haplotypes, with each Markov path representing an ancestral recombination graph[14]. In addition, beyond humans, there is the potential to apply the method to investigate the population size history of other species for which a single diploid genome sequence has been obtained (Supplementary Section S2.2).

Methods Summary

Illumina short reads were obtained from Short Read Archive and capillary reads from TraceDB. Reads were aligned to the human reference genome with BWA[26]. The consensus sequences were called by SAMtools[27] and then divided into non-overlapping 100bp bins with a bin scored heterozygous if there is a heterozygote in the bin or being homozygous otherwise. The resultant bin sequences were taken as the input of the PSMC estimate. Coalescent simulation was done by ms[28] and cosi[21]. The simulated sequences were binned in the same way. The free parameters in the discrete PSMC-HMM model are the scaled mutation rate, recombination rate and piecewise constant population sizes. The time interval each size parameter spans was manually chosen. The estimation-maximization iteration started from a constant-sized population history. The estimation step was done analytically; Powell’s direction set method is used for the maximization step. Parameter values stablized by the 20th iteration, and these were taken as the final estimate. All parameters are scaled to a constant that is further determined under the assumption of a neutral mutation rate 2.5×10−8.

30 in total

1. Linkage disequilibrium in the human genome.

Authors: D E Reich; M Cargill; S Bolk; J Ireland; P C Sabeti; D J Richter; T Lavery; R Kouyoumjian; S F Farhadian; R Ward; E S Lander
Journal: Nature Date: 2001-05-10 Impact factor: 49.962

2. Estimate of the mutation rate per nucleotide in humans.

Authors: M W Nachman; S L Crowell
Journal: Genetics Date: 2000-09 Impact factor: 4.562

3. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

4. Calibrating a coalescent simulation of human genome sequence variation.

Authors: Stephen F Schaffner; Catherine Foo; Stacey Gabriel; David Reich; Mark J Daly; David Altshuler
Journal: Genome Res Date: 2005-11 Impact factor: 9.043

5. A draft sequence of the Neandertal genome.

Authors: Johannes Krause; Adrian W Briggs; Tomislav Maricic; Udo Stenzel; Martin Kircher; Nick Patterson; Richard E Green; Heng Li; Weiwei Zhai; Markus Hsi-Yang Fritz; Nancy F Hansen; Eric Y Durand; Anna-Sapfo Malaspinas; Jeffrey D Jensen; Tomas Marques-Bonet; Can Alkan; Kay Prüfer; Matthias Meyer; Hernán A Burbano; Jeffrey M Good; Rigo Schultz; Ayinuer Aximu-Petri; Anne Butthof; Barbara Höber; Barbara Höffner; Madlen Siegemund; Antje Weihmann; Chad Nusbaum; Eric S Lander; Carsten Russ; Nathaniel Novod; Jason Affourtit; Michael Egholm; Christine Verna; Pavao Rudan; Dejana Brajkovic; Željko Kucan; Ivan Gušic; Vladimir B Doronichev; Liubov V Golovanova; Carles Lalueza-Fox; Marco de la Rasilla; Javier Fortea; Antonio Rosas; Ralf W Schmitz; Philip L F Johnson; Evan E Eichler; Daniel Falush; Ewan Birney; James C Mullikin; Montgomery Slatkin; Rasmus Nielsen; Janet Kelso; Michael Lachmann; David Reich; Svante Pääbo
Journal: Science Date: 2010-05-07 Impact factor: 47.728

6. A map of human genome variation from population-scale sequencing.

Authors: Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal: Nature Date: 2010-10-28 Impact factor: 49.962

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. Haplotype-resolved genome sequencing of a Gujarati Indian individual.

Authors: Jacob O Kitzman; Alexandra P Mackenzie; Andrew Adey; Joseph B Hiatt; Rupali P Patwardhan; Peter H Sudmant; Sarah B Ng; Can Alkan; Ruolan Qiu; Evan E Eichler; Jay Shendure
Journal: Nat Biotechnol Date: 2010-12-19 Impact factor: 54.908

9. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data.

Authors: Ryan N Gutenkunst; Ryan D Hernandez; Scott H Williamson; Carlos D Bustamante
Journal: PLoS Genet Date: 2009-10-23 Impact factor: 5.917

10. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

746 in total

1. Specific inactivation of two immunomodulatory SIGLEC genes during human evolution.

Authors: Xiaoxia Wang; Nivedita Mitra; Ismael Secundino; Kalyan Banda; Pedro Cruz; Vered Padler-Karavani; Andrea Verhagen; Chris Reid; Martina Lari; Ermanno Rizzi; Carlotta Balsamo; Giorgio Corti; Gianluca De Bellis; Laura Longo; William Beggs; David Caramelli; Sarah A Tishkoff; Toshiyuki Hayakawa; Eric D Green; James C Mullikin; Victor Nizet; Jack Bui; Ajit Varki
Journal: Proc Natl Acad Sci U S A Date: 2012-06-04 Impact factor: 11.205

2. Bypassing Drug Resistance Mechanisms of Prostate Cancer with Small Molecules that Target Androgen Receptor-Chromatin Interactions.

Authors: Kush Dalal; Meixia Che; Nanette S Que; Aishwariya Sharma; Rendong Yang; Nada Lallous; Hendrik Borgmann; Deniz Ozistanbullu; Ronnie Tse; Fuqiang Ban; Huifang Li; Kevin J Tam; Mani Roshan-Moniri; Eric LeBlanc; Martin E Gleave; Daniel T Gewirth; Scott M Dehm; Artem Cherkasov; Paul S Rennie
Journal: Mol Cancer Ther Date: 2017-08-03 Impact factor: 6.261

3. Whole-genome sequencing of the snub-nosed monkey provides insights into folivory and evolutionary history.

Authors: Xuming Zhou; Boshi Wang; Qi Pan; Jinbo Zhang; Sudhir Kumar; Xiaoqing Sun; Zhijin Liu; Huijuan Pan; Yu Lin; Guangjian Liu; Wei Zhan; Mingzhou Li; Baoping Ren; Xingyong Ma; Hang Ruan; Chen Cheng; Dawei Wang; Fanglei Shi; Yuanyuan Hui; Yujing Tao; Chenglin Zhang; Pingfen Zhu; Zuofu Xiang; Wenkai Jiang; Jiang Chang; Hailong Wang; Zhisheng Cao; Zhi Jiang; Baoguo Li; Guang Yang; Christian Roos; Paul A Garber; Michael W Bruford; Ruiqiang Li; Ming Li
Journal: Nat Genet Date: 2014-11-02 Impact factor: 38.330

4. Gene genealogies within a fixed pedigree, and the robustness of Kingman's coalescent.

Authors: John Wakeley; Léandra King; Bobbi S Low; Sohini Ramachandran
Journal: Genetics Date: 2012-01-10 Impact factor: 4.562

Review 5. Molecular phylogenetics: principles and practice.

Authors: Ziheng Yang; Bruce Rannala
Journal: Nat Rev Genet Date: 2012-03-28 Impact factor: 53.242

6. Whole-genome sequencing data offer insights into human demography.

Authors: Jonathan K Pritchard
Journal: Nat Genet Date: 2011-09-28 Impact factor: 38.330

7. Blockwise HMM computation for large-scale population genomic inference.

Authors: Joshua S Paul; Yun S Song
Journal: Bioinformatics Date: 2012-05-28 Impact factor: 6.937

8. Genomic variation in natural populations of Drosophila melanogaster.

Authors: Charles H Langley; Kristian Stevens; Charis Cardeno; Yuh Chwen G Lee; Daniel R Schrider; John E Pool; Sasha A Langley; Charlyn Suarez; Russell B Corbett-Detig; Bryan Kolaczkowski; Shu Fang; Phillip M Nista; Alisha K Holloway; Andrew D Kern; Colin N Dewey; Yun S Song; Matthew W Hahn; David J Begun
Journal: Genetics Date: 2012-06-05 Impact factor: 4.562

Review 9. Recent advances in the study of fine-scale population structure in humans.

Authors: John Novembre; Benjamin M Peter
Journal: Curr Opin Genet Dev Date: 2016-09-20 Impact factor: 5.578

10. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture.

Authors: Sharon R Browning; Brian L Browning; Ying Zhou; Serena Tucci; Joshua M Akey
Journal: Cell Date: 2018-03-15 Impact factor: 41.582