Literature DB >> 26832887

Inferring the Dynamics of Effective Population Size Using Autosomal Genomes.

Zheng Hou¹, Yin Luo², Zhisheng Wang³, Hong-Xiang Zheng¹, Yi Wang¹, Hang Zhou⁴, Leqin Wu⁵, Li Jin^1,4.

Abstract

Next-generation sequencing technology has provided a great opportunity for inferring human demographic history by investigating changes in the effective population size (Ne). In this report, we introduce a strategy for estimating Ne dynamics, allowing the exploration of large multi-locus SNP datasets. We applied this strategy to the Phase 1 Han Chinese samples from the 1000 Genomes Project. The Han Chinese population has undergone a continuous expansion since 25,000 years ago, at first slowly from about 7,300 to 9,800 (at the end of the last glacial maximum about 15,000 YBP), then more quickly to about 46,000 (at the beginning of the Neolithic about 8,000 YBP), and then even more quickly to reach a population size of about 140,000 (recently).

Entities: Chemical Disease Species

Mesh：

Year: 2016 PMID： 26832887 PMCID： PMC4735516 DOI： 10.1038/srep20079

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

The dynamics of human population size provide important information for understanding the processes underlying human evolutionary history. Important events in the course of human evolution, such as the development of technological innovations and climatic changes, often led to changes in human population size, and in turn have left footprints on extant genetic polymorphism12. The next-generation whole-genome sequencing technology has provided a great opportunity for interrogating the human population demography34 through investigations of changes in effective population size (N). A number of methods for estimating N from genetic data have been developed since Kingman introduced the coalescent theory5, and recent developments have made it possible to study changes in N using large-scale sequencing data34678. Gronau et al.3 and Heled et al.6 estimate inter-population variation in N, which provides no information on the dynamics within a population. These two methods and as well as Li et al.4’s method also have limitations in estimating the dynamics of N when dealing with whole-genome sequencing data of medium to large numbers of sampled individuals, due to either the complexity of the algorithm itself4 or the substantial amount of computational calculations required36 which is very difficult to handle using current computational resources. On the other hand, inferring the dynamics of N from the site frequency spectrum78 requires the assumption of independence of sites, which wastes the linkage information in the data. East Asia, being the crossroads of human migrations and human activities, is one of the most important regions for studying both the evolution and the genetic diversity of human populations9. In particular, the details of the population demographic history in this region since the Last Glacial Maximum (LGM) have scarcely been investigated. Furthermore, present research is still confined to mtDNA, Y-chromosome, or only a few autosomal loci with conflicting results10111213, especially with regard to the start time and the extent of expansions. In this report, we develop a strategy for estimating the changes in N, allowing the exploration of whole-genome sequences, we call ENUMS (Estimation of N Using Multiple Segments). This strategy includes three steps: (1) identification of haplotype blocks (hereafter referred to as blocks), following the definition of Wang et al.14, (2) estimation of N dynamics through time for each block (hereafter referred to as block N dynamics) using the Bayesian Skyline Plot (BSP) method15, and (3) estimation of population N changes over time by taking the weighted average value of N of all blocks at each time point, which minimizes the Euclidean distance to all block N dynamics (see Methods for details). Based on the coalescence theory and by employing the standard Markov Chain Monte Carlo (MCMC) procedure, the Bayesian Skyline Plot (BSP) method15 can co-estimate the evolutionary rate, substitution model parameters, phylogeny and ancestral population dynamics within a single analysis directly by sampling DNA sequences. In order to reduce the noise associated with short coalescent intervals, the method allows multiple coalescent intervals to be grouped, assuming the population sizes in successive coalescent intervals are correlated. The population sizes in these grouped intervals are allowed to change linearly or remain constant. The resulting estimation of N dynamics over time is gained from the posterior sampling of the MCMC procedure, including credibility intervals that represent both phylogenetic and coalescent uncertainty. By applying the BSP method directly to each block and then integrating the information from all blocks, our ENUMS strategy not only has inherited the advantages of the BSP method, but is also able to circumvent the limitations of dealing with large samples of whole-genome sequences. We applied this strategy to the samples of Han Chinese autosomal genomes in Phase 1 taken from the 1000 Genomes Project (1KGP)16 to investigate how N changed in three periods separated by two important events (i.e., the end of the LGM and introduction of agriculture). The three periods are: (1) the LGM period (25,000 YBP –15,000 YBP); (2) from the end of the LGM period to the end of the Paleolithic era (15,000 YBP – 8,000 YBP) and (3) from the beginning of the Neolithic era to recent (8,000 YBP – recent).

Results

Genome-wide autosomal SNP (Single Nucleotide Polymorphism) sites of the 197 Han Chinese individuals (CHB & CHS) from the low coverage data set (the Phase 1 data) in the 1KPG16 were used in this study. To circumvent the unspecified influence of crossovers on the BSP method15 which was used as part of the estimation algorithm in the following analysis, we partitioned the sequence data of each chromosome into blocks free of traces of crossover events by employing the four-gamete test (FGT) algorithm14 on all of the individuals under study. Given the knowledge of poor sequencing quality in the 1KGP data set, we selected 844 haplotype blocks of higher quality, 5,516,675 bp in length totally, from all the autosomal blocks (see Supplementary Results & Supplementary Table 1 for details). 332 of these blocks overlapped with at least one gene while 512 did not. Although this strategy may cause biased results since only part of each chromosome were used, it uses at least some of the linkage information by assuming no recombination within each block for inference of N (also see Methods & Discussion for details). The BSP method15 was applied to individual blocks to estimate block N dynamics. Blocks with the age of the most recent common ancestor (MRCA) younger than 25,000 years were removed, leaving 801 blocks (hereafter referred to as Best Set), 315 of which overlap with at least one gene. To further simplify the description of the dynamics of each block N, we denoted a vector of 26 elements corresponding to N values that were taken at a series of time points from present to 25,000 YBP with an interval of 1,000 years (also see Methods & Supplementary Figure 2). To characterize the dynamics of the N of all blocks studied, we estimated the population N by calculating the weighted average values over the 801 blocks for every element of the aforementioned vectors (see Methods for details). The result revealed a continuous expansion from 25,000 YBP to recent, resulting in an approximately 18-fold increase in size. The initial N was ~7,300, while the recent N is ~140,000 (Fig. 1). We investigated the change of N for three time periods: (1) from 25,000 YBP to the end of the LGM (15,000 YBP), (2) from 15,000 YBP to the beginning of Neolithic (8,000 YBP), and (3) from 8,000 YBP to recent. The population increased by only 33% throughout the LGM period, with slight fluctuations. It reached ~46,000 by the end of the Paleolithic era (8,000 YBP), and then expanded to ~140,000 in size from 8,000 YBP to recent (i.e., since the invention of the agriculture1718). When taking every block as equally weighted, a highly similar pattern of estimated population N dynamics is obtained (see Supplementary Figure 3 for details).

Figure 1

Changes of N in the Han Chinese population since 25,000 YBP.

Blue triangles, brown circles and purple squares show the optimal estimation of N in Stages 1, 2 and 3 respectively.

Among all of the 25 blocks that might have undergone recent positive selection detected by the modified CMS1920 test (see Supplementary Methods & Supplementary Figure 4 for details), seven of them overlapped with at least one gene (see Supplementary Table 2 for details) while two of them have been reported in East Asian populations (see Supplementary Table 3 for details). After removing all these 25 blocks from the Best Set, the remaining blocks show highly similar trends of population N dynamics (also see Fig. 2). We also partitioned the Best Set into two subsets according to whether the whole or part of the segment overlapped with any of the genes. Both subsets yielded highly similar N trends (Fig. 3).

Figure 2

Impact of recent positive selection on the trend of estimated population N dynamics.

Light blue squares represent the trend of population N estimated by all blocks in the Best Set. Brown stars represent the trend of population N estimated by the blocks in the Best Set excluding the blocks might have undergone recent positive selection.

Figure 3

The trend of estimated population N after partitioning the Best Set into two subsets, according to whether the whole or part of the block overlaps with any of the currently known genes.

Blue triangles represent the blocks that do not overlap with any genes while brown stars represent the blocks that overlap or partially overlap with one or more genes. Light squares represent the trend of population N estimated by all blocks in the Best Set.

Although our interest was in the Han Chinese N dynamics from present to 25,000 YBP, we also investigated the trend of N changes from 25,000 YBP to 300,000 YBP (see Supplementary Figure 5 for details). The results suggest that N increased very slowly from 50,000 YBP to 25,000 YBP (less than 3-fold) while remaining nearly the same from 300,000 YBP to 50,000 YBP.

Discussion

In this report, we developed a strategy for estimating the changes in N from the recent to the past, allowing for the exploration of large multi-locus SNP datasets, and even whole-genome sequences. The results of the application of this new strategy on the recently released Phase 1 Han Chinese samples from the 1KGP16 suggest the following: a slight population expansion in East Asia during the LGM; an increase of N continuing during the post-LGM period and a population expansion escalation with the agriculture in the recent millennia. The two sampling populations (CHB and CHS) used by this study, although collected from the extant Chinese population, constitute an effective representation of East Asia102122. Xu et al.21 showed that both CHB and CHS are highly admixed. Furthermore, Zheng et al.10 showed that CHB and CHS have experienced similar population size changes since 25,000 YBP based on the analyses of mitochondrial genomes. Therefore, the current analyses based on CHB and CHS capture the overall picture of the demographic dynamics of human populations in East Asia. To explore how the differentiation between CHB and CHS would influence the estimation of the population N dynamics, we calculated the fixation index (F)2324 value of each block in the Best Set and the range of the resulted F values is 0 to 4.5%. We then divided the Best Set into two sub-sets according to the whole-genome average F value between CHB and CHS (0.12%)16. 289 blocks are with an F value more the 0.12% while 512 are not. Both sub-sets reveal highly similar patterns of population N dynamics (Fig. 4), which indicates that the influence of the population structure is very slight.

Figure 4

The trend of estimated population N after partitioning the Best Set into two subsets by the whole-genome average F value between CHB and CHS (0.12%).

Purple stars represent the blocks with F values not more than 0.12% (512 blocks in all) while green crosses represent the blocks with F values more than 0.12% (289 blocks in all). Light circles represent the trend of population N estimated by all blocks in the Best Set.

Constructing coalescent trees with recombination over a whole chromosome is an NP-hard problem when the sample size reaches tens of hundreds, therefore, it is reasonable to choose segments free of traces of crossover events in order to accomplish the estimation. However, even the FGT method, the most sensitive one for detecting currently known crossover events142526, may fail to find all crossover events (also see Supplementary Results). Thus, how residual crossover events, although very rare (also see Supplementary Results), would influence our results is worth further investigation. We excluded any blocks shorter than 5,000 bp mainly because genealogical information may be insufficient in extremely short blocks (data not shown), but also to avoid a substantial amount of computational calculation. Generally, a 6,000 bp block consumes nearly 156 hours using one core of an XEON E5650 CPU and 10GB for hard disk of a 600,000,000-step MCMC iteration calculation and 20 GB of memory for dealing with the results generated by the MCMC iteration calculation (also see Methods for details). As a result, ~132,000 core hours of CPU and ~6TB of hard disk were consumed for all the segments in the Best Set for just one set of parameters. Meanwhile, there has not been any evidence supporting that inference is biased without short blocks15. The properties of the BSP method were well studied using simulated data under different patterns of demography1527, supporting the applicability of the method. Since the estimation of population expansion based on the ENUMS strategy is solely dependent on the BSP method (also see Methods), these simulation studies could reflect the properties and accuracy of ENUMS. Furthermore, the information on the uncertainty of the estimation of N, as provided by the BSP, has been taken into consideration during the third step in ENUMS. The times estimated by the BSP method15 are in the unit of mutations per site. To rescale them into the unit of years, we need to know the mutation rate per site per year. There are two approaches to estimate the mutation rate28 and we considered the one calculating pairwise substitution rates between closely related species as more suitable for our study (see Supplementary Discussion for details). So, we set the mutation rate as 2.5 × 10−8 per site per generation. However, if we choose the one counting mutation rates that occur between generations in present-day individuals, assuming the mutation rate to be 1.25 × 10−8 per site per generation, the results still support that the Han Chinese population has experienced a continuous expansion since 25,000 years ago (also see Supplementary Figure 6). Therefore, our results revealed autosomal expansion at least 12,000 years earlier than mtDNA expansion10 and 19,000 years earlier than Y-chromosomal expansion13 (see Supplementary Results for details). The time interval by which N values are selected for each block (also see Methods) might also have an impact on the trend of estimated population N. Nonetheless, nearly the same results could be obtained after transforming the time interval from 1000 years to 500 years (Fig. 5). In addition, the quality of the dataset used in this study was not quite adequate due to the low-coverage sequencing strategy used, and thereby there may be a bias in the estimation of coalescent times as well as effective population sizes. The SNPs with ≤1% frequency were insufficient due to the poor sequencing quality16, especially with the inference of recent population dynamics of N, since the signature of a recent population expansion will mainly be found in the singletons. A better estimation would be expected when high-quality data is available.

Figure 5

Impact of time intervals by which N values are selected for each block on the trend of estimated population N dynamics.

Blue stars represent the trend of population N with a time interval of 1,000 years. Brown squares represent the trend of population N with a time interval of 500 years.

How recent positive selection might influence the previous estimates was investigated in two ways. Firstly, we partitioned the Best Set into two subsets according to whether the whole or part of the segment overlapped with any of the genes. Both subsets yielded highly similar N trends (Fig. 3), which also demonstrates that the amount of blocks does not influence the estimation of the population N when it is not too small. Secondly, we detected recent positive selection in each block in the Best Set by the modified CMS1920 test (see Supplementary Methods) and 25 blocks appear to be subject to recent positive selection in total. Nearly the same trend of population N dynamics has been obtained after removing these 25 blocks from the Best Set (Fig. 2). Both the above observations suggest that recent positive selection has very limited effects on the N estimation in this study.

Methods

Materials

Genome-wide autosomal sequencing data of the 197 Han Chinese individuals (CHB & CHS) obtained from the low coverage data set in the 1KGP16 (ftp://ftp.1000genomes.ebi.ac.uk:/vol1/ftp/phase1/analysis_results/shapeit2_phased_haplotypes) were used. The OMNI16 dataset was used as a reference to assess the sequencing quality of each individual genome (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20111117_omni_genotypes_and_intensities/Omni25_genotypes_2123_samples.b37.vcf.gz).

Framework of ENUMS

We have developed a strategy, denoted as ENUMS, for estimating the changes of N from the recent to the past within a population, allowing the exploration of large multi-locus SNP datasets, including whole-genome sequences. It functions by identifying haplotype blocks, estimating block N dynamics through time of each block and estimating population N changes over time by integrating information from all block N dynamics. Based on the BSP method, this strategy circumvents the limitations of large sample sizes and minimizes the Euclidean distance while integrating information from all block N dynamics.

Estimation of block N dynamics

Generally, the dynamics of each block N are estimated by the BSP method15 in the program BEAST (version 1.7.0)29 (also see Supplementary Methods for details). No recombination is assumed within each block (also see Discussion & Supplementary Results for details). Isolation and random mating are assumed in the Han Chinese population. Each MCMC sample1530 used in the BSP method15 is based on a run of 600,000,000 generations, sampled every 2,000 generations, with the first 50,000,000 generations discarded as burn-in. Specifically, the steps we took in the MCMC process were nearly 10 times longer than the ones used in other studies1031. In order to plot the dynamics of block N with respect to time, a strict clock and a neutral mutation rate of 2.5 × 10−8 per generation per site432 and 25 years per generation are also assumed. For the convenience of analyzing all blocks together, N values were selected at a series of time points from present to 25,000 YBP with an interval of 1,000 years to describe the dynamics of each block N, which can be denoted by a vector , where j is an integer denoting the index of the block and T is a vector denoting the index of the time points selected, i.e., T = (0, 1000, …, 25000)′.

Estimation of population N dynamics using all blocks under survey

Because estimation of N changes from any block in the Best Set may be biased by the influence of many factors such as genetic drift, positive selection, we take the vector that can minimize function (1) as the estimation of the population N changes, i.e., the vector describing the population N dynamics is the one whose sum of squared Euclidean distance to all block N vectors in the Best Set is minimal (see function (2)). where M is the total number of blocks, and is the vector indicating the weight of at each time point. Each element of is defined as the inverse of the approximation of the standard deviation of the estimated values of N at the corresponding time point in the main result of this study. Due to the convexity of this problem, is the optimal solution of (1) if and only if Thus the solution of (2) is i.e., the estimation of the population N.

Additional Information

How to cite this article: Hou, Z. et al. Inferring the Dynamics of Effective Population Size Using Autosomal Genomes. Sci. Rep. 6, 20079; doi: 10.1038/srep20079 (2016).

28 in total

1. Estimate of the mutation rate per nucleotide in humans.

Authors: M W Nachman; S L Crowell
Journal: Genetics Date: 2000-09 Impact factor: 4.562

2. Haploview: analysis and visualization of LD and haplotype maps.

Authors: J C Barrett; B Fry; J Maller; M J Daly
Journal: Bioinformatics Date: 2004-08-05 Impact factor: 6.937

3. Bayesian coalescent inference of past population dynamics from molecular sequences.

Authors: A J Drummond; A Rambaut; B Shapiro; O G Pybus
Journal: Mol Biol Evol Date: 2005-02-09 Impact factor: 16.240

4. Genetic studies of human diversity in East Asia.

Authors: Feng Zhang; Bing Su; Ya-ping Zhang; Li Jin
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2007-06-29 Impact factor: 6.237

5. Identifying recent adaptations in large-scale genomic data.

Authors: Sharon R Grossman; Kristian G Andersen; Ilya Shlyakhter; Shervin Tabrizi; Sarah Winnicki; Angela Yen; Daniel J Park; Dustin Griesemer; Elinor K Karlsson; Sunny H Wong; Moran Cabili; Richard A Adegbola; Rameshwar N K Bamezai; Adrian V S Hill; Fredrik O Vannberg; John L Rinn; Eric S Lander; Stephen F Schaffner; Pardis C Sabeti
Journal: Cell Date: 2013-02-14 Impact factor: 41.582

6. Male demography in East Asia: a north-south contrast in human population expansion times.

Authors: Yali Xue; Tatiana Zerjal; Weidong Bao; Suling Zhu; Qunfang Shu; Jiujin Xu; Ruofu Du; Songbin Fu; Pu Li; Matthew E Hurles; Huanming Yang; Chris Tyler-Smith
Journal: Genetics Date: 2006-02-19 Impact factor: 4.562

7. mtDNA variation predicts population size in humans and reveals a major Southern Asian chapter in human prehistory.

Authors: Quentin D Atkinson; Russell D Gray; Alexei J Drummond
Journal: Mol Biol Evol Date: 2007-12-18 Impact factor: 16.240

8. Robust demographic inference from genomic and SNP data.

Authors: Laurent Excoffier; Isabelle Dupanloup; Emilia Huerta-Sánchez; Vitor C Sousa; Matthieu Foll
Journal: PLoS Genet Date: 2013-10-24 Impact factor: 5.917

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. Y chromosomes of 40% Chinese descend from three Neolithic super-grandfathers.

Authors: Shi Yan; Chuan-Chao Wang; Hong-Xiang Zheng; Wei Wang; Zhen-Dong Qin; Lan-Hai Wei; Yi Wang; Xue-Dong Pan; Wen-Qing Fu; Yun-Gang He; Li-Jun Xiong; Wen-Fei Jin; Shi-Lin Li; Yu An; Hui Li; Li Jin
Journal: PLoS One Date: 2014-08-29 Impact factor: 3.240

1 in total

1. Polymorphisms of the CYR61 gene in patients with acute myeloid leukemia in a Han Chinese population.

Authors: Chang-Chun Niu; Ya-Fang Wan; Cheng Yang; Tian Li; Pu Liao
Journal: Medicine (Baltimore) Date: 2018-08 Impact factor: 1.817

1 in total