| Literature DB >> 25937887 |
Andreas Wollstein1, Oscar Lao2.
Abstract
Detecting and quantifying the population substructure present in a sample of individuals are of main interest in the fields of genetic epidemiology, population genetics, and forensics among others. To date, several algorithms have been proposed for estimating the amount of genetic ancestry within an individual. In the present review, we introduce the most widely used methods in population genetics for detecting individual genetic ancestry. We further show, by means of simulations, the performance of popular algorithms for detecting individual ancestry in various controlled demographic scenarios. Finally, we provide some hints on how to interpret the results from these algorithms.Entities:
Keywords: ADMIXTURE; Global ancestry; Human genetic variability; Individual ancestry; MDS; PCA; Population substructure; SNPs; fastSTRUCTURE; sNMF
Year: 2015 PMID: 25937887 PMCID: PMC4416275 DOI: 10.1186/s13323-015-0019-x
Source DB: PubMed Journal: Investig Genet ISSN: 2041-2223
Commonly applied algorithms to SNP data for quantifying individual population substructure in humans
|
|
|
|
|
|
|---|---|---|---|---|
| Model-free | Principal component analysis | EIGENSOFTa |
| [ |
| Principal components and Moran’s | adegenet (R software) |
| [ | |
| Multidimensional scaling | PLINKa |
| [ | |
| Principal coordinates | PCO-MC |
| [ | |
| Spectral graph theory | GemTools |
| [ | |
| Spectral graph theory | SpectralGem |
| [ | |
| Laplacian eigenfunction | LAPSTRUCT |
| [ | |
| Genetic algorithm coupled to AMOVA | GAGA |
| [ | |
| Model-based | Log-likelihood HWE | ADMIXTURE |
| [ |
| Log-likelihood HWE | FRAPPE |
| [ | |
| Bayesian HWE | STRUCTURE |
| [ | |
| Bayesian HWE | fastSTRUCTURE |
| [ | |
| Nonnegative matrix factorization | sNMF |
| [ | |
| Bayesian | BAPS |
| [ | |
| Chromopainting and Bayesian classifier | fineSTRUCTURE |
| [ | |
| Log-likelihood genotypic/haplotypic gradients | LOCO-LD |
| [ | |
| Log-likelihood allelic gradients | SPA |
| [ | |
| ADMIXTURE and linear regression | GPS |
| [ | |
| Bayesian clustering with spatial information | TESS |
| [ |
aWe provide one of the possible implementations present in the literature.
Figure 1Basic admixture models commonly used in population genetics. Each rectangle represents a population. Both models consider one initial ancestral population (gray color) that splits into two new populations t_split generations ago. Each of the new populations evolves without exchanging migrants for a period of time, during which genetic differentiation between them can take place as exemplified by the presence of a different color. (A) Continuous gene flow (CGF) model. The blue population contributes 4 Nm chromosome migrants to the red population from time point t_split onwards, replacing the same number of chromosomes from this population. (B) Hybrid (HI) model. At t_admixture, there is a single event of admixture, and a new hybrid population is created from m fraction of chromosome migrants from the blue population and 1-m fraction of migrants from the red population. After this event, each population continues to evolve independently. Adapted from [20].
Default parameter used in two-population models, with and without migration
|
|
|
|
|---|---|---|
| Sample size population 1 | n1 | 100 |
| Sample size population 2 | n2 | 100 |
| Number of independent SNPs | nsnps | 5,000 |
| Mutation rate (length)a | theta | 2 |
| Effective population sizeb | N1, N2 | 10,000 |
| Divergence time | T1 | 2,000 |
| Constant migration rate | 4 | 0 |
aThe scaled mutation rate theta = 2*Ne*mu = 2 describes a region of about 2 kb assuming a mutation rate of 2.5e − 8. bThe effective population size corresponds broadly to that of Africa.
Results from the two-population model simulations
|
|
|
|
|
|---|---|---|---|
| Sampling depth, n1, n2 | |||
| 8 | 99.92 | 100 | 39.56 |
| 10 | 99.83 | 100 | 34.03 |
| 20 | 99.87 | 100 | 100 |
| 40 | 99.81 | 100 | 100 |
| 100 | 99.74 | 100 | 100 |
| Uneven sampling, n1 | |||
| 8 | 98.94 | 99.45 | 98.59 |
| 10 | 99.43 | 99.78 | 99.32 |
| 20 | 99.61 | 100 | 92.21 |
| 40 | 99.67 | 100 | 100 |
| 100 | 99.74 | 100 | 100 |
| Sequencing depth, nsnps | |||
| 10 | 3.13 | 0.65 | 18.51 |
| 50 | 66.56 | 75.54 | 74.42 |
| 100 | 85.33 | 92.95 | 91.89 |
| 500 | 96.78 | 99.87 | 99.93 |
| 1,000 | 98.62 | 99.99 | 100 |
| 5,000 | 99.74 | 100 | 100 |
| Population size, theta | |||
| 1 | 99.73 | 100 | 100 |
| 2 | 99.74 | 100 | 100 |
| 5 | 99.74 | 100 | 100 |
| 10 | 99.72 | 100 | 100 |
| Effective population size, N2 | |||
| 100 | 99.98 | 100 | 100 |
| 2,500 | 99.94 | 100 | 100 |
| 7,500 | 99.82 | 100 | 100 |
| 10,000 | 99.74 | 100 | 100 |
| Divergence time ( | |||
| 0.000075 | 0.54 | 0.38 | 0.01 |
| 0.00025 | 0.24 | 0.03 | 0 |
| 0.00125 | 6.19 | 0.03 | 0.24 |
| 0.0025 | 69.36 | 95.28 | 0.53 |
| 0.0125 | 98.36 | 100 | 100 |
| 0.05 | 99.74 | 100 | 100 |
| Constant migration rate, 4 | |||
| 0.1 | 99.77 | 100 | 100 |
| 1 | 99.78 | 100 | 100 |
| 5 | 99.56 | 100 | 100 |
| 10 | 99.15 | 99.99 | 100 |
| 50 | 93.95 | 99.98 | 33.3 |
| 100 | 41.61 | 94.06 | 0.56 |
We simulated two populations using ms [75], which splitted and evolved independently t generations ago. See Table 1 for default parameters. Each simulation comprises 1,000 independent regions of 2 kb, from which one SNP per region is sampled at random. Each parameter set was replicated ten times. For each algorithm, the estimated ancestry proportions over the different runs were sorted according to the expected ancestry matrix denoting the true population labels using CLUMPP [44]. From this, standard denoted demographic parameters were successively varied to exemplify the impact on the estimates. We report the coefficient of determination that can be understood as the percentage of the true outcome.
Results from admixture simulation with changing parameter in the HI model from HapMap III data
|
|
|
|
|
|---|---|---|---|
| Sample size | |||
| 8 | 98.2 | 99 | 86.66 |
| 10 | 99.5 | 99.52 | 98.69 |
| 20 | 99.74 | 99.82 | 99.71 |
| 40 | 99.85 | 99.9 | 99.86 |
| 50 | 99.87 | 99.93 | 99.9 |
| 100 | 99.91 | 99.95 | 99.95 |
| nsnps | |||
| 5 | 4.56 | 15.38 | 19.44 |
| 10 | 15.92 | 47.37 | 46.2 |
| 50 | 80.62 | 86.31 | 86.89 |
| 100 | 89.67 | 93.04 | 93.33 |
| 500 | 98.46 | 99.07 | 99.11 |
| 1,000 | 99.19 | 99.54 | 99.56 |
| 5,000 | 99.84 | 99.92 | 99.91 |
| 10,000 | 99.91 | 99.95 | 99.95 |
| Nbreaks | |||
| 5 | 88.82 | 88.37 | 87.46 |
| 10 | 94.38 | 94.86 | 94.43 |
| 50 | 98.74 | 98.87 | 98.8 |
| 100 | 99.33 | 99.41 | 99.38 |
| 500 | 99.81 | 99.85 | 99.84 |
| 1,000 | 99.86 | 99.91 | 99.9 |
| 5,000 | 99.91 | 99.94 | 99.94 |
| 10,000 | 99.91 | 99.95 | 99.95 |
| alpha | |||
| 0.01 | 99.94 | 99.99 | 99.99 |
| 0.03 | 99.93 | 99.97 | 99.95 |
| 0.07 | 99.93 | 99.97 | 99.92 |
| 0.1 | 99.93 | 99.97 | 99.91 |
| 0.3 | 99.91 | 99.96 | 99.95 |
| 0.5 | 99.91 | 99.95 | 99.95 |
The admixed population was generated from the African (YRI) and European (CEU) population from HapMap III. A sample from an admixed population is known to consist of a mosaic of chromosomal regions or blocks from the ancestral population. With increasing time since the admixture event, these regions are becoming broken up into smaller pieces through recombination that is denoted by the number of break points (Nbreaks). Individuals from the synthetically admixed population were sampled randomly from blocks from source populations, respectively (the defined admixture proportions, alpha). Finally, a subsample (nsnps) of uniformly distributed sites was chosen. The distance of the sites has been chosen to be greater than 1 Mb to assure linkage equilibrium.
Figure 2Estimated proportions of ancestry from the continuous gene flow (CGF) model (see main text). See Table 2 for default parameters. (A) Results for varying divergence time while keeping the migration rate constant at 4 Nm = 50. (B) The estimated ancestry proportions for keeping the divergence time constant at T = 10 while varying the migration rate. Error bars denote the standard deviation of the estimated ancestry proportion per population. Simulations were produced using the following ms command [75]: ms 200 5000 -t 2 -I 2 100 100 -em 1 2 2000 -n 2 1 -ej 2 1.
Figure 3Migration and inbreeding using the two-population model (see legend of Figure for ms command). Inbreeding was simulated by a reduction of the heterozygote genotypes proportional to the given F value (see main text for details).
Figure 4Netto time estimates for fastSTRUCTURE, sNMF, and ADMIXTURE. Mean time estimates of the termination of the respective programs from ten independent replications. We simulated 100 chromosomes from two populations with an effective population size of 10,000 and a Ne*m = 20 using ms [75] (see legend of Figure 2 for command details). The termination time can be expected to scale similarly as the number of used SNPs given the complexity of the programs.
Figure 5Estimated error in the estimated individual admixture proportions from the simulated admixed population (HI model). We used an extended version of the backward demographic simulator described in [76] that includes recombination and different types of mating and allows for ancestry painting [14]. Over all parameters that are defined in this model [19], we varied the time of split of the ancestral populations, which ranged between 50 and 2,000 generations among simulations. Each simulation generated 75 (25 by population) full human genomes with 22 diploid chromosomes (l) with the following sizes: 13.65, 13.15, 11.20, 10.65, 10.20, 9.65, 9.35, 8.50, 8.40, 8.95, 7.95, 8.65, 6.35, 5.80, 6.30, 6.75, 6.50, 5.95, 5.40, 5.40, 3.10, and 3.65 Mb [77]. The mutation rate was set to 2.5 × 10−8 [78] and the recombination rate to 1.8 × 10−8. PLINK was applied to exclude SNPs with minor allele frequency less than 0.05 and LD (default PLINK --indep 50 5 2). The effective population sizes of the parental and hybrid populations were set to 5,000 diploid individuals; the time of admixture was ten generations ago, and each parental population equally contributed to the admixed population. By this way, we minimized the putative effect of genetic drift in the admixture proportions of the hybrid population. Furthermore, in order to include the effects of bias sample size, we repeated all the analyses with 1:1 (A) and 1:5 (B) parental population size ratios. Four different algorithms were considered: sNMF, ADMIXTURE, fastSTRUCTURE, and MDS. In the case of MDS, ancestry proportions of each individual from the admixed population were estimated as the relative position in the first dimension in relation to the mean estimated coordinate of the parental populations.