| Literature DB >> 29041903 |
Maria Kabisch1,2, Ute Hamann1, Justo Lorenzo Bermejo3.
Abstract
BACKGROUND: Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation.Entities:
Keywords: Coalescent theory; Genotype imputation; Imputation accuracy; Linkage disequilibrium; Population growth; Population structure
Mesh:
Year: 2017 PMID: 29041903 PMCID: PMC5646149 DOI: 10.1186/s12864-017-4208-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of conducted simulations and proposed algorithm for coalescent-based genotype imputation
Brief background information about the programs and methods used in the present study
| Program/Method | Background information | Application to the present study | Reference |
|---|---|---|---|
| BATWING | The program reads multi-locus haplotype data and uses a Markov chain Monte Carlo method based on coalescent theory to generate approximate random samples of the underlying gene genealogy. BATWING allows specification of the population growth and structure models with their corresponding prior distributions. | Estimation of gene genealogies underlying haplotypes under the basic coalescent and also considering population growth and structure | [ |
| Genetree | The program constructs gene trees describing the history of a sample of DNA sequences and calculates maximum likelihood estimates of the time to the most recent common ancestor and mutation, migration and growth rates, also in substructured populations. | Exclusion of incompatible sites by pairwise four-gamete tests | [ |
| IMPUTE2 | Computer program for phasing observed genotypes and imputing missing genotypes. Basically, phasing and imputation are alternatively iterated in a Markov chain Monte Carlo framework which accounts for phase uncertainty. | Used as gold standard for genotype imputation assuming no recombination and also considering regional recombination rates | [ |
| msms | Extension of Hudson’s coalescent simulator ms, which also permits to study selection. Since selection was not considered in the present study, our haplotypes were simulated using standard coalescent methods: genealogies were generated by tracing randomly sampled alleles backwards in time. | Haplotype simulation under the basic coalescent and also considering population growth and structure | [ |
| SHAPEIT2 | Fast and accurate method for phasing from genotype or sequencing data. | Phasing of real genotype data | [ |
| SumTrees | The program constructs a summary tree based on tree samples provided by the user. Supported methods for summary tree construction include the Maximum Clade Credibility Topology, and the majority-rule clade consensus. | Combination of gene genealogies estimated by BATWING into a majority-rule consensus tree | [ |
Distribution of simulated variants according to allele frequency, and dependence of imputation accuracy on the sizes of study population and reference panel. Different total numbers of haplotypes were simulated under the basic coalescent (Nsim). Genotypes were imputed based on the basic coalescent and with IMPUTE2 without recombination. Mean genotype concordance rates with the corresponding 95% confidence intervals (CIs) are shown for all variants and stratified by minor allele frequency (MAF)
| Variants | Basic coalescent | IMPUTE2 | |||||
|---|---|---|---|---|---|---|---|
| Nsim | MAF | N (%) | Mean | (95% CI) | Mean | (95% CI) | |
|
| |||||||
| 1000 | all | 96 | (100.0) | 0.95 | (0.94,0.96) | 0.77 | (0.74,0.81) |
| ≤0.01 | 30 | (31.3) | 0.97 | (0.96,0.98) | 0.90 | (0.88,0.92) | |
| >0.01, ≤0.05 | 24 | (25.0) | 0.94 | (0.93,0.95) | 0.85 | (0.83,0.87) | |
| >0.05 | 42 | (43.8) | 0.93 | (0.92,0.95) | 0.62 | (0.58,0.66) | |
|
| |||||||
| 800 | all | 88 | (100.0) | 0.91 | (0.89,0.94) | 0.77 | (0.72,0.82) |
| ≤0.01 | 27 | (30.7) | 0.99 | (0.98,0.99) | 0.99 | (0.98,1.00) | |
| >0.01, ≤0.05 | 25 | (28.4) | 0.95 | (0.93,0.97) | 0.91 | (0.89,0.93) | |
| >0.05 | 36 | (40.9) | 0.83 | (0.78,0.89) | 0.52 | (0.47,0.58) | |
| 600 | all | 90 | (100.0) | 0.97 | (0.96,0.99) | 0.66 | (0.61,0.71) |
| ≤0.01 | 15 | (16.7) | 0.99 | (0.98,1.00) | 0.98 | (0.97,0.99) | |
| >0.01, ≤0.05 | 14 | (15.6) | 0.93 | (0.91,0.95) | 0.92 | (0.89,0.94) | |
| >0.05 | 61 | (67.8) | 0.98 | (0.96,1.00) | 0.52 | (0.49,0.55) | |
| 400 | all | 87 | (100.0) | 0.92 | (0.90,0.95) | 0.77 | (0.73,0.81) |
| ≤0.01 | 19 | (21.8) | 0.98 | (0.97,0.99) | 0.97 | (0.96,0.98) | |
| >0.01, ≤0.05 | 28 | (32.2) | 0.95 | (0.93,0.96) | 0.88 | (0.83,0.92) | |
| >0.05 | 40 | (46.0) | 0.88 | (0.82,0.94) | 0.59 | (0.56,0.62) | |
| 200 | all | 89 | (100.0) | 0.89 | (0.86,0.92) | 0.76 | (0.72,0.79) |
| ≤0.01 | 10 | (11.2) | 0.97 | (0.96,1.00) | 0.97 | (0.96,0.99) | |
| >0.01, ≤0.05 | 28 | (31.5) | 0.91 | (0.89,0.94) | 0.90 | (0.88,0.93) | |
| >0.05 | 51 | (57.3) | 0.85 | (0.81,0.90) | 0.64 | (0.61,0.67) | |
aResults averaged over ten simulation replicates and ten iterations with independent selection of measured variant sites
Fig. 2a-d Accuracy of imputation relying on the coalescent (black) and with IMPUTE2 (gray) represented by mean concordance rates with the corresponding 95% confidence intervals (shown as error bars). a 1000 haplotypes were simulated under the basic coalescent. Genotypes were imputed based on the basic coalescent and with IMPUTE2 without recombination. Results are presented for all variants and stratified by minor allele frequency. b Different total numbers of haplotypes were simulated under the basic coalescent. Genotypes were imputed based on the basic coalescent and with IMPUTE2 without recombination. c Haplotypes were simulated under the basic coalescent extended to accommodate different exponential growth rates (α). Genotypes were imputed based on the coalescent with growth, and with IMPUTE2 without recombination. d Haplotypes were simulated under the coalescent extended to consider different numbers of subpopulations (β). Genotypes were imputed based on the coalescent with population structure and with IMPUTE2 without recombination
Distribution of simulated variants according to allele frequency, and dependence of imputation accuracy on population growth and structure. 1000 haplotypes were simulated under the coalescent with different exponential growth rates (α) and numbers of subpopulations (β). Genotypes were imputed based on the basic coalescent, the coalescent incorporating population growth and/or structure and with IMPUTE2 without recombination. Mean genotype concordance rates with the corresponding 95% confidence intervals (CIs) are presented. Results are shown for all variants and stratified by minor allele frequency (MAF)
| Variants | Basic coalescent | Coalescent with | IMPUTE2 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
| MAF | N (%) | Mean | (95% CI) | Mean | (95% CI) | Mean | (95% CI) | |
|
| ||||||||||
| 1.25 | 1 | all | 87 | (100.0) | 0.95 | (0.94,0.96) | 0.95 | (0.93,0.96) | 0.77 | (0.72,0.82) |
| ≤0.01 | 30 | (34.5) | 0.99 | (0.98,1.00) | 0.99 | (0.98,1.00) | 0.99 | (0.98,1.00) | ||
| >0.01, ≤0.05 | 20 | (23.0) | 0.90 | (0.88,0.92) | 0.90 | (0.88,0.92) | 0.90 | (0.88,0.92) | ||
| >0.05 | 37 | (42.5) | 0.94 | (0.91,0.97) | 0.94 | (0.91,0.97) | 0.51 | (0.46,0.57) | ||
| 2.50 | 1 | all | 86 | (100.0) | 0.94 | (0.92,0.96) | 0.94 | (0.92,0.95) | 0.76 | (0.71,0.81) |
| ≤0.01 | 25 | (29.1) | 0.98 | (0.97,0.99) | 0.98 | (0.97,0.99) | 0.98 | (0.97,0.99) | ||
| >0.01, ≤0.05 | 23 | (26.7) | 0.90 | (0.88,0.92) | 0.90 | (0.88,0.92) | 0.89 | (0.87,0.91) | ||
| >0.05 | 38 | (44.2) | 0.93 | (0.89,0.96) | 0.93 | (0.89,0.96) | 0.50 | (0.45,0.55) | ||
| 3.75 | 1 | all | 88 | (100.0) | 0.94 | (0.93,0.96) | 0.94 | (0.93,0.96) | 0.76 | (0.70,0.81) |
| ≤0.01 | 27 | (31.0) | 0.98 | (0.97,0.99) | 0.98 | (0.97,0.99) | 0.98 | (0.98,0.99) | ||
| >0.01, ≤0.05 | 24 | (27.3) | 0.91 | (0.89,0.92) | 0.91 | (0.89,0.92) | 0.89 | (0.88,0.91) | ||
| >0.05 | 37 | (42.0) | 0.93 | (0.90,0.96) | 0.93 | (0.90,0.96) | 0.50 | (0.45,0.56) | ||
| 5.00 | 1 | all | 88 | (100.0) | 0.94 | (0.93,0.96) | 0.94 | (0.93,0.96) | 0.76 | (0.71,0.81) |
| ≤0.01 | 27 | (30.7) | 0.99 | (0.98,1.00) | 0.99 | (0.98,1.00) | 0.99 | (0.98,1.00) | ||
| >0.01, ≤0.05 | 24 | (27.3) | 0.92 | (0.90,0.93) | 0.92 | (0.90,0.93) | 0.90 | (0.88,0.92) | ||
| >0.05 | 37 | (42.0) | 0.93 | (0.90,0.96) | 0.93 | (0.90,0.96) | 0.50 | (0.45,0.56) | ||
|
| ||||||||||
| 0.00 | 2 | all | 86 | (100.0) | 0.95 | (0.93,0.97) | 0.95 | (0.93,0.97) | 0.85 | (0.81,0.89) |
| ≤0.01 | 38 | (44.2) | 0.96 | (0.93,1.00) | 0.96 | (0.93,1.00) | 0.98 | (0.97,0.99) | ||
| >0.01, ≤0.05 | 32 | (37.2) | 0.94 | (0.92,0.95) | 0.94 | (0.92,0.95) | 0.89 | (0.88,0.91) | ||
| >0.05 | 16 | (18.6) | 0.93 | (0.86,0.99) | 0.92 | (0.86,0.99) | 0.44 | (0.41,0.48) | ||
| 0.00 | 3 | all | 83 | (100.0) | 0.94 | (0.91,0.97) | 0.94 | (0.91,0.97) | 0.87 | (0.82,0.91) |
| ≤0.01 | 47 | (56.6) | 0.97 | (0.94,1.00) | 0.97 | (0.94,1.00) | 0.99 | (0.98,1.00) | ||
| >0.01, ≤0.05 | 20 | (24.1) | 0.92 | (0.90,0.94) | 0.92 | (0.90,0.94) | 0.91 | (0.89,0.94) | ||
| >0.05 | 16 | (19.3) | 0.86 | (0.77,0.95) | 0.86 | (0.77,0.95) | 0.47 | (0.40,0.53) | ||
| 0.00 | 4 | all | 86 | (100.0) | 0.93 | (0.90,0.96) | 0.93 | (0.90,0.96) | 0.92 | (0.88,0.95) |
| ≤0.01 | 56 | (65.1) | 0.97 | (0.94,1.00) | 0.97 | (0.94,1.00) | 0.99 | (0.98,1.00) | ||
| >0.01, ≤0.05 | 18 | (20.9) | 0.95 | (0.93,0.97) | 0.95 | (0.93,0.97) | 0.91 | (0.89,0.94) | ||
| >0.05 | 12 | (14.0) | 0.67 | (0.60,0.75) | 0.68 | (0.60,0.75) | 0.59 | (0.48,0.70) | ||
| 0.00 | 5 | all | 86 | (100.0) | 0.94 | (0.91,0.96) | 0.94 | (0.91,0.96) | 0.94 | (0.92,0.96) |
| ≤0.01 | 48 | (55.8) | 0.97 | (0.93,1.00) | 0.97 | (0.93,1.00) | 0.99 | (0.98,1.00) | ||
| >0.01, ≤0.05 | 30 | (34.9) | 0.93 | (0.92,0.94) | 0.93 | (0.92,0.94) | 0.93 | (0.91,0.94) | ||
| >0.05 | 8 | (9.3) | 0.76 | (0.63,0.89) | 0.76 | (0.63,0.89) | 0.67 | (0.58,0.76) | ||
|
| ||||||||||
| 5.00 | 5 | all | 85 | (100.0) | 0.93 | (0.91,0.96) | 0.93 | (0.90,0.96) | 0.94 | (0.92,0.96) |
| ≤0.01 | 47 | (55.3) | 0.97 | (0.93,1.00) | 0.97 | (0.93,1.00) | 0.99 | (0.98,1.00) | ||
| >0.01, ≤0.05 | 28 | (32.9) | 0.93 | (0.92,0.94) | 0.93 | (0.92,0.94) | 0.93 | (0.92,0.94) | ||
| >0.05 | 10 | (11.8) | 0.76 | (0.67,0.85) | 0.76 | (0.67,0.85) | 0.72 | (0.65,0.79) | ||
Distribution of simulated variants according to allele frequency, and accuracy of variants imputed based on the basic coalescent, with IMPUTE2 without recombination, and with IMPUTE2 with recombination. Real genotypes were retrieved from the 1000 Genomes Project (1000 GP) assuming that the CEU, AFR and AMR subpopulations constituted hypothetical study populations. Remaining 1000 GP individuals built the external reference panel for genotype imputation. Mean concordance rates with the corresponding 95% confidence intervals (CIs) are represented. Results are shown for all variants and also stratified by minor allele frequency (MAF)
| Variants | Basic coalescent |
IMPUTE2 without | Variants |
IMPUTE2 with | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Study | MAF | N (%) | Mean | (95% CI) | Mean | (95% CI) | N (%) | Mean | (95% CI) | ||
| CEU | all | 41 | (100.0) | 0.92 | (0.89,0.95) | 0.90 | (0.86,0.93) | 42 | (100.0) | 0.93 | (0.89,0.96) |
| ≤0.01 | 11 | (26.8) | 0.98 | (0.97,0.99) | 0.98 | (0.97,0.99) | 12 | (28.6) | 0.98 | (0.97,0.99) | |
| >0.01, ≤0.05 | 21 | (51.2) | 0.92 | (0.91,0.93) | 0.91 | (0.91,0.92) | 19 | (45.2) | 0.99 | (0.97,1.00) | |
| >0.05 | 9 | (22.0) | 0.84 | (0.70,0.99) | 0.75 | (0.62,0.88) | 11 | (26.2) | 0.76 | (0.68,0.85) | |
| AFR | all | 123 | (100.0) | 0.93 | (0.92,0.95) | 0.93 | (0.91,0.95) | 121 | (100.0) | 0.96 | (0.94,0.97) |
| ≤0.01 | 35 | (28.5) | 0.99 | (0.98,1.00) | 0.99 | (0.98,1.00) | 37 | (30.6) | 0.99 | (0.98,1.00) | |
| >0.01, ≤0.05 | 58 | (47.2) | 0.97 | (0.96,0.98) | 0.97 | (0.96,0.97) | 56 | (46.3) | 0.96 | (0.95,0.97) | |
| >0.05 | 30 | (24.4) | 0.81 | (0.76,0.85) | 0.79 | (0.74,0.83) | 28 | (23.1) | 0.90 | (0.85,0.95) | |
| AMR | all | 66 | (100.0) | 0.92 | (0.90,0.95) | 0.92 | (0.89,0.94) | 67 | (100.0) | 0.96 | (0.94,0.98) |
| ≤0.01 | 36 | (54.5) | 0.98 | (0.97,0.99) | 0.98 | (0.97,0.99) | 36 | (53.7) | 0.99 | (0.98,1.00) | |
| >0.01, ≤0.05 | 5 | (7.6) | 0.91 | (0.88,0.94) | 0.90 | (0.86,0.95) | 7 | (10.4) | 0.93 | (0.90,0.96) | |
| >0.05 | 25 | (37.9) | 0.84 | (0.79,0.89) | 0.82 | (0.76,0.87) | 24 | (35.8) | 0.93 | (0.87,0.98) | |
CEU…Utah residents with Northern and Western European ancestry; AFR…African populations including African Ancestry in Southwest US (ASW), Luhya in Webuye, Kenya (LWK) and Yoruba in Ibadan, Nigeria (YRI); AMR…American populations including Colombian in Medellin, Colombia (CLM), Mexican Ancestry in Los Angeles, California (MXL) and Puerto Rican in Puerto Rico (PUR)