| Literature DB >> 22595242 |
Ethan M Jewett1, Matthew Zawistowski, Noah A Rosenberg, Sebastian Zöllner.
Abstract
The potential for imputed genotypes to enhance an analysis of genetic data depends largely on the accuracy of imputation, which in turn depends on properties of the reference panel of template haplotypes used to perform the imputation. To provide a basis for exploring how properties of the reference panel affect imputation accuracy theoretically rather than with computationally intensive imputation experiments, we introduce a coalescent model that considers imputation accuracy in terms of population-genetic parameters. Our model allows us to investigate sampling designs in the frequently occurring scenario in which imputation targets and templates are sampled from different populations. In particular, we derive expressions for expected imputation accuracy as a function of reference panel size and divergence time between the reference and target populations. We find that a modestly sized "internal" reference panel from the same population as a target haplotype yields, on average, greater imputation accuracy than a larger "external" panel from a different population, even if the divergence time between the two populations is small. The improvement in accuracy for the internal panel increases with increasing divergence time between the target and reference populations. Thus, in humans, our model predicts that imputation accuracy can be improved by generating small population-specific custom reference panels to augment existing collections such as those of the HapMap or 1000 Genomes Projects. Our approach can be extended to understand additional factors that affect imputation accuracy in complex population-genetic settings, and the results can ultimately facilitate improvements in imputation study designs.Entities:
Mesh:
Year: 2012 PMID: 22595242 PMCID: PMC3416004 DOI: 10.1534/genetics.111.137984
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Figure 1Two-population coalescent model for imputation reference panel selection. (A) Two populations, labeled 1 and 2, of sizes N1 and N2 diploid individuals, diverge from an ancestral population of size NA at time tD. A single haplotype T for which genotypes at untyped markers are to be imputed is sampled from population 1. We consider two possible reference panels for imputing T: an internal reference panel of n1 haplotypes sampled from population 1 and an external reference panel of n2 haplotypes sampled from population 2. If T first coalesces with a type 1 lineage (blue), then the internal panel is optimal for imputing T (event C1). The external panel is optimal (event C2) if T first coalesces with a lineage of type 2 (red). Finally, if T first coalesces with a type 1–2 lineage (orange), then the two reference panels are equivalent (event C12). (B) To compute the probability of optimality for each reference panel, we condition on (the event that T coalesces before the divergence), the quantities iD and jD (the numbers of lineages originating in populations 1 and 2, respectively, that remain at the time of divergence), and iC, jC, and kC (the numbers of type 1, type 2, and type 1–2 lineages remaining at the instant when T first coalesces). In the realization pictured, T does not coalesce before the divergence time (event ) and iD = 3, jD = 2, iC = 2, and jC = kC = 1. Because T first coalesces with a type 1–2 lineage (event C12), the two reference panels are equivalent for imputing T.
Derivation of the recursion
| Lineage pair for next coalescence | Resulting lineage | Number of ways event can occur | ℙ( |
|---|---|---|---|
| — | 1 | ||
| — | 0 | ||
| — | 0 | ||
| 1, 1 | 1 | ||
| 1, 2 | 1–2 | ||
| 1, 1–2 | 1–2 | ||
| 2, 1–2 | 1–2 | ||
| 2, 2 | 2 | ||
| 1–2, 1–2 | 1–2 |
Assume that in addition to lineage T, i lineages of type 1, j lineages of type 2, and k lineages of type 1–2 exist in the ancestral population at some time t > tD. Conditional on this configuration, let denote the probability that T first coalesces with a lineage of type 1. Column 1 lists each possible lineage pair for the next coalescence event. Column 2 gives the resulting lineage type for the coalescence. Column 3 contains the number of ways each event can occur. Column 4 gives the probability that T first coalesces with a lineage of type 1, conditional on the pair of lineages in column 1 being the next to coalesce. The recursive equation for is obtained by conditioning on all the possible lineage pairs for the next coalescence.
Figure 2Coalescence times between the target T and the reference panels. T1 indicates the time at which the target haplotype T first coalesces with a type 1 or type 1–2 lineage. We choose one of the descendant reference haplotypes from that coalescence event (highlighted in purple) to be the template from the internal reference panel. We assume that when using the internal panel, the number of mutations that result in incorrectly imputed sites follows a Poisson distribution with mean 2T1θω/2, where 2T1 is the total branch length separating the target T from the templates sampled from the internal panel in units of 2NA generations. Here, θ = 4NAμ is the per-base population-scaled mutation rate, μ is the per-base per-generation mutation rate, and ω is the number of bases genotyped in the reference population that will be imputed in T. Similarly, T2 is the time at which the target haplotype T first coalesces with a type 2 or type 1–2 lineage and 2T2 is the branch length between T and the set of potential templates from the external reference panel (the best external reference panel is highlighted in green).
Figure 3The two-population coalescent model of divergence, assuming exponential growth in the descendant populations. The sizes of populations 1 and 2 change over time according to and , respectively, for t ∈ [0, tD]. The quantities α1, α2 > 0 are growth rates, and N1(0) and N2(0) are the sizes of populations 1 and 2 in the present. At time t, populations 1 and 2 merge instantaneously into the ancestral population, which has constant size NA. In our analysis, to explore the effect of exponential population growth on imputation accuracy, we vary N1(0) and N2(0) while holding N1(tD) and N2(tD) fixed.
Reformulation of the results of Equations (1) through (25) for the case of exponential growth
| Index | Number | Quantity | Dependencies | Description |
|---|---|---|---|---|
| 1 | 26 | None | Conversion of elapsed time in units of 2 | |
| 2 | 27 | 1 | Probability that | |
| 3 | 4 | 2 | Joint probability that | |
| 4 | 5 | 3 | Probability that | |
| 5 | 6 | ℙ( | 4 | Probability that |
| 6 | 15 | None | Probability that | |
| 7 | 16 | None | Probability that | |
| 8 | 17 | None | Probability that | |
| 9 | 1 | 6, 5, 3, 2 | Probability that | |
| 10 | 9 | 7, 3, 2 | Probability that | |
| 11 | 10 | 8, 3, 2 | Probability that | |
| 12 | 23 | 5, 2 | Survival function of the time until lineage | |
| 13 | 22 | 12 | Expected time until | |
| 14 | 24 | 4, 3 | Expected time until | |
| 15 | 21 | 14, 13, 5, 4 | Expected time until | |
| 16 | 25 | 2 | Expected time until |
The derivation of each expression is the same as in the case of populations of constant size, except that h,(t; N(0),α) is used, rather than h,(t; N). The quantities on which the expressions in the table depend are given in the Dependencies column. The numbers in the Dependencies column correspond to those in the Index column. The number of each equation—or its analog for the case of populations of constant size—is given in the Number column. Formulas for the case in which populations 1 and 2 have constant size are obtained by setting α1 and α2 equal to 0.
Comparison of closed-form, recursive, and simulated probabilities
| Closed form | Recursion | Simulation | Closed form | Recursion | Simulation | ||
|---|---|---|---|---|---|---|---|
| 5 | 5 | 0.4069 | 0.4069 | 0.4069 | 0.5083 | 0.5083 | 0.5082 |
| 5 | 10 | 0.2807 | 0.2807 | 0.2808 | 0.4164 | 0.4164 | 0.4164 |
| 5 | 50 | 0.1188 | 0.1188 | 0.1191 | 0.3034 | 0.3034 | 0.3038 |
| 10 | 10 | 0.4392 | 0.4392 | 0.4392 | 0.6051 | 0.6051 | 0.6053 |
| 10 | 50 | – | 0.2146 | 0.2153 | – | 0.4830 | 0.4836 |
| 50 | 50 | – | 0.6078 | 0.6068 | – | 0.8778 | 0.8790 |
| 50 | 100 | – | 0.5404 | 0.5378 | – | 0.8670 | 0.8682 |
| 100 | 100 | – | 0.7268 | 0.7274 | – | 0.9494 | 0.9499 |
ℙ(C1) computed analytically using closed-form (Equations 7 and 8) and recursive (Equation 15) expressions, and estimated from coalescent simulations using 106 replicates.
Figure 4Imputation performance for the constant-size two-population model. For two different divergence times tD, the figure shows the probability ℙ(C1) that the internal reference panel is optimal and the expectation E[S2 − S1] of the number of additional imputation errors made when imputing using the external reference panel rather than the internal reference panel. (A) ℙ(C1), tD = 0.005. (B) ℙ(C1), tD = 0.05. (C) E[S2 − S1], tD = 0.005. (D) E[S2 − S1], tD = 0.05. E[S2 − S1] is reported in units of the population-scaled mutation rate θω = 4N for the imputed region of ω bases. Reference panel size is the number of haplotypes in the panel. For clarity, the scale of C and D differs from that of A and B.
Figure 5Imputation performance for the exponential-growth two-population model. For two different divergence times tD, the figure shows the probability ℙ(C1) that the internal reference panel is optimal and the expectation E[S2 − S1] of the number of additional imputation errors made when imputing using the external reference panel rather than the internal reference panel. Values for the exponential growth model are plotted with dashed lines and, for comparison, the corresponding values for a constant-size model are shown with solid lines. (A) ℙ(C1), tD = 0.005. (B) ℙ(C1), tD = 0.05. (C) E[S2 − S1], tD = 0.005. (D) E[S2 − S1], tD = 0.05. E[S2 − S1] is reported in units of the population-scaled mutation rate θω = 4N for the imputed region of ω bases. Reference panel size is the number of haplotypes in the panel. For clarity, the scale of C and D differs from that of A and B.
Figure 6The effect of population growth on coalescent waiting times. Increasing the present-day size N(0) of a population while holding the size N(tD) at time tD fixed increases the mean waiting time for each coalescence event.