| Literature DB >> 22384356 |
Bryan Howie, Jonathan Marchini, Matthew Stephens.
Abstract
Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package.Entities:
Keywords: GWAS; haplotype; human; linkage disequilibrium; reference panel
Year: 2011 PMID: 22384356 PMCID: PMC3276165 DOI: 10.1534/g3.111.001198
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
HapMap 3 panels used for cross-validation
| Panel ID | Panel Description | Number of Unrelated Individuals |
|---|---|---|
| ASW | African ancestry in Southwest USA | 63 |
| CEU | Utah residents with Northern and Western European ancestry from the CEPH collection | 117 |
| CHB | Han Chinese in Beijing, China | 84 |
| CHD | Chinese in Metropolitan Denver, Colorado | 85 |
| GIH | Gujarati Indians in Houston, Texas | 88 |
| JPT | Japanese in Tokyo, Japan | 86 |
| LWK | Luhya in Webuye, Kenya | 90 |
| MKK | Maasai in Kinyawa, Kenya | 143 |
| MXL | Mexican ancestry in Los Angeles, California | 52 |
| TSI | Toscani in Italia | 88 |
| YRI | Yoruba in Ibadan, Nigeria | 115 |
| All panels | 1011 |
In panels that included trios (ASW, CEU, MXL, MKK, and YRI), we retained the trio parents as “unrelated” individuals. In panels that included parent-child duos (ASW, CEU, MXL, and YRI), we retained the observed duo parent and the inferred transmitted haplotype from the unobserved duo parent, yielding three “unrelated” haplotypes per duo; we then paired the inferred transmitted haplotypes at random to create diploid pseudo-individuals.
We combined the CHB and JPT panels into a single CHB+JPT panel with 170 individuals for all of the analyses in this paper.
Figure 1 Imputation accuracy at low-frequency SNPs in HapMap 3 cross-validations in ASW and TSI, as a function of reference panel composition and k value. These plots show the imputation accuracy of IMPUTE2 in (A) the ASW panel and (B) the TSI panel. The accuracy of each experiment is plotted on the y-axis as the mean R across all SNPs with MAF < 5% in the cross-validation panel (identified by the gray box in each plot). The x-axis shows the k parameter, which scales linearly with the computational burden of imputation updates in IMPUTE2. Each curve represents a different reference panel, with panels added cumulatively in the order shown in the legends, reading from bottom to top. Similar plots for other HapMap 3 target panels can be found in File S1.
Figure 2 Imputation accuracy at low-frequency SNPs in HapMap 3 cross-validations, as a function of target panel, reference panel composition, k value, and imputation method. These plots show the imputation accuracy of IMPUTE2 and Beagle in various cross-validation experiments. The accuracy of each experiment is plotted on the y-axis as the mean R across all SNPs with MAF < 5% in the cross-validation panel (identified by the gray box in each plot). The x-axis shows the k parameter, which scales linearly with the computational burden of imputation updates in IMPUTE2. The solid black curves show how R varies with k when using IMPUTE2 with a reference panel containing the full set of 2020 HapMap 3 haplotypes; the dashed black lines show the accuracy of Beagle with this reference panel. IMPUTE2 was also applied to subpanels of the full HapMap 3 panel, with results shown as orange curves. Similar plots for other observed SNP sets and imputed SNP MAFs can be found in File S3.
Figure 3 Imputation accuracy in Gambian validation set as a function of reference panel composition and minor allele frequency. These plots show the accuracy obtained when imputing masked SNPs in 1216 Gambian individuals from the MalariaGEN dataset using IMPUTE2 with k = 500. Each reference panel is represented by a different color, and the results are shown for (A) all SNPs and (B) SNPs with MAF < 10% in the Gambian validation set. The results are binned by MAF, with 5% bins in (A) and 1% bins in (B). Each point on a curve is located in the middle of the corresponding MAF bin. The following reference panel codes are used in the legend: GMB (Gambia, 200 haplotypes); GHN (Ghana, 200 haplotypes); and HM3 (HapMap 3, 2022 haplotypes).
Figure 4 Comparison of imputation accuracy between IMPUTE2 and Beagle in Gambian validation set. This plot shows the accuracy obtained when imputing masked SNPs in 1216 Gambian individuals from the MalariaGEN dataset using either IMPUTE2 with k = 500 (solid lines) or Beagle on default settings (dashed lines). Imputation was performed with a reference panel of Gambian haplotypes (blue) and a reference panel of Gambian, Ghanaian, and HapMap 3 African ancestry haplotypes (gray). The results are grouped into 5% MAF bins, and each point on a curve is located in the middle of the corresponding MAF bin. The following reference panel codes are used in the legend: GMB (Gambia, 200 haplotypes); GHN (Ghana, 200 haplotypes); and HM3.afr (HapMap 3 African ancestry, 822 haplotypes).
Computational benchmarks for a simulated GWAS of 1000 European individuals imputed from reference panels with 10,000 SNPs
| Method | Reference Panel | Running Time (minutes) | RAM (GB) | |
|---|---|---|---|---|
| IMPUTE2 | 500 | European | 90 | 0.26 |
| 500 | Cosmopolitan | 127 | 0.60 | |
| 1000 | European | 157 | 0.30 | |
| 4800 | Cosmopolitan | 603 | 0.74 | |
| Beagle | — | European | 655 | 5.2 |
| — | Cosmopolitan | 5904 | 15.2 |
The European panel contains 1000 haplotypes.
The Cosmopolitan panel contains 4800 haplotypes with ancestry from Africa, Asia, and Europe.