| Literature DB >> 25621886 |
Hou-Feng Zheng1, Jing-Jing Rong2, Ming Liu2, Fang Han3, Xing-Wei Zhang2, J Brent Richards4, Li Wang2.
Abstract
Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) < 5%) are not systemically assessed. With the emergence of next-generation sequencing, large reference panels (such as the 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.Entities:
Mesh:
Year: 2015 PMID: 25621886 PMCID: PMC4306552 DOI: 10.1371/journal.pone.0116487
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The proportion of variants by Minor Allele Frequency (MAF) across imputation reference panels.
Overview of the imputation performances for the 3 genome-wide genotype arrays based on different reference panels.
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| 317k | 281,641 | 8,508,091 | 7,123,480 | 84% | 11,577,780 | 7,526,749 | 65% | 37,427,201 | 10,642,325 | 28% |
| 610k | 488,822 | 8,510,853 | 7,412,689 | 87% | 11,581,767 | 7,767,264 | 67% | 37,427,643 | 10,649,233 | 28% |
| 1M | 841,995 | 8,522,561 | 7,610,312 | 89% | 11,591,081 | 7,939,987 | 69% | 37,429,304 | 10,743,754 | 29% |
* SNP QC was done
** Well-imputed SNPs were those with proper info ≥ 0.4
Concordance of the 9 imputation scenarios.
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||
| 317K | 1KGpilot | 93.83 | 0 | 0 | 0.72(0.348) | 0.85(0.274) | 0.93(0.187) |
| 317K | 1KGinterim | 93.81 | 0 | 0 | 0.74(0.349) | 0.86(0.281) | 0.93(0.192) |
| 317K | 1KGphase1 | 94.70 | 0 | 0 | 0.78(0.349) | 0.88(0.268) | 0.94(0.184) |
| 610K | 1KGpilot | 96.11 | 0 | 0 | 0.77(0.354) | 0.91(0.251) | 0.97(0.155) |
| 610K | 1KGinterim | 96.22 | 0 | 0 | 0.81(0.341) | 0.93(0.249) | 0.97(0.158) |
| 610K | 1KGphase1 | 96.99 | 0 | 0 | 0.87(0.311) | 0.95(0.224) | 0.98(0.145) |
| 1M | 1KGpilot | 97.05 | 0 | 0.0015 | 0.84(0.355) | 0.96(0.251) | 0.98(0.141) |
| 1M | 1KGinterim | 97.30 | 0 | 0 | 0.88(0.358) | 0.97(0.255) | 0.98(0.138) |
| 1M | 1KGphase1 | 97.98 | 0 | 0 | 0.92(0.354) | 0.99(0.226) | 0.99(0.122) |
The MAF distribution of the genotyped variants.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 317k | 281,641 | 0 | 0 | 33 | 7,982 | 273,626 |
| 610k | 488,822 | 0 | 0 | 168 | 20,508 | 468,146 |
| 1m | 841,995 | 0 | 74 | 1,203 | 54,367 | 786,351 |
* SNP QC was done
Figure 2The proportion of well-imputed SNPs (info>0.4) in different MAF bins across imputation reference panels (Panel A is for the 317K genotypic array, Panel B is for 610K genotypic array, and Panel C is for 1M genotypic array).
Panel D, E and F is a comparison of median info score across 3 reference panels for 317K, 610K and 1M genotypic array respectively.