| Literature DB >> 30131328 |
Genevieve L Wojcik1, Christian Fuchsberger2,3, Daniel Taliun2, Ryan Welch2, Alicia R Martin1, Suyash Shringarpure1, Christopher S Carlson4, Goncalo Abecasis2, Hyun Min Kang2, Michael Boehnke2, Carlos D Bustamante1,5, Christopher R Gignoux6, Eimear E Kenny7,8,9,10.
Abstract
The emergence of very large cohorts in genomic research has facilitated a focus on genotype-imputation strategies to power rare variant association. These strategies have benefited from improvements in imputation methods and association tests, however little attention has been paid to ways in which array design can increase rare variant association power. Therefore, we developed a novel framework to select tag SNPs using the reference panel of 26 populations from Phase 3 of the 1000 Genomes Project. We evaluate tag SNP performance via mean imputed r2 at untyped sites using leave-one-out internal validation and standard imputation methods, rather than pairwise linkage disequilibrium. Moving beyond pairwise metrics allows us to account for haplotype diversity across the genome for improve imputation accuracy and demonstrates population-specific biases from pairwise estimates. We also examine array design strategies that contrast multi-ethnic cohorts vs. single populations, and show a boost in performance for the former can be obtained by prioritizing tag SNPs that contribute information across multiple populations simultaneously. Using our framework, we demonstrate increased imputation accuracy for rare variants (frequency < 1%) by 0.5-3.1% for an array of one million sites and 0.7-7.1% for an array of 500,000 sites, depending on the population. Finally, we show how recent explosive growth in non-African populations means tag SNPs capture on average 30% fewer other variants than in African populations. The unified framework presented here will enable investigators to make informed decisions for the design of new arrays, and help empower the next phase of rare variant association for global health.Entities:
Keywords: Genomics; Imputation; Statistical Genetics; array design; tag SNPs
Mesh:
Year: 2018 PMID: 30131328 PMCID: PMC6169386 DOI: 10.1534/g3.118.200502
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Imputation Accuracy by super population of tags selected in European populations for a scaffold assuming 500,000 genome-wide variants. Tags were required to have a MAF ≥ 1% and r2 ≥ 0.5 with target sites. This trend is observed across all super populations (S1 Fig).
Figure 2Proportion of tags that are informative by population with the three methods. (Left, lightest) tags selected from only a single population, (Center) tags selected by pooling all populations agnostically, and (Right) tags selected with the cross-population prioritization approach. Tag SNPs were informative if they were in linkage disequilibrium (r2 > 0.5) with at least one untagged site.
Figure 3Increased imputation accuracy with cross-population prioritization (solid line) vs. naïve approach (dashed line) for a minimum pairwise correlation threshold of r2 > 0.5 and MAF > 1% across different scaffold sizes. Imputation accuracy was calculated separately within minor allele frequency bins for each super population.
Figure 4Influence of (A) minimum r2 threshold and (B) lower MAF threshold on imputation accuracy and coverage (r2 > 0.5 and r2 > 0.8) within populations from the Americas with an allocation of 1M sites.
Figure 5Tag SNPs informativeness across population. (A) Proportion of sites informative (r2 > 0.5, MAF > 0.01, 1M site scaffold) across a number of populations, with lines corresponding to the index population. For example, for sites that are informative (r2 > 0.5 with any untyped SNP in genome) in five out of the six populations, only slightly more than half are informative in East Asian populations while greater than 90% are informative in African populations. (B) Proportion of sites shared across populations, conditional on index population. For example, for sites informative in African populations, less than half are informative in East Asian, European, and South Asian populations.
Performance per tag SNP to capture all variation possible with r2 > 0.8 on chromosome 9, as well as within a one million site genome-wide scaffold allocation through cross-population prioritization
| Population | All Possible Tags | One Million Tag Scaffold | ||
|---|---|---|---|---|
| AAC | 74,255 | 8.04 | 36,336 | 12.97 |
| AFR | 81,416 | 7.17 | 34,548 | 12.16 |
| AMR | 43,065 | 9.40 | 28,691 | 12.80 |
| EAS | 28,473 | 10.27 | 16,457 | 16.16 |
| EUR | 35,027 | 9.48 | 22,111 | 13.63 |
| SAS | 37,644 | 9.28 | 23,480 | 13.33 |
Lone sites by super population and their imputation accuracy for a one million site scaffold
| Population | Number of Individuals | Number of Lone Sites | Imputation Accuracy Quality | Number Unrecoverable with r2acc ≥ 0.2 (%) | ||
|---|---|---|---|---|---|---|
| AAC | 156 | 7,509 | 90.79% | 80.72% | 51.72% | 691 (9.2%) |
| AFR | 495 | 4,497 | 63.29% | 38.73% | 7.03% | 1,651 (36.7%) |
| AMR | 341 | 2,701 | 48.98% | 25.88% | 3.78% | 1,378 (51.02%) |
| EAS | 503 | 4,947 | 44.37% | 12.41% | 2.14% | 2,752 (55.63%) |
| EUR | 501 | 3,881 | 51.07% | 23.22% | 3.74% | 1,899 (48.93%) |
| SAS | 477 | 4,293 | 51.01% | 18.77% | 2.26% | 2,103 (48.99%) |
Coverage of 1 million and 500,000 tag SNP set by super population for all polymorphic sites on chromosome 9 with MAF > 0.5%
| Super population | Total Number of Polymorphic Sites | Scaffold of 1,000,000 tags | Scaffold of 500,000 tags | ||||
|---|---|---|---|---|---|---|---|
| Coverage | Imputation Accuracy | Coverage | Imputation Accuracy | ||||
| 780896 | 63.64% | 30.27% | 34.03% | 14.07% | |||
| 777207 | 59.15% | 28.05% | 33.17% | 14.10% | |||
| 503804 | 79.90% | 53.60% | 61.00% | 37.02% | |||
| 367189 | 76.95% | 63.08% | 73.16% | 55.09% | |||
| 414184 | 78.77% | 62.65% | 72.87% | 52.86% | |||
| 455573 | 74.84% | 56.97% | 67.28% | 45.91% | |||
Figure 6Coverage (dashed lines) vs. Imputation Accuracy (solid lines), assuming a genome-wide scaffold size of one million tags. Coverage is shown with an r2 > 0.8. While pairwise tagging values are low, particularly in African-descent populations, multi-marker imputation accuracy remains high across groups.