| Literature DB >> 32519380 |
Corbin Quick1, Pramod Anugu2, Solomon Musani2, Scott T Weiss3,4,5, Esteban G Burchard6,7, Marquitta J White6, Kevin L Keys6, Francesco Cucca8,9, Carlo Sidore8, Michael Boehnke1, Christian Fuchsberger1,10,11.
Abstract
A key aim for current genome-wide association studies (GWAS) is to interrogate the full spectrum of genetic variation underlying human traits, including rare variants, across populations. Deep whole-genome sequencing is the gold standard to fully capture genetic variation, but remains prohibitively expensive for large sample sizes. Array genotyping interrogates a sparser set of variants, which can be used as a scaffold for genotype imputation to capture a wider set of variants. However, imputation quality depends crucially on reference panel size and genetic distance from the target population. Here, we consider sequencing a subset of GWAS participants and imputing the rest using a reference panel that includes both sequenced GWAS participants and an external reference panel. We investigate how imputation quality and GWAS power are affected by the number of participants sequenced for admixed populations (African and Latino Americans) and European population isolates (Sardinians and Finns), and identify powerful, cost-effective GWAS designs given current sequencing and array costs. For populations that are well-represented in existing reference panels, we find that array genotyping alone is cost-effective and well-powered to detect common- and rare-variant associations. For poorly represented populations, sequencing a subset of participants is often most cost-effective, and can substantially increase imputation quality and GWAS power.Entities:
Keywords: GWAS; WGS; genotype imputation; genotyping; rare variants; sequencing; study design
Year: 2020 PMID: 32519380 PMCID: PMC7449570 DOI: 10.1002/gepi.22326
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Genotyping arrays used for comparisons
| Array | No. marker variants | List cost per sample Illumina ( |
|---|---|---|
| Illumina Infinium Core | 307 K | $49 |
| Illumina Infinium OmniExpress | 710 K | $94 |
| Illumina Infinium Omni2.5 | 2.5 M | $172 |
Figure 1Imputation quality by population and genotyping array. Imputation coverage (upper panels) and mean imputation r 2 (lower panels) as functions of the number of population‐matched individuals included in augmented reference panels (number sequenced, x‐axis). Here and elsewhere, MAF is calculated separately within each population. MAF, minor allele frequency
Figure 2Power and optimal design by population and genotyping array. Power to detect association for case–control studies with equal numbers of cases and controls as a function of sequenced subsample size (x‐axis) and imputed subsample size (y‐axis) for a variant with MAF 0.5% and RR 4 for a disease with prevalence 1%. Axes are scaled to reflect costs of genotyping arrays (Table 1) and sequencing ($1 K per sample). Dashed diagonal lines indicate study designs with the same total cost, given by y = a − bx, where and . Circled points indicate optimal study designs, which attain the indicated power level at minimum total experimental cost (or, maximize power at the indicated total experimental cost), shown only for optimal designs with total genotyping cost ≤ $2 M ($1.5 M for Latino Americans). MAF, minor allele frequency; RR, relative risk
Figure 3Power as a function of MAF and effect size. Statistical power (y‐axis) to detect a rare large‐effect variant (MAF = 0.25%, RR = 3; top row) and common modest‐effect variant (MAF = 5%, RR = 1.3; bottom row) for a disease with prevalence 1% as a function of the number of participants array‐genotyped and imputed (x‐axis) when 0, 500, or 2,000 participants are sequenced and included in an augmented reference panel. The number of participants sequenced has a far greater impact on statistical power for the rare variant association. Importantly, statistical power is bounded above by the probability that the variant is imputable (r 2 > 0.3 and reference ), causing power to asymptote below 1 as a function of the number of imputed participants (e.g., upper‐left panel). MAC, minor allele count; MAF, minor allele frequency; RR, relative risk
Figure 4Optimal design as a function of minor allele frequency and effect size. Percentage of participants sequenced (x‐axis) and total sample size (y‐axis) under optimal designs to attain statistical power 80% for rare and common variants across two effect size values for each of the four study populations using the Infinium Core array. Here, effect size refers to the χ 2 NCP for single‐variant association tests given perfect genotype accuracy, which is defined as η 2 in Section 2. RR values corresponding to each combination of MAF and NCP are indicated in the far‐right panel (for Sardinians). With NCP held constant, differences in optimal design for different MAF values are solely due to differences in imputation coverage and quality across the MAF spectrum. MAF, minor allele frequency; NCP, noncentrality parameter; RR, relative risk