| Literature DB >> 23950724 |
Laura L Faye1, Mitchell J Machiela, Peter Kraft, Shelley B Bull, Lei Sun.
Abstract
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.Entities:
Mesh:
Year: 2013 PMID: 23950724 PMCID: PMC3738448 DOI: 10.1371/journal.pgen.1003609
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Notation.
| Sequenced or imputed SNPs indexed |
|
| Causal SNP | C |
| GWAS tag SNP | G |
| Test statistic at sequenced SNP |
|
| Observed value of the test statistic at the tag SNP |
|
| Re-ranking statistic at sequenced SNP |
|
| Correlation between: | |
| Actual genotypes of casual and tag, causal and sequenced SNP |
|
| Estimated genotypes for the tag and sequenced SNP |
|
| Actual genotype of the causal SNP and estimated genotype at sequenced SNP |
|
| Call rate (1-missing data rate) at sequenced SNP |
|
| Joint call rate for tag SNP and sequenced SNP |
|
| Correlation between actual and estimated genotypes at: sequenced SNP |
|
| Estimated correlation (sample correlation) of the above |
|
| Tag selection bias (E[TG| threshold selection and ranking]-E[TG]), Bootstrap estimate of the bias |
|
| Genetic effect at the causal SNP, estimate |
|
| Standard deviation of the estimate at the causal SNP, estimate |
|
| Sample size |
|
| Expected value of the test statistic at the causal SNP re-scaled for sample size |
|
|
| |
| Standard normal critical value at significance level |
|
| Standard normal cumulative distribution and density functions |
|
Figure 1Tagging effect decreases localization success rates with or without the selection effect.
The expected values of the association test statistics at a tag SNP (red) and the causal SNP (black), shading from 25th–75th percentiles (A, C), and the localization success rates (B, D) for association studies (1000 cases and 1000 controls) of one causal SNP (MAF = 0.12; OR = 1.25; perfect genotyping accuracy) and one tag SNP (MAF = 0.12; in varying degree of correlation with the causal SNP, r = 0.2 to 1; perfect genotyping accuracy) with no selection for significance at the tag SNP (A, B) or selection at the tag SNP requiring the test statistic T to be significant with p-value<0.05 (C, D).
Figure 3Well-tagged causal SNPs sequenced with low accuracy are unlikely to be correctly identified even as sample size increases.
Localization success rates for association studies (sample size from 50∶50 cases∶controls to 5000∶5000 cases∶controls, X-axis) of one causal SNP (MAF = 0.12; OR = 1.25; imperfect genotyping accuracy due to sequencing or imputation errors resulting in correlation between the actual and estimated genotypes ρ) and one tag SNP (MAF = 0.12; in high correlation with the causal SNP, r = 0.8 (purple solid) to 0.98 (red dashed); 100% genotyping accuracy with ρ = 1) with no selection for significance at the tag SNP.
Figure 2Low genotyping accuracy further reduces localization success rates with or without the selection effect.
Localization success rates for association studies (1000 cases and 1000 controls) of one causal SNP (MAF = 0.12; OR = 1.25; imperfect genotyping accuracy due to sequencing or imputation errors resulting in correlation between the actual and estimated genotypes ρ0.80 (blue dash-dotted) to 1 (black solid) and one tag SNP (MAF = 0.12; in varying degree of correlation with the causal SNP, r0.2 to 1 (X-axis); perfect genotyping accuracy with ρ 1) with no selection for significance at the tag SNP (A) or selection at the tag SNP requiring the test statistic T to be significant with p-value<0.05 (B).
Parameters and parameter values of the main simulation studies.
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | Scenario 5 | |
| Effect of selection, tagging, genotype accuracy | Effect of selection, genotype accuracy | Effect of genotype accuracy | Effect of multiple causal SNPs | Effect of missing data | |
| Sample sequenced | Same as the Discovery sample, conditional on significance at the GWAS tag SNP (p<5×10−7) | Same as the Discovery sample, conditional on significance at any of the SNPs in the region (p<5×10−7) | Independent sample | Independent sample | Independent sample |
| # of GWAS tag SNPs in the region, | 1 | 1 | 1 | 1 | 1 |
| # of post-GWAS SNPs (# of causal SNPs) in the region, | 10 (1) | 10 (1) | 10 (1) | 11 (2) | 10 (1) |
| OR of the causal SNP(s), β | 1.5 | 2 | 2 | 2, 2 | 2 |
| MAF of the tag and post-GWAS SNPs | 4.8% | 5% | 5% | 5% | 5% |
| Correlation between the tag and causal SNP(s), | 0.78, 0.83, 0.85, 0.90, 0.93, 0.95, 1 | 0.95 | 0.95 | 0.80, 0.95 ( | 0.95 |
| Correlation between two adjacent non-causal post-GWA SNPs, | 0.975 | 0.975 | 0.975 | 0.975 | 0.975 |
| Correlation between the actual and called genotypes for sequenced/imputed SNPs, | 0.82, 0.86, 0.90, 0.95, 0.97, 1 | Same as S1 | Same as S1 | Same as S1 | 1 |
| Call rates (1-missing data rate), | 100% | 100% | 100% | 100% | 80%, 90%, 95%, 98%, 99% or 100% |
| Sample size, | 4901 (1963 cases and 2938 controls of the WTCCC T1D study) | 2500, 5000, 7500 or 10,000 (equal cases and controls) | Same as S2 | Same as S2 | Same as S2 |
| Simulation replicates for each configuration | 300 | 800 | 800 | 800 | 800 |
| Localization Success Rate | P(the causal SNP is top-ranked) | Same as S1 | Same as S1 | Defined for each of the 2 causal SNPs as P(the causal SNP ranks in top 2) | Same as S1 |
Localization success rates for simulation Scenarios 1, 2, 3, 4.
| Average correlation between the actual and estimated genotypes of sequenced or imputed SNPs, | |||||||||||||
| Correlation between the tag and causal SNPs, r | ◂--------------------- Low-coverage Sequencing ---------------------▸ | High-coverage Sequencing | |||||||||||
| Sample | 0.82 | 0.86 | 0.90 | 0.95 | 0.97 | 1.00 | |||||||
| size | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | |
|
| |||||||||||||
| 0.78 | 4901 | 0.20 | 0.38 | 0.21 | 0.42 | 0.23 | 0.42 | 0.34 | 0.60 | 0.29 | 0.49 | 0.42 | 0.60 |
| 0.83 | 4901 | 0.12 | 0.35 | 0.16 | 0.41 | 0.20 | 0.47 | 0.26 | 0.52 | 0.34 | 0.54 | 0.48 | 0.64 |
| 0.85 | 4901 | 0.20 | 0.39 | 0.23 | 0.43 | 0.26 | 0.51 | 0.33 | 0.47 | 0.34 | 0.47 | 0.43 | 0.55 |
| 0.90 | 4901 | 0.09 | 0.32 | 0.15 | 0.43 | 0.18 | 0.45 | 0.31 | 0.56 | 0.34 | 0.43 | 0.40 | 0.57 |
| 0.93 | 4901 | 0.08 | 0.41 | 0.12 | 0.35 | 0.19 | 0.40 | 0.34 | 0.53 | 0.44 | 0.54 | 0.45 | 0.56 |
| 0.95 | 4901 | 0.11 | 0.32 | 0.09 | 0.31 | 0.21 | 0.42 | 0.23 | 0.33 | 0.32 | 0.42 | 0.42 | 0.54 |
| Tag is causal | 4901 | 0.93 | 0.17 | 0.89 | 0.26 | 0.89 | 0.25 | 0.80 | 0.36 | 0.72 | 0.39 | 0.55 | 0.29 |
|
| |||||||||||||
| 0.95 | 2500 | 0.12 | 0.28 | 0.12 | 0.29 | 0.17 | 0.34 | 0.26 | 0.38 | 0.33 | 0.35 | 0.50 | 0.50 |
| 5000 | 0.13 | 0.36 | 0.18 | 0.41 | 0.23 | 0.43 | 0.32 | 0.50 | 0.45 | 0.53 | 0.58 | 0.58 | |
| 7500 | 0.13 | 0.46 | 0.16 | 0.50 | 0.22 | 0.52 | 0.39 | 0.58 | 0.51 | 0.64 | 0.72 | 0.72 | |
| 10000 | 0.12 | 0.48 | 0.16 | 0.53 | 0.23 | 0.55 | 0.37 | 0.67 | 0.54 | 0.68 | 0.74 | 0.74 | |
|
| |||||||||||||
| 0.95 | 2500 | 0.12 | 0.24 | 0.14 | 0.28 | 0.18 | 0.29 | 0.28 | 0.35 | 0.33 | 0.36 | 0.46 | 0.46 |
| 5000 | 0.13 | 0.37 | 0.18 | 0.35 | 0.21 | 0.44 | 0.33 | 0.49 | 0.43 | 0.52 | 0.56 | 0.56 | |
| 7500 | 0.14 | 0.47 | 0.17 | 0.46 | 0.23 | 0.53 | 0.37 | 0.56 | 0.50 | 0.59 | 0.64 | 0.64 | |
| 10000 | 0.12 | 0.52 | 0.19 | 0.56 | 0.22 | 0.57 | 0.42 | 0.63 | 0.55 | 0.68 | 0.78 | 0.78 | |
|
| |||||||||||||
| 0.80 | 2500 | 0.07 | 0.14 | 0.07 | 0.16 | 0.10 | 0.17 | 0.13 | 0.19 | 0.18 | 0.20 | 0.23 | 0.23 |
| 5000 | 0.05 | 0.14 | 0.07 | 0.16 | 0.08 | 0.17 | 0.13 | 0.17 | 0.16 | 0.20 | 0.21 | 0.21 | |
| 7500 | 0.06 | 0.18 | 0.06 | 0.17 | 0.10 | 0.20 | 0.14 | 0.21 | 0.18 | 0.24 | 0.22 | 0.22 | |
| 10000 | 0.04 | 0.15 | 0.05 | 0.16 | 0.08 | 0.18 | 0.10 | 0.21 | 0.16 | 0.23 | 0.23 | 0.23 | |
| 0.95 | 2500 | 0.04 | 0.15 | 0.07 | 0.15 | 0.07 | 0.16 | 0.11 | 0.18 | 0.14 | 0.18 | 0.24 | 0.24 |
| 5000 | 0.04 | 0.13 | 0.04 | 0.14 | 0.05 | 0.18 | 0.10 | 0.17 | 0.13 | 0.18 | 0.20 | 0.20 | |
| 7500 | 0.04 | 0.18 | 0.04 | 0.20 | 0.05 | 0.20 | 0.12 | 0.23 | 0.18 | 0.25 | 0.25 | 0.25 | |
| 10000 | 0.03 | 0.16 | 0.03 | 0.16 | 0.04 | 0.16 | 0.09 | 0.19 | 0.12 | 0.19 | 0.20 | 0.20 | |
See Table 2 for details of the simulation models; scenario 4 has two causal loci.
1963 cases and 2938 controls for Scenario 1; equal number of cases and controls for Scenario 2,3,4.
Naïve is standard ranking without correction for selection or genotyping error.
Re-ranked is ranking by corrected statistic in Equation 1.
In this simulation, the GWAS tag SNP is causal and all post-GWAS SNPs are non-causal.
Localization success rates for simulation Scenarios 5a.
| Correlation between the tag and causal SNPs, r | Call Rate ( = 1-Missing Data Rate) | ||||||||||||
| Sample | 0.80 | 0.90 | 0.95 | 0.98 | 0.99 | 1.00 | |||||||
| size | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | Naïve | Re-ranked | |
| 0.95 | 2500 | 0.17 | 0.20 | 0.26 | 0.25 | 0.25 | 0.25 | 0.30 | 0.30 | 0.33 | 0.33 | 0.45 | 0.45 |
| 5000 | 0.21 | 0.29 | 0.30 | 0.32 | 0.36 | 0.40 | 0.39 | 0.41 | 0.50 | 0.51 | 0.51 | 0.51 | |
| 7500 | 0.18 | 0.34 | 0.33 | 0.43 | 0.46 | 0.50 | 0.53 | 0.55 | 0.61 | 0.60 | 0.71 | 0.71 | |
| 10000 | 0.18 | 0.38 | 0.37 | 0.42 | 0.48 | 0.53 | 0.58 | 0.60 | 0.66 | 0.67 | 0.77 | 0.77 | |
See Table 2 for details of the simulation models.
equal number of cases and controls.
Naïve is standard ranking without correction for selection or genotyping error.
Re-ranked is ranking by corrected statistic in Equation 1.
Figure 4Naïve test statistics and re-ranking statistics for regions surrounding rs78246868 in the 8q24.21 region for association with prostate cancer risk.
Naïve test statistics (A), and re-ranking statistics adjusting for genotyping accuracy (B) for SNPs in LD (r2>0.2) with rs78246868. Circles highlight SNPs whose rank changed considerably after re-ranking. Color indicates pair-wise correlation with the most significant SNP in the region selected based on the naïve ranking (purple diamond). Other shapes indicate genotyping accuracy over all 7 studies as measured by ρ. rs78246868 is no longer the most significant SNP in the region after re-ranking.
Figure 5Naïve test statistics and re-ranking statistics for regions surrounding rs8071558 in the 17q24.3 region for association with prostate cancer risk.
Naïve test statistics (A), and re-ranking statistics adjusting for genotyping accuracy (B) for SNPs in LD (r2>0.2) with rs8071558. Circles highlight SNPs whose rank changed considerably after re-ranking. Color indicates pair-wise correlation with the most significant SNP in the region selected based on the naïve ranking (purple diamond). Other shape indicates genotyping accuracy over all 7 studies as measured by ρ, rs8071558 is no longer the most significant SNP in the region after re-ranking.