| Literature DB >> 22373247 |
Christopher Pardy1, Allan Motyer, Susan Wilson.
Abstract
Our goal is to identify common single-nucleotide polymorphisms (SNPs) (minor allele frequency > 1%) that add predictive accuracy above that gained by knowledge of easily measured clinical variables. We take an algorithmic approach to predict each phenotypic variable using a combination of phenotypic and genotypic predictors. We perform our procedure on the first simulated replicate and then validate against the others. Our procedure performs well when predicting Q1 but is less successful for the other outcomes. We use resampling procedures where possible to guard against false positives and to improve generalizability. The approach is based on finding a consensus regarding important SNPs by applying random forests and the least absolute shrinkage and selection operator (LASSO) on multiple subsamples. Random forests are used first to discard unimportant predictors, narrowing our focus to roughly 100 important SNPs. A cross-validation LASSO is then used to further select variables. We combine these procedures to guarantee that cross-validation can be used to choose a shrinkage parameter for the LASSO. If the clinical variables were unavailable, this prefiltering step would be essential. We perform the SNP-based analyses simultaneously rather than one at a time to estimate SNP effects in the presence of other causal variants. We analyzed the first simulated replicate of Genetic Analysis Workshop 17 without knowledge of the true model. Post-conference knowledge of the simulation parameters allowed us to investigate the limitations of our approach. We found that many of the false positives we identified were substantially correlated with genuine causal SNPs.Entities:
Year: 2011 PMID: 22373247 PMCID: PMC3287897 DOI: 10.1186/1753-6561-5-S9-S59
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Figure 1LASSO cross-validation plots for affected status. Cross validation fails to identify an appropriate shrinkage parameter by using just the 4,755 SNPs (left-hand panel). A parameter can be chosen when there is additional adjustment for clinical variables (center panel) or when SNPs are prefiltered according to random forest importance score.
Figure 2Outline of our approach
Final consensus models for Q1 with and without SNPs
| Estimate | Standard error | p-value | |
|---|---|---|---|
| Model with SNPs | |||
| Intercept | −1.45 | 0.11 | <2 × 10−16 |
| C10S4601 | −0.24 | 0.10 | 0.020 |
| C10S4927 | 0.10 | 0.04 | 0.020 |
| C12S2798 | −0.07 | 0.05 | 0.158 |
| C13S431* | 0.45 | 0.15 | 0.004 |
| C13S522* | 0.80 | 0.13 | 1.39 × 10−9 |
| C13S523* | 0.64 | 0.09 | 9.52 × 10−12 |
| C14S2902 | 0.15 | 0.05 | 0.002 |
| C18S794 | 0.14 | 0.05 | 0.004 |
| C19S5879 | 0.11 | 0.04 | 0.006 |
| C1S4244 | −0.37 | 0.16 | 0.018 |
| C1S7427 | 0.10 | 0.04 | 0.013 |
| C4S1220 | 0.07 | 0.04 | 0.081 |
| C5S221 | 0.11 | 0.05 | 0.23 |
| C6S4003 | −0.24 | 0.13 | 0.057 |
| C6S469 | 0.30 | 0.14 | 0.031 |
| C7S2893 | 0.09 | 0.05 | 0.062 |
| C8S2699 | −0.20 | 0.13 | 0.114 |
| C9S13 | 0.13 | 0.08 | 0.111 |
| Q2 | 0.26 | 0.03 | <2 × 10−16 |
| Age | 0.02 | 0.00 | <2 × 10−16 |
| Model without SNPs | |||
| Intercept | −0.93 | 0.08 | <2 × 10−16 |
| Q2 | 0.27 | 0.03 | 2.94 × 10−15 |
| Age | 0.02 | 0.00 | <2 × 10−16 |
| Smoke | 0.59 | 0.08 | 1.01 × 10−14 |
Asterisks indicate a genuine causal variant.
Strongest correlations between false positives and genuine causal SNPs
| Outcome | Identified SNP | Causal SNP | Correlation |
|---|---|---|---|
| Consensus modelsa | |||
| Q1 | C9S13 | C4S1878 | 0.21 |
| C9S13 | C4S1884 | 0.23 | |
| C1S7427 | C6S5380 | 0.22 | |
| C12S2798 | C6S5380 | −0.28 | |
| C8S2699 | C6S5426 | 0.21 | |
| C1S7427 | C13S523 | 0.21 | |
| C12S2798 | C13S523 | −0.27 | |
| C12S7427 | C14S3706 | −0.27 | |
| Q4 | C10S6324 | C13S523 | 0.23 |
| Affected | C19S3379 | C4S1878 | −0.21 |
| C18S2310 | C6S5426 | 0.26 | |
| C19S3379 | C6S5426 | 0.22 | |
| GWAS for Q1b | C12S707 | C4S1878 | 0.27 |
| C12S711 | C4S1878 | 0.28 | |
| C12S707 | C4S1884 | 0.26 | |
| C12S707 | C6S5380 | 0.21 | |
| C12S711 | C6S5380 | 0.21 | |
| C12S2798 | C6S5380 | −0.28 | |
| C12S707 | C13S522 | 0.21 | |
| C12S711 | C13S522 | 0.21 | |
| C12S707 | C13S523 | 0.45 | |
| C12S711 | C13S523 | 0.4 | |
| C12S2028 | C13S523 | 0.32 | |
| C12S2798 | C13S523 | −0.27 | |
| C12S707 | C14S1734 | 0.31 | |
| C12S711 | C14S1734 | 0.26 | |
| C12S2028 | C14S1734 | 0.25 | |
| C12S707 | C18S2492 | 0.41 | |
| C12S711 | C18S2492 | 0.26 | |
a SNPs identified by our procedure.
b SNPs identified by a naïve genome-wide association study for Q1.