| Literature DB >> 21317188 |
Usman Roshan1, Satish Chikkagoudar, Zhi Wei, Kai Wang, Hakon Hakonarson.
Abstract
We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square-ranked SNPs, where r is the number of SNPs with P-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu.Entities:
Mesh:
Year: 2011 PMID: 21317188 PMCID: PMC3089490 DOI: 10.1093/nar/gkr064
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Mean number of causal variants and associated regions in top r SNPs given by each method at the three different relative risks
| RR | χ2 | SV(2 | SV(5 | SV(10 | RF(2 | RF(5 | RF(10 |
|---|---|---|---|---|---|---|---|
| Mean number of causal variants | |||||||
| 1.25 | 1.2 | 1.2 | 0.9 | 0.5 | 1.2 | 1.1 | 1 |
| 1.5 | 8.9 | 10.8 | 7.4 | 6.6 | 9.7 | 9.3 | 9.4 |
| 2 | 14 | 14.6 | 13.6 | 13.1 | 14.1 | 14.1 | 13.7 |
| Mean number of associated regions | |||||||
| 1.25 | 1 | 0.8 | 0.8 | 0.8 | 0.9 | 0.8 | 0.8 |
| 1.5 | 4.5 | 5.9 | 3.5 | 2.6 | 5.3 | 4.9 | 4.8 |
| 2 | 10.6 | 11.9 | 9.6 | 9.4 | 11 | 10.9 | 10.8 |
Mean number of causal variants detected in top r ranked SNPs given by each method on different sample sizes at relative risk 1.25
| Sample size | Mean | χ2 | SVM(2 | RF(2 |
|---|---|---|---|---|
| 2000 | 2.1 | 1.2 | 1.2 | 1.2 |
| 4000 | 9.2 | 4.8 | 5.7 | 4.4 |
| 8000 | 31.7 | 11.0 | 12.1 | 8.5 |
Mean number of causal variants detected in top r ranked SNPs by each method on data with causal allele frequencies at most 5% and two different sample sizes and relative risks
| Sample size and relative risk | χ2 | SVM(2 | RF(2 |
|---|---|---|---|
| 4000, 1.25 | 0.5 | 1 | 1 |
| 8000, 1.25 | 1.7 | 2.5 | 2.5 |
| 4000, 1.5 | 4.8 | 6.3 | 6.1 |
| 8000, 1.5 | 12.7 | 13.2 | 13.3 |
Mean number of causal variants and associated regions in SVM and RF applied to all SNPs in the GWAS
| RR | Causal variants | Associated regions | ||||
|---|---|---|---|---|---|---|
| χ2 | SVM | RF | χ2 | SVM | RF | |
| 1.25 | 1.2 | 1.3 | 1 | 1 | 0.8 | 0.9 |
| 1.5 | 8.9 | 8.1 | 8.7 | 4.8 | 4.3 | 4.8 |
| 2 | 14 | 12.4 | 14 | 10.6 | 9.4 | 10.7 |
Figure 1.Empirical power to detect k causal variants in simulated data of relative risk 1.5.
Correlation coefficient between original SNP ranks and mean SNP ranks across 100 jacknifed datasets of relative risk 1.5
| 2 | 5 | 10 | ||
|---|---|---|---|---|
| χ2 | 0.99 | 0.99 | 0.97 | 0.94 |
| SVM | 0.99 | 0.98 | 0.97 | 0.96 |
| RF(100) | 0.87 | 0.84 | 0.65 | 0.59 |
| RF(10000) | 0.99 | 0.98 | 0.95 | 0.91 |
Number of type 1 diabetes associated regions given by top r SNPs of chi-square, RF(2r) and SVM(2r)
| Regions defined by replicated SNPs in Ref. ( | T1D Base regions | |
|---|---|---|
| χ2 | 5 | 5 |
| RF(2 | 8 | 13 |
| SVM(2 | 9 | 15 |
Figure 2.ROC area under curve of the composite odds ratio score on the GoKinD type 1 diabetes study as a function of top-ranked SNPs obtained from the WTCCC study by the three different methods.