| Literature DB >> 18466559 |
Daniel F Schwarz1, Silke Szymczak, Andreas Ziegler, Inke R König.
Abstract
With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.Entities:
Year: 2007 PMID: 18466559 PMCID: PMC2367487 DOI: 10.1186/1753-6561-1-s1-s59
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Figure 1Importance of SNPs. Global importance scores for the single SNPs in the genome-wide scan in chromosomal order. Vertical dotted lines show chromosomal boundaries.
Figure 2Prediction error in random forests based on different numbers of variables. Prediction error of random forests based on different numbers of variables, estimated in the out-of-bag (OOB) samples. Only error estimates of the first 100 sets are displayed. The first local minimum in prediction error is for the set including 37 variables, which was selected for further analyses.
Effect estimates of the selected variables
| logRega | |||||
| Variable | Coef | SE | nominal | adjusted | Simulated effects |
| DR allele from mother | 1.476 | 0.114 | <10-15 | <10-13 | |
| DR allele from father | 1.480 | 0.115 | <10-15 | <10-13 | |
| Chr 11 bp110, 204, 257 | 0.822 | 0.123 | 2.40 × 10-11 | 7.20 × 10-10 | Locus F |
| Lifetime smoking | 0.976 | 0.168 | 7.06 × 10-9 | 2.05 × 10-7 | Smoking |
| Gender | 0.804 | 0.169 | 2.14 × 10-6 | 5.99 × 10-5 | Gender |
| Chr 6 bp 32, 521, 277 | -0.861 | 0.201 | 1.87 × 10-5 | 0.0005 | DR/Locus C |
| Chr 18 bp 66, 048, 927 | 0.333 | 0.133 | 0.0121 | 0.3146 | Locus E |
| Chr 6 bp 36, 582, 440 | 0.634 | 0.255 | 0.0131 | 0.3275 | Locus D |
| Chr 6 bp 28, 758, 332 | 0.248 | 0.125 | 0.0477 | 1.0000 | |
| Chr 1 bp 26, 043, 914 | 0.185 | 0.143 | 0.1942 | 1.0000 | |
| Chr 2 bp 34, 451, 973 | 0.154 | 0.126 | 0.2228 | 1.0000 | |
| Chr 6 bp 30, 266, 243 | -0.184 | 0.165 | 0.2647 | 1.0000 | |
| Chr 7 bp 97, 632, 608 | 0.116 | 0.125 | 0.3522 | 1.0000 | |
| Chr 6 bp 26, 075, 047 | -0.126 | 0.138 | 0.3613 | 1.0000 | |
| Chr 18 bp 10, 152, 707 | 0.098 | 0.115 | 0.3907 | 1.0000 | |
| Chr 8 bp 127, 252, 736 | -0.101 | 0.121 | 0.4050 | 1.0000 | |
| Chr 13 bp 45, 600, 085 | -0.212 | 0.258 | 0.4094 | 1.0000 | |
| Chr 13 bp 31, 890, 164 | 0.090 | 0.116 | 0.4356 | 1.0000 | |
| Chr 11 bp 22, 794, 066 | 0.118 | 0.155 | 0.4475 | 1.0000 | |
| Chr 5 bp 57, 110, 585 | -0.251 | 0.337 | 0.4559 | 1.0000 | |
| Chr 6 bp 32, 772, 203 | 0.094 | 0.133 | 0.4769 | 1.0000 | |
| Chr 4 bp 15, 714, 556 | 0.066 | 0.112 | 0.5547 | 1.0000 | |
| Chr 6 bp 133, 756, 692 | 0.072 | 0.133 | 0.5885 | 1.0000 | |
| Chr 14 bp 37, 328, 424 | 0.073 | 0.142 | 0.6051 | 1.0000 | |
| Chr 1 bp 48, 687, 156 | -0.192 | 0.378 | 0.6115 | 1.0000 | |
| Chr 15 bp 77, 852, 281 | 0.097 | 0.195 | 0.6170 | 1.0000 | |
| Chr 10 bp 10, 764, 908 | 0.050 | 0.133 | 0.7034 | 1.0000 | |
| Chr 15 bp 66, 671, 014 | 0.049 | 0.222 | 0.8235 | 1.0000 | |
| Chr 2 bp 17, 889, 207 | 0.059 | 0.269 | 0.8261 | 1.0000 | |
| Chr 6 bp 155, 580, 230 | -0.020 | 0.131 | 0.8757 | 1.0000 | |
| Chr 7 bp 8, 524, 374 | -0.009 | 0.116 | 0.9332 | 1.0000 | |
| Chr 2 bp 157, 502, 490 | 0.018 | 0.570 | 0.9744 | 1.0000 | |
| Intercept | -11.464 | 1.594 | |||
alogReg, logistic regression; Coef, estimated regression coefficient; SE, standard error, nominal p, two-sided nominal p-value; adjusted p, two-sided p-value adjusted according to the Bonferroni-Holm procedure; chr, chromosome; bp, base pair.