| Literature DB >> 18466556 |
Yan Meng1, Qiong Yang, Karen T Cuenco, L Adrienne Cupples, Anita L Destefano, Kathryn L Lunetta.
Abstract
We used the simulated data set from Genetic Analysis Workshop 15 Problem 3 to assess a two-stage approach for identifying single-nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). In the first stage, we used random forests (RF) to screen large amounts of genetic data using the variable importance measure, which takes into account SNP interaction effects as well as main effects without requiring model specification. We used the simulated 9187 SNPs mimicking a 10 K SNP chip, along with covariates DR (the simulated DRB1 gentoype), smoking, and sex as input to the RF analyses with a training set consisting of 750 unrelated RA cases and 750 controls. We used an iterative RF screening procedure to identify a smaller set of variables for further analysis. In the second stage, we used the software program CaMML for producing Bayesian networks, and developed complex etiologic models for RA risk using the variables identified by our RF screening procedure. We evaluated the performance of this method using independent test data sets for up to 100 replicates.Entities:
Year: 2007 PMID: 18466556 PMCID: PMC2367609 DOI: 10.1186/1753-6561-1-s1-s56
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Figure 1Bayesian network based on variables of ITbp for Replicate 1.
Power estimate of ITbp and CaMML
| Variables | Disease locus | ITbp 100 replicates | ITbp Replicate 1–50 | CaMML Replicate 1–50 | |
| NA | NA | ||||
| NA | NA | ||||
| NA | NA | ||||
| chr 6_155 | C | 0.104 | 97% | 98% | 98% |
| chr 6_150 | C | 0.027 | 13% | 8% | 4% |
| chr 6_149 | C | 0.014 | 6% | 8% | 4% |
| chr 6_139 | C | 0.009 | 18% | 20% | 10% |
| chr 6_138 | C | 0.009 | 17% | 18% | 10% |
| chr 6_140 | C | 0.007 | 1% | 2% | 2% |
| chr 6_134 | C | 0.007 | 4% | 4% | 4% |
| chr 6_137 | C | 0.006 | 2% | 2% | 2% |
| chr 6_130 | C | 0.006 | 8% | 6% | 2% |
| chr 6_148 | C | 0.005 | 9% | 8% | 4% |
| chr 6_147 | C | 0.004 | 9% | 10% | 8% |
| chr 6_135 | C | 0.002 | 3% | 6% | 2% |
| chr 6_145 | C | 0.001 | 35% | 32% | 24% |
| chr 6_132 | C | 0.0 | 7% | 6% | 6% |
| chr 6_156 | D | 0.001 | 11% | 14% | 2% |
| chr 11_387 | F | 0.135 | 5% | 6% | 6% |
| chr 11_388 | F | 0.064 | 5% | 4% | 4% |
| chr 11_391 | F | 0.031 | 1% | 2% | 2% |
| chr 16_29 | A | 0.001 | 1% | 0% | 0% |
| chr 18_269 | E | 0.171 | 51% | 48% | 10% |
| chr 8_442 | B | 0.001 | 0% | 0% | 0% |
| chr 9_186 | G | 0.021 | 0% | 0% | 0% |
| chr 9_189 | H | 0.014 | 0% | 0% | 0% |
aSurrogates and covariates are in bold.
Prediction error for random forest analyses
| ITbp | ITtop50 | IT0 | |||||
| Statistics | Training | Test | Training | Test | Training | Test | CaMML Test |
| Mean | 11.28 | 14.05 | 12.73 | 13.60 | 14.60 | 14.73 | 12.42 |
| SD | 0.83 | 0.90 | 0.95 | 0.85 | 0.96 | 0.91 | 0.97 |
| Min | 9.80 | 12.20 | 10.93 | 11.60 | 12.27 | 12.20 | 10.35 |
| Max | 14.73 | 16.87 | 16.00 | 15.47 | 18.00 | 16.53 | 16.00 |
| 5.26 × 10-18 | 1.35 × 10-9 | 0.25 | |||||
| Difference in median | -2.77 | -0.93 | -0.17 | ||||
ap-Value of the paired Wilcoxon rank test comparing training and test data prediction error.
Paired Wilcoxon rank test of prediction errors from three RFs, using Replicates 1–100
| Training data | Test data | |||
| Comparison of prediction errors | Difference in median | Difference in median | ||
| ITbp vs. ITtop50 | 3.94 × 10-18 | -1.40 | 9.09 × 10-10 | 0.43 |
| ITbp vs. IT0 | 3.95 × 10-18 | -3.33 | 2.57 × 10-12 | -0.73 |
| ITtop50 vs. IT0 | 3.94 × 10-18 | -1.87 | 1.20 × 10-17 | -2.10 |
Paired Wilcoxon rank test of prediction errors from three RFs and CaMML using test data and Replicates 1–50
| Test data | ||
| Comparison of prediction errors | Difference in median | |
| CaMML vs. ITbp | 1.10 × 10-8 | -1.52 |
| CaMML vs. ITtop50 | 1.04 × 10-6 | -1.13 |
| CaMML vs. IT0 | 2.16 × 10-9 | -2.32 |