| Literature DB >> 20018058 |
Yoonhee Kim1, Robert Wojciechowski, Heejong Sung, Rasika A Mathias, Li Wang, Alison P Klein, Rhoshel K Lenroot, James Malley, Joan E Bailey-Wilson.
Abstract
Random forests (RF) is one of a broad class of machine learning methods that are able to deal with large-scale data without model specification, which makes it an attractive method for genome-wide association studies (GWAS). The performance of RF and other association methods in the presence of interactions was evaluated using the simulated data from Genetic Analysis Workshop 16 Problem 3, with knowledge of the major causative markers, risk factors, and their interactions in the simulated traits. There was good power to detect the environmental risk factors using RF, trend tests, or regression analyses but the power to detect the effects of the causal markers was poor for all methods. The causal marker that had an interactive effect with smoking did show moderate evidence of association in the RF and regression analyses, suggesting that RF may perform well at detecting such interactions in larger, more highly powered datasets.Entities:
Year: 2009 PMID: 20018058 PMCID: PMC2795965 DOI: 10.1186/1753-6561-3-s7-s64
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
PLINK and RF results for MI
| PLINKa | RF | |||||
|---|---|---|---|---|---|---|
| REPc | REPc | |||||
| Predictors | Univariate | Multivariate | mtry = 200d | mtry = 7d | mtry = 200d | mtry = 7d |
| Covariates | ||||||
| Age | - | 7.66 × 10-7 | 100 | 100 | 100 | 100 |
| Sex | - | 6.97 × 10-12 | 100 | 36 | 100 | 44 |
| Smoking | - | 1.72 × 10-10 | 100 | 71 | 100 | 84 |
| Cholesterol | - | <1.0 × 10-50 | 100 | 100 | 100 | 100 |
| CAC | ||||||
| rs6743961 | 0.451 | 0.7394 | 2 | 10 | 9 | 22 |
| rs17714718 | 0.0188 | 0.093 | 20 | 13 | 33 | 29 |
| rs1894638 | 0.0406 | 0.3198 | 5 | 18 | 14 | 32 |
| rs1919811 | 0.2106 | 0.2088 | 8 | 14 | 9 | 21 |
| rs213952 | 0.3923 | 0.2319 | 11 | 3 | 25 | 11 |
| MI | ||||||
| rs12565497 | 0.00303 | 0.000192 | 56 | 21 | 71 | 37 |
| rs11927551 | 0.963 | 0.6993 | 1 | 4 | 7 | 9 |
aPLINK analyses assumed either additive or log-additive genetic models.
bp-Values for covariates were averages across all SNPs.
cREP: the number of times in 100 RFs (1000 trees each) in which the given SNP of interest appeared in the top 1536 or 3072 predictors based on GINI index.
dmtry: the number of predictors (200 or 7) randomly selected at each node to find the best split while growing trees.
PLINK and RF results for CAC
| PLINKa | RF | |||||
|---|---|---|---|---|---|---|
| REPc | REPc | |||||
| Predictors | Univariate | Multivariate | mtry = 200d | mtry = 7d | mtry = 200d | mtry = 7d |
| Covariates | ||||||
| Age | - | 0.508 | 100 | 65 | 100 | 65 |
| Sex | - | 1.79 × 10-13 | 13 | 6 | 19 | 6 |
| Smoking | - | 0.259 | 11 | 7 | 17 | 7 |
| Cholesterol | - | <1.0 × 10-50 | 100 | 100 | 100 | 100 |
| CAC | ||||||
| rs6743961 | 0.113 | 0.588 | 6 | 7 | 11 | 11 |
| rs17714718 | 0.0299 | 0.1392 | 3 | 5 | 7 | 8 |
| rs1894638 | 0.0129 | 0.5723 | 4 | 9 | 5 | 12 |
| rs1919811 | 0.128 | 0.1413 | 5 | 5 | 7 | 6 |
| rs213952 | 0.0786 | 0.4738 | 4 | 6 | 8 | 14 |
aPLINK analyses assumed either additive or log-additive genetic models.
bp-Values for covariates were averages across all SNPs.
cREP: the number of times in 100 RFs (1000 trees each) in which the given SNP of interest appeared in the top 1536 or 3072 predictors based on MSE index.
dmtry: the number of predictors (200 or 7) randomly selected at each node to find the best split while growing trees.