| Literature DB >> 14975132 |
Alexandre Bureau1, Josée Dupuis, Brooke Hayward, Kathleen Falls, Paul Van Eerdewegh.
Abstract
Random Forest is a prediction technique based on growing trees on bootstrap samples of data, in conjunction with a random selection of explanatory variables to define the best split at each node. In the case of a quantitative outcome, the tree predictor takes on a numerical value. We applied Random Forest to the first replicate of the Genetic Analysis Workshop 13 simulated data set, with the sibling pairs as our units of analysis and identity by descent (IBD) at selected loci as our explanatory variables. With the knowledge of the true model, we performed two sets of analyses on three phenotypes: HDL, triglycerides, and glucose. The goal was to approach the mapping of complex traits from a multivariate perspective. The first set of analyses mimics a candidate gene approach with a high proportion of true genes among the predictors while the second set represents a genome scan analysis using microsatellite markers. Random Forest was able to identify a few of the major genes influencing the phenotypes, such as baseline HDL and triglycerides, but failed to identify the major genes regulating baseline glucose levels.Entities:
Mesh:
Substances:
Year: 2003 PMID: 14975132 PMCID: PMC1866502 DOI: 10.1186/1471-2156-4-S1-S64
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Figure 1Importance of candidate genes Importance measured by percent increase in PE. Variables are the mean IBD at genes b12 to b38 and s3 to s12 and at four random locations (r) on each of chromosomes 2, 6, 10, and 16 (total of 52 loci, b37 and s12 being merged into one). Random forests are 5000 trees. A sample of 30 variables is considered at each split.
Figure 2Importance of genome scan markers Importance measured by percent increase in PE. Variables are the mean IBD at the 399 genome scan markers. Only the variables with non-zero importance are shown. Random forests are 10,000 trees. A sample of 100 variables is considered at each split.
Figure 3Importance of genome scan markers using ZImportance of Z2 at the 399 genome scan markers (A). Importance of mean IBD and Z2 at the 399 genome scan markers (798 variables) (B). Only the variables with non-zero importance are shown. Random forests are 10,000 trees. A sample of 100 variables is considered at each split.
QTLs at candidate loci and in peak intervals with highest importance
| 1 | b12 | FPB | b17 | b12 | b12 | FP | FP | b19,b20 |
| 2 | b20 | FP | b14 | b20 | FP | FP | FP | b13 |
| 3 | b13 | FP | b12 | FP | b19,b20,b18 | b30A | b13 | b12 |
| 4 | s3A | FP | b27A | FP | FP | s3 | b14 | FP |
| 5 | s4A | FP | FP | b17 | FP | FP | FP | FP |
| 6 | FP | FP | FP | b19 | FP | FP | b17 | FP |
| 7 | b19 | FP | b33A | b20 | FP | b14 | FP | b29 |
| 8 | FP | b23A | b25 | FP | FP | FP | b12 | b22, s3 |
| 9 | FP | b22A | FP | b13 | FP | FP | FP | FP |
| 10 | s7A | b18 | FP | FP | b24A | FP | b35A | FP |
| 11 | b24A | FP | FP | s7A | b26A | FP | FP | FP |
| 12 | FP | s4 | FP | s3 | b20 | FP | FP | b14 |
List of genes influencing a trait directly or indirectly through a correlated trait. For peak intervals (genome scan), all genes directly affecting a trait are listed. A Genes influencing only a correlated trait. BFP, false positive (for genome scan, peak intervals containing no gene influencing the trait).