| Literature DB >> 24739673 |
Cyprien Mbogning1, Hervé Perdry2, Wilson Toussile2, Philippe Broët3.
Abstract
BACKGROUND: Dissecting the genomic spectrum of clinical disease entities is a challenging task. Recursive partitioning (or classification trees) methods provide powerful tools for exploring complex interplay among genomic factors, with respect to a main factor, that can reveal hidden genomic patterns. To take confounding variables into account, the partially linear tree-based regression (PLTR) model has been recently published. It combines regression models and tree-based methodology. It is however computationally burdensome and not well suited for situations for which a large number of exploratory variables is expected.Entities:
Keywords: Disease taxonomy; Genomic; Lung cancer; Recursive partitioning; Tree-based regression
Year: 2014 PMID: 24739673 PMCID: PMC4129184 DOI: 10.1186/2043-9113-4-6
Source DB: PubMed Journal: J Clin Bioinforma ISSN: 2043-9113
Figure 1Flow-chart of the four steps of the method.
Figure 2Tree used for scenario 2 simulations. The leaves are represented by circles and the number beneath each node represents the real value of the coefficient consider in each leaf of the tree.
Figure 3Tree used for scenario 3 simulations. The leaves are represented by circles and the number beneath each node represents the real value of the coefficient consider in each leaf of the tree.
Figure 4Quantile-quantile plots of the observed statistics versus the “naïve” quantiles.
Figure 5Quantile-quantile plots of the observed statistics versus the scaled quantiles.
Number of trees by number of leaves, for the 300 trees selected by the different methods under scenario 1
| BOOT | 273 | 8 | 8 | 7 | 1 | 3 |
| CV | 252 | 0 | 18 | 10 | 10 | 10 |
| BIC | 295 | 4 | 1 | 0 | 0 | 0 |
| AIC | 142 | 64 | 54 | 30 | 5 | 5 |
Number of trees by number of leaves, for the 300 trees selected by the different methods under scenario 2
| BOOT | 0 | 300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CV | 0 | 18 | 83 | 61 | 36 | 32 | 21 | 19 | 13 | 17 |
| BIC | 0 | 0 | 112 | 133 | 46 | 7 | 2 | 0 | 0 | 0 |
| AIC | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 24 | 265 |
Number of trees by number of leaves, for the 300 trees selected by the different methods under scenario 3
| BOOT | 0 | 300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CV | 0 | 0 | 41 | 16 | 22 | 154 | 22 | 12 | 17 | 16 |
| BIC | 0 | 0 | 0 | 1 | 89 | 162 | 36 | 9 | 3 | 0 |
| AIC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 297 |
Figure 6Distribution of the generalization 10-fold cross-validation error for AIC, BIC, CV criteria across the 300 simulated data sets.
Variables selected by the procedure using BIC criterion under scenario 2, with global percentages between brackets
| | | ||||||
|---|---|---|---|---|---|---|---|
| Correct | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Variables | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| | 2 | 134 (44.66%) | 22 (7.33%) | 10 (3.33%) | 0 | 0 | 1 (0.33%) |
| 3 | 112 (37.33%) | 19 (6.33%) | 2 (0.66%) | 0 | 0 | 0 | |
Variables selected by the procedure using BIC criterion under scenario 3, with global percentages between brackets
| | | ||||||
|---|---|---|---|---|---|---|---|
| Correct | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| variables | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| | 3 | 1 (0.33%) | 1 (0.33%) | 0 | 0 | 0 | 0 |
| 4 | 239 (79.66%) | 50 (16.66%) | 8 (2.66%) | 1 (0.33%) | 0 | 0 | |
Figure 7Optimal tree obtained with the two competing methods on the real data set: (a) BIC selected tree, (b) Original PLTR selected tree. The leaves are represented by circles and the number in each leave node represents the number of observations falling inside the node; the percentage represented proportion of cases inside the node.