| Literature DB >> 25614764 |
Hemant Ishwaran1, James D Malley2.
Abstract
BACKGROUND: Using a collection of different terminal nodesize constructed random forests, each generating a synthetic feature, a synthetic random forest is defined as a kind of hyperforest, calculated using the new input synthetic features, along with the original features.Entities:
Keywords: Machine; Nodesize; Random forest; Synthetic feature; Trees
Year: 2014 PMID: 25614764 PMCID: PMC4279689 DOI: 10.1186/s13040-014-0028-y
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Regression benchmark performance
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| Air | 111 | 5 | 27.24 | 28.68 | 27.53 | 28.14 |
| Air2 | 111 | 5 | 28.40 | 30.72 | 28.85 | 28.36 |
| Automobile | 193 | 29 | 9.83 | 8.94 | 6.79 | 7.52 |
| Bodyfat | 252 | 13 | 31.36 | 32.02 | 31.67 | 32.19 |
| BostonHousing | 506 | 13 | 18.88 | 14.64 | 12.39 | 12.80 |
| BostonHousing2 | 506 | 16 | 17.44 | 13.57 | 11.32 | 11.61 |
| CMB | 899 | 4 | 96.33 | 100.90 | 90.32 | 89.86 |
| Crime | 47 | 15 | 61.74 | 59.99 | 59.51 | 59.03 |
| Diabetes | 442 | 10 | 57.58 | 53.22 | 53.14 | 55.20 |
| DiabetesI | 442 | 64 | 57.05 | 54.42 | 54.61 | 55.92 |
| Fitness | 31 | 6 | 83.34 | 66.48 | 59.61 | 57.76 |
| Highway | 39 | 11 | 38.84 | 43.67 | 33.95 | 32.18 |
| Iowa | 33 | 9 | 62.60 | 62.16 | 50.03 | 50.22 |
| Ozone | 203 | 12 | 26.90 | 26.19 | 26.20 | 26.42 |
| OzoneI | 203 | 134 | 27.42 | 26.14 | 26.32 | 26.08 |
| Pollute | 60 | 15 | 49.64 | 51.36 | 49.52 | 46.74 |
| Prostate | 97 | 8 | 87.32 | 46.02 | 46.95 | 50.12 |
| Servo | 167 | 19 | 15.22 | 21.47 | 11.27 | 11.99 |
| ServoFactor | 167 | 16 | 43.24 | 34.65 | 32.54 | 31.44 |
| Tecator | 215 | 22 | 13.84 | 16.11 | 13.48 | 6.19 |
| Tecator2 | 215 | 100 | 31.24 | 34.21 | 30.64 | 27.94 |
| Windmill | 1114 | 12 | 31.64 | 31.39 | 31.31 | 32.15 |
| expon | 250 | 2 | 47.76 | 46.04 | 46.48 | 47.60 |
| expon.noise | 250 | 17 | 62.13 | 67.49 | 66.44 | 53.04 |
| mlb.friedman1 | 250 | 10 | 21.46 | 26.11 | 24.15 | 19.04 |
| mlb.friedman1.noise | 250 | 10 | 30.91 | 34.77 | 33.13 | 30.48 |
| mlb.friedman1.bigp | 250 | 250 | 37.67 | 44.14 | 43.81 | 31.99 |
| mlb.friedman2 | 250 | 4 | 13.94 | 14.75 | 14.24 | 14.04 |
| mlb.friedman2.noise | 250 | 4 | 37.19 | 36.77 | 36.80 | 38.58 |
| mlb.friedman2.bigp | 250 | 254 | 22.92 | 29.01 | 28.10 | 17.73 |
| mlb.friedman3 | 250 | 4 | 19.21 | 22.01 | 19.87 | 15.59 |
| mlb.friedman3.noise | 250 | 4 | 37.47 | 39.38 | 38.53 | 36.97 |
| mlb.friedman3.bigp | 250 | 254 | 37.19 | 46.72 | 45.47 | 26.78 |
| mlb.peak | 250 | 20 | 14.75 | 17.24 | 16.28 | 6.21 |
| mlb.peak.bigp | 250 | 20 | 14.75 | 17.24 | 16.28 | 6.21 |
| mlb.noise | 250 | 500 | 101.69 | 100.75 | 100.47 | 100.29 |
| sine | 250 | 2 | 35.92 | 37.79 | 35.95 | 34.72 |
| sine.noise | 250 | 5 | 56.64 | 66.07 | 61.14 | 54.71 |
| syn.ex1 | 250 | 50 | 20.69 | 30.87 | 28.57 | 8.54 |
| syn.ex2 | 250 | 20 | 88.60 | 89.66 | 89.59 | 92.68 |
| syn.ex3 | 250 | 50 | 43.88 | 47.88 | 47.50 | 43.04 |
| syn.ex4 | 250 | 50 | 34.75 | 37.78 | 36.90 | 30.40 |
| syn.ex5 | 250 | 20 | 62.50 | 65.07 | 64.82 | 62.80 |
| syn.ex6 | 250 | 30 | 102.30 | 100.58 | 103.16 | |
| syn.ex7 | 250 | 300 | 55.13 | 61.68 | 61.38 | 52.41 |
| syn.ex8 | 250 | 50 | 117.93 | 58.11 | 57.76 | 52.01 |
Cross-validated and test-set standardized mean-squared error (MSE) performance over 100 independent replications. Standardized MSE obtained by dividing MSE by the variance of the Y-response and multiplying by 100.
Figure 1Regression benchmark results. Cross-validated and test-set standardized mean-squared error (MSE) performance over 100 independent replications. Boxplots display results from the 100 replications for COBRA (gray square symbol), RF (red square symbol), optimized random forests RFopt (blue square symbol), and synthetic random forests SRF (■). Standardized MSE obtained by dividing MSE by the variance of the Y-response and multiplying by 100.
Regression benchmark performance
|
|
|
|
| |
|---|---|---|---|---|
| COBRA |
| 0.0968 | 0.3703 | 0.0000 |
| RF | 388 |
| 0.0000 | 0.0000 |
| RFopt | 623 | 1029 |
| 0.0018 |
| SRF | 1005 | 966 | 827 |
|
Upper diagonal values are Wilcoxon signed rank p-values comparing two procedures; lower diagonal values are the corresponding test statistic. Diagonal values (in bold) record the overall rank of a procedure.
Multiclass benchmark performance
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| BreastCancer | 683 | 10 | 2 | 2.59 | 2.50 | 2.28 |
| DNA | 3186 | 180 | 3 | 3.03 | 2.79 | 2.34 |
| Esophagus | 3127 | 28 | 2 | 18.33 | 17.81 | 18.27 |
| Glass | 214 | 9 | 6 | 6.16 | 6.20 | 5.78 |
| HouseVotes84 | 232 | 16 | 2 | 5.85 | 4.82 | 4.41 |
| Hypothyroid | 2000 | 24 | 2 | 1.20 | 1.18 | 1.14 |
| Ionosphere | 351 | 34 | 2 | 5.76 | 5.28 | 5.14 |
| PimaIndiansDiabetes | 768 | 8 | 2 | 15.69 | 15.66 | 16.21 |
| Prostate | 158 | 20 | 2 | 15.81 | 15.83 | 16.02 |
| Satellite | 6435 | 36 | 6 | 2.30 | 1.98 | 1.92 |
| SickEuthyroid | 2000 | 24 | 2 | 2.51 | 2.35 | 2.30 |
| Sonar | 208 | 60 | 2 | 12.91 | 12.46 | 9.73 |
| SouthAfricanHeart | 462 | 9 | 2 | 19.69 | 19.34 | 19.86 |
| Soybean | 562 | 35 | 15 | 0.82 | 0.71 | 0.77 |
| Spam | 4601 | 57 | 2 | 4.39 | 4.18 | 3.74 |
| Vehicle | 846 | 18 | 4 | 7.51 | 8.81 | 6.82 |
| Vowel | 990 | 10 | 11 | 2.66 | 1.81 | 1.09 |
| WisconsinBreast | 699 | 10 | 2 | 3.13 | 3.05 | 3.07 |
| Zoo | 101 | 16 | 7 | 1.53 | 0.51 | 1.30 |
| aging | 29 | 8740 | 3 | 16.64 | 16.96 | 16.54 |
| brain | 42 | 5597 | 5 | 8.32 | 7.07 | 7.99 |
| colon | 62 | 2000 | 2 | 12.88 | 12.78 | 12.78 |
| leukemia | 72 | 3571 | 2 | 4.06 | 3.95 | 2.45 |
| lymphoma | 62 | 4026 | 3 | 2.71 | 2.62 | 2.31 |
| prostate | 102 | 6033 | 2 | 8.36 | 8.25 | 5.85 |
| srbct | 63 | 2308 | 4 | 3.62 | 3.70 | 2.45 |
| mlb.cassini | 250 | 2 | 3 | 0.92 | 0.55 | 0.62 |
| mlb.circle | 250 | 2 | 2 | 5.27 | 4.59 | 4.26 |
| mlb.cuboids | 250 | 3 | 4 | 0.66 | 0.53 | 0.57 |
| mlb.dnormals | 250 | 2 | 2 | 6.24 | 6.30 | 6.31 |
| mlb.ringnorm | 250 | 20 | 2 | 10.71 | 10.17 | 4.83 |
| mlb.shapes | 250 | 2 | 4 | 0.87 | 0.70 | 0.52 |
| mlb.smiley | 250 | 2 | 4 | 0.51 | 0.26 | 0.58 |
| mlb.spirals | 250 | 2 | 2 | 1.66 | 0.72 | 0.18 |
| mlb.threenorm | 250 | 20 | 2 | 15.62 | 15.34 | 12.98 |
| mlb.twonorm | 250 | 20 | 2 | 8.50 | 7.87 | 4.31 |
| mlb.waveform | 250 | 21 | 3 | 9.34 | 9.47 | 7.83 |
| mlb.xor | 250 | 2 | 2 | 3.61 | 2.50 | 1.30 |
Cross-validated and test-set Brier score performance (× 100) over 100 independent replications.
Figure 2Multiclass benchmark results. Cross-validated and test-set Brier score performance (×100) over 100 independent replications. Boxplots display results from the 100 replications for RF (red square symbol), optimized random forests RFopt (blue square symbol), and synthetic random forests SRF (■).
Multiclass benchmark performance
|
|
|
| |
|---|---|---|---|
| RF |
| 0.0000 | 0.0000 |
| RFopt | 648 |
| 0.0045 |
| SRF | 688 | 563 |
|
Upper diagonal values are Wilcoxon signed rank p-values comparing two procedures; lower diagonal values are the corresponding test statistic. Diagonal values (in bold) record the overall rank of a procedure.