| Literature DB >> 28659973 |
Xiangkui Jiang1, Chang-An Wu2, Huaping Guo2.
Abstract
A forest is an ensemble with decision trees as members. This paper proposes a novel strategy to pruning forest to enhance ensemble generalization ability and reduce ensemble size. Unlike conventional ensemble pruning approaches, the proposed method tries to evaluate the importance of branches of trees with respect to the whole ensemble using a novel proposed metric called importance gain. The importance of a branch is designed by considering ensemble accuracy and the diversity of ensemble members, and thus the metric reasonably evaluates how much improvement of the ensemble accuracy can be achieved when a branch is pruned. Our experiments show that the proposed method can significantly reduce ensemble size and improve ensemble accuracy, no matter whether ensembles are constructed by a certain algorithm such as bagging or obtained by an ensemble selection algorithm, no matter whether each decision tree is pruned or unpruned.Entities:
Mesh:
Year: 2017 PMID: 28659973 PMCID: PMC5474283 DOI: 10.1155/2017/3162571
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Decision tree T0. v is a test node and v1 and v2 are two leaves.
Algorithm 1The procedure of forest pruning.
The details of data sets used in this paper.
| Data set | #Attrs | #Size | #Cls |
|---|---|---|---|
| Australian | 226 | 70 | 24 |
| Autos | 205 | 26 | 6 |
| Backache | 180 | 33 | 2 |
| Balance-scale | 625 | 5 | 3 |
| Breast-cancer | 268 | 10 | 2 |
| Cars | 1728 | 7 | 4 |
| Credit-rating | 690 | 16 | 2 |
| German-credit | 1000 | 21 | 2 |
| Ecoli | 336 | 8 | 8 |
| Hayes-roth | 160 | 5 | 4 |
| Heart-c | 303 | 14 | 5 |
| Horse-colic | 368 | 24 | 2 |
| Ionosphere | 351 | 35 | 2 |
| Iris | 150 | 5 | 3 |
| Lymph | 148 | 19 | 4 |
| Page-blocks | 5473 | 11 | 5 |
| Pima | 768 | 9 | 2 |
| prnn-fglass | 214 | 10 | 6 |
| Vote | 439 | 17 | 2 |
Figure 2Results on data sets. (a) Forest size (node number) versus the times of running FP. (b) Forest accuracy versus the times of running FP.
Figure 3Results on data sets. (a) Forest size (node number) versus the number of decision trees. (b) Forest accuracy versus the number of decision trees. Solid curves and dash curves represent the performance of FP and bagging, respectively.
The accuracy of FP, bagging, and random forest. ∙ represents that FP outperforms bagging in pairwise t-tests at 95% significance level and denotes that FP is outperformed by bagging.
| Dataset | Unpruned C4.5 | Pruned C4.5 | PF | RF | ||
|---|---|---|---|---|---|---|
| PF | Bagging | PF | Bagging | |||
| Australian | 87.14 (2.0) | 86.09 (5.0)∙ | 86.80 (3.0) | 85.86 (6.0)∙ | 87.21 (1.0) | 86.14 (4.0)∙ |
| Autos | 74.40 (2.0) | 73.30 (4.0)∙ | 74.20 (3.0) | 73.20 (5.0)∙ | 74.72 (1.0) | 73.10 (6.0)∙ |
| Backache | 85.07 (3.0) | 83.17 (5.5)∙ | 85.89 (1.0) | 83.17 (5.5)∙ | 85.21 (2.0) | 83.22 (4.0)∙ |
| Balance-scale | 78.89 (3.0) | 75.07 (6.0)∙ | 79.79 (1.0) | 76.64 (4.0)∙ | 79.65 (2.0) | 76.32 (5.0)∙ |
| Breast-cancer | 69.98 (2.0) | 67.10 (5.0)∙ | 69.97 (3.0) | 66.58 (6.0)∙ | 70.11 (1.0) | 68.88 (4.0)∙ |
| Cars | 86.51 (4.0) | 86.78 (2.0) | 86.88 (1.0) | 86.28 (5.0) | 86.55 (3.0) | 86.11 (6.0) |
| Credit-rating | 86.44 (2.0) | 85.54 (4.0)∙ | 86.34 (3.0) | 85.43 (5.0)∙ | 86.82 (1.0) | 85.42 (6.0)∙ |
| German-credit | 75.33 (1.0) | 73.83 (4.0)∙ | 74.86 (3.0) | 73.11 (6.0)∙ | 75.22 (2.0) | 73.18 (5.0)∙ |
| Ecoli | 84.47 (2.0) | 83.32 (6.0)∙ | 84.20 (3.0) | 83.40 (5.0)∙ | 84.52 (1.0) | 83.89 (4.0)∙ |
| Hayes-roth | 78.75 (3.0) | 78.63 (5.0) | 78.77 (1.0) | 76.31 (6.0)∙ | 78.76 (2.0) | 77.77 (4.0) |
| Heart-c | 80.94 (2.0) | 80.34 (5.0) | 81.01 (1.0) | 80.27 (6.0) | 80.90 (3.0) | 80.87 (4.0) |
| Horse-colic | 84.52 (1.0) | 83.29 (6.0)∙ | 84.33 (2.0) | 83.42 (5.0)∙ | 84.31 (3.0) | 83.99 (4.0) |
| Ionosphere | 93.99 (1.0) | 93.93 (2.0) | 93.59 (6.0) | 93.71 (4.0) | 93.87 (3.0) | 93.56 (5.0) |
| Iris | 93.55 (6.0) | 94.24 (4.0) | 94.52 (3.0) | 94.53 (2.0) | 94.21 (5.0) | 94.62 (1.0) |
| Lymphography | 83.81 (5.0) | 83.43 (6.0) | 84.55 (2.0) | 84.53 (3.0) | 84.38 (4.0) | 84.82 (1.0) |
| Page-blocks | 97.03 (4.5) | 97.04 (2.5) | 97.04 (2.5) | 97.06 (1.0) | 97.03 (4.5) | 97.01 (6.0) |
| Pima | 75.09 (3.0) | 74.27 (4.0)∙ | 75.46 (1.0) | 74.06 (5.0)∙ | 75.43 (2.0) | 73.21 (6.0)∙ |
| prnn-fglass | 78.14 (4.0) | 78.46 (1.0) | 77.62 (6.0) | 77.84 (5.0) | 78.18 (3.0) | 78.32 (2.0) |
| Vote | 95.77 (1.0) | 95.13 (6.0)∙ | 95.67 (3.0) | 95.33 (4.0) | 95.72 (2.0) | 95.31 (5.0) |
The ranks of algorithms using Friedman test, where Alg1, Alg2, Alg3, Alg4, Alg5, and Alg6 indicate PF pruning bagging with unpruned C4.5, bagging with unpruned C4.5, PF pruning bagging with pruned C4.5, bagging with pruned C4.5, PF pruning random forest, and random forest.
| Algorithm | Alg5 | Alg3 | Alg1 | Alg2 | Alg6 | Alg4 |
|---|---|---|---|---|---|---|
| Ranks | 2.39 | 2.50 | 2.71 | 4.32 | 4.42 | 4.66 |
The testing results using post hoc, Alg1, Alg2, Alg3, Alg4, Alg5, and Alg6 indicate PF pruning bagging with unpruned C4.5, bagging with unpruned C4.5, PF pruning bagging with pruned C4.5, bagging with pruned C4.5, PF pruning random forest, and random forest.
| Comparison | Statistic |
|
|---|---|---|
| Alg1 versus Alg2 | 2.64469 | 0.04088 |
| Alg3 versus Alg4 | 3.55515 | 0.00189 |
| Alg5 versus Alg6 | 3.33837 | 0.01264 |
The size (node number) of PF and bagging. ∙ denotes that the size of PF is significantly smaller than the corresponding comparing method.
| Dataset | Unpruned C4.5 | Pruned C4.5 | PF-RF | RF | ||
|---|---|---|---|---|---|---|
| PF | Bagging | PF | Bagging | |||
| Australian | 4440.82 ± 223.24 | 5950.06 ± 210.53∙ | 2194.71 ± 99.65 | 2897.88 ± 98.66∙ | 1989.67 ± 99.65 | 2653.88 ± 99.61∙ |
| Autos | 1134.83 ± 193.45 | 1813.19 ± 183.49∙ | 987.82 ± 198.22 | 1523.32 ± 193.22∙ | 954.26 ± 198.22 | 1429.12 ± 182.21∙ |
| Backache | 1162.79 ± 96.58 | 1592.80 ± 75.97∙ | 518.77 ± 40.49 | 764.24 ± 37.78∙ | 522.74 ± 40.49 | 789.23 ± 45.62∙ |
| Balance-scale | 3458.52 ± 74.55 | 4620.58 ± 78.20∙ | 3000.44 ± 71.76 | 3762.60 ± 65.55∙ | 2967.44 ± 71.76 | 3763.19 ± 79.46∙ |
| Breast-cancer | 2164.64 ± 156.41 | 3194.20 ± 144.95∙ | 843.96 ± 129.44 | 1189.33 ± 154.08∙ | 886.66 ± 129.44 | 1011.21 ± 148.92∙ |
| Cars | 1741.68 ± 60.59 | 2092.20 ± 144.95∙ | 1569.11 ± 57.55 | 1834.91 ± 46.80∙ | 1421.32 ± 56.65 | 1899.92 ± 68.88∙ |
| Credit-rating | 4370.65 ± 219.27 | 5940.51 ± 223.51∙ | 2168.11 ± 121.51 | 2904.40 ± 99.73∙ | 2015.21 ± 140.58 | 2650.40 ± 102.13∙ |
| German-credit | 9270.75 ± 197.62 | 11464.19 ± 168.63∙ | 4410.11 ± 114.94 | 5421.60 ± 107.24∙ | 4311.54 ± 124.68 | 5340.60 ± 217.48∙ |
| Ecoli | 1366.62 ± 61.68 | 1736.52 ± 64.91∙ | 1304.30 ± 54.39 | 1611.02 ± 56.31∙ | 1324.30 ± 54.42 | 1820.02 ± 88.74∙ |
| Hayes-roth | 498.65 ± 28.99 | 697.58 ± 40.87∙ | 272.30 ± 45.11 | 308.48 ± 53.86∙ | 264.24 ± 46.46 | 299.48 ± 63.84∙ |
| Heart-c | 1503.46 ± 65.47 | 1946.94 ± 62.52∙ | 647.89 ± 102.15 | 974.93 ± 129.83∙ | 647.89 ± 102.15 | 1032.93 ± 111.57∙ |
| Horse-colic | 2307.67 ± 106.99 | 3625.23 ± 116.63∙ | 684.29 ± 106.35 | 974.93 ± 129.83∙ | 647.89 ± 102.15 | 743.25 ± 120.43∙ |
| Ionosphere | 552.49 ± 61.41 | 680.43 ± 69.95∙ | 521.83 ± 58.01 | 634.73 ± 64.44∙ | 542.58 ± 96.02 | 665.84 ± 66.44∙ |
| Iris | 168.46 ± 111.12 | 222.66 ± 150.42∙ | 144.52 ± 97.26 | 191.84 ± 133.12∙ | 133.24 ± 98.32 | 212.55 ± 129.47∙ |
| Lymphography | 1089.87 ± 67.16 | 1394.37 ± 61.85∙ | 711.62 ± 37.61 | 856.44 ± 30.83∙ | 724.53 ± 37.61 | 924.33 ± 50.78∙ |
| Page-blocks | 1420.05 ± 278.51 | 2187.45 ± 555.02∙ | 1394.11 ± 600.06 | 2092.93 ± 403.79∙ | 1401.11 ± 588.03 | 2134.40 ± 534.97∙ |
| Pima | 2202.41 ± 674.18 | 2776.77 ± 852.95∙ | 2021.19 ± 698.02 | 2481.64 ± 747.19∙ | 1927.67 ± 625.27 | 2521.43 ± 699.82∙ |
| prnn-fglass | 1219.98 ± 39.85 | 1398.62 ± 36.29∙ | 1145.20 ± 39.76 | 1269.28 ± 35.52∙ | 1098.18 ± 34.26 | 1314.05 ± 60.97∙ |
| Vote | 303.06 ± 124.00 | 527.80 ± 225.05∙ | 174.04 ± 77.61 | 276.00 ± 127.46∙ | 182.14 ± 76.21 | 288.33 ± 113.76∙ |
The performance of FP on pruning subensemble obtained by FP on bagging. ∙ represents that FP is significantly better (or smaller) than EPIC in pairwise t-tests at 95% significance level and denotes that FP is significantly worse (or larger) than EPIC.
| Dataset | Error rate | Size | ||
|---|---|---|---|---|
| PF | EPIC | PF | EIPC | |
| Australian | 86.83 ± 3.72 | 86.22 ± 3.69∙ | 2447.50 ± 123.93 | 3246.16 ± 116.07∙ |
| Autos | 84.83 ± 4.46 | 82.11 ± 5.89∙ | 708.01 ± 54.55 | 931.44 ± 51.16∙ |
| Backache | 84.83 ± 4.46 | 82.11 ± 5.89∙ | 708.01 ± 54.55 | 931.44 ± 51.16∙ |
| Balance-scale | 79.74 ± 3.69 | 78.57 ± 3.82∙ | 3277.76 ± 85.07 | 4030.82 ± 94.67∙ |
| Breast-cancer | 70.26 ± 7.24 | 67.16 ± 8.36∙ | 843.96 ± 129.44 | 1189.33 ± 154.08∙ |
| Cars | 87.02 ± 5.06 | 86.83 ± 5.04 | 178.32 ± 60.44 | 2022.81 ± 53.19∙ |
| Credit-rating | 86.13 ± 3.92 | 85.61 ± 3.95∙ | 2414.60 ± 123.66 | 3226.25 ± 131.46∙ |
| German-credit | 74.98 ± 3.63 | 73.13 ± 4.00∙ | 4410.11 ± 114.94 | 6007.28 ± 124.30∙ |
| Ecoli | 83.77 ± 5.96 | 83.24 ± 5.98∙ | 1498.86 ± 62.27 | 1806.26 ± 70.98∙ |
| Hayes-roth | 78.75 ± 9.57 | 76.81 ± 9.16∙ | 275.09 ± 47.90 | 311.32 ± 57.05∙ |
| Heart-c | 81.21 ± 6.37 | 79.99 ± 6.65∙ | 1230.14 ± 54.80 | 1510.57 ± 52.56∙ |
| Horse-colic | 84.53 ± 5.30 | 83.80 ± 6.11∙ | 940.07 ± 66.64 | 1337.60 ± 75.73∙ |
| Ionosphere | 93.90 ± 4.05 | 94.02 ± 3.83 | 590.63 ± 65.62 | 706.79 ± 73.17∙ |
| Iris | 94.47 ± 5.11 | 94.47 ± 5.02 | 152.58 ± 108.04 | 197.80 ± 141.31∙ |
| Lymphography | 81.65 ± 9.45 | 81.46 ± 9.39 | 858.42 ± 46.50 | 1022.67 ± 39.68∙ |
| Page-blocks | 97.02 ± 0.74 | 97.07 ± 0.69 | 1396.63 ± 237.03 | 2086.89 ± 399.10∙ |
| Pima | 74.92 ± 3.94 | 74.03 ± 3.58∙ | 2391.95 ± 764.16 | 2910.31 ± 936.70∙ |
| prnn-fglass | 78.13 ± 8.06 | 77.99 ± 8.44 | 1280.14 ± 43.85 | 1410.84 ± 39.59∙ |
| Vote | 95.70 ± 2.86 | 95.33 ± 2.97 | 177.36 ± 86.10 | 281.62 ± 140.60∙ |