| Literature DB >> 34945988 |
Wojciech Wieczorek1, Jan Kozak2, Łukasz Strąk3, Arkadiusz Nowakowski3.
Abstract
A new two-stage method for the construction of a decision tree is developed. The first stage is based on the definition of a minimum query set, which is the smallest set of attribute-value pairs for which any two objects can be distinguished. To obtain this set, an appropriate linear programming model is proposed. The queries from this set are building blocks of the second stage in which we try to find an optimal decision tree using a genetic algorithm. In a series of experiments, we show that for some databases, our approach should be considered as an alternative method to classical ones (CART, C4.5) and other heuristic approaches in terms of classification quality.Entities:
Keywords: classification; decision tree; query set
Year: 2021 PMID: 34945988 PMCID: PMC8700169 DOI: 10.3390/e23121682
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1An exemplary decision tree.
Characteristics of data sets.
| Data Set | Objects | Number of Attributes | Classes |
|---|---|---|---|
| balance-scale (bs) | 625 | 4 | 3 |
| breast-cancer-wisconsin (bcw) | 699 | 9 | 2 |
| car (car) | 1728 | 6 | 4 |
| dermatology (derm) | 366 | 34 | 6 |
| house-votes-84 (hv84) | 435 | 16 | 2 |
| lymphography (lymp) | 148 | 18 | 4 |
| monks-1 (monk1) | 432 | 6 | 2 |
| Somerville Happiness Survey 2015 (SHS) | 143 | 6 | 2 |
| soybean-large (soy-l) | 307 | 35 | 19 |
| tic-tac-toe (ttt) | 958 | 9 | 2 |
| zoo (zoo) | 101 | 16 | 7 |
The quality of classification depending on the approach (bold text is the best value).
| Data Set | Measure | MQS | C4.5 | CART | EVO | ACDT |
|---|---|---|---|---|---|---|
| bs | acc | 0.7551 | 0.6809 |
| 0.7730 | 0.7936 |
| pre | 0.5559 | 0.4562 |
| 0.5196 | 0.5482 | |
| rec | 0.5360 | 0.4843 |
| 0.5505 | 0.5646 | |
| f1 | 0.5436 | 0.4656 |
| 0.5290 | 0.5538 | |
| bcw | acc | 0.8817 |
| 0.9190 | 0.9317 | 0.9192 |
| pre | 0.8855 |
| 0.9252 | 0.9270 | 0.9144 | |
| rec | 0.8808 | 0.9261 | 0.9059 |
| 0.9173 | |
| f1 | 0.8812 |
| 0.9135 | 0.9290 | 0.9158 | |
| car | acc | 0.9210 | 0.9056 |
| 0.7069 | 0.9492 |
| pre | 0.7946 | 0.7667 |
| 0.3029 | 0.8511 | |
| rec | 0.8565 | 0.7600 |
| 0.2609 | 0.9131 | |
| f1 | 0.8205 | 0.7630 |
| 0.2306 | 0.8714 | |
| derm | acc | 0.8861 |
| 0.9273 | 0.7879 | 0.9361 |
| pre | 0.8605 |
| 0.9152 | 0.7753 | 0.9276 | |
| rec | 0.8478 | 0.9244 | 0.9157 | 0.7225 |
| |
| f1 | 0.8488 |
| 0.9142 | 0.7293 | 0.9253 | |
| hv84 | acc | 0.9078 | 0.9466 | 0.9313 |
| 0.9450 |
| pre | 0.8897 | 0.9300 | 0.9224 |
| 0.9385 | |
| rec | 0.9096 | 0.9534 | 0.9326 |
| 0.9442 | |
| f1 | 0.8981 | 0.9436 | 0.9269 |
| 0.9412 | |
| lymp | acc |
|
|
| 0.7896 | 0.8163 |
| pre | 0.5411 |
| 0.6677 | 0.6178 | 0.5764 | |
| rec | 0.6683 |
| 0.6722 | 0.4980 | 0.5741 | |
| f1 | 0.5837 |
| 0.6679 | 0.5290 | 0.5718 | |
| monk1 | acc |
| 0.8385 | 0.9538 | 0.7959 | 0.9331 |
| pre |
| 0.8807 | 0.9548 | 0.8469 | 0.9330 | |
| rec |
| 0.8333 | 0.9548 | 0.7899 | 0.9330 | |
| f1 |
| 0.8323 | 0.9538 | 0.7857 | 0.9330 | |
| SHS | acc |
| 0.4419 | 0.4186 | 0.4682 | 0.4985 |
| pre |
| 0.5974 | 0.4378 | 0.5837 | 0.6118 | |
| rec |
| 0.5428 | 0.4352 | 0.5532 | 0.5785 | |
| f1 |
| 0.4028 | 0.4173 | 0.4481 | 0.4844 | |
| soy-l | acc | 0.5634 | 0.8478 |
| 0.4706 | 0.7789 |
| pre | 0.4974 |
| 0.8560 | 0.4912 | 0.7173 | |
| rec | 0.6348 |
| 0.8382 | 0.3224 | 0.6909 | |
| f1 | 0.5294 | 0.8229 |
| 0.3418 | 0.6367 | |
| ttt | acc |
| 0.8368 | 0.9132 | 0.7434 | 0.8927 |
| pre |
| 0.8092 | 0.8951 | 0.7387 | 0.8978 | |
| rec |
| 0.8146 | 0.9066 | 0.6175 | 0.8485 | |
| f1 |
| 0.8118 | 0.9005 | 0.6217 | 0.8675 | |
| zoo | acc | 0.8800 |
|
| 0.8720 | 0.9505 |
| pre | 0.7381 |
| 0.7857 | 0.7998 | 0.9080 | |
| rec | 0.8163 |
| 0.8571 | 0.7539 | 0.8964 | |
| f1 | 0.7636 |
| 0.8095 | 0.7587 | 0.8857 |
Decision tree characteristics depending on the approach.
| Data Set | Parameter | MQS | C4.5 | CART | EVO | ACDT |
|---|---|---|---|---|---|---|
| bs | time[s] | 76.1 | <0.1 | <0.1 | 20.5 | 0.3 |
| size | 257.1 | 31.0 | 241.0 | 15.1 | 79.4 | |
| height | 14.9 | 4.0 | 10.0 | 8.1 | 8.9 | |
| bcw | time[s] | 11.7 | <0.1 | <0.1 | 12.5 | 0.2 |
| size | 51.1 | 22.0 | 71.0 | 9.1 | 18.0 | |
| height | 8.0 | 3.0 | 12.0 | 5.4 | 5.7 | |
| car | time[s] | 114.1 | <0.1 | <0.1 | 11.2 | 0.5 |
| size | 318.9 | 134.0 | 163.0 | 1.7 | 109.4 | |
| height | 13.3 | 6.0 | 14.0 | 1.5 | 11.8 | |
| derm | time[s] | 26.6 | <0.1 | <0.1 | 10.2 | 0.4 |
| size | 64.7 | 25.0 | 27.0 | 10.8 | 16.6 | |
| height | 9.0 | 7.0 | 10.0 | 5.7 | 6.8 | |
| hv84 | time[s] | 2.6 | <0.1 | <0.1 | 4.3 | 0.1 |
| size | 31.6 | 7.0 | 41.0 | 3.8 | 16.4 | |
| height | 5.9 | 3.0 | 6.0 | 2.5 | 4.2 | |
| lymp | time[s] | 1.2 | <0.1 | <0.1 | 5.5 | 0.1 |
| size | 34.0 | 20.0 | 49.0 | 11.0 | 20.0 | |
| height | 6.0 | 6.0 | 7.0 | 6.2 | 5.0 | |
| monk1 | time[s] | 0.1 | <0.1 | <0.1 | 3.9 | 0.1 |
| size | 20.3 | 32.0 | 89.0 | 4.1 | 23.0 | |
| height | 5.0 | 5.0 | 10.0 | 2.6 | 6.1 | |
| SHS | time[s] | 2.4 | <0.1 | <0.1 | 2.5 | 0.1 |
| size | 64.3 | 9.0 | 87.0 | 8.0 | 15.6 | |
| height | 7.9 | 3.0 | 13.0 | 4.2 | 6.1 | |
| soy-l | time[s] | 14.6 | <0.1 | <0.1 | 14.3 | 1.3 |
| size | 151.8 | 67.0 | 75.0 | 14.0 | 45.8 | |
| height | 9.3 | 9.0 | 17.0 | 6.6 | 8.8 | |
| ttt | time[s] | 24.9 | <0.1 | <0.1 | 17.5 | 0.4 |
| size | 228.2 | 124.0 | 151.0 | 7.8 | 54.2 | |
| height | 9.0 | 7.0 | 11.0 | 4.1 | 8.0 | |
| zoo | time[s] | 0.5 | <0.1 | <0.1 | 3.3 | <0.1 |
| size | 19.1 | 15.0 | 19.0 | 9.0 | 13.2 | |
| height | 4.9 | 6.0 | 7.0 | 5.1 | 4.9 |
Figure 2Box plot—accuracy of classification for the MQS algorithm.
Figure 3Box plot—accuracy of classification for the EVO-Tree algorithm.
Figure 4Box plot—accuracy of classification for the ACDT algorithm.
The Friedman test results and mean ranks.
|
| |
| N | 44 |
| Chi-Square | 24.0594 |
| degrees of freedom | 4 |
| 0.0001 | |
| 5% critical difference | 0.6192 |
|
| |
| MQS | 3.1591 |
| C4.5 | 2.6932 |
| CART | 2.5568 |
| EVO | 3.9545 |
| ACDT | 2.6364 |
Friedman test results and mean ranks after rejection of the critically worse method.
|
| |
| N | 44 |
| Chi-Square | 5.8 |
| degrees of freedom | 3 |
| 0.1218 | |
| 5% critical difference | 0.5305 |
|
| |
| MQS | 2.8864 |
| C4.5 | 2.4205 |
| CART | 2.2614 |
| ACDT | 2.4318 |