| Literature DB >> 28292263 |
Li Ma1, Suohai Fan2.
Abstract
BACKGROUND: The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization.Entities:
Keywords: Feature selection; Imbalance data; Intelligence algorithm; Parameter optimization; Random forests
Mesh:
Year: 2017 PMID: 28292263 PMCID: PMC5351181 DOI: 10.1186/s12859-017-1578-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Random forests algorithm
Fig. 2Two random procedures diagram
Definitions in Borderline-SMOTE 1
| Point | Definition |
|---|---|
| Noisy point |
|
| Boundary point/dangerous point |
|
| Safe point | 0 ≤ |
Fig. 3Diagrams of GA, PSO, and AFSA
Fig. 4Binary coding
Fig. 5The diagram of a hybrid algorithm based on RF and an artificial algorithm
Fig. 6Crossover operation
Fig. 7Mutation operation
Dataset
| Id | Dataset | N | M | Positive class | Negative class | IR | Label |
|---|---|---|---|---|---|---|---|
| 1 | Circle | 1362 | 2 | 229 | 1133 | 0.2021:1 | 1:0 |
| 2 | Blood-transfusion | 748 | 4 | 178 | 570 | 0.3123:1 | 4:2 |
| 3 | Haberman's survival | 306 | 3 | 81 | 225 | 0.36:1 | 2:1 |
| 4 | Breast-cancer-wisconsin | 702 | 10 | 243 | 459 | 0.5249:1 | 1:0 |
| 5 | SPECT.train | 80 | 23 | 26 | 54 | 0.4815 | 1:0 |
Comparison of algorithms and references
| Algorithm | Reference | Algorithm | Reference |
|---|---|---|---|
| SMOTE | [ | Safe-level SMOTE | [ |
| Borderline-SMOTE 1 | [ | C-SMOTE | [ |
| k-means-SMOTE | [ | - | - |
Fig. 8CURE-SMOTE algorithm diagram
Fig. 9Artificial samples generated by different methods
Fig. 10The CURE clustering result
The classification results of different sampling algorithms
| Dataset | Method | F | G-Mean | AUC | OOB error |
|---|---|---|---|---|---|
| 1. Circle | Original data | 0.9081 | 0.9339 | 0.9389 | 0.0296 |
| Random oversampling | 0.9249 | 0.9553 | 0.9567 |
| |
| SMOTE | 0.9086 | 0.9535 | 0.9579 | 0.0384 | |
| Borderline-SMOTE1 | 0.9110 | 0.9534 | 0.9619 | 0.0438 | |
| Safe-level-SMOTE | 0.9146 | 0.9595 | 0.9559 | 0.0431 | |
| C-SMOTE | 0.9302 | 0.9713 | 0.9813 | 0.0702 | |
| k-means-SMOTE | 0.9262 | 0.9589 | 0.9602 | 0.0323 | |
| CURE-SMOTE |
|
|
| 0.0323 | |
| 2. Blood-transfusion | Original data | 0.3509 | 0.5094 | 0.5083 | 0.2548 |
| Random oversampling | 0.3903 | 0.5490 | 0.5449 | 0.2250 | |
| SMOTE | 0.4118 | 0.5798 | 0.5537 | 0.2152 | |
| Borderline-SMOTE1 | 0.4185 | 0.5832 | 0.5424 |
| |
| Safe-level-SMOTE | 0.4494 | 0.6174 | 0.5549 | 0.2479 | |
| C-SMOTE | 0.4006 | 0.5549 | 0.5531 | 0.2418 | |
| k-means-SMOTE | 0.4157 | 0.5941 | 0.5433 | 0.1872 | |
| CURE-SMOTE |
|
|
| 0.2531 | |
| 3. Haberman’s survival | Original data | 0.3279 | 0.5018 | 0.6063 | 0.3149 |
| Random oversampling | 0.3504 | 0.5178 | 0.5959 |
| |
| SMOTE | 0.4350 | 0.5971 | 0.6259 | 0.1728 | |
| Borderline-SMOTE1 | 0.4523 | 0.6119 | 0.6298 | 0.2589 | |
| Safe-level-SMOTE | 0.4762 | 0.6008 | 0.6030 | 0.3077 | |
| C-SMOTE | 0.4528 | 0.5487 | 0.5656 | 0.2780 | |
| k-means-SMOTE | 0.4685 | 0.6249 | 0.6328 | 0.1828 | |
| CURE-SMOTE |
|
|
| 0.2717 | |
| 4. Breast–cancer-wisconsin | Original data | 0.9486 | 0.9619 | 0.9491 | 0.0446 |
| Random oversampling | 0.9451 | 0.9623 | 0.9620 |
| |
| SMOTE | 0.9502 | 0.9666 | 0.9627 | 0.0341 | |
| Borderline-SMOTE1 | 0.9506 | 0.9661 | 0.9635 | 0.0379 | |
| Safe-level-SMOTE | 0.9509 |
|
| 0.0404 | |
| C-SMOTE | 0.9491 | 0.9636 | 0.9561 | 0.0380 | |
| k-means-SMOTE | 0.9449 | 0.9616 | 0.9562 | 0.0373 | |
| CURE-SMOTE |
| 0.9664 | 0.9621 | 0.0427 | |
| 5. SPECT.train | Original data | 0.6348 | 0.6764 | 0.6579 | 0.3634 |
| Random oversampling | 0.6539 | 0.6924 | 0.6753 | 0.3468 | |
| SMOTE | 0.6618 | 0.6990 | 0.6825 | 0.3688 | |
| Borderline-SMOTE1 | 0.6710 | 0.6926 | 0.6746 | 0.3489 | |
| Safe-level-SMOTE | 0.6770 | 0.7074 | 0.6913 | 0.3160 | |
| C-SMOTE | 0.6564 | 0.6936 | 0.6764 | 0.3448 | |
| k-means-SMOTE | 0.6796 | 0.6941 | 0.6846 | 0.3599 | |
| CURE-SMOTE |
|
|
|
|
From the classification results obtained by the different sampling algorithms discussed in Table 4, the best F-value, G-mean and AUC were achieved on the Circle dataset by CURE-SMOTE, and its OOB error is second-best, behind only random sampling. The overall classification result on the blood-transfusion dataset is poorer, but the CURE-SMOTE algorithm achieves the best F-value, G-mean and AUC, while its OOB error is inferior to the original data. On the Haberman's survival dataset, the F-value, G-mean and AUC achieved by CURE-SMOTE are superior to the other sampling algorithms. For the breast-cancer-wisconsin dataset, CURE-SMOTE achieves the best F-value, but its G-mean and AUC are slightly lower, although they are little different from the other sampling algorithms. On the SPECT dataset, CURE-SMOTE surpasses the other sampling algorithms with regard to F-value, G-mean, AUC and OOB error
The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface
Dataset
| id | Dataset | N | M | Positive class | Negative class | IR | Label |
|---|---|---|---|---|---|---|---|
| 1 | Connectionist Bench | 208 | 17 | 97 | 111 | 0.8739 | R:M |
| 2 | Wine | 130 | 13 | 59 | 71 | 0.831 | 1:2 |
| 3 | Ionosphere | 351 | 34 | 126 | 225 | 0.56 | b:g |
| 4 | Breast-cancer-wisconsin | 702 | 10 | 243 | 459 | 0.5249 | 1:0 |
| 5 | Steel Plates Faults | 1,941 | 27 |
| - | - | 7 labels |
| 6 | Libras Movement | 360 | 90 | - | - | - | 15 labels |
| 7 | mfeat-factors | 2,000 | 216 | - | - | - | 10 labels |
Parameter settings
| Hybrid GA-RF | popsize :5 | maxgen :20 | Pc: 0.6 | Pm:0.1 | ||
|---|---|---|---|---|---|---|
| Hybrid PSO-RF | popsize :5 | maxgen :20 | c1:1.5 | r1,r2∈[0,1] | Vmin:Vmax = -0.5:0.5 | w:0.5 |
| c2:1.5 | ||||||
| Hybrid AFSA-RF | popsize: 5 | maxgen: 20 | visual: 3 | try_number: 5, delta: 0.618 | ||
The binary classification results
| 1 |
| ⌊ log2( |
| GA-RF | PSO-RF | AFSA-RF | ||
|---|---|---|---|---|---|---|---|---|
| Connectionist Bench | Accuracy | 0.6442 | 0.6442 | 0.6058 | 0.6635 | 0.6538 | 0.7308 | 0.6827 |
| Sensitive | 0.5882 | 0.6122 | 0.6500 | 0.7556 | 0.5741 | 0.6744 | 0.5870 | |
| Precision | 0.6522 | 0.6250 | 0.4906 | 0.5862 | 0.7045 | 0.6744 | 0.6585 | |
| Specificity | 0.6981 | 0.6727 | 0.5781 | 0.5932 | 0.7400 | 0.7705 | 0.7586 | |
| F | 0.6186 | 0.6186 | 0.5591 | 0.6602 | 0.6327 |
| 0.6207 | |
| G-mean | 0.6408 | 0.6418 | 0.6130 | 0.6695 | 0.6518 |
| 0.6673 | |
| AUC | 0.4107 | 0.4119 | 0.3758 | 0.4482 | 0.4248 |
| 0.4453 | |
| OOB | 0.3808 | 0.3889 | 0.3344 | 0.3391 | 0.3314 | 0.3085 |
| |
| margin | 0.1078 | 0.1632 | 0.1991 | 0.2084 | 0.2056 | 0.1468 |
| |
|
| 100 | 100 | 100 | 100 | 315 | 193 |
| |
|
| 1 | 4 | 5 | 17 | 6 | 8 |
| |
|
| 17 | 17 | 17 | 17 | 13 | 16 |
| |
| Wine | Accuracy | 0.9846 | 0.9692 | 0.9846 | 0.9692 | 0.9846 | 0.9846 | 0.9692 |
| Sensitive | 1.0000 | 0.9286 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
| Precision | 0.9655 | 1.0000 | 0.9677 | 0.9333 | 0.9706 | 0.9643 | 0.9355 | |
| Specificity | 0.9730 | 1.0000 | 0.9714 | 0.9459 | 0.9688 | 0.9737 | 0.9444 | |
| F | 0.9825 | 0.9630 | 0.9836 | 0.9655 |
| 0.9818 | 0.9667 | |
| G-mean | 0.9864 | 0.9636 | 0.9856 | 0.9726 | 0.9843 |
| 0.9718 | |
| AUC | 0.9730 | 0.9286 | 0.9714 | 0.9459 | 0.9688 |
| 0.9444 | |
| OOB | 0.0442 | 0.0502 | 0.0288 | 0.0748 | 0.0246 |
| 0.0238 | |
| margin | 0.6951 | 0.7553 | 0.8149 | 0.7995 | 0.7863 | 0.7890 |
| |
|
| 100 | 100 | 100 | 100 | 349 |
| 90 | |
|
| 1 | 3 | 4 | 13 | 5 |
| 5 | |
|
| 13 | 13 | 13 | 13 | 12 |
| 12 | |
| Ionosphere | Accuracy | 0.9200 | 0.9314 | 0.9371 | 0.9257 | 0.9371 | 0.9257 | 0.9314 |
| Sensitive | 0.9107 | 0.8475 | 0.8889 | 0.8824 | 0.8333 | 0.9032 | 0.9107 | |
| Precision | 0.8500 | 0.9434 | 0.9057 | 0.9231 | 0.9804 | 0.8889 | 0.8793 | |
| Specificity | 0.9244 | 0.9741 | 0.9587 | 0.9533 | 0.9913 | 0.9381 | 0.9412 | |
| F | 0.8793 | 0.8929 | 0.8972 | 0.9003 |
| 0.8960 | 0.8947 | |
| G-mean | 0.9175 | 0.9086 | 0.9231 | 0.9171 | 0.9089 | 0.9205 |
| |
| AUC | 0.8956 | 0.8651 | 0.9002 | 0.8975 | 0.8548 | 0.8835 |
| |
| OOB | 0.1096 | 0.0860 | 0.1132 | 0.0884 |
| 0.0831 | 0.0825 | |
| margin | 0.5696 | 0.6918 | 0.6511 | 0.7041 |
| 0.6934 | 0.6351 | |
|
| 100 | 100 | 100 | 100 |
| 321 | 350 | |
|
| 1 | 5 | 6 | 34 |
| 15 | 2 | |
|
| 34 | 34 | 34 | 34 |
| 30 | 28 | |
| Breast -cancer -wisconsin | Accuracy | 0.9801 | 0.9658 | 0.9715 | 0.9573 | 0.9544 | 0.9801 | 0.9658 |
| Sensitive | 0.9914 | 0.9474 | 0.9583 | 0.9748 | 0.9919 | 1.0000 | 0.9474 | |
| Precision | 0.9504 | 0.9474 | 0.9583 | 0.9063 | 0.8905 | 0.9421 | 0.9474 | |
| Specificity | 0.9745 | 0.9747 | 0.9784 | 0.9483 | 0.9342 | 0.9705 | 0.9747 | |
| F | 0.9701 | 0.9474 | 0.9583 | 0.9393 | 0.9385 |
| 0.9474 | |
| G-mean | 0.9829 | 0.9609 | 0.9683 | 0.9614 | 0.9626 |
| 0.9609 | |
| AUC | 0.9844 | 0.9555 | 0.9595 | 0.9547 | 0.9601 |
| 0.9474 | |
| OOB | 0.0422 | 0.0399 | 0.0433 | 0.0467 |
| 0.0411 | 0.0372 | |
| margin | 0.8247 | 0.8569 | 0.8509 | 0.8652 |
| 0.8179 | 0.8616 | |
|
| 100 | 100 | 100 | 100 |
| 420 | 351 | |
|
| 1 | 3 | 4 | 10 |
| 1 | 3 | |
|
| 10 | 10 | 10 | 10 |
| 9 | 7 |
The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface
The multi-classification results
| 1 |
| ⌊ log2( |
| GA-RF | PSO-RF | AFSA-RF | ||
|---|---|---|---|---|---|---|---|---|
| Steel Plates Faults | Accuracy | 0.7464 | 0.7485 | 0.7598 | 0.7814 | 0.7881 |
| 0.7914 |
| OOB | 0.3152 | 0.2819 | 0.2746 | 0.2640 | 0.2437 | 0.2276 |
| |
| margin | 0.2456 | 0.3384 | 0.3484 | 0.3789 | 0.3803 |
| 0.3810 | |
|
| 100 | 100 | 100 | 100 | 397 |
| 400 | |
|
| 1 | 5 | 5 | 27 | 8 |
| 6 | |
|
| 27 | 27 | 27 | 27 | 23 |
| 22 | |
| Libras Movement | Accuracy | 0.7167 | 0.7556 | 0.6889 | 0.6444 | 0.7606 | 0.7767 |
|
| OOB | 0.3546 | 0.3397 | 0.3480 | 0.3163 |
| 0.3323 | 0.3116 | |
| margin | 0.1464 | 0.1798 | 0.1990 | 0.2180 | 0.2443 | 0.2677 |
| |
|
| 100 | 100 | 100 | 100 |
| 348 | 135 | |
|
| 1 | 9 | 7 | 90 |
| 8 | 9 | |
|
| 90 | 90 | 90 | 90 |
| 76 | 49 | |
| mfeat-fac | Accuracy | 0.4280 | 0.9030 | 0.8010 | 0.9620 |
| 0.9600 | 0.9611 |
| OOB | 0.6949 | 0.1823 | 0.3192 | 0.0486 | 0.0416 | 0.0410 |
| |
| margin | −0.0987 | 0.4561 | 0.2361 | 0.8708 |
| 0.8615 | 0.8698 | |
|
| 100 | 100 | 100 | 100 | 377 | 270 |
| |
|
| 1 | 15 | 8 | 215 | 14 | 18 |
| |
|
| 215 | 215 | 215 | 215 | 145 | 112 |
|
The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface
Fig. 11Comparison of OOB errors among different methods and datasets