| Literature DB >> 33954229 |
G Sekhar Reddy1, Suneetha Chittineni2.
Abstract
Information efficiency is gaining more importance in the development as well as application sectors of information technology. Data mining is a computer-assisted process of massive data investigation that extracts meaningful information from the datasets. The mined information is used in decision-making to understand the behavior of each attribute. Therefore, a new classification algorithm is introduced in this paper to improve information management. The classical C4.5 decision tree approach is combined with the Selfish Herd Optimization (SHO) algorithm to tune the gain of given datasets. The optimal weights for the information gain will be updated based on SHO. Further, the dataset is partitioned into two classes based on quadratic entropy calculation and information gain. Decision tree gain optimization is the main aim of our proposed C4.5-SHO method. The robustness of the proposed method is evaluated on various datasets and compared with classifiers, such as ID3 and CART. The accuracy and area under the receiver operating characteristic curve parameters are estimated and compared with existing algorithms like ant colony optimization, particle swarm optimization and cuckoo search.Entities:
Keywords: AUROC; C4.5 decision tree; Information gain; Selfish herd optimization
Year: 2021 PMID: 33954229 PMCID: PMC8049126 DOI: 10.7717/peerj-cs.424
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Flow diagram of SHO.
Description of data set.
| Data set | No. of attributes | No. of samples | Classes |
|---|---|---|---|
| Monks | 7 | 432 | 2 |
| Car | 6 | 1,728 | 4 |
| Chess | 6 | 28,056 | 36 |
| Breast-cancer | 10 | 699 | 2 |
| Hayes | 5 | 160 | 3 |
| Abalone | 8 | 4,177 | 2 |
| Wine | 13 | 178 | 2 |
| Ionosphere | 34 | 351 | 2 |
| Iris | 4 | 150 | 2 |
| Scale | 4 | 625 | 2 |
Algorithms parameters and values.
| SHO | ACO | PSO | CS | ||||
|---|---|---|---|---|---|---|---|
| Number of populations | 50 | Number of populations | 50 | Number of populations | 100 | Number of populations | 50 |
| Maximum iterations | 500 | Maximum iterations | 500 | Maximum iterations | 500 | Maximum iterations | 500 |
| Dimension | 5 | Phromone exponential weight | −1 | Inertia weight | −1 | Dimension | 5 |
| Lower boundary | −1 | Heuristic exponential weight | 1 | Inertia weight damping ratio | 0.99 | Lower bound and upper bound | −1 and 1 |
| Upper boundary | 1 | Evaporation rate | 1 | Personal and global learning coefficient | 1.5 and 2 | Number of nests | 20 |
| Prey’s rate | 0.7, 0.9 | Lower bound and upper bound | −1 and 1 | Lower bound and upper bound | −10 and 10 | Transition probability coefficient | 0.1 |
| Number of runs | 100 | Number of runs | 100 | Number of runs | 100 | Number of runs | 100 |
Algorithms parameters for decision tree.
| C4.5 | ID3 | CART | |||
|---|---|---|---|---|---|
| Confidence factor | 0.25 | Minimum number of instances in split | 10 | Complexity parameter | 0.01 |
| Minimum instance per leaf | 2 | Minimum number of instances in a leaf | 5 | Minimum number of instances in split | 20 |
| Minimum number of instances in a leaf | 5 | Maximum depth | 20 | Minimum number of instances in a leaf | 7 |
| Use binary splits only | False | – | Maximum depth | 30 | |
Classification accuracy of the proposed classifier C4.5 with C4.5, ID3 and CART.
| Data set | C4.5-SHO | C4.5 | ID3 | CART |
|---|---|---|---|---|
| Monks | 0.9832 | 0.966 | 0.951 | 0.954 |
| Car | 0.9725 | 0.923 | 0.9547 | 0.8415 |
| Chess | 0.9959 | 0.9944 | 0.9715 | 0.8954 |
| Breast-cancer | 0.9796 | 0.95 | 0.9621 | 0.9531 |
| Hayes | 0.9553 | 0.8094 | 0.9014 | 0.7452 |
| Abalone | 0.9667 | 0.9235 | 0.9111 | 0.9111 |
| Wine | 0.9769 | 0.963 | 0.9443 | 0.9145 |
| Ionosphere | 0.9899 | 0.9421 | 0.9364 | 0.9087 |
| Iris | 0.9986 | 0.9712 | 0.7543 | 0.8924 |
| Scale | 0.9437 | 0.7782 | 0.7932 | 0.7725 |
| Average value | 0.97623 | 0.92208 | 0.908 | 0.87884 |
Classification accuracy of the proposed Algorithm with ACO, PSO and CS.
| Data set | SHO-C4.5 | ACO | PSO | CS |
|---|---|---|---|---|
| Monks | 0.9832 | 0.9600 | 0.9435 | 0.9563 |
| Car | 0.9725 | 0.9322 | 0.9298 | 0.9202 |
| Chess | 0.9959 | 0.9944 | 0.9944 | 0.9742 |
| Breast-cancer | 0.9796 | 0.9555 | 0.954 | 0.9621 |
| Hayes | 0.9553 | 0.90311 | 0.9322 | 0.9415 |
| Abalone | 0.9667 | 0.9500 | 0.9345 | 0.9247 |
| Wine | 0.9769 | 0.9240 | 0.8999 | 0.8924 |
| Ionosphere | 0.9899 | 0.9583 | 0.9645 | 0.9645 |
| Iris | 0.9986 | 0.9796 | 0.9741 | 0.9764 |
| Scale | 0.9437 | 0.9060 | 0.9177 | 0.8911 |
| Average value | 0.97623 | 0.946311 | 0.94446 | 0.94034 |
Area under the ROC curve of proposed C4.5 with ID3 and CART.
| Dataset | C4.5-SHO | C4.5 | ID3 | CART |
|---|---|---|---|---|
| Monks | 0.9619 | 0.95713 | 0.9636 | 0.9791 |
| Car | 0.9819 | 0.9393 | 0.9891 | 0.8933 |
| Chess | 0.9673 | 0.9252 | 0.9090 | 0.9049 |
| Breast-cancer | 0.9793 | 0.9171 | 0.9730 | 0.9218 |
| Hayes | 0.9874 | 0.9069 | 0.9108 | 0.8360 |
| Abalone | 0.9647 | 0.9224 | 0.9573 | 0.9082 |
| Wine | 0.9914 | 0.9772 | 0.9497 | 0.9739 |
| Ionosphere | 0.9943 | 0.9680 | 0.9059 | 0.9560 |
| Iris | 0.9890 | 0.9048 | 0.7945 | 0.9481 |
| Scale | 0.9850 | 0.8562 | 0.7845 | 0.8007 |
| Average value | 0.98022 | 0.92742 | 0.91374 | 0.9122 |
Area under ROC curve of the proposed Algorithm with ALO, PSO and CS.
| Dataset | C4.5-SHO | ACO | PSO | CS |
|---|---|---|---|---|
| Monks | 0.9935 | 0.9874 | 0.97668 | 0.9733 |
| Car | 0.98452 | 0.97908 | 0.97583 | 0.9659 |
| Chess | 0.99931 | 0.98612 | 0.9815 | 0.9503 |
| Breast-cancer | 0.9854 | 0.9795 | 0.9695 | 0.9581 |
| Hayes | 0.99616 | 0.92611 | 0.9442 | 0.9571 |
| Abalone | 0.9885 | 0.9828 | 0.9694 | 0.9566 |
| Wine | 0.9932 | 0.9830 | 0.8977 | 0.8964 |
| Ionosphere | 0.9954 | 0.9741 | 0.9630 | 0.9569 |
| Iris | 0.9873 | 0.9687 | 0.9656 | 0.9578 |
| Scale | 0.9858 | 0.9266 | 0.9165 | 0.8968 |
| Average value | 0.9909 | 0.96934 | 0.95599 | 0.94692 |
Entropy comparison.
| Dataset | C4.5-SHO (Shanon entropy) | C4.5 – SHO (Havrda & charvt entropy) | C4.5 – SHO (Quadratic entropy) | C4.5- SHO (Renyi entropy) | C4.5- SHO (Taneja entropy) |
|---|---|---|---|---|---|
| Monks | 0.9429 | 0.9756 | 0.9859 | 0.9926 | 0.9415 |
| Car | 0.9585 | 0.9527 | 0.9753 | 0.9895 | 0.9700 |
| Chess | 0.9510 | 0.9535 | 0.9907 | 0.9809 | 0.9401 |
| Breast-cancer | 0.9852 | 0.9558 | 0.9863 | 0.9564 | 0.9672 |
| Hayes | 0.9579 | 0.9460 | 0.9981 | 0.9476 | 0.9102 |
| Abalone | 0.9556 | 0.9618 | 0.9789 | 0.9715 | 0.9447 |
| Wine | 0.9485 | 0.9731 | 0.9823 | 0.9297 | 0.9317 |
| Ionosphere | 0.9319 | 0.9415 | 0.9665 | 0.9636 | 0.9036 |
| Iris | 0.9465 | 0.9807 | 0.9832 | 0.9514 | 0.9428 |
| Scale | 0.9725 | 0.8936 | 0.9747 | 0.9617 | 0.9031 |
| Average Value | 0.95505 | 0.95343 | 0.98219 | 0.96449 | 0.93549 |
Figure 2Convergence evaluation of SHO.
Figure 3Comparison of convergence plot.
Computational time.
| Algorithm | Time (sec) |
|---|---|
| ACO | 0.974 |
| PSO | 0.54 |
| CS | 0.6 |
| SHO | 0.49 |
Figure 4Comparison of computational time.
Pseudo code for C4.5 decision tree algorithm.
Peseudo code for the proposed SHO algorithm in data classification.