| Literature DB >> 33286862 |
Muhammad Umar Chaudhry1,2, Muhammad Yasir3, Muhammad Nabeel Asghar4, Jee-Hyong Lee2.
Abstract
The complexity and high dimensionality are the inherent concerns of big data. The role of feature selection has gained prime importance to cope with the issue by reducing dimensionality of datasets. The compromise between the maximum classification accuracy and the minimum dimensions is as yet an unsolved puzzle. Recently, Monte Carlo Tree Search (MCTS)-based techniques have been invented that have attained great success in feature selection by constructing a binary feature selection tree and efficiently focusing on the most valuable features in the features space. However, one challenging problem associated with such approaches is a tradeoff between the tree search and the number of simulations. In a limited number of simulations, the tree might not meet the sufficient depth, thus inducing biasness towards randomness in feature subset selection. In this paper, a new algorithm for feature selection is proposed where multiple feature selection trees are built iteratively in a recursive fashion. The state space of every successor feature selection tree is less than its predecessor, thus increasing the impact of tree search in selecting best features, keeping the MCTS simulations fixed. In this study, experiments are performed on 16 benchmark datasets for validation purposes. We also compare the performance with state-of-the-art methods in literature both in terms of classification accuracy and the feature selection ratio.Entities:
Keywords: Monte Carlo Tree Search (MCTS); R-MOTiFS; dimensionality reduction; feature selection; heuristic feature selection
Year: 2020 PMID: 33286862 PMCID: PMC7597188 DOI: 10.3390/e22101093
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Notations used to explain the proposed method.
| Notation | Interpretation |
|---|---|
|
| Original feature set |
|
| Input feature set in |
|
| Best feature subset in |
|
| Node |
|
| Number of times node |
|
| Simulation reward |
Figure 1The proposed method, Recursive-Monte Carlo Tree Search-Based Feature Selection (R-MOTiFS).
Summary of the selected datasets.
| # | Dataset | No. of Features | No. of Instances | No. of Classes |
|---|---|---|---|---|
| 1 | Spambase | 57 | 4701 | 2 |
| 2 | Ionosphere | 34 | 351 | 2 |
| 3 | Arrhythmia | 195 | 452 | 16 |
| 4 | Multiple Features | 649 | 2000 | 10 |
| 5 | Waveform | 40 | 5000 | 3 |
| 6 | WBDC | 30 | 569 | 2 |
| 7 | German number | 24 | 1000 | 2 |
| 8 | DNA | 180 | 2000 | 2 |
| 9 | Sonar | 60 | 208 | 2 |
| 10 | Hillvalley | 100 | 606 | 2 |
| 11 | Musk 1 | 166 | 476 | 2 |
| 12 | Coil20 | 1024 | 1440 | 20 |
| 13 | Orl | 1024 | 400 | 40 |
| 14 | Lung_Discrete | 325 | 73 | 7 |
| 15 | Kr-vs-kp | 36 | 3196 | 2 |
| 16 | Spect | 22 | 267 | 2 |
Comparison of R-MOTiFS with MOTiFS and H-MOTiFS. Best results in each row are in bold.
| Dataset | Accuracy | ||
|---|---|---|---|
| R-MOTiFS | MOTiFS [ | H-MOTiFS [ | |
| Spambase | 0.907 | 0.907 | |
| Ionosphere | 0.890 ± 0.008 | 0.889 | |
| Arrhythmia | 0.650 | 0.640 | |
| Multiple features | 0.980 | ||
| Waveform | 0.817 ± 0.005 | 0.816 | |
| WDBC | 0.962 ± 0.002 | 0.964 | |
| German Number | 0.718 ± 0.014 | 0.725 | |
| DNA | 0.893 ± 0.002 | 0.810 | |
| Sonar | 0.834 ± 0.003 | 0.836 | |
| Hill valley | 0.552 ± 0.016 | 0.535 | |
| Musk 1 | 0.852 | 0.850 | |
| Coil20 | 0.981 ± 0.009 | 0.980 | |
| Orl | 0.862 ± 0.011 | 0.862 | |
| Lung_discrete | 0.807 ± 0.006 | 0.810 | |
| Kr-vs-kp | 0.964 ± 0.005 | 0.961 | |
| Spect | 0.813 ± 0.008 | 0.809 | |
Comparison of R-MOTiFS with MOTiFS and H-MOTiFS w.r.t FSR (feature selection ratio). Best results in each row are in bold.
| DataSet | FSR | ||
|---|---|---|---|
| R-MOTiFS | MOTiFS [ | H-MOTiFS [ | |
| Spambase |
| 0.029 | 0.050 |
| Ionosphere |
| 0.072 | 0.127 |
| Arhythmia |
| 0.007 | 0.016 |
| Multiple ft. |
| 0.003 | 0.005 |
| Waveform | 0.057 | 0.042 |
|
| WDBC | 0.076 | 0.063 |
|
| GermanNumber | 0.083 | 0.063 |
|
| DNA |
| 0.009 | 0.050 |
| Sonar | 0.059 | 0.029 |
|
| HillValley |
| 0.012 | 0.056 |
| Musk 1 |
| 0.010 | 0.017 |
| Coil20 |
| 0.002 | 0.003 |
| ORL |
| 0.002 | 0.003 |
| Lung_discrete |
| 0.005 | 0.008 |
| Kr-vs-Kp | 0.060 | 0.048 |
|
| Spect | 0.093 | 0.079 |
|
Comparison of R-MOTiFS with other methods. Best results in each row are bold and underlined. The second-best results in each row are in bold. “-” is placed wherever information is not available.
| Dataset | Accuracy, Number of Selected Features | ||||||
|---|---|---|---|---|---|---|---|
| R-MOTiFS | GA | SFSW | E-FSGA | PSO (4-2) | WOA | WOA-T | |
| Spambase | 0.910 | 0.885 |
| - | - | - | |
| Ionosphere | 0.875 | 0.883 | 0.862 | 0.873 | 0.884 | ||
| Arhythmia | 0.635 | - | - | - | - | ||
| Multiple Feat | 0.976 | 0.945 | - | - | - | ||
| Waveform | 0.817 | - | - | 0.713 | 0.710 | ||
| WDBC | 0.961 | 0.941 |
| 0.940 | 0.955 | 0.950 | |
| GermanNumber | 0.713 | - | 0.685 | - | - | ||
| DNA | 0.831 | - | - | - | - | ||
| Sonar | 0.834 | 0.827 | 0.808 | 0.782 | 0.854 | ||
| HillValley | 0.552 | 0.564 | - | - | - | ||
| Musk 1 | 0.840 | 0.815 | - | - | - | ||
| Coil20 | - | 0.892 | - | - | - | ||
| ORL | - | 0.622 | - | - | - | ||
| Lung discrete | - | 0.713 | 0.784 | 0.730 | 0.737 | ||
| Kr-vs-Kp | - | - | - | 0.915 | 0.896 | ||
| Spect | - | - | - | 0.788 | 0.792 | ||
GA: Genetic Algorithm. SFSW: Simultaneous Feature Selection and Weighing. E-FSGA: Ensemble Feature Selection using bi-objective Genetic Algorithm. PSO(4-2):Particle Swarm Optimization. WoA: Whale Optimization Algorithm. WoA-T: Whale Optimization Algorithm-Tournament selection.
Comparison of R-MOTiFS with other methods w.r.t FSR (feature selection ratio). Best results in each row are bold and underlined. The second-best results in each row are in bold.
| Dataset | FSR | |||||
|---|---|---|---|---|---|---|
| R-MOTiFS | GA | SFSW | PSO (4-2) | WOA | WOA-T | |
| Spambase |
|
| 0.034 | - | - | - |
| Ionosphere |
| 0.079 | 0.077 |
| 0.041 | 0.044 |
| Arhythmia |
|
|
| - | - | - |
| Multiple Feat. |
| 0.003 |
| - | - | - |
| Waveform |
| 0.045 |
| - | 0.021 | 0.021 |
| WDBC |
| 0.053 | 0.070 |
| 0.046 | 0.046 |
| GermanNumber |
|
| 0.068 | 0.053 | - | - |
| DNA |
|
| 0.011 | - | - | - |
| Sonar |
| 0.033 | 0.041 |
| 0.020 | 0.022 |
| HillValley |
| 0.017 | 0.014 |
| - | - |
| Musk 1 |
| 0.011 |
| 0.011 | - | - |
| Coil20 |
|
| - | - | - | - |
| ORL |
|
| - | - | - | - |
| Lung_discrete |
| 0.007 | - |
| 0.010 | 0.011 |
| Kr-vs-Kp |
|
| - | - | 0.033 | 0.034 |
| Spect |
|
| - | - | 0.065 | 0.069 |
Results of the Wilcoxon Signed-Rank Test.
| R-MOTiFS vs. | R+ | R– | ||
|---|---|---|---|---|
| MOTiFS | 136 | 0 | 0.0004 | 0 |
| H-MOTiFS | 72.5 | 63.5 | 0.8181 | 63.5 |
| GA | 136 | 0 | 0.0004 | 0 |
| SFSW | 66 | 0 | 0.0033 | 0 |
| PSO (4-2) | 9 | 19 | NA | 9 |
| WoA | 28 | 0 | NA | 0 |
| WoA-T | 28 | 0 | NA | 0 |
Results of Friedman Test.
| Methods | Rank |
|---|---|
| R-MOTiFS | 1.36 |
| H-MOTiFS | 1.64 |
| MOTiFS | 4.64 |
| SFSW | 3.45 |
| GA | 3.72 |
|
|