| Literature DB >> 31819106 |
Jamshid Pirgazi1, Mohsen Alimoradi2, Tahereh Esmaeili Abharian2, Mohammad Hossein Olyaee3.
Abstract
Feature selection problem is one of the most significant issues in data classification. The purpose of feature selection is selection of the least number of features in order to increase accuracy and decrease the cost of data classification. In recent years, due to appearance of high-dimensional datasets with low number of samples, classification models have encountered over-fitting problem. Therefore, the need for feature selection methods that are used to remove the extensions and irrelevant features is felt. Recently, although, various methods have been proposed for selecting the optimal subset of features with high precision, these methods have encountered some problems such as instability, high convergence time, selection of a semi-optimal solution as the final result. In other words, they have not been able to fully extract the effective features. In this paper, a hybrid method based on the IWSSr method and Shuffled Frog Leaping Algorithm (SFLA) is proposed to select effective features in a large-scale gene dataset. The proposed algorithm is implemented in two phases: filtering and wrapping. In the filter phase, the Relief method is used for weighting features. Then, in the wrapping phase, by using the SFLA and the IWSSr algorithms, the search for effective features in a feature-rich area is performed. The proposed method is evaluated by using some standard gene expression datasets. The experimental results approve that the proposed approach in comparison to similar methods, has been achieved a more compact set of features along with high accuracy. The source code and testing datasets are available at https://github.com/jimy2020/SFLA_IWSSr-Feature-Selection.Entities:
Mesh:
Year: 2019 PMID: 31819106 PMCID: PMC6901457 DOI: 10.1038/s41598-019-54987-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Pseudo code of SFLA.
Figure 2Pseudo code of IWSSr.
Figure 3The general scheme of the Relief algorithm.
Figure 4Create frogs in the proposed algorithm.
Figure 5Pseudo code of the proposed hybrid algorithm.
Figure 6Leap algorithm for worst frog Improvement (Fw) by the help of better frog (Fb) (IWF).
Microarray data sets used in the experiments.
| Data set | Original Data | Training Data | Independent Data | #Gene | #Classes | #class1 | #class2 |
|---|---|---|---|---|---|---|---|
| Colon | 62 | 50 | 12 | 2000 | 2 | 40 | 22 |
| Arcene | 100 | 80 | 20 | 10000 | 2 | 44 | 56 |
| Prostate1 | 88 | 71 | 17 | 12625 | 2 | 38 | 50 |
| DLBCL | 77 | 61 | 16 | 11226 | 2 | 58 | 19 |
| Lung | 181 | 145 | 36 | 12533 | 2 | 150 | 31 |
| Dorothea | 800 | 640 | 160 | 100000 | 2 | 610 | 190 |
| Prostate | 136 | 109 | 27 | 12600 | 2 | 77 | 59 |
| CNS | 60 | 48 | 12 | 7129 | 2 | 21 | 39 |
| Leukemia | 72 | 58 | 14 | 7129 | 2 | 47 | 25 |
| Breast | 97 | 78 | 19 | 24481 | 2 | 51 | 46 |
SFLA parameters used in the problem.
| Parameter | Value | Comments |
|---|---|---|
| 100 | ||
| 10 | Number of memplexes | |
| 10 | Population size of each memplex | |
| 4 | Population size of submemplexes | |
| 40 | Total Iteration number | |
| 10 | The number of replications of the division of memplexes into submemplexes | |
| 5 | The maximum leap length allowed to change |
Result of feature selection algorithm.
| DataSet | IWSS | IWSSr | LFS | BARS | FCBF | PCA | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | |
| Colon | 80.65 | 3.8 | 83.87 | 2.8 | 80.80 | 4.1 | 85.70 | 3.0 | 77.40 | 14.6 | 72.50 | 28.9 |
| Arcene | 70.00 | 13.4 | 72.00 | 6.2 | 73.00 | 4.5 | 74.00 | 4.9 | 70.00 | 34.2 | — | — |
| Prostate1 | 76.23 | 12.8 | 77.42 | 8.3 | 73.12 | 3.6 | 85.34 | 4.1 | 63.12 | 32.4 | 59.12 | 37.1 |
| DLBCL | 83.11 | 3.2 | 81.23 | 2.7 | 88.67 | 4.1 | 75.21 | 2.8 | 96.45 | 56.2 | 68.11 | 42.7 |
| Lung | 97.20 | 2.7 | 97.20 | 2.4 | 93.60 | 2.5 | 98.30 | 3.0 | 99.40 | 115.2 | 85.61 | 125.2 |
| Dorothea | 93.50 | 7.4 | 92.90 | 6.3 | 90.30 | 5.5 | 93.80 | 7.3 | 92.60 | 92.8 | — | — |
| Prostate | 77.90 | 11.1 | 78.70 | 7.0 | 75.40 | 4.5 | 86.80 | 3.7 | 61.30 | 35.8 | 57.35 | 36.6 |
| CNS | 85.21 | 3.2 | 86.10 | 3.1 | 83.23 | 3.4 | 89.12 | 2.8 | 93.24 | 42.2 | 77.32 | 44.1 |
| Leukemia | 87.50 | 2.5 | 87.50 | 3.0 | 93.00 | 3.2 | 90.50 | 2.3 | 95.80 | 45.8 | 79.10 | 53.8 |
| Breast | 69.21 | 11.1 | 70.21 | 9.2 | 70.43 | 10.1 | 72.81 | 9.34 | 69.43 | 107.3 | 63.10 | 96.3 |
Comparison of proposed method with GRASP and FICA.
| Grasp + HC | Grasp + IWSS | Grasp + IWSSr | Grasp + BARS | Grasp + SFS | FICA + IWSSr | F-Score | SVM-RFE | Proposed method | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | Acc | Atts | |
| Colon | 81.10 | 3.0 | 79.60 | 3.4 | 82.20 | 3.1 | 80.00 | 2.9 | 80.00 | 3.5 | 93.60 | 4.5 | 83.74 | 55 | 93.70 | 9.8 | 94.72 | 5.3 |
| Arcene | 80.00 | 5.7 | 79.30 | 6.0 | 78.50 | 5.7 | 79.00 | 5.2 | 79.30 | 6.3 | 93.40 | 7.1 | 73.25 | 110 | 89.11 | 13.5 | 95.16 | 8.5 |
| Prostate1 | 80.45 | 4.3 | 79.12 | 4.1 | 78.49 | 3.7 | 81.12 | 4.7 | 78.43 | 6.3 | — | — | 68.74 | 105 | 82.71 | 17.2 | 88.52 | 8.2 |
| DLBCL | 85.65 | 2.1 | 84.60 | 2.2 | 85.61 | 2.1 | 89.11 | 2.2 | 85.70 | 2.4 | 99.10 | 4.5 | 93.11 | 100 | 95.23 | 15.7 | 99.21 | 6.8 |
| Lung | 95.60 | 2.2 | 95.08 | 2.2 | 95.70 | 2.4 | 96.02 | 2.3 | 96.20 | 2.4 | 98.90 | 3 | 82.16 | 105 | 98.73 | 9.4 | 99.16 | 5.6 |
| Dorothea | 93.30 | 3.7 | 93.30 | 4.2 | 92.90 | 3.8 | 93.50 | 5.0 | 93.20 | 4.4 | 75.80 | 3 | 76.24 | 310 | 84.32 | 21.7 | 91.43 | 7.2 |
| Prostate | 77.80 | 5.0 | 78.60 | 5.7 | 77.50 | 4.6 | 78.60 | 5.1 | 78.10 | 5.6 | 92.40 | 4.4 | 54.33 | 250 | 92.20 | 14.4 | 94.18 | 7.8 |
| CNS | 91.46 | 2.6 | 93.12 | 2.8 | 87.32 | 2.8 | 92.14 | 3.1 | 91.12 | 3.1 | — | — | 66.53 | 90 | 76.96 | 16.3 | 95.64 | 6.7 |
| Leukemia | 92.60 | 2.7 | 93.70 | 2.7 | 91.60 | 2.8 | 93.30 | 2.8 | 93.60 | 3.3 | 99.60 | 1.8 | 75.57 | 70 | 100.00 | 8.6 | 99.62 | 5.2 |
| Breast | 79.63 | 4.3 | 80.11 | 3.1 | 78.38 | 3.5 | 81.24 | 2.7 | 80.91 | 3.6 | — | — | 73.82 | 120 | 86.09 | 17.3 | 88.17 | 10.2 |
| Mean | 85.75 | 3.56 | 85.65 | 3.64 | 84.82 | 3.45 | 86.40 | 3.60 | 85.65 | 4.09 | — | — | 74.74 | 131.5 | 89.90 | 14.39 | 93.34 | 7.12 |
Performance results of proposed method in training and independent data.
| Training data | Independent data | |||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | Balance rates | Accuracy | Sensitivity | Specificity | Balance rates | |
| Colon | 94.50 | 95.87 | 86.11 | 90.99 | 93.33 | 95.00 | 90.00 | 92.50 |
| Arcene | 94.75 | 92.57 | 96.44 | 94.50 | 94.00 | 92.21 | 95.45 | 93.83 |
| Prostate1 | 88.87 | 86.77 | 90.50 | 88.63 | 88.23 | 87.77 | 91.00 | 89.38 |
| DLBCL | 99.50 | 98.89 | 94.81 | 96.85 | 98.12 | 98.33 | 95.00 | 96.66 |
| Lung | 99.13 | 99.58 | 96.80 | 98.19 | 99.16 | 99.66 | 96.66 | 98.16 |
| Dorothea | 91.25 | 93.70 | 90.26 | 91.98 | 90.37 | 91.31 | 89.47 | 90.39 |
| Prostate | 94.18 | 98.36 | 90.45 | 94.41 | 94.44 | 96.25 | 91.81 | 94.03 |
| CNS | 95.31 | 90.54 | 97.21 | 93.88 | 94.99 | 92.50 | 96.25 | 94.37 |
| Leukemia | 99.34 | 100.00 | 97.29 | 98.64 | 98.57 | 99.00 | 97.50 | 98.25 |
| Breast | 88.12 | 88.23 | 87.95 | 88.09 | 87.89 | 89.00 | 85.55 | 87.27 |
Figure 7Distribution of selected feature values using the proposed method.
Figure 8Mean accuracy of frog’s populations in the 40 iterations of training.
minimum, maximum and average number of iterations performed by the proposed algorithm.
| Dataset | Minimum number of iterations | average number of iterations | Maximum number of iterations | Average accuracy |
|---|---|---|---|---|
| Colon | 12 | 13.9 | 18 | 94.72 |
| Arcene | 4 | 70.8 | 9 | 95.16 |
| Prostate1 | 6 | 9.40 | 15 | 88.52 |
| DLBCL | 9 | 14.7 | 16 | 99.21 |
| Lung | 5 | 80.7 | 10 | 99.16 |
| Dorothea | 9 | 11.4 | 14 | 91.43 |
| Prostate | 17 | 22.2 | 30 | 94.18 |
| CNS | 8 | 9.80 | 21 | 95.64 |
| Leukemia | 11 | 16.4 | 19 | 99.62 |
| Breast | 12 | 9.50 | 22 | 88.17 |
| Average | 9.3 | 12.38 | 17.4 | 93.34 |
Figure 9Comparing the performance criterion (accuracy, Specificity, Sensitivity and Balanced Rate) of proposed method.