| Literature DB >> 35885095 |
Lan Huang1, Xuemei Hu1, Yan Wang1,2, Yuan Fu3.
Abstract
Feature selection (FS) is a vital step in data mining and machine learning, especially for analyzing the data in high-dimensional feature space. Gene expression data usually consist of a few samples characterized by high-dimensional feature space. As a result, they are not suitable to be processed by simple methods, such as the filter-based method. In this study, we propose a novel feature selection algorithm based on the Explosion Gravitation Field Algorithm, called EGFAFS. To reduce the dimensions of the feature space to acceptable dimensions, we constructed a recommended feature pool by a series of Random Forests based on the Gini index. Furthermore, by paying more attention to the features in the recommended feature pool, we can find the best subset more efficiently. To verify the performance of EGFAFS for FS, we tested EGFAFS on eight gene expression datasets compared with four heuristic-based FS methods (GA, PSO, SA, and DE) and four other FS methods (Boruta, HSICLasso, DNN-FS, and EGSG). The results show that EGFAFS has better performance for FS on gene expression data in terms of evaluation metrics, having more than the other eight FS algorithms. The genes selected by EGFAGS play an essential role in the differential co-expression network and some biological functions further demonstrate the success of EGFAFS for solving FS problems on gene expression data.Entities:
Keywords: Explosion Gravitation Field Algorithm; feature selection; gene expression data; heuristic algorithm
Year: 2022 PMID: 35885095 PMCID: PMC9322764 DOI: 10.3390/e24070873
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1The flow chart of EGFA.
Figure 2The overall flow chart of EGFAFS.
Figure 3The movement process. is a feature selected from original . is a feature selected from its center.
Figure 4The explosion process. Step a. Copy to . Step b. Select features from randomly as, and select features from the original feature space as . Step c. Replace the features in with features in one by one.
Detailed information of eight cancer datasets.
| ID | Dataset | No. of Genes | No. of Normal | No. of Tumor |
|---|---|---|---|---|
| 1 | HNSC | 19,214 | 44 | 502 |
| 2 | LIHC | 19,214 | 50 | 373 |
| 3 | LUAD | 19,214 | 59 | 515 |
| 4 | LUSC | 19,214 | 49 | 501 |
| 5 | PRAD | 19,214 | 52 | 496 |
| 6 | STAD | 19,214 | 32 | 375 |
| 7 | THCA | 19,214 | 50 | 510 |
| 8 | UCEC | 19,214 | 35 | 544 |
The comparison of EGFAFS with and without the recommended feature pool.
| Dataset | ACC | F1 | Recall | PRE | MCC | AP | AUC | |
|---|---|---|---|---|---|---|---|---|
| A | HNSC |
|
|
|
|
| 0.9333 | 0.9935 |
| LIHC | 0.9764 | 0.9166 | 0.9166 | 0.9166 | 0.9029 | 0.9085 | 0.9897 | |
| LUAD | 0.9913 | 0.9600 | 0.9230 | 1.0000 | 0.9560 | 0.9897 | 0.9984 | |
| PRAD | 0.9636 | 0.7142 | 0.6250 | 0.8333 | 0.7035 | 0.8561 | 0.9791 | |
| STAD | 0.9512 | 0.6000 | 0.5000 | 0.7500 | 0.5885 | 0.7164 | 0.9451 | |
| B | HNSC |
|
|
|
|
|
|
|
| LIHC |
|
|
|
|
|
|
| |
| LUAD |
|
|
|
|
|
|
| |
| PRAD |
|
|
|
|
|
|
| |
| STAD |
|
|
|
|
|
|
|
Figure 5Performance comparison of EGFAFS with different population sizes.
Figure 6Performance comparison of EGFAFS with different numbers of RFs.
Figure 7Performance comparison of EGFAFS and eight FS methods on eight datasets. The horizontal axis gives the nine FS methods, and the vertical axis gives the scores of the three metrics: F1, MCC, and AP.
Comparison of the running time of EGFAFS with eight FS methods.
| Method | HNSC | LIHC | LUAD | LUSC | PRAD | STAD | THCA | UCEC | Total |
|---|---|---|---|---|---|---|---|---|---|
| GA | 167.42 | 134.33 | 167.65 | 155.10 | 188.16 | 130.12 | 172.39 | 162.00 | 1277.17 |
| PSO | 52.41 | 43.33 | 50.38 | 48.11 | 59.48 | 44.53 | 51.93 | 51.70 | 401.87 |
| SA | 190.29 | 148.71 | 186.36 | 175.60 | 221.18 | 151.78 | 203.82 | 187.39 | 1465.13 |
| DE | 79.14 | 62.51 | 77.43 | 70.20 | 87.37 | 63.52 | 79.46 | 76.00 | 595.63 |
| EGSG |
|
|
|
|
|
|
|
|
|
| Boruta | 209.31 | 195.87 | 191.85 | 180.71 | 194.19 | 197.39 | 201.10 | 192.56 | 1562.98 |
| HSICLasso | 48.10 | 37.22 | 50.17 | 52.37 | 47.23 | 35.88 | 47.87 | 50.32 | 369.16 |
| DNN-FS | 142.47 | 115.33 | 148.91 | 141.82 | 141.28 | 109.35 | 146.92 | 149.16 | 1095.24 |
| EGFAFS | 143.49 | 109.79 | 151.83 | 138.50 | 147.37 | 106.78 | 153.12 | 140.49 | 1091.37 |
Figure 8The degree of the features selected by EGFAFS for LIHC dataset in the differential co-expression network.
The information of genes selected by EGFAFS in differential co-expression networks.
| ID | Dataset | M_D | No_G | No_Select | No_Total |
|---|---|---|---|---|---|
| 1 | HNSC | 30 | 25 | 38 | 50 |
| 2 | LIHC | 48 | 15 | 25 | 50 |
| 3 | LUAD | 56 | 29 | 34 | 50 |
| 4 | LUSC | 52 | 27 | 37 | 50 |
| 5 | PRAD | 108 | 11 | 16 | 50 |
| 6 | STAD | 20 | 20 | 29 | 50 |
| 7 | THCA | 93 | 16 | 24 | 50 |
| 8 | UCEC | 35 | 45 | 49 | 50 |
Figure 9GO enrichment analysis of genes selected by EGFAFS on LIHC dataset.