| Literature DB >> 29914084 |
Qi Chen1,2, Zhaopeng Meng3,4, Xinyi Liu5, Qianguo Jin6, Ran Su7,8.
Abstract
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.Entities:
Keywords: RFE; decision variant; feature selection; random forest; voting
Year: 2018 PMID: 29914084 PMCID: PMC6027449 DOI: 10.3390/genes9060301
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The statistical analysis of 30 most recent publications which used recursive feature elimination (RFE) for feature selection. HA: Used the highest classification accuracy as the decision variant; PreNum: Used a pre-defined number of features as variant; No: Represents that no choice was made; Other: sed other variants for feature selection.
Figure 2The main procedure of the recursive feature elimination (RFE) method.
Figure 3The three variants we analyzed in this study: HA, 90% HA, and PreNum (equals to 12). The result was analyzed based on the TG-Gates_500 data.
Figure 4Voting strategy to select the optimal feature subset after the 10-fold cross-validation. Here, we assume that the top two ranked features have votes 7 and 5, respectively.
Figure 5The frequency of votes of the selected features in the candidate feature pool.
The performance using feature with votes larger than a threshold υ.
| υ | Number of Features | Balanced Accuracy (%) | Sensitivity (%) | Specificity (%) |
|---|---|---|---|---|
| 7 | 1 | 52.93 | 42.22 | 63.64 |
| 6 | 1 | 52.93 | 42.22 | 63.64 |
| 5 | 2 | 57.47 | 42.22 | 72.73 |
| 4 | 4 | 61.21 | 46.67 | 75.76 |
| 3 | 6 | 66.41 | 55.56 | 77.27 |
| 2 | 12 | 72.78 | 62.22 | 83.33 |
| 1 | 42 | 66.16 | 44.44 | 87.88 |
| 0 | 151 | 55.40 | 24.44 | 86.36 |
| Without FS | 500 | 47.57 | 13.33 | 81.82 |
Classification performance using three selection variants and performance without any feature selection for TG-Gates_500.
| Number of Features | Balanced Accuracy (%) | Sensitivity (%) | Specificity (%) | |
|---|---|---|---|---|
| HA | 12 | 72.78 | 62.22 | 83.33 |
| 90% HA | 17 | 77.27 | 66.67 | 87.87 |
| PreNum (12) | 26 | 75.40 | 64.44 | 86.36 |
| Without FS | 500 | 47.57 | 13.33 | 81.82 |
Classification performance using three selection variants and performance without any feature selection for CPPsite3.
| Number of Features | Balanced Accuracy (%) | Sensitivity (%) | Specificity (%) | |
|---|---|---|---|---|
| HA | 17 | 70.05 | 66.84 | 73.26 |
| 90% HA | 17 | 68.18 | 64.17 | 72.19 |
| PreNum (17) | 24 | 70.05 | 67.91 | 72.19 |
| Without FS | 188 | 65.24 | 61.50 | 68.98 |