| Literature DB >> 28674556 |
Ursula Neumann1,2,3, Nikita Genze1, Dominik Heider1,2,3.
Abstract
BACKGROUND: Feature selection methods aim at identifying a subset of features that improve the prediction performance of subsequent classification models and thereby also simplify their interpretability. Preceding studies demonstrated that single feature selection methods can have specific biases, whereas an ensemble feature selection has the advantage to alleviate and compensate for these biases.Entities:
Keywords: Ensemble learning; Feature selection; Machine learning; R-package
Year: 2017 PMID: 28674556 PMCID: PMC5488355 DOI: 10.1186/s13040-017-0142-8
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Method overview
| Command | Parameters | Information |
|---|---|---|
| ensemble_fs | data | object of class data.frame |
| classnumber | index of variable for binary classification | |
| NA_threshold | threshold for deletion of features with a greater proportion of NAs | |
| cor_threshold | correlation threshold within features | |
| runs | amount of runs for randomForest and cforest | |
| selection | selection of feature selection methods to be conducted | |
| barplot_fs | name | character string giving the name of the file |
| efs_table | table object of class matrix retrieved from ensemble_fs | |
| efs_eval | data | object of class data.frame |
| efs_table | table object of class matrix retrieved from ensemble_fs | |
| file_name | character string, name which is used for the two possible PDF files. | |
| classnumber | index of variable for binary classification | |
| NA_threshold | threshold for deletion of features with a greater proportion of NAs | |
| logreg | logical value indicating whether to conduct an evaluation via logistic regression or not | |
| permutation | logical value indicating whether to conduct a permutation of the class variable or not | |
| p_num | number of permutations; default set to a 100 | |
| variances | logical value indicating whether to calculate the variances of importances retrieved | |
| from bootstrapping or not | ||
| jaccard | logical value indicating whether to calculate the Jaccard-index or not | |
| bs_num | number of bootstrap permutations of the importances | |
| bs_percentage | proportion of randomly selected samples for bootstrapping |
The R-package EFS provides three functions
Fig. 1Cumulative barplot retrieved from barplot_fs function of R-package EFS
Fig. 2Performance of LR model. On the y-axis the average true positive rate (i.e., sensitivity) and on the x-axis the false positive rate (i.e., 1-specificity) is shown. Two ROC curves are shown: of all features (black) and the EFS selected features (blue). The dotted line marks the performance of random guessing
Fig. 3Boxplot of importances retrieved from the bootstrapping algorithm