| Literature DB >> 27891179 |
Ursula Neumann1, Mona Riemenschneider2, Jan-Peter Sowa3, Theodor Baars4, Julia Kälsch3, Ali Canbay3, Dominik Heider1.
Abstract
MOTIVATION: Biomarker discovery methods are essential to identify a minimal subset of features (e.g., serum markers in predictive medicine) that are relevant to develop prediction models with high accuracy. By now, there exist diverse feature selection methods, which either are embedded, combined, or independent of predictive learning algorithms. Many preceding studies showed the defectiveness of single feature selection results, which cause difficulties for professionals in a variety of fields (e.g., medical practitioners) to analyze and interpret the obtained feature subsets. Whereas each of these methods is highly biased, an ensemble feature selection has the advantage to alleviate and compensate for such biases. Concerning the reliability, validity, and reproducibility of these methods, we examined eight different feature selection methods for binary classification datasets and developed an ensemble feature selection system.Entities:
Keywords: Biomarker discovery; Ensemble learning; Feature selection; Machine learning; Random forest
Year: 2016 PMID: 27891179 PMCID: PMC5116216 DOI: 10.1186/s13040-016-0114-4
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Overview of datasets. Number of features after removing samples with missing values
| Dataset | No. of Samples | No. of Features | Categorical | Numeric |
|---|---|---|---|---|
| MI-Mortality | 406 | 14 | 7 | 7 |
| Fibrosis | 101 | 26 | 7 | 19 |
| FLIP | 103 | 13 | 6 | 7 |
| SPECTF | 267 | 44 | 44 | 0 |
| Sonar | 208 | 60 | 0 | 60 |
| WBC | 569 | 30 | 0 | 30 |
Fig. 1Venn diagrams. Comparison of feature subsets retrieved from AUC importance and EFS importance
Types of selected features. Evaluation of the selected features subsets of AUC-FS and EFS
| Dataset | AUC-FS selected | EFS selected | EFS/all in % | Numeric* | Categorical* |
|---|---|---|---|---|---|
| MI-Mortality | 4 | 5 | 35.7 | 3 | 2 |
| Fibrosis | 8 | 7 | 26.9 | 5 | 3 |
| FLIP | 4 | 5 | 38.5 | 3 | 2 |
| SPECTF | 15 | 19 | 43.2 | 0 | 19 |
| Sonar | 20 | 24 | 40.0 | 24 | 0 |
| WBC | 10 | 10 | 33.3 | 9 | 1 |
*refers to the EFS selected features
Results on datasets
| Dataset | All [CI] | AUC-FS [CI] | EFS [CI] | AUC-FS vs. EFS* | all vs. EFS** |
|---|---|---|---|---|---|
| MI-Mortality | 0.758 [0.700, 0.800] | 0.757 [0.704, 0.811] | 0.776 [0.725, 0.826] | 0.228 | 0.201 |
| Fibrosis | 0.493 [0.300, 0.600] | 0.681 [0.537, 0.824] | 0.746 [0.617, 0.874] | 0.273 |
|
| FLIP | 0.759 [0.600, 0.900] | 0.723 [0.582, 0.863] | 0.761 [0.633, 0.890] | 0.254 | 0.971 |
| SPECTF | 0.807 [0.700, 0.900] | 0.856 [0.811, 0.901] | 0.865 [0.821, 0.910] | 0.444 | 4.68e-4 |
| Sonar | 0.792 [0.700, 0.900] | 0.840 [0.787, 0.894] | 0.862 [0.813, 0.911] | 0.200 |
|
| WBC | 0.611 [0.600, 0.700] | 0.987 [0.977, 0.998] | 0.991 [0.981, 1.000] |
|
|
Column 1 to 3 are AUCs values of all features, selected by AUC-FS and by the EFS with confidential intervalls in brackets. The last two columns show the p-values of the comparison by the method of [28]. The function compares the AUC of the ROC curves of (*) the AUC-FS and EFS method and (**) of all parameters and EFS outcome. Statistical significant p-values are printed in bold
Fig. 2Performance of logistic regression models. On the y-axis the sensitivity and on the x-axis the specificity is shown. Three ROC curves are shown per dataset: of all features (solid), the AUC-FS selected (dashed) and the EFS selected (twodashed) features. The dotted line marks the performance of random guessing
Variance of feature importances. Variance of the five most important features of a 10-fold cross-validation
| Dataset | Variance #1 | Variance #2 | Variance #3 | Variance #4 | Variance #5 |
|---|---|---|---|---|---|
| MI-Mortality | 0.001759124 | 0.004694053 | 0.004904828 | 0.003720571 | 0.001580310 |
| Fibrosis | 0.003124527 | 0.008085472 | 0.019901386 | 0.009202372 | 0.019804508 |
| FLIP | 0.006604973 | 0.011325453 | 0.014731007 | 0.023499884 | 0.020140657 |
| SPECTF | 0.000380482 | 0.014946809 | 0.011520607 | 0.005807655 | 0.002880478 |
| Sonar | 0.003887830 | 0.001792209 | 0.003004598 | 0.003115140 | 0.002680274 |
| WBC | 0.001071784 | 0.001769331 | 0.002912278 | 0.000387555 | 0.001096465 |
Quantity of selected features. Number of selected features of our EFS method with and without the AUC-FS
| Dataset | EFS | EFS without AUC-FS | Intersection |
|---|---|---|---|
| MI-Mortality | 5 | 5 | 5 |
| Fibrosis | 7 | 9 | 7 |
| FLIP | 5 | 5 | 5 |
| SPECTF | 19 | 20 | 19 |
| Sonar | 24 | 24 | 24 |
| WBC | 10 | 11 | 9 |