| Literature DB >> 31992203 |
Rudolf Jagdhuber1,2, Michel Lang1, Arnulf Stenzl3, Jochen Neuhaus4, Jörg Rahnenführer5.
Abstract
BACKGROUND: With modern methods in biotechnology, the search for biomarkers has advanced to a challenging statistical task exploring high dimensional data sets. Feature selection is a widely researched preprocessing step to handle huge numbers of biomarker candidates and has special importance for the analysis of biomedical data. Such data sets often include many input features not related to the diagnostic or therapeutic target variable. A less researched, but also relevant aspect for medical applications are costs of different biomarker candidates. These costs are often financial costs, but can also refer to other aspects, for example the decision between a painful biopsy marker and a simple urine test. In this paper, we propose extensions to two feature selection methods to control the total amount of such costs: greedy forward selection and genetic algorithms. In comprehensive simulation studies of binary classification tasks, we compare the predictive performance, the run-time and the detection rate of relevant features for the new proposed methods and five baseline alternatives to handle budget constraints.Entities:
Keywords: Budget constraint; Cost limit; Feature cost; Feature selection; Genetic algorithm
Mesh:
Substances:
Year: 2020 PMID: 31992203 PMCID: PMC6986087 DOI: 10.1186/s12859-020-3361-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Optimization path of a (not cost-adapted) genetic algorithm that uses a fitness function accounting for the extent of constraint violation. Data: 298 features (each with cost 1), cmax=10. The first candidate meeting the constraint is found in iteration 42
Combinations of γ, p, p(rel) and β used for the simulation design
| Setting A | 30 | 18 | 0.3 | |
| Setting B | 30 | 3 | 1 | |
| Setting C | 300 | 30 | 0.5 | |
| Setting D | 300 | 3 | 0.5 | |
| Setting E | 2 | 1500 | 15 | 0.5 |
| Setting F | 1500 | 20 | 0.5 | |
| Setting G | 300 | 30 | 0.3 | |
| Setting H | 300 | 30 | 0.5 | |
| Setting I | 300 | 30 | ||
| Setting J | 300 | 30 | ||
| Setting K | 300 | 30 | 0.5 |
For every setting B=100 training data sets are generated. Settings G to K are specialized settings, which focus on changes in the data generation process. For details see “Settings with altered simulation design” section
Fig. 2Illustration of a possible situation when applying a non-identity covariance matrix to the situation of (8). Two highly correlated normal distributed features x1 and x2, where the first component x1 has different means for the two classes (μ=1 for y=1 and μ=0 for y=0) and the second component x2 has the same mean (μ=0) for both classes. The resulting multivariate structure is perfectly separable by a linear function
Combination of γ, p, p(rel) and β used for the plasmode simulation Setting R
| Setting R | 298 | 30 | 0.5 |
Fig. 3Performance results for simulation Settings A to K. Boxplots for every feature selection method illustrate the distribution of the AUC values obtained in the 100 data sets (transparent dots). The black diamonds depict the mean AUC values. A horizontal bar highlights the area between the 0.05 and 0.95 quantile of AUC values when always selecting the cheapest subset (green) or the best real cFS subset (golden) of relevant features that fit in the budget. Both correspond to a univariately optimal solution
Fig. 4Precision-recall plot comparing analyzed feature selection methods for all simulation settings. Precision corresponds to the ratio of relevant detected features divided by total amount of features in the model. Recall shows the ratio of relevant detected features divided by the total existing number of relevant features. The cost budget defines an upper limit for the recall in the simulations. It is highlighted by a green line. To assess the quality of the feature selection methods precision and recall for selecting features randomly is added to the plots as horizontal and vertical dashed lines. The plot boundaries are re-scaled to depict the area of interest between randomness and optimal values
Overview of precision and recall of all analyzed feature selection methods for different simulation settings
| Recall | Precision | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | B | C | D | E | F | G | H | K | A | B | C | D | E | F | G | H | K | |
| Methods | ||||||||||||||||||
| FS | 33.9 | 58.3 | 19.2 | 53.7 | 79.4 | 34.4 | 9.7 | 24.0 | 18.9 | 98.7 | 98.3 | 98.8 | 100.0 | 69.9 | 93.9 | 59.1 | 91.2 | 98.1 |
| cFS | 41.6 | 66.7 | 29.1 | 62.7 | 69.3 | 40.2 | 20.5 | 27.9 | 28.2 | 95.5 | 100.0 | 92.5 | 93.1 | 37.9 | 68.2 | 58.7 | 85.7 | 92.3 |
| cFS.mean | 38.8 | 65.7 | 25.2 | 59.3 | 78.3 | 40.3 | 18.6 | 27.0 | 24.7 | 97.8 | 99.0 | 97.4 | 97.8 | 59.8 | 89.9 | 65.8 | 92.0 | 97.5 |
| cFS.max | 37.6 | 64.0 | 23.6 | 57.0 | 79.1 | 38.7 | 16.3 | 26.3 | 22.9 | 98.7 | 99.0 | 98.6 | 98.8 | 63.5 | 91.1 | 65.6 | 93.0 | 97.9 |
| fGA | 40.6 | 66.3 | 27.4 | 55.0 | 81.7 | 38.0 | 19.0 | 27.8 | 23.8 | 97.9 | 99.5 | 98.1 | 87.3 | 59.3 | 83.7 | 68.1 | 93.8 | 98.3 |
| cGA | 40.3 | 66.3 | 27.3 | 61.7 | 79.9 | 38.6 | 19.0 | 27.8 | 23.8 | 97.8 | 99.5 | 97.0 | 96.9 | 61.2 | 90.0 | 69.7 | 94.2 | 97.4 |
| Filter.tTest | 34.3 | 58.7 | 18.4 | 53.0 | 98.4 | 34.7 | 8.8 | 23.9 | 18.9 | 99.5 | 99.4 | 99.8 | 100.0 | 88.2 | 95.6 | 61.5 | 93.6 | 99.8 |
| Filter.Symuncert | 24.1 | 56.0 | 18.4 | 47.0 | 84.2 | 33.0 | 8.7 | 24.1 | 18.6 | 99.3 | 100.0 | 99.6 | 97.2 | 84.0 | 95.0 | 57.9 | 99.3 | 99.8 |
| Filter.PraznikJMIM | 30.3 | 57.3 | 18.6 | 44.0 | 60.8 | 28.1 | 7.0 | 23.8 | 18.4 | 83.1 | 86.0 | 97.6 | 76.3 | 54.4 | 77.0 | 43.3 | 89.3 | 96.0 |
| Filter.RangerImpurity | 34.3 | 57.0 | 19.1 | 53.7 | 96.0 | 33.4 | 8.9 | 24.1 | 18.8 | 93.5 | 85.5 | 99.0 | 98.8 | 85.0 | 93.3 | 60.5 | 89.9 | 98.3 |
| Reference | ||||||||||||||||||
| Budget constraint | 50.0 | 66.7 | 33.3 | 66.7 | 100.0 | 50.0 | 33.3 | 33.3 | 33.3 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Random selection | 20.5 | 4.7 | 1.5 | 0.3 | 1.1 | 0.5 | 1.6 | 4.0 | 1.6 | 60.0 | 10.0 | 10.0 | 1.0 | 1.0 | 1.3 | 10.0 | 10.0 | 10.0 |
Values are given in percent. The results are an extended version of the data shown in Fig. 4. Please refer to the description of this figure for further details
Fig. 5Selection frequency the feature with strongest benefit-cost-ratio (first row) and the feature with strongest effect size (second row). The third row shows the average number of noise features across the simulations
Fig. 6Performance results for the plasmode simulation Setting R and the real world data Setting S. Boxplots for every feature selection method illustrate the distribution of the AUC values obtained for the 100 training-test splits (transparent dots). The black diamonds depict the mean AUC values. A green bar in the top plot highlights the area between the 0.05 and 0.95 quantile of AUC values when always selecting the optimal subset of relevant features that fit in the budget. For Setting S, in the bottom plot, the left elements show the results with cmax=1.5 and the right elements show the results with cmax=3
Fig. 7Discretized violin plots of the relevant feature count distribution (blue) and the total model size distribution (black) for the 100 analyzed training-test splits of the plasmode simulation. The green bar indicates the maximum number of relevant features that can be added within the budget of this setting. Bottom: Setting S. Discretized violin plots of the distribution of total model size for the analyzed budget limits cmax=1.5 (black) and cmax=3 (gray)