| Literature DB >> 21554689 |
Natalia Becker1, Grischa Toedt, Peter Lichter, Axel Benner.
Abstract
BACKGROUND: Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution.Entities:
Mesh:
Year: 2011 PMID: 21554689 PMCID: PMC3113938 DOI: 10.1186/1471-2105-12-138
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Mean misclassification rate of feature selection methods applied to simulated test data
| FS method | r = 10 | r = 50 | r = 100 | r = 200 |
|---|---|---|---|---|
| 34.8(2.2) | 33.1(2.0) | |||
| 28.3(2.8) | ||||
| SCAD SVM | ||||
| Elastic Net SVM | ||||
| Elastic SCAD SVM | ||||
Training and test data with 1000 features and 500 samples were simulated. The number of relative features (r) were increased from r = 10 to r = 200 in four steps. Each simulation step was based on 100 simulations of training and test data. In bold - the significant best method(s) according to the MCB test at the family-wise significance level α = 0.05 and non-inferiority margin of Δ = 5%.
Average Youden index for classifiers applied to simulated test data
| FS method | r = 10 | r = 50 | r = 100 | r = 200 |
|---|---|---|---|---|
| 0.81(0.11) | 0.32(0.16) | 0.14(0.10) | ||
| SCAD SVM | 0.28(0.12) | |||
| Elastic Net SVM | ||||
| Elastic SCAD SVM |
In bold - the significant best method(s) according to the MCB test at the family-wise significance level α = 0.05 and non-inferiority margin of Δ = 0.10.
Median number of features selected
| FS method | r = 10 | r = 50 | r = 100 | r = 200 |
|---|---|---|---|---|
| 141(56) | 296(98) | 509(290) | 789(223) | |
| SCAD SVM | 593(382) | 726(181) | ||
| Elastic Net SVM | 38(25) | 242(110) | ||
| Elastic SCAD SVM |
In bold - median number of features that come closest to the true number of relevant features per simulation scenario, (in parentheses - median absolute deviation); underline - the second best.
Summary of classifiers for the NKI data set with distant metastasis as endpoint
| FS method | # features | test error(%) | sensitivity(%) | specificity(%) | Youden index | AUC |
|---|---|---|---|---|---|---|
| 4919 (all) | 24 | 79 | 68 | 0.47 | 0.735 | |
| RFE SVM | 256 | 25 | 83 | 59 | 0.42 | 0.71 |
| MammaPrint(R) | 70 | 37 | 74 | 40 | 0.14 | 0.57 |
| 1573 | 17 | 84 | 81 | 0.65 | 0.825 | |
| SCAD SVM | 476 | 25 | 84 | 56 | 0.39 | 0.695 |
| Elastic Net SVM | 109 | 25 | 83 | 59 | 0.42 | 0.71 |
| Elastic SCAD SVM | 459 | 24 | 84 | 57 | 0.41 | 0.705 |
Misclassification error, sensitivity, specificity, Youden index and AUC value for four feature selection methods, RFE SVM and standard SVM based on ten-fold stratified cross validation.
Figure 1ROC plot for the NKI breast data set. The characteristics for the different feature selection methods were derived using ten-fold stratified cross validation. TPR and FPR values are presented as points (x axis: 1- specificity = FPR, y axis. sensitivity = TPR). RFE_256 is RFE SVM with 256 top ranked features, ENet is Elastic Net SVM, ESCAD is Elastic SCAD SVM. '70_sign' stands for the 70-gene signature classifier. Gray dashed lines depict isolines of the Youden index.
Summary of classifiers for the MAQC-II data set with pCR status as endpoint
| FS method | # features | test error(%) | sensitivity(%) | specificity(%) | Youden index | AUC |
|---|---|---|---|---|---|---|
| 22283 (all) | 19 | 32 | 97 | 0.25 | 0.62 | |
| RFE SVM | 2048 | 20 | 27 | 93 | 0.20 | 0.895 |
| 7299 | 21 | 27 | 93 | 0.20 | 0.60 | |
| SCAD SVM | 1072 | 21 | 35 | 91 | 0.26 | 0.63 |
| Elastic Net SVM | 398 | 24 | 15 | 91 | 0.06 | 0.53 |
| Elastic SCAD SVM | 148 | 15 | 52 | 94 | 0.46 | 0.73 |
Misclassification error, sensitivity, specificity, Youden index and AUC value for four feature selection methods, RFE SVM and standard SVM based on ten-fold stratified cross validation.
Figure 2ROC plot for MAQC-II breast data set with pCR as endpoint. The characteristics for the different feature selection methods were derived using ten-fold statrifierd cross validation. TPR and FPR values are presented as points (x axis: 1- specificity = FPR, y axis. sensitivity = TPR). RFE_256 is RFE SVM with 1024 top ranked features, ENet is Elastic Net SVM, ESCAD is Elastic SCAD SVM. Gray dashed lines depict isolines of the Youden index.
Summary of classifiers for the MAQC-II data set with ER status as endpoint
| FS method | # features | test error(%) | sensitivity(%) | specificity(%) | Youden index | AUC |
|---|---|---|---|---|---|---|
| 22283 (all) | 10 | 93 | 84 | 0.77 | 0.855 | |
| RFE SVM | 2048 | 14 | 89 | 81 | 0.79 | 0.895 |
| 860 | 11 | 89 | 88 | 0.77 | 0.885 | |
| SCAD SVM | 32 | 9 | 91 | 91 | 0.83 | 0.915 |
| Elastic Net SVM | 3 | 9 | 93 | 82 | 0.75 | 0.875 |
| Elastic SCAD SVM | 59 | 7 | 96 | 88 | 0.84 | 0.92 |
Misclassification error, sensitivity, specificity, Youden index and AUC value for four feature selection methods, RFE SVM and standard SVM without feature selection based on ten-fold stratified cross validation.
Figure 3ROC plot for MAQC-II breast data set with ER as endpoint. The characteristics for the different feature selection methods were derived using ten-fold stratified cross validation. TPR and FPR values are presented as points (x axis: 1- specificity = FPR, y axis. sensitivity = TPR). RFE_256 is RFE SVM with 1024 top ranked features, ENet is Elastic Net SVM, ESCAD is Elastic SCAD SVM. Gray dashed lines depict isolines of the Youden index.
Summary of classifiers for Mainz cohort, validated on Rotterdam cohort with relapse as endpoint
| FS method | # features | test error(%) | sensitivity(%) | specificity(%) | Youden index | AUC |
|---|---|---|---|---|---|---|
| 22283 (all) | 44 | 68 | 48 | 0.16 | 0.58 | |
| RFE SVM | 512 | 37 | 38 | 77 | 0.16 | 0.58 |
| 1861 | 37 | 47 | 72 | 0.19 | 0.595 | |
| SCAD SVM | 915 | 37 | 35 | 80 | 0.15 | 0.575 |
| Elastic Net SVM | 278 | 43 | 51 | 60 | 0.12 | 0.56 |
| Elastic SCAD SVM | 2823 | 37 | 34 | 81 | 0.15 | 0.575 |
Misclassification error, sensitivity, specificity, Youden index and AUC value for four feature selection methods, RFE SVM and standard SVM trained on the Mainz cohort and applied to the Rotterdam cohort.
Summary of classifiers for Rotterdam cohort, validated on Mainz cohort with relapse as endpoint
| FS method | # features | test error(%) | sensitivity(%) | specificity(%) | Youden index | AUC |
|---|---|---|---|---|---|---|
| 22283 (all) | 25 | 11 | 93 | 0.04 | 0.52 | |
| RFE SVM | 22283 (all) | 25 | 11 | 93 | 0.04 | 0.52 |
| 8319 | 28 | 30 | 84 | 0.14 | 0.57 | |
| SCAD SVM | 1284 | 35 | 41 | 72 | 0.13 | 0.565 |
| Elastic Net SVM | 272 | 28 | 37 | 81 | 0.19 | 0.595 |
| Elastic SCAD SVM | 2074 | 26 | 30 | 87 | 0.17 | 0.585 |
Misclassification error, sensitivity, specificity, Youden index and AUC value for four feature selection methods, RFE SVM and standard SVM trained on the Rotterdam cohort and applied to the Mainz cohort.