| Literature DB >> 26075274 |
Manli Zhou1, Youxi Luo2, Guoquan Sun3, Guoqin Mai3, Fengfeng Zhou3.
Abstract
Efficient and intuitive characterization of biological big data is becoming a major challenge for modern bio-OMIC based scientists. Interactive visualization and exploration of big data is proven to be one of the successful solutions. Most of the existing feature selection algorithms do not allow the interactive inputs from users in the optimizing process of feature selection. This study investigates this question as fixing a few user-input features in the finally selected feature subset and formulates these user-input features as constraints for a programming model. The proposed algorithm, fsCoP (feature selection based on constrained programming), performs well similar to or much better than the existing feature selection algorithms, even with the constraints from both literature and the existing algorithms. An fsCoP biomarker may be intriguing for further wet lab validation, since it satisfies both the classification optimization function and the biomedical knowledge. fsCoP may also be used for the interactive exploration of bio-OMIC big data by interactively adding user-defined constraints for modeling.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26075274 PMCID: PMC4437250 DOI: 10.1155/2015/910515
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Performance comparison of the algorithm fsCoP. fsCoP has no prefixed features, and the model fsCoP(ACE2) has two predetermined features.
| GSE5406 | |||||
|
| |||||
| fsCoP | Sn | Sp | Acc | Avc | MCC |
|
| |||||
| SVM |
|
|
|
|
|
| NBayes | 1.000 | 0.998 | 1.000 | 0.999 | 0.999 |
| DTree | 0.992 | 0.800 | 0.967 | 0.896 | 0.848 |
| Lasso | 0.999 | 0.900 | 0.987 | 0.950 | 0.939 |
| KNN | 1.000 | 0.871 | 0.983 | 0.936 | 0.923 |
|
| |||||
| fsCoP(ACE2) | Sn | Sp | Acc | Avc | MCC |
|
| |||||
| SVM | 1.000 | 0.996 | 0.999 | 0.998 | 0.998 |
| NBayes |
|
|
|
|
|
| DTree | 0.993 | 0.796 | 0.967 | 0.894 | 0.847 |
| Lasso | 1.000 | 0.907 | 0.988 | 0.953 | 0.944 |
| KNN | 0.999 | 0.860 | 0.982 | 0.930 | 0.916 |
|
| |||||
| GSE1869 | |||||
|
| |||||
| fsCoP | Sn | Sp | Acc | Avc | MCC |
|
| |||||
| SVM | 1.000 | 0.955 | 0.983 | 0.978 | 0.965 |
| NBayes | 1.000 | 0.972 | 0.990 | 0.986 | 0.979 |
| DTree | 0.907 | 0.000 | 0.567 | 0.453 | NaN |
| Lasso | 0.960 | 0.989 | 0.971 | 0.974 | 0.943 |
| KNN |
|
|
|
|
|
|
| |||||
| fsCoP(ACE2) | Sn | Sp | Acc | Avc | MCC |
|
| |||||
| SVM | 1.000 | 0.939 | 0.977 | 0.970 | 0.953 |
| NBayes |
|
|
|
|
|
| DTree | 0.987 | 0.000 | 0.617 | 0.493 | NaN |
| Lasso | 0.990 | 0.967 | 0.981 | 0.978 | 0.962 |
| KNN |
|
|
|
|
|
Running time of fsCoP and fsCoP(ACE2) on GSE5406. All the running times are calculated in seconds and column “repeat” gives the number of repeats of each model with different random seed.
| Repeat | fsCoP | Avg (fsCoP) | fsCoP(ACE2) | Avg (fsCoP(ACE2)) |
|---|---|---|---|---|
| 5 | 11.95 | 2.39 | 11.78 | 2.36 |
| 10 | 23.83 | 2.38 | 23.96 | 2.40 |
| 50 | 120.01 | 2.40 | 117.79 | 2.36 |
| 100 | 240.23 | 2.40 | 236.75 | 2.37 |
Figure 1Classification performance comparison of the five feature selection algorithms on the datasets: (a) GSE5406 and (b) GSE1869. The histograms give the detailed values of the classification performance measurements, that is, Sn, Sp, Acc, Avc, and MCC.
Figure 2Improvements of fsCoP compared with the four investigated feature selection algorithms, by fixing the features selected by each algorithm. The average “Avg()” and standard deviation “StdEv()” of the five classification performance measurements, that is, Sn, Sp, Acc, Avc, and MCC, are calculated over the 30 runnings of 5-fold cross validations of a given feature subset.