| Literature DB >> 29270229 |
Zhenqiu Liu1, Fengzhu Sun2, Dermot P McGovern3.
Abstract
BACKGROUND: Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L1, SCAD and MC+. However, none of the existing algorithms optimizes L0, which penalizes the number of nonzero features directly.Entities:
Keywords: Big data mining; Classification; GLM; L0 penalty; Multi-omics data; Sparse modeling; Suboptimal debulking
Year: 2017 PMID: 29270229 PMCID: PMC5735537 DOI: 10.1186/s13040-017-0159-z
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Link functions for linear, logistic and Poisson regression models in GLM, where different models have different A(∗), B(∗), and C(∗)
| GLM models |
|
| Link |
|
|---|---|---|---|---|
| Linear regression |
|
| Identity | 1 |
| Logistic regression | log(1+ |
| logit |
|
| Poisson regression | exp( | exp( | log |
|
Performance of different GLM methods for Poisson regression over 100 simulations, where values in the parenthesis are the standard deviations, and ANSF: Average number of selected features; rMSE: Average square root of mean squared error; : average absolute bias when comparing true and estimated parameters
| PMS | glmnet | SparseReg |
| ||
|---|---|---|---|---|---|
|
| SCAD | MC+ | |||
| rMSE | 1.10(±.091) | 1.090(±.092) |
| 1.937(±.222) | |
| N =100 |
| 1.755(±.274) | 1.754(±.275) | 1.737±.273) |
|
| P =100 | ANSF | 43.03(±3.52) | 43.07(±3.57) | 42.06(±3.51) |
|
| PTM | 0 | 0 | 0% |
| |
| FDR | 90.6 | 90.6 | 90.6 |
| |
| rMSE | 0.503(±.017) | 0.502(±.017) |
| 2.108(±.359) | |
| N =100 |
| 2.671(±.421) | 2.673(±.425) | 2.821±2.012) |
|
|
| ANSF | 75.47(±5.61) | 75.82(±5.71) | 75.14(±8.69) |
|
| PTM | 0 | 0 | 0 |
| |
| FDR | 94.7 | 94.7 | 94.6 |
| |
| rMSE |
| 0.272(±.012) | 0.275(±.025) | 1.916(±.081) | |
| N =500 |
| 5.845(±.280) | 6.185(±2.359) | 5.807±.273) |
|
|
| ANSF | 465.6(±14.1) | 475.1(±15.5) | 463.6(±13.9) |
|
| PTM | 0 | 0 | 0% |
| |
| FDR | 99.1 | 99.2 | 99.1 |
| |
PMS: Performance Measures. PTM: Percentage of true models. FDR: False discovery rate. The values in boldface indicate the best performance
Performance of different GLM methods for logistic regression over 100 simulations, where ANSF: Average number of selected features; trMSE: Test Average square root of mean squared error; : average absolute bias when comparing true and estimated parameters
| PMS |
| SparseReg |
| ||
|---|---|---|---|---|---|
| SCAD | MC+ | ||||
| trMSE | 0.0474(±.0035) | 0.0469(±.0039) | 0.0456(±.0042) |
| |
| N =100 |
| 0.2984(±.1262) | 0.3129(±.1249) | 0.1625(±.0752) |
|
| P =100 | ANSF | 17.10(±9.32) | 18.35(±10.185) | 10.410(±6.174) |
|
| PTM | 0% | 0 | 2 |
| |
| FDR | 77.7 | 78.4 | 62.4 |
| |
| trMSE | 0.0517(±.0045) | 0.0496(±.0046) | 0.0468(±.0045) |
| |
| N =100 |
| 0.5968(±.2599) | 0.6465(±.2205) | 0.2818(±.1030) |
|
|
| ANSF | 50.92(±39.974) | 73.030(±40.792) | 24.80(±13.314) |
|
| PTM | 0% | 0 | 0 |
| |
| FDR | 90.5 | 93% | 83.9 |
| |
PMS: Performance Measures. PTM: Percentage of true models. FDR: False discovery rate. The values in boldface indicate the best performance
Fig. 1Gene signatures associated with suboptimal debulking, where nodes in red: mRNA signatures; nodes in green: microRNA signatures; nodes in pink: methylation signatures, and edges in red: positive partial correlation; edges in blue: negative partial correlation
Fig. 2Predictive AUCs for integrated data, mRNA expression only, microRNA expression only, and methylation only