| Literature DB >> 27843486 |
Zhenqiu Liu1, Gang Li2.
Abstract
Variable selections for regression with high-dimensional big data have found many applications in bioinformatics and computational biology. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. However, it is well known that L0 optimization is NP-hard and computationally challenging. In this paper, we propose efficient EM (L0EM) and dual L0EM (DL0EM) algorithms that directly approximate the L0 optimization problem. While L0EM is efficient with large sample size, DL0EM is efficient with high-dimensional (n ≪ m) data. They also provide a natural solution to all Lp p ∈ [0,2] problems, including lasso with p = 1 and elastic net with p ∈ [1,2]. The regularized parameter λ can be determined through cross validation or AIC and BIC. We demonstrate our methods through simulation and high-dimensional genomic data. The results indicate that L0 has better performance than lasso, SCAD, and MC+, and L0 with AIC or BIC has similar performance as computationally intensive cross validation. The proposed algorithms are efficient in identifying the nonzero variables with less bias and constructing biologically important networks with high-dimensional big data.Entities:
Mesh:
Year: 2016 PMID: 27843486 PMCID: PMC5098106 DOI: 10.1155/2016/3456153
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Algorithm 1L 0EM algorithm.
Algorithm 2DL 0EM algorithm.
Performance measures for L 0 and L 1 regularized regression over 100 simulations, where values in the parentheses are the standard deviations. # SF: number of average selected features; MSE: average mean squared error; : average absolute bias when comparing true and estimated parameters.
|
|
|
| ||||
|---|---|---|---|---|---|---|
| # SF | MSE |
| # SF | MSE |
| |
| 0 | 3.39 (±1.1) | 1.01 (±0.14) | 0.206 (±0.12) | 14.5 (±3.45) | 1.19 (±0.19) | 0.38 (±0.1) |
| 0.3 | 3.37 (±0.9) | 1.02 (±0.16) | 0.23 (±0.12) | 14.5 (±2.91) | 1.21 (±0.19) | 0.41 (±0.19) |
| 0.6 | 3.49 (±1.7) | 1.02 (±0.23) | 0.23 (±0.16) | 13.5 (±3.0) | 1.26 (±0.2) | 0.54 (±0.15) |
| 0.8 | 3.32 (±0.9) | 1.06 (±0.15) | 0.28 (±0.21) | 11.7 (±2.69) | 1.3 (±0.21) | 0.89 (±0.25) |
Performance measures for L 0 and L 1 regularized regression with λ = max{λ MSE, λ SS} over 100 simulations, where values in the parenthesis are the standard deviations. # SF: number of average selected features; MSE: average mean squared error; : average absolute bias when comparing true and estimated parameters.
|
|
|
| ||||
|---|---|---|---|---|---|---|
| # SF | MSE |
| #SF | MSE |
| |
| 0 | 3.09 (±0.53) | 1.04 (±0.15) | 0.18 (±0.11) | 13.3 (±4.56) | 1.21 (±0.17) | 0.39 (±0.1) |
| 0.3 | 3.08 (±0.54) | 1.04 (±0.15) | 0.17 (±0.07) | 14.5 (±4.20) | 1.22 (±0.17) | 0.42 (±0.19) |
| 0.6 | 3.10 (±0.46) | 1.07 (±0.17) | 0.21 (±0.10) | 13.8 (±5.4) | 1.27 (±0.47) | 0.57 (±0.25) |
| 0.8 | 3.02 (±0.14) | 1.04 (±0.14) | 0.26 (±0.13) | 13.4 (±4.91) | 1.25 (±0.21) | 0.74 (±0.25) |
Performance measures for L 0, L 1, SCAD, and MC+ regularized regressions with cross validation and λ = Max{λ MSE, λ SS} over 100 simulations and the sample size of n = 100, and m = 1000, where values in the parenthesis are the standard deviations. # SF: number of average selected features; MSE: average mean squared error; : average absolute bias when comparing true and estimated parameters.
| Measures |
|
|
| |
|---|---|---|---|---|
|
| # SF | 3 (±0) | 2.9 (±0.47) | 2 (±0.73) |
|
| 0.14 (±0.09) | 0.39 (±0.63) | 1.69 (±1.25) | |
| Test MSE | 1.14 (±0.34) | 1.59 (±1.3) | 2.8 (±1.72) | |
| # true model | 100/100 | 78/100 | 23/100 | |
|
| ||||
|
| # SF | 24 (±18.4) | 31.3 (±20.7) | 36.7 (±16.5) |
|
| 0.57 (±0.11) | 0.73 (±0.13) | 1.14 (±0.25) | |
| Test MSE | 1.50 (±0.25) | 1.63 (±0.29) | 1.92 (±0.41) | |
| # true model | 0/100 | 0/100 | 0/100 | |
|
| ||||
| SCAD | # SF | 106.8 (±110.6) | 73 (±111) | 56.2 (±62.4) |
|
| 0.62 (±0.13) | 0.72 (±0.14) | 1.13 (±0.26) | |
| Test MSE | 1.32 (±0.27) | 1.54 (±0.27) | 2.04 (±0.51) | |
| # true model | 0/100 | 0/100 | 0/100 | |
|
| ||||
| MC+ | # SF | 60.3 (±38.6) | 70.5 (±26.0) | 78.73 (±16.5) |
|
| 0.56 (±0.14) | 0.66 (±0.12) | 0.78 (±0.17) | |
| Test MSE | 1.25 (±0.21) | 1.31 (±0.27) | 1.46 (±0.27) | |
| # true model | 0/100 | 0/100 | 0/100 | |
Figure 1Regularized path for L 0 penalized regression with n = 100, m = 1000, and r = 0.3.
Performance measures for L 0 regularized regression with AIC and BIC over 100 simulations with n = 100, and m = 1000, where values in the parenthesis are the standard deviations. # SF: number of average selected features; MSE: in-sample average mean squared error; : average absolute bias when comparing true and estimated parameters.
| Measures |
|
|
| |
|---|---|---|---|---|
| AIC | # SF | 3.26 (±0.54) | 3.72 (±1.94) | 4.8 (±2.77) |
|
| 0.19 (±0.09) | 0.36 (±0.58) | 1.02 (±1.2) | |
| MSE | 0.96 (±0.14) | 1.02 (±0.31) | 1.27 (±0.51) | |
| # true model | 78/100 | 73/100 | 59/100 | |
|
| ||||
| BIC | # SF | 3.0 (±0.0) | 3.0 (±0.38) | 2.89 (±0.80) |
|
| 0.16 (±0.08) | 0.45 (±0.69) | 1.80 (±1.20) | |
| MSE | 0.97 (±0.15) | 1.29 (±0.81) | 2.48 (±1.17) | |
| # true model | 100/100 | 94/100 | 53/100 | |
Performance measures for L 0 regularized regression for graphical structure detection over 100 simulations, where values in the parenthesis are the standard deviations.
| Band 1 | Band 2 | |||||
|---|---|---|---|---|---|---|
| AIC | AUC | FDR (%) | FNR (%) | AUC | FDR (%) | FNR (%) |
|
| .95 (±.01) | .29 (±.08) | 9.4 (±2.6) | .82 (±.01) | .10 (±.05) | 36.7 (±1.5) |
| 100 | .99 (±.005) | .20 (±.06) | 1.2 (±1.1) | .84 (±.01) | .11 (±.04) | 32.7 (±1.9) |
| 200 | .999 (±.0003) | .20 (±.05) | 0 (±0) | .93 (±.01) | .11 (±.04) | 14.2 (±2.4) |
|
| ||||||
| BIC | AUC | FPR (%) | FNR (%) | AUC | FPR (%) | FNR (%) |
|
| ||||||
|
| .90 (±.02) | .10 (±.05) | 20 (±3.6) | .803 (±.008) | .02 (±.02) | 39.3 (±1.5) |
| 100 | .991 (±.007) | .03 (±.03) | 1.8 (±1.3) | .83 (±.01) | .03 (±.02) | 34.9 (±1.6) |
| 200 | .9999 (±.0005) | .01 (±.01) | .01 (±.10) | .82 (±.01) | .03 (±.02) | 36.7 (±1.8) |
|
| ||||||
|
| AUC | FPR (%) | FNR (%) | AUC | FPR (%) | FNR (%) |
|
| ||||||
|
| .91 (±.03) | 3.5 (±.05) | 11 (±3.6) | 0.77 (±.01) | 5.3 (±.07) | 40.9 (±.62) |
| 100 | .99 (±.003) | 1.52 (±.22) | .33 (±.67) | 0.78 (±.007) | 7.1 (±1.4) | 36.3 (±1.1) |
| 200 | .99 (±.003) | 1.21 (±.07) | .45 (±.53) | 0.79 (±.01) | 8.1 (±.57) | 34.0 (±1.4) |
Figure 2Subnetwork constructed with L 0 penalized regression, multisource gene expression profiling, and BIC.
Figure 3Known and predicted protein-protein interactions with the 22 genes on the subnetwork of Figure 2, where nodes represent proteins (genes) and edges indicate the direct (physical) and indirect (functional) associations. Stronger associations are represented by thicker lines.