| Literature DB >> 27136190 |
Hai-Hui Huang1, Xiao-Ying Liu1, Yong Liang1.
Abstract
Cancer classification and feature (gene) selection plays an important role in knowledge discovery in genomic data. Although logistic regression is one of the most popular classification methods, it does not induce feature selection. In this paper, we presented a new hybrid L1/2 +2 regularization (HLR) function, a linear combination of L1/2 and L2 penalties, to select the relevant gene in the logistic regression. The HLR approach inherits some fascinating characteristics from L1/2 (sparsity) and L2 (grouping effect where highly correlated variables are in or out a model together) penalties. We also proposed a novel univariate HLR thresholding approach to update the estimated coefficients and developed the coordinate descent algorithm for the HLR penalized logistic regression model. The empirical results and simulations indicate that the proposed method is highly competitive amongst several state-of-the-art methods.Entities:
Mesh:
Year: 2016 PMID: 27136190 PMCID: PMC4852916 DOI: 10.1371/journal.pone.0149675
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
asunbiasedness, and oracle properties [5-7]. However, similar to most of the regularization methods, the L1/2 penalty ignores the correlation between features, and consequently unable to analyze data with dependent structures. If there is a group of variables among which the pair-wise correlations are very high, then the L1/2 method tends to select only one variable to represents the corresponding group. In gene expression study, genes are often highly correlated if they share the same biological pathway [8]. Some efforts had been made to deal with the problem of highly correlated variables. Zhou and Hastie proposed Elastic net penalty [9] which is a linear combination of L1 and L2 (the ridge technique) penalties, and such method emphasizes a grouping effect, where strongly correlated genes tend to be in or out of the model together. Becker et al. [10] proposed the Elastic SCAD (SCAD − L2), a combination of SCAD and L2 penalties. By introducing the L2 penalty term, Elastic SCAD also works for the groups of predictors.
asa representative of the Lq (0
5]. With the thresholding representation, solving the L1/2 regularization is much easier than solving the L0 regularization. Moreover, the L1/2 penalty is unbiasedness and has oracle properties [5-7]. These characteristics are making the L1/2 penalty became an efficient tool for high dimensional problems [16,17]. However, due to the insensitivity of the highly correlated data, the L1/2 penalty tends to select only one variable to represent the correlated group. This drawback may deteriorate the performance of the L1/2 method.
Fig 1Exact solutions of (a) Lasso, (b) L The regularization parameters are λ = 0.1 and α = 0.8 for Elastic net and HLR. (β-OLS is the ordinary least-squares (OLS) estimator).
Fig 2Contour plots (two-dimensional) for the regularization methods.
The regularization parameters are λ = 1 and α = 0.2 for the HLR method.
Mean results of the simulation.
In bold–the best performance amongst all the methods.
| Scenario | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | |
| Sensitivity of feature selection | Specificity of feature selection | Accuracy of classification (test set) | |||||||||||
| 0.966 | 0.798 | 0.344 | 0.361 | 0.996 | 0.968 | 0.967 | 0.966 | 89.26% | 81.47% | 84.76% | 80.26% | ||
| 0.971 | 0.888 | 0.411 | 0.355 | 0.998 | 92.05% | 82.22% | 81.45% | ||||||
| 0.3 | 1.000 | 0.913 | 0.722 | 0.674 | 0.995 | 0.928 | 0.890 | 0.723 | 93.21% | 84.51% | 82.51% | ||
| 0.997 | 0.916 | 0.737 | 0.662 | 0.994 | 0.926 | 0.886 | 0.735 | 91.03% | 81.34% | 84.47% | 80.27% | ||
| 1.000 | 0.931 | 0.892 | 0.769 | 82.66% | 84.99% | 85.05% | |||||||
| 0.887 | 0.723 | 0.351 | 0.270 | 0.995 | 0.981 | 0.923 | 94.24% | 84.10% | 91.88% | 85.88% | |||
| 0.755 | 0.630 | 0.275 | 0.220 | 1.000 | 0.974 | 95.90% | 86.50% | 90.20% | 84.20% | ||||
| 0.6 | 1.000 | 0.866 | 0.800 | 0.629 | 1.000 | 0.949 | 0.929 | 0.849 | 96.33% | 86.43% | 89.20% | ||
| 1.000 | 0.854 | 0.795 | 0.621 | 1.000 | 0.953 | 0.939 | 0.837 | 96.22% | 86.41% | 92.12% | 91.01% | ||
| 1.000 | 1.000 | 0.968 | 0.942 | 0.841 | 92.82% | ||||||||
| 0.548 | 0.548 | 0.174 | 0.145 | 0.938 | 0.972 | 0.987 | 0.934 | 96.05% | 86.79% | 93.22% | 91.15% | ||
| 0.337 | 0.495 | 0.159 | 0.139 | 0.999 | 97.89% | 87.90% | 93.70% | 92.70% | |||||
| 0.9 | 1.000 | 0.872 | 0.809 | 0.636 | 1.000 | 0.954 | 0.952 | 0.861 | 97.28% | 88.60% | 93.70% | 93.19% | |
| 1.000 | 0.856 | 0.818 | 0.622 | 0.995 | 0.951 | 0.949 | 0.875 | 98.22% | 88.14% | 93.52% | 93.82% | ||
| 1.000 | 1.000 | 0.966 | 0.956 | 0.880 | |||||||||
Mean results are based on 500 repeats. The sensitivity and specificity are both dedicated to measures the quality of the selected features, the accuracy evaluates the classification performance of the different regularization approaches on the test sets.
Real datasets used in this paper.
| Dataset | No. of Samples (Total) | No. of Genes | Classes |
|---|---|---|---|
| Prostate | 102 | 12600 | Normal/Tumor |
| Lymphoma | 77 | 7129 | DLBCL/FL |
| Lung cancer | 164 | 22401 | Normal/Tumor |
Mean results of empirical datasets.
In bold–the best performance.
| Dataset | Method | Training accuracy (10-CV) | Accuracy (testing) | No. of selected genes | ||||
|---|---|---|---|---|---|---|---|---|
| Lasso | 96.22% | 92.40% | 13.7 | |||||
| L1/2 | 96.13% | 92.18% | 8.2 | |||||
| Prostate | SCAD − L2 | 95.99% | 91.33% | 22 | ||||
| ElasticNet | 96.28% | 91.35% | 15.2 | |||||
| HLR | 12.6 | |||||||
| Lymphom | Lasso | 96.03% | 91.11% | 13.2 | ||||
| L1/2 | 95.15% | 91.20% | 10.7 | |||||
| SCAD − L2 | 95.78% | 92.99% | 20.9 | |||||
| ElasticNet | 96.01% | 92.17% | 21.2 | |||||
| HLR | 15.1 | |||||||
| Lung cancer | Lasso | 96.32% | 96.99% | 13.8 | ||||
| L1/2 | 97.17% | 97.20% | 11.5 | |||||
| SCAD − L2 | 97.95% | 98.17% | 25.1 | |||||
| ElasticNet | 97.21% | 28.9 | ||||||
| HLR | 98.35% | 15.6 | ||||||
Mean results are based on 500 repeats.
Fig 3The performance of the AUC from ROC analyzes of each method on prostate, lymphoma and lung cancer datasets.
The most frequently selected 10 genes found by the five sparse logistic regression methods from the lung cancer dataset.
| Rank | Lasso | L1/2 | SCAD − L2 | ElasticNet | HLR |
|---|---|---|---|---|---|
| 1 | STX11 | A2M | ABCA8 | CCDC69 | ACADL |
| 2 | GABARAPL1 | ACADL | ADH1B | STX11 | CCDC69 |
| 3 | PDLIM2 | PNLIP | CAT | GABARAPL1 | STX11 |
| 4 | CAV1 | AAAS | CAV1 | TNXB | ABCA8 |
| 5 | ABCA8 | A4GALT | CCDC69 | PDLIM2 | PAEP |
| 6 | GPM6A | ABHD8 | GABARAPL1 | FAM13C | AGER |
| 7 | GRK5 | ADD2 | GPM6A | GPM6A | GATA2 |
| 8 | TNXB | SLN | GRK5 | SFTPC | PNLIP |
| 9 | ADH1B | ACTL7B | PDLIM2 | ARHGAP44 | A2M |
| 10 | PTRF | ADAR | PTRF | CAT | ACAN |
The validation results of the classifiers based on the top rank selected genes from lung cancer dataset.
In bold–the best performance.
| Dataset | Method | SVM with the top genes | ||||||
|---|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | ||||||
| GSE19804 | Lasso | 89.17% | 92.50% | |||||
| L1/2 | 85.83% | 90.83% | 91.67% | |||||
| SCAD − L2 | 89.17% | 89.17% | 93.33% | |||||
| ElasticNet | 86.67% | 87.50% | 89.17% | |||||
| HLR | 92.50% | |||||||
| GSE32863 | Lasso | 93.10% | 95.69% | 93.97% | ||||
| L1/2 | 93.97% | 94.83% | 95.69% | |||||
| SCAD − L2 | 90.28% | 92.24% | 94.83% | |||||
| ElasticNet | 89.66% | 91.38% | 93.97% | |||||
| HLR | ||||||||
We used the SVM approach to build the classifiers based on the first two, first five and first ten genes selected by the different regularization approaches from the lung cancer dataset (Table 4), and were trained on the lung cancer dataset (Table 2) respectively. These classifiers then were applied to the two independent lung cancer datasets, GSE19804 and GSE32863, respectively.
The result of the literature.
In bold–the best performance.
| Dataset | Author | Accuracy (CV) | No. of selected features |
|---|---|---|---|
| T.K. Paul et al. [ | 96.60% | 48.5 | |
| Wessels et al. [ | 93.40% | 14 | |
| Shen et al. [ | 94.60% | unknown | |
| prostate | Lecocke et al. [ | 90.10% | unknown |
| Dagliyan et al. [ | 94.80% | unknown | |
| Glaab et al. [ | 94.00% | 30 | |
| HLR | 12.6 | ||
| Lymphoma | Wessels et al. [ | 95.70% | 80 |
| Liu et al. [ | 93.50% | 6 | |
| Shipp et al. [ | 92.20% | 30 | |
| Goh et al. [ | 91.00% | 10 | |
| Lecocke et al. [ | 90.20% | unknown | |
| Hu et al. [ | 87.01% | unknown | |
| Dagliyan et al. [ | 92.25% | unknown | |
| Glaab et al. [ | 95.00% | 30 | |
| HLR | 15.1 |