Literature DB >> 27136190

Feature Selection and Cancer Classification via Sparse Logistic Regression with the Hybrid L1/2 +2 Regularization.

Hai-Hui Huang¹, Xiao-Ying Liu¹, Yong Liang¹.

Abstract

Cancer classification and feature (gene) selection plays an important role in knowledge discovery in genomic data. Although logistic regression is one of the most popular classification methods, it does not induce feature selection. In this paper, we presented a new hybrid L1/2 +2 regularization (HLR) function, a linear combination of L1/2 and L2 penalties, to select the relevant gene in the logistic regression. The HLR approach inherits some fascinating characteristics from L1/2 (sparsity) and L2 (grouping effect where highly correlated variables are in or out a model together) penalties. We also proposed a novel univariate HLR thresholding approach to update the estimated coefficients and developed the coordinate descent algorithm for the HLR penalized logistic regression model. The empirical results and simulations indicate that the proposed method is highly competitive amongst several state-of-the-art methods.

Entities: Chemical Disease Gene

Mesh：

Year: 2016 PMID： 27136190 PMCID： PMC4852916 DOI： 10.1371/journal.pone.0149675

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1. Introduction

With advances in high-throughput molecular techniques, the researchers can study the expression of tens of thousands of genes simultaneously. Cancer classification based on gene expression levels is one of the central problems in genome research. Logistic regression is a popular classification method and has an explicit statistical interpretation which can obtain probabilities of classification regarding the cancer phenotype. However, in most gene expression studies, the number of genes typically far exceeds the number of the sample size. This situation is called high-dimensional and low sample size problem, and the normal logistic regression method cannot be directly used to estimate the regression parameters. To deal with the problem of high dimensionality, one of the popular techniques is the regularization method. A well-known regularization method is the L1 penalty [1], which is the least absolute shrinkage and selection operator (Lasso). It is performing continuous shrinkage and gene selection at the same time. Other L1 norm type regularization methods typically include the smoothly-clipped-absolute-deviation (SCAD) penalty [2], which is symmetric, nonconcave, and has singularities at the origin to produce sparse solutions. The adaptive Lasso [3] penalizes the different coefficients with the dynamic weights in the L1 penalty. However, the L1 type regularization may yield inconsistent feature selections in some situations [3] and often introduces extra bias in the estimation of the parameters in the logistic regression [4]. Xu et al. [5] proposed the L1/2 penalty, a method that can be taken as a representative of Lq (0 as unbiasedness, and oracle properties [5-7]. However, similar to most of the regularization methods, the L1/2 penalty ignores the correlation between features, and consequently unable to analyze data with dependent structures. If there is a group of variables among which the pair-wise correlations are very high, then the L1/2 method tends to select only one variable to represents the corresponding group. In gene expression study, genes are often highly correlated if they share the same biological pathway [8]. Some efforts had been made to deal with the problem of highly correlated variables. Zhou and Hastie proposed Elastic net penalty [9] which is a linear combination of L1 and L2 (the ridge technique) penalties, and such method emphasizes a grouping effect, where strongly correlated genes tend to be in or out of the model together. Becker et al. [10] proposed the Elastic SCAD (SCAD − L2), a combination of SCAD and L2 penalties. By introducing the L2 penalty term, Elastic SCAD also works for the groups of predictors. In this article, we proposed the HLR (Hybrid L1/2 + 2 Regularization) approach to fit the logistic regression models for gene selection, where the regularization is a linear combination of the L1/2 and L2 penalties. The L1/2 penalty achieves feature selection. In theory, a strictly convex penalty function provides a sufficient condition for the grouping effect of variables and the L2 penalty guarantees strict convexity [11]. Therefore, the L2 penalty induces the grouping effect simultaneously in the HLR approach. Experimental results on artificial and real gene expression data in this paper demonstrate that our proposed method is very promising. The rest of the article is organized as follows. In Section 2, we first defined the HLR approach and presented an efficient algorithm for solving the logistic regression model with the HLR penalty. In Section 3, we evaluated the performance of our proposed approach on the simulated data and five public gene expression datasets. We presented a conclusion of the paper in Section 4.

2. Methods

2.1 Regularization

Suppose that dataset D has n samples D = {(X1, y1), (X2, y2),…,(X, y)}, where X = (x, x, …, x) is ith sample with p dimensional and y is the corresponding dependent variable. For any non-negative λ, the normal regularization form is: where P(β) represents the regularization term. There are many regularization methods proposed in recent years. One of the popular methods is the L1 regularization (Lasso), where . The others L1 type regularizations include SCAD, the adaptive Lasso, Elastic net, Stage wise Lasso [12], Dantzig selector [13] and Elastic SCAD. However, in genomic research, the result of the L1 type regularization may not sparse enough for interpretation. Actually, a typical microarray or RNA-seq data set has many thousands of predictors (genes), and researchers often desire to select fewer but informative genes. Beside this, the L1 regularization is asymptotically biased [14,15]. Although the L0 regularization, where , yields the sparsest solutions, it has to deal with NP-hard combinatory optimization problem. To gain a more concise solution and improve the predictive accuracy of the classification model, we need to think beyond the L1 and L0 regularizations to the Lq (0as a representative of the Lq (05]. With the thresholding representation, solving the L1/2 regularization is much easier than solving the L0 regularization. Moreover, the L1/2 penalty is unbiasedness and has oracle properties [5-7]. These characteristics are making the L1/2 penalty became an efficient tool for high dimensional problems [16,17]. However, due to the insensitivity of the highly correlated data, the L1/2 penalty tends to select only one variable to represent the correlated group. This drawback may deteriorate the performance of the L1/2 method.

2.2 Hybrid L1/2 +2 Regularization (HLR)

For any fixed non-negative λ1 and λ2, we define the hybrid L1/2 +2 regularization (HLR) criterion: where β = (β1, …, β) are the coefficients to be estimated and The HLR estimator is the minimizer of Eq (2): Let α = λ1/(1 + λ2), then solving in Eq (3) is equivalent to the optimization problem: We call the function α|β|1/2 + (1 − α)|β|2 as the HLR, which is a combination of the L1/2 and L2 penalties. When α = 0, the HLR penalty becomes ridge regularization. When α = 1, the HLR becomes L1/2 regularization. The L2 penalty is enjoying the grouping effect and the L1/2 penalty induces sparse solutions. This combination of the both penalties makes the HLR approach not only capable of dealing with the correlation data, but also able to generate a succinct result. Fig 1 shows four regularization methods: Lasso, L1/2, Elastic net and HLR penalties with an orthogonal design matrix in the regression model. The estimators of Lasso and Elastic net are biased, whereas the L1/2 penalty is asymptotically unbiased. Similar to the L1/2 method, the HLR approach also performs better than Lasso and Elastic net in the property of unbiasedness.

Fig 1

Exact solutions of (a) Lasso, (b) L The regularization parameters are λ = 0.1 and α = 0.8 for Elastic net and HLR. (β-OLS is the ordinary least-squares (OLS) estimator).

Exact solutions of (a) Lasso, (b) L The regularization parameters are λ = 0.1 and α = 0.8 for Elastic net and HLR. (β-OLS is the ordinary least-squares (OLS) estimator). Fig 2 describes the contour plots on two-dimensional for the penalty functions of Lasso, Elastic net, L1/2 and HLR approaches. It is suggest that the L1/2 penalty is non-convex, whereas the HLR is convex for the given α. The following theorem will show how the HLR strengthens the L1/2 regularization.

Fig 2

Contour plots (two-dimensional) for the regularization methods.

The regularization parameters are λ = 1 and α = 0.2 for the HLR method.

Contour plots (two-dimensional) for the regularization methods.

The regularization parameters are λ = 1 and α = 0.2 for the HLR method.

Theorem 1

Given dataset (y, X) and (λ1, λ2), then the HLR estimates are given by The L1/2 regularization can be rewritten as The proof of Theorem 1 can be found in S1 File. Therorem1 shows the HLR approach is a stabilized version of the L1/2 regularization. Note that is a sample version of the correlation matrix Σ and where δ = λ2/(1 + λ2) shrinks that towards the identity matrix. The classification accuracy can often be enhanced by replacing by a more shrunken estimate in linear discriminate analysis [18,19]. In other word, the HLR improves the L1/2 technique by regularizing in Eq (6).

2.3 The sparse logistic regression with the HLR method

Suppose that dataset D has n samples D = {(X1, y1), (X2, y2), …, (X, y)}, where X = (x, x, …, x) is ith sample with p genes and y is the corresponding dependent variable that consist of a binary value with 0 or 1. Define a classifier f(x) = e / (1 + e) and the logistic regression is defined as: Where β = (β1, …, β) are the coefficients to be estimated. With a simple algebra, the regression model can be presented as: In this paper, we apply the HLR approach to the logistic regression model. For any fixed non-negative λ and α, the sparse logistic regression model based on the HLR approach is defined as:

2.4 Solving algorithm for the sparse logistic regression with the HLR approach

The coordinate descent algorithm [20] is an efficient method for solving regularization models because its computational time increases linearly with the dimension of the problems. Its standard procedure can be showed as follows: for every βj (j = 1,2,…,p), to partially optimize the target function with respect to coefficient with the remaining elements of β fixed at their most recently updated values, iteratively cycling through all coefficients until meet converged. The specific form of renewing coefficients is associated with the thresholding operator of the penalty. Suppose that dataset D has n samples D = {(X1, y1), (X2, y2), …, (X, y)}, where X = (x, x, …, x) is ith sample with p dimensional and y is the corresponding dependent variable. The variables are standardized: . Following Friedman et al. [20] and Liang et al. [16], in this paper, we present the original coordinate-wise update form for the HLR approach: where , and as the partial residual for fitting β. is the L1/2 thresholding operator where , π = 3.14. The Eq (9) can be linearized by one-term Taylor series expansion: where is the estimated response, is the weight for the estimated response. is the evaluated value under the current parameters. Thus, we can redefine the partial residual for fitting current as and . The procedure of the coordinate descent algorithm for the HLR penalized logistic model is described as follows. Step 1: Initialize all β(m) ← 0 (j = 1, 2,…,p) and X, y, set m ← 0, λ and α are chosen by cross-validation; Step 2: Calculate Z(m) and W(m) and approximate the loss function (12) based on the current β(m); Step 3: Update each β(m), and cycle over j = 1,…, p; Step 3.1: Compute and ; Step 3.2: Update Step 4: Let m ← m + 1, β(m + 1) ← β(m); If β(m) dose not convergence, then repeat Steps 2, 3;

3. Results and Discussion

3.1 Analyzes of simulated data

The goal of this section is to evaluate the performance of the logistic regression with the HLR approach in the simulation study. Four approaches are compared with our proposed method: logistic regression with the Lasso regularization, L1/2 regularization, SCAD − L2 and Elastic net regularization respectively. We simulate data from the true model where X ∼ N(0, 1), ϵ is the independent random error and σ is the parameter that controls the signal to noise. Four scenarios are presented here. In every example, the dimension of predictors is 1000. The notation. /. was represented the number of observations in the training and test sets respectively, e.g. 100/100. Here are the details of the four scenarios. In scenario 1, the dataset consists of 100/100 observations, we set σ = 0.3 and , we simulated a grouped variable situation where ρ is the correlation coefficient of the grouped variables. The scenario 2 was defined similarly to the scenario 1, except that we considered the case when there are other independent factors also contributes to the corresponding classification variable y, In scenario 3, we set σ = 0.4 and the dataset consist of 200/200 observations, and , we defined two grouped variables In scenario 4, the true features were added up to 20% of the total features, σ = 0.4 and the dataset consist of 400/400 observations, and , we defined three grouped variables In this example, there were three groups of the correlated features and some single independent features. An ideal sparse regression method would select only the 200 true features and set the coefficients of the 800 noise features to zero. In our experiment, we set the correlation coefficient ρ of features are 0.3, 0.6, 0.9 respectively. The Lasso and Elastic net were conducted by Glmnet (a Matlab package, version 2014-04-28, download at http://web.stanford.edu/~hastie/glmnet_matlab/). The optimal regularization parameters or tuning parameters (balance the tradeoff between data fit and model complexity) of the Lasso, L1/2, SCAD − L2, Elastic net and the HLR approaches were tuned by the 10-fold cross-validation (CV) approach in the training set. Note that, the Elastic net and HLR methods were tuned by the 10-CV approach on the two-dimensional parameter surfaces. The SCAD − L2 were tuned by the 10-CV approach on the three-dimensional parameter surfaces. Then, the different classifiers were built by these sparse logistic regressions with the estimated tuning parameters. Finally, the obtained classifiers were applied to the test set for classification and prediction. We repeated the simulations 500 times for each penalty method and computed the mean classification accuracy on the test sets. To evaluate the quality of the selected features for the regularization approaches, the sensitivity and specificity of the feature selection performance [21] were defined as the follows: where the .* is the element-wise product, and |.|0 calculates the number of non-zero elements in a vector, and are the logical “not” operators on the vectors β and . As showed in Table 1, for all scenarios, our proposed HLR procedure generally gave higher or comparable classification accuracy than the Lasso, SCAD − L2, Elastic net and L1/2 methods. Also, the HLR approach results in much higher sensitivity for identifying true features compared to the other four algorithms. For example, in the scenario 1 with ρ = 0.9, our proposed method gained the impressive performance (accuracy 99.87% with perfect sensitivity and specificity). The specificity of the HLR approach is somewhat decreased, but not greatly as compared to the achieved in sensitivity.

Table 1

Mean results of the simulation.

In bold–the best performance amongst all the methods.

		Scenario
ρ	Method	1	2	3	4	1	2	3	4	1	2	3	4
		Sensitivity of feature selection				Specificity of feature selection				Accuracy of classification (test set)
	Lasso	0.966	0.798	0.344	0.361	0.996	0.968	0.967	0.966	89.26%	81.47%	84.76%	80.26%
	L_1/2	0.971	0.888	0.411	0.355	0.998	0.974	0.975	0.970	92.05%	82.22%	85.11%	81.45%
0.3	SCAD − L₂	1.000	0.913	0.722	0.674	0.995	0.928	0.890	0.723	93.21%	82.90%	84.51%	82.51%
	EN	0.997	0.916	0.737	0.662	0.994	0.926	0.886	0.735	91.03%	81.34%	84.47%	80.27%
	HLR	1.000	0.924	0.791	0.708	0.999	0.931	0.892	0.769	95.27%	82.66%	84.99%	85.05%
	Lasso	0.887	0.723	0.351	0.270	0.995	0.975	0.981	0.923	94.24%	84.10%	91.88%	85.88%
	L_1/2	0.755	0.630	0.275	0.220	1.000	0.974	0.988	0.928	95.90%	86.50%	90.20%	84.20%
0.6	SCAD − L₂	1.000	0.866	0.800	0.629	1.000	0.949	0.929	0.849	96.33%	86.43%	89.20%	93.03%
	EN	1.000	0.854	0.795	0.621	1.000	0.953	0.939	0.837	96.22%	86.41%	92.12%	91.01%
	HLR	1.000	0.875	0.816	0.636	1.000	0.968	0.942	0.841	99.53%	87.16%	92.71%	92.82%
	Lasso	0.548	0.548	0.174	0.145	0.938	0.972	0.987	0.934	96.05%	86.79%	93.22%	91.15%
	L_1/2	0.337	0.495	0.159	0.139	0.999	0.977	0.991	0.944	97.89%	87.90%	93.70%	92.70%
0.9	SCAD − L₂	1.000	0.872	0.809	0.636	1.000	0.954	0.952	0.861	97.28%	88.60%	93.70%	93.19%
	EN	1.000	0.856	0.818	0.622	0.995	0.951	0.949	0.875	98.22%	88.14%	93.52%	93.82%
	HLR	1.000	0.897	0.824	0.645	1.000	0.966	0.956	0.880	99.87%	89.40%	94.76%	94.40%

Mean results are based on 500 repeats. The sensitivity and specificity are both dedicated to measures the quality of the selected features, the accuracy evaluates the classification performance of the different regularization approaches on the test sets.

Mean results of the simulation.

In bold–the best performance amongst all the methods. Mean results are based on 500 repeats. The sensitivity and specificity are both dedicated to measures the quality of the selected features, the accuracy evaluates the classification performance of the different regularization approaches on the test sets.

3.2 Analyzes of real data

To further evaluate the effectiveness of our proposed method, in this section, we used several publicly available datasets: Prostate, DLBCL and Lung cancer. The prostate and DLBCL datasets were both downloaded from http://ico2s.org/datasets/microarray.html, and the lung cancer dataset can be downloaded at http://www.ncbi.nlm.nih.gov/geo with access number [GSE40419]. More information on these datasets is given in Table 2.

Table 2

Real datasets used in this paper.

Dataset	No. of Samples (Total)	No. of Genes	Classes
Prostate	102	12600	Normal/Tumor
Lymphoma	77	7129	DLBCL/FL
Lung cancer	164	22401	Normal/Tumor

Prostate

This dataset was originally proposed by Singh et al. [22]; it is contains the expression profiles of 12,600 genes for 50 normal tissues and 52 prostate tumor tissues.

Lymphoma

This dataset (Shipp et al. [23]) contains 77 microarray gene expression profiles of the two most prevalent adult lymphoid malignancies: 58 samples of diffuse large B-cell lymphomas (DLBCL) and 19 follicular lymphomas (FL). The original data contains 7,129 gene expression values.

Lung cancer

As RNA- sequencing (RNA-seq) technique widely used, therefore, it is important to test the proposed method whether it has the ability to handle the RNA-seq data. To verify it, one dataset that used the next-generation sequencing was involved in our analysis. This dataset [24] contains 164 samples with 87 lung adenocarcinomas and 77 adjacent normal tissues. We evaluate the performance of the HLR penalized logistic regression models using the random partition. This means that we divide the datasets at random such that approximate 75% of the datasets becomes the training samples and the other 25% as the test samples. The optimal tuning parameters were found by using the 10-fold cross-validation in the training set. Then, the classification model was built by the sparse logistic regression with the estimated tuning parameters. Finally, application of the classifier to the test set provides the prediction characteristics such as classification accuracy, AUC under the receiver operating characteristic (ROC) analysis. The above procedures were repeated 500 times with different random dataset partitions. The mean number of the selected genes, the training and the testing classification accuracies, were summarized in Table 3 and the averaged AUC performances were showed in Fig 3.

Table 3

Mean results of empirical datasets.

In bold–the best performance.

Dataset	Method	Training accuracy (10-CV)	Accuracy (testing)	No. of selected genes
	Lasso	96.22%	92.40%	13.7
	L_1/2	96.13%	92.18%	8.2
Prostate	SCAD − L₂	95.99%	91.33%	22
	ElasticNet	96.28%	91.35%	15.2
	HLR	97.61%	93.68%	12.6
Lymphom					Lasso	96.03%	91.11%	13.2
L_1/2	95.15%	91.20%	10.7
SCAD − L₂	95.78%	92.99%	20.9
ElasticNet	96.01%	92.17%	21.2
HLR	96.55%	94.23%	15.1
Lung cancer					Lasso	96.32%	96.99%	13.8
L_1/2	97.17%	97.20%	11.5
SCAD − L₂	97.95%	98.17%	25.1
ElasticNet	97.21%	98.38%	28.9
HLR	98.59%	98.35%	15.6

Mean results are based on 500 repeats.

Fig 3

The performance of the AUC from ROC analyzes of each method on prostate, lymphoma and lung cancer datasets.

Mean results of empirical datasets.

In bold–the best performance. Mean results are based on 500 repeats. As showed in Table 3, for prostate dataset, the classifier with the HLR approach gives the average 10-fold CV accuracy of 97.61% and the average test accuracy of 93.68% with about 12.6 genes selected. The classifiers with Lasso, L1/2, SCAD − L2 and Elastic net methods give the average 10-fold CV accuracy of 96.22%, 96.13%, 95.99%, 96.28% and the average test accuracy of 92.4%, 92.18%, 91.33%, 91.35% with 13.7, 8.2, 22 and 15.2 genes selected respectively. For lymphoma datasets, it can be seen that the HLR method also achieves the best classification performances with the highest accuracy rates in the training and test sets. For lung cancer, our method gained the best training accuracy. The testing performance of Elastic net was slightly better than our method. However, the HLR method achieved its success using only about 15.6 predictors (genes), compared to 28.9 genes for the Elastic net method. Although the Lasso or L1/2 methods gained the sparsest solutions, the classification performance of these two approaches were worse than the HLR method. This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost. As showed in Fig 3, our proposed method achieved the best classification performances in these three real datasets amongst all the competitors. For example, the AUC from ROC analysis of the HLR method for datasets prostate, lymphoma and lung cancer datasets were estimated to be 0.9353, 0.9347 and 0.9932 respectively. AUC results of the Lasso method for the three datasets were calculated to be 0.9327, 0.9253 and 0.9813 respectively, which were worse than the proposed HLR method. We summarized the top 10 ranked (most frequently) genes selected by the five regularization methods for the lung cancer gene expression dataset in Table 4, the information of top 10 ranked genes for the other datasets could be found in S2 File. Note that in Table 1, the proposed HLR method has the impressive performances to select the true features in the simulation data. It is implied that the genes selected by the HLR method in these three cancer datasets are valuable to the researchers who want to find out the key factors that associated with the cancer development. For example, in Table 4, the biomarkers selected by our HLR method include advanced glycosylation end product receptor (AGER), which is a member of the immunoglobulin superfamily predominantly expressed in the lung. AGER plays a role in epithelial organization, and decreased express of AGER in lung tumors may conduce to loss of epithelial tissue structure, potentially leading to malignant transformation [25]. The unique function of AGER in lung, making it could be used as an additional diagnostic tool for lung cancer [26], and even a target [27]. GATA2 (GATA binding protein 2) are expressed principally in hematopoietic lineages, and have essential roles in the development of multiple hematopoietic cells, including erythrocytes and megakaryocytes. It is crucial for the proliferation and maintenance of hematopoietic stem cells and multi-potential progenitors [28]. Kumar et al. [29] showed a strong relationship between GATA2 and RAS-pathway mutant lung tumor cells.

Table 4

The most frequently selected 10 genes found by the five sparse logistic regression methods from the lung cancer dataset.

Rank	Lasso	L_1/2	SCAD − L₂	ElasticNet	HLR
1	STX11	A2M	ABCA8	CCDC69	ACADL
2	GABARAPL1	ACADL	ADH1B	STX11	CCDC69
3	PDLIM2	PNLIP	CAT	GABARAPL1	STX11
4	CAV1	AAAS	CAV1	TNXB	ABCA8
5	ABCA8	A4GALT	CCDC69	PDLIM2	PAEP
6	GPM6A	ABHD8	GABARAPL1	FAM13C	AGER
7	GRK5	ADD2	GPM6A	GPM6A	GATA2
8	TNXB	SLN	GRK5	SFTPC	PNLIP
9	ADH1B	ACTL7B	PDLIM2	ARHGAP44	A2M
10	PTRF	ADAR	PTRF	CAT	ACAN

To further verify the biomarkers selected by our method, we had collected two independent lung cancer datasets for validation. The GSE19804 [30] contains 120 samples with 60 lung adenocarcinomas and 60 adjacent normal tissues. The GSE32863 [31] contains 116 samples include 58 lung adenocarcinomas and 58 healthy controls. These two datasets are available from the GEO series accession number [GSE19804] and [GSE32863]. We used the support vector machine (SVM) approach to build the classifiers based on the first two, first five and first ten genes selected by the different regularization approaches from the lung cancer dataset (Table 4), and were trained on the lung cancer dataset (Table 2) respectively. These classifiers then were applied to the two independent lung cancer datasets, GSE19804 and GSE32863, respectively. It is known that the obtained prediction models may be only applicable to samples from the same platform, cell type, environmental conditions and experimental procedure. However, interestingly, as demonstrated in Table 5, we can see that all the classification accuracies predicted by the classifiers with the selected genes by the HLR approach, are higher than 90%. Especially the classification accuracy on the GSE32863 dataset is 97.41% with the classifier based on the first ten genes. Such performances are better than the genes selected by other methods. For example, the accuracy of the classifier with the first two genes selected by Elastic net, for GSE19804, was estimated to be 86.67% that was worse than the classifier with the genes selected by our method, 90.83%. The performance of the classifier with the first five genes selected by SCAD − L2, for GSE32863, was calculated to be 92.24% that was worse than the classifier with the genes selected by our HLR method, 96.55%. The results indicate that the sparse logistic regression with the HLR approach can select powerful discriminatory genes.

Table 5

The validation results of the classifiers based on the top rank selected genes from lung cancer dataset.

In bold–the best performance.

Dataset	Method	SVM with the top genes
		2	5	10
GSE19804					Lasso	89.17%	93.33%	92.50%
L_1/2	85.83%	90.83%	91.67%
SCAD − L₂	89.17%	89.17%	93.33%
ElasticNet	86.67%	87.50%	89.17%
HLR	90.83%	92.50%	94.17%
GSE32863					Lasso	93.10%	95.69%	93.97%
L_1/2	93.97%	94.83%	95.69%
SCAD − L₂	90.28%	92.24%	94.83%
ElasticNet	89.66%	91.38%	93.97%
HLR	94.83%	96.55%	97.41%

We used the SVM approach to build the classifiers based on the first two, first five and first ten genes selected by the different regularization approaches from the lung cancer dataset (Table 4), and were trained on the lung cancer dataset (Table 2) respectively. These classifiers then were applied to the two independent lung cancer datasets, GSE19804 and GSE32863, respectively.

The validation results of the classifiers based on the top rank selected genes from lung cancer dataset.

In bold–the best performance. We used the SVM approach to build the classifiers based on the first two, first five and first ten genes selected by the different regularization approaches from the lung cancer dataset (Table 4), and were trained on the lung cancer dataset (Table 2) respectively. These classifiers then were applied to the two independent lung cancer datasets, GSE19804 and GSE32863, respectively. In addition to comparing with the Lasso, L1/2, SCAD − L2 and Elastic net techniques, we also make a comparison with the results of other methods for datasets prostate and lymphoma published in the literature. Note that we only considered methods using the CV approach for evaluation, since approaches based on a mere training/test set partition are now widely known as unreliable [32]. Table 6 displays the best classification accuracy of other methods. In Table 6, classification accuracy achieved by the HLR approach is greater than other methods. Meanwhile, the number of selected genes is smaller than other methods except on the Lymphoma dataset.

Table 6

The result of the literature.

In bold–the best performance.

Dataset	Author	Accuracy (CV)	No. of selected features
	T.K. Paul et al. [33]	96.60%	48.5
	Wessels et al. [34]	93.40%	14
	Shen et al. [35]	94.60%	unknown
prostate	Lecocke et al. [36]	90.10%	unknown
	Dagliyan et al. [37]	94.80%	unknown
	Glaab et al. [38]	94.00%	30
	HLR	97.61%	12.6
Lymphoma	Wessels et al. [34]	95.70%	80
	Liu et al. [39]	93.50%	6
	Shipp et al. [23]	92.20%	30
	Goh et al. [40]	91.00%	10
	Lecocke et al. [36]	90.20%	unknown
	Hu et al. [41]	87.01%	unknown
	Dagliyan et al. [37]	92.25%	unknown
	Glaab et al. [38]	95.00%	30
	HLR	96.55%	15.1

The result of the literature.

In bold–the best performance.

4. Conclusion

In this paper, we have proposed the HLR function, a new shrinkage and selection method. The HLR approach is inherited some valuable characteristics from the L1/2 (sparsity) and L2 (grouping effect where highly correlated variables are in or out a model together) penalties. We also proposed a novel univariate HLR thresholding function to update the estimated coefficients and developed the coordinate descent algorithm for the HLR penalized logistic regression model. The empirical results and simulations show the HLR method was highly competitive amongst Lasso, L1/2, SCAD − L2 and Elastic net in analyzing high dimensional and low sample sizes data (microarray and RNA-seq data). Thus, logistic regression with the HLR approach is the promising tool for feature selection in the classification problem. Source code of sparse logistic regression with the HLR approach was provided in S3 File.

The proof of theorem 1.

(PDF) Click here for additional data file.

The most frequently selected 10 genes information.

Top-10 ranked genes selected by all the methods for prostate and lymphoma datasets. (PDF) Click here for additional data file.

Source code of the HLR method.

MATLAB code of sparse logistic regression with the HLR approach. (RAR) Click here for additional data file.

23 in total

1. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.

Authors: Margaret A Shipp; Ken N Ross; Pablo Tamayo; Andrew P Weng; Jeffery L Kutok; Ricardo C T Aguiar; Michelle Gaasenbeek; Michael Angelo; Michael Reich; Geraldine S Pinkus; Tane S Ray; Margaret A Koval; Kim W Last; Andrew Norton; T Andrew Lister; Jill Mesirov; Donna S Neuberg; Eric S Lander; Jon C Aster; Todd R Golub
Journal: Nat Med Date: 2002-01 Impact factor: 53.440

2. Regression approaches for microarray data analysis.

Authors: Mark R Segal; Kam D Dahlquist; Bruce R Conklin
Journal: J Comput Biol Date: 2003 Impact factor: 1.479

Review 3. The role of the GATA2 transcription factor in normal and malignant hematopoiesis.

Authors: Carmen Vicente; Ana Conchillo; María A García-Sánchez; María D Odero
Journal: Crit Rev Oncol Hematol Date: 2011-05-24 Impact factor: 6.312

4. The GATA2 transcriptional network is requisite for RAS oncogene-driven non-small cell lung cancer.

Authors: Madhu S Kumar; David C Hancock; Miriam Molina-Arcas; Michael Steckel; Phillip East; Markus Diefenbacher; Elena Armenteros-Monterroso; François Lassailly; Nik Matthews; Emma Nye; Gordon Stamp; Axel Behrens; Julian Downward
Journal: Cell Date: 2012-04-27 Impact factor: 41.582

5. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

Review 6. The receptor for advanced glycation end products (RAGE) and the lung.

Authors: Stephen T Buckley; Carsten Ehrhardt
Journal: J Biomed Biotechnol Date: 2010-01-19

7. Receptor for advanced glycation end products (RAGE) soluble form (sRAGE): a new biomarker for lung cancer.

Authors: R Jing; M Cui; J Wang; H Wang
Journal: Neoplasma Date: 2010 Impact factor: 2.575

8. Gene expression correlates of clinical prostate cancer behavior.

Authors: Dinesh Singh; Phillip G Febbo; Kenneth Ross; Donald G Jackson; Judith Manola; Christine Ladd; Pablo Tamayo; Andrew A Renshaw; Anthony V D'Amico; Jerome P Richie; Eric S Lander; Massimo Loda; Philip W Kantoff; Todd R Golub; William R Sellers
Journal: Cancer Cell Date: 2002-03 Impact factor: 31.743

9. Genome-scale analysis of DNA methylation in lung adenocarcinoma and integration with mRNA expression.

Authors: Suhaida A Selamat; Brian S Chung; Luc Girard; Wei Zhang; Ying Zhang; Mihaela Campan; Kimberly D Siegmund; Michael N Koss; Jeffrey A Hagen; Wan L Lam; Stephen Lam; Adi F Gazdar; Ite A Laird-Offringa
Journal: Genome Res Date: 2012-05-21 Impact factor: 9.043

10. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.

Authors: Enrico Glaab; Jaume Bacardit; Jonathan M Garibaldi; Natalio Krasnogor
Journal: PLoS One Date: 2012-07-11 Impact factor: 3.240

12 in total

1. Sparse Bayesian classification and feature selection for biological expression data with high correlations.

Authors: Xian Yang; Wei Pan; Yike Guo
Journal: PLoS One Date: 2017-12-27 Impact factor: 3.240

2. Identifying common transcriptome signatures of cancer by interpreting deep learning models.

Authors: Anupama Jha; Mathieu Quesnel-Vallières; David Wang; Andrei Thomas-Tikhonenko; Kristen W Lynch; Yoseph Barash
Journal: Genome Biol Date: 2022-05-17 Impact factor: 17.906

3. Collaborative representation-based classification of microarray gene expression data.

Authors: Lizhen Shen; Hua Jiang; Mingfang He; Guoqing Liu
Journal: PLoS One Date: 2017-12-13 Impact factor: 3.240

4. Complex harmonic regularization with differential evolution in a memetic framework for biomarker selection.

Authors: Sai Wang; Hai-Wei Shen; Hua Chai; Yong Liang
Journal: PLoS One Date: 2019-02-14 Impact factor: 3.240

5. CAncer bioMarker Prediction Pipeline (CAMPP)-A standardized framework for the analysis of quantitative biological data.

Authors: Thilde Terkelsen; Anders Krogh; Elena Papaleo
Journal: PLoS Comput Biol Date: 2020-03-16 Impact factor: 4.475

6. LogSum + L₂ penalized logistic regression model for biomarker selection and cancer classification.

Authors: Xiao-Ying Liu; Sheng-Bing Wu; Wen-Quan Zeng; Zhan-Jiang Yuan; Hong-Bo Xu
Journal: Sci Rep Date: 2020-12-17 Impact factor: 4.379

7. Integrating molecular interactions and gene expression to identify biomarkers and network modules of chronic obstructive pulmonary disease.

Authors: Hai-Hui Huang; Yong Liang
Journal: Technol Health Care Date: 2022 Impact factor: 1.205

8. Lung adenocarcinoma and lung squamous cell carcinoma cancer classification, biomarker identification, and gene expression analysis using overlapping feature selection methods.

Authors: Joe W Chen; Joseph Dhahbi
Journal: Sci Rep Date: 2021-06-25 Impact factor: 4.379

9. Developing a Novel Machine Learning-Based Classification Scheme for Predicting SPCs in Breast Cancer Survivors.

Authors: Chi-Chang Chang; Ssu-Han Chen
Journal: Front Genet Date: 2019-09-18 Impact factor: 4.599

10. Gene Mutation Classification through Text Evidence Facilitating Cancer Tumour Detection.

Authors: Meenu Gupta; Hao Wu; Simrann Arora; Akash Gupta; Gopal Chaudhary; Qiaozhi Hua
Journal: J Healthc Eng Date: 2021-07-27 Impact factor: 2.682