Literature DB >> 28122498

Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer.

Jing Zhai¹, Chiu-Hsieh Hsu¹, Z John Daye².

Abstract

BACKGROUND: Many questions in statistical genomics can be formulated in terms of variable selection of candidate biological factors for modeling a trait or quantity of interest. Often, in these applications, additional covariates describing clinical, demographical or experimental effects must be included a priori as mandatory covariates while allowing the selection of a large number of candidate or optional variables. As genomic studies routinely require mandatory covariates, it is of interest to propose principled methods of variable selection that can incorporate mandatory covariates.
METHODS: In this article, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory covariates while allowing selection for others. The ridle provides a principled approach to mitigate effects of multicollinearity among the mandatory covariates and possible dependency between mandatory and optional variables. We provide detailed empirical and theoretical studies to evaluate our method. In addition, we develop an efficient algorithm for the ridle. Software, based on efficient Fortran code with R-language wrappers, is publicly and freely available at https://sites.google.com/site/zhongyindaye/software .
RESULTS: The ridle is useful when mandatory predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Both theoretical and comprehensive simulation studies have shown that the ridle to be advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. A microarray gene expression analysis of the histologic grades of breast cancer has identified 24 genes, in which 2 genes are selected only by the ridle among current methods and found to be associated with tumor grade.
CONCLUSIONS: In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Results suggest that the ridle is advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves.

Entities: Chemical Disease Gene Species

Keywords: Gene expression analysis; Lasso; Linear models; Penalized regression; Ridge; Variable selection

Mesh：

Substances：
Biomarkers, Tumor

Year: 2017 PMID： 28122498 PMCID： PMC5267467 DOI： 10.1186/s12874-017-0291-y

Source DB: PubMed Journal: BMC Med Res Methodol ISSN： 1471-2288 Impact factor: 4.615

Background

Many essential problems in statistical genomics may be formulated in terms of variable selection of candidate biological factors for modeling of some trait or quantity of interest [1-3]. Often, additional covariates describing clinical, demographical, or other experimental factors must be included a priori as mandatory covariates while allowing the selection of possibly a large number of candidate or optional variables. Substantial progress has been made recently in the analysis of high-dimensional data with sparse regression methods. The lasso was proposed that induces sparsity using an L1-norm penalty on all coefficients [4]. With the introduction of computationally efficient algorithms [5, 6], the lasso has since become a widely-applied variable selection method. Other methods for sparse regression include the smoothly clipped absolute deviation (SCAD) [7], adaptive lasso [8], Dantzig selector [9], etc. However, these methods were not designed for applications with mandatory covariates. An ad hoc approach is often employed where the response is regressed on mandatory covariates without penalization, as if in an ordinary least squares (OLS), while penalized regression is applied upon the optional variables, independently of the mandatory ones, to achieve variable selection. However, standard statistical principle advocates the consideration of all covariates simultaneously in order to account for complex dependencies among covariates. By penalizing coefficients disparately on some of the variables while not on others, this approach can yield both poor prediction accuracy and unreliable selection of optional variables. As mandatory covariates are routinely encountered in genomic-data analysis, it is of interest to develop a principled approach towards sparse regression with mandatory covariates. In this article, we consider the problem of efficient estimation of coefficients of mandatory covariates and simultaneous variable selection of optional variables. Cancer arises as a disorder of the cell life cycle that leads to excessive cell proliferation and poor differentiation. Pathologists often use grading systems to measure the degree of cell differentiation in tumors [10, 11]. Tumor grade is one of the most important indicators for clinicians to guide treatment options and make prognosis for patients [12]. Histologic grade of breast cancer is representative of its aggressive potential [13]. Cancer cells with higher grades tend to be more aggressive and require quite different treatment strategies than those with lower grades. Due to the importance of tumor grade as an essential measure in clinical prognosis, treatment and of the survival of breast cancer patients, understanding genetic factors that may be predictive of tumor grade has become a desideratum of current research in breast cancers. In this article, we will propose a principled method to identify genes which may affect tumor grade while accounting for their clinical phenotypes such as age at diagnosis, p53 sequence mutation status, etc. by incorporating them as mandatory covariates. We propose the ridge-lasso hybrid estimator (ridle), a novel penalized regression procedure that can simultaneously estimate coefficients of mandatory covariates while allowing selection for others. The ridle employs the L2-norm penalty to estimate mandatory coefficients and the L1-norm penalty to perform variable selection on the optional set. The L2-norm penalty has been successfully employed in ridge regression to efficiently estimate coefficients under a spectrum of dependency structures [14-16]. In this article, we provide theoretical, simulation, and real-data analysis to suggest the ridle as an efficient method for sparse regression with mandatory covariates. In particular, we will show that the ridle can achieve improved prediction accuracy and variable selection under commonly encountered scenarios when (1) the mandatory covariates are highly correlated among themselves or (2) the mandatory variables are correlated with the optional ones. The rest of the article is presented as follows: “Methods” section introduces the ridle procedure, where an efficient algorithm is introduced and theoretical results are provided to suggest the efficacy of the ridle for sparse regression under mandatory predictors. “Results” section evaluates our method on simulated data. Further, we apply our method to a gene selection analysis of microarray data, where we identified more genes in breast-cancer related pathways with the ridle. Additionally, the ridle is the only method that identified two genes AREG and TRPM4 from the ErbB signaling pathway and ion-channel family, respectively, which are known to be related to cancer. Further discussions are provided in “Discussion” section, and we conclude with “Conclusions” section.

Methods

The Ridle

Consider the linear regression model, where y is an n-dimensional vector of random responses, X=(x 1,x 2,…,x ) is the n×d design matrix, is a vector of regression parameters, and is an n-dimensional vector of independent and identically distributed (i.i.d.) random variables with mean 0 and variance σ 2. We further assume that the response is centered and each predictor x =(x 1,x 2,…,x ) is standardized to have variance 1. We define the ridle as, where and are non-intersecting subsets of the indices such that . Subsets and comprise, respectively, indices of optional and mandatory variables. The ridle penalizes coefficients in by the L1-norm penalty and coefficients in by the L2-norm penalty. It allows variable selection, as in the lasso, for predictors in and estimation without selection, as in the ridge, for predictors in . If λ 2 is equal to 0, the ridle is equivalent to thresholding the coefficients of some predictors for variable selection and estimating the rest without penalization. As the lasso penalty is applied to optional variables while no penalization is imposed on coefficients in , we call the special case of the ridle when λ 2=0 as the -unpenalized lasso. For further insight, we examine the ridle estimator under two special situations.

Ridle estimator in two special cases

Orthogonal design case.

For X X/n equal to the identity matrix I, we can easily obtain the ridle solution in terms of the ordinary least squares estimates , where (·)+ denotes the positive part of the value, such that the expression is set to 0 for negative quantities. The ridle estimates equate to those of the lasso for and the ridge for . When λ 2=0, the -unpenalized lasso estimates equate to those of the lasso for and the OLS for . It is clear that, when the design matrix is orthogonal, the L1-norm and L2-norm penalties work independently to penalize coefficients with indices in and , respectively. The situation is more involved when predictors are correlated.

Two-predictor case

Consider the case when d=2. Let and for the ridle estimates. Figure 1 presents the penalty contours of the lasso, ridle, and ridge estimators. The ellipses centered at the OLS solutions are the contours of the quadratic loss function, plus a constant. With standardized predictors, these elliptical contours are at a ± 45°angle to the horizontal axes. Solutions occur when the ellipses first contact the penalty contours.

Fig. 1

Penalty contour when d=2 for (a) lasso, (b) ridle, and (c) ridge regressions

Penalty contour when d=2 for (a) lasso, (b) ridle, and (c) ridge regressions In Fig. 1 a, we obtain the lasso solution as an ellipse hits a corner of the lasso penalty contour, setting β 1 to 0. In Fig. 1 c, we see that the ridge penalty contour is circular, and an ellipse hitting the penalty contour gives nonzero estimates. The ridle penalty is described in Fig. 1 b. It has both the characteristics of the lasso and ridge with an oval shape along the horizontal and sharp corners on the vertical axis. The ridle solution occurs when an ellipse centered on the OLS estimates hits a sharp corner on the vertical axis, yielding β 1=0 and a nonzero β 2. Thus, we see that the ridle may provide sparse solutions for coefficients in while preserving non-sparsity for coefficients in . When d=2, we have the design matrix, with pairwise correlation ρ. We can show that the ridle estimates are where , θ 1=λ 1(n+λ 2)/(2n(n+λ 2−n ρ 2)), and θ 2=n(1−ρ 2)/(n+λ 2−n ρ 2). We see that the coefficient can be thresholded to 0 with increasing θ 1(λ 1,λ 2,ρ) and θ 2(λ 2,ρ) functions to increase (temper) the thresholding of when is negative (positive). On the other hand, converges to a weighted average of and without necessarily thesholding it to 0 as θ 1 increases to . In the special case when λ 2=0, the ridle is reduced to the -unpenalized lasso, with estimates -unpenalized lasso, with estimates , if , and otherwise. Under multicollinearity when ρ is large, the OLS estimates are known to have large variability. In this case, the ridge is often employed to improve prediction accuracy by regulating variances. Compared with the ridle, the -unpenalized lasso that imposes no penalization on mandatory coefficients can be less effective in tempering the effects of multicollinearity. For example, when ρ=1 and θ 1 is large, the -unpenalized lasso estimate for β 2 is , such that the -unpenalized lasso can have larger prediction error than the OLS. The lasso has the solution for j=1,2 and does not involve the correlation ρ when d=2 [4]. In contrast, ridge coefficients tend to be averaged with increasing correlation. This property helps ridge to reduce variances of its estimates and improve prediction accuracy when data is multicollinear [16]. The ridle estimates are also defined in terms of weighted averages of and according to correlation ρ. In the following, we will show via theoretical studies how this property can improve variable selection for ridle.

Theoretical properties

In this section, we provide theoretical properties of the ridle estimator. These results are useful in providing a window to understanding the proposed method and a guide as to how the methods might perform in practice. Here, we use the sign-consistency approach [17] for theoretical derivations, which can provide results that are easy to interpret and relate to applications. More involved theoretical approaches, such as the asymptotic and non-asympotic oracle properties [7, 18], often rely on complex conditions that are difficult to interpret. Proofs for theoretical results in this section are provided in Additional file 1. Without loss of generality, we assume that the true coefficients are partitioned such that , , and . Let and . With the columns of X partitioned as β 0, C has the expression We assume that where C is a positive definite matrix, and

Asymptotic normality of the Ridle

Theorem 1

With (5) and (6), satisfies the following for λ 1,λ 2>0 such that and . where and W∼N(0,σ 2 C). Theorem 1 shows that the coefficients of mandatory covariates can contribute to the biasness of the ridle estimates. If the coefficients of variables in are relatively large, then a small c 2 is required to keep the bias low. On the other hand, when coefficients of variables in are small, a wider spectrum of values of c 2 can be chosen to improve prediction accuracy. Hence, we expect the ridle to perform the best when coefficients of mandatory variables tend to be small.

Variable selection consistency of the Ridle.

In this section, we provide sign consistency results for the ridle. We define the ridle estimates to be sign consistent if there exist λ 1=λ 1(n) and λ 2=λ 2(n) such that Sign consistency, as a stronger condition, directly implies variable selection consistency. With (5)-(7), we give the following conditions for sign consistency of the ridle estimator. Sufficient condition: There exists η>0 such that where . Necessary condition:

Theorem 2

Under (5)-(7), is sign consistent if condition (10) holds for λ 1,λ 2>0 such that λ 1/n→0, , and λ 2/λ 1→c<∞.

Theorem 3

Under (5)-(7), is sign consistent only if condition (11) holds for λ 1,λ 2>0 such that λ 2/n→0.

Remark 1

Let . Then conditions (10) and (11) are satisfied with left-hand sides equal to 0. Thus, the ridle estimator is sign consistent when predictors with nonzero coefficients are unrelated with predictors with zero coefficients.

Remark 2

Suppose . Then, conditions (10) and (11) become respectively. This is equivalent to the Irrepresentable conditions of [17] for lasso sign consistency. Thus, when variables in are uncorrelated with those in , sign consistency for the ridle is equivalent to that of the lasso for predictors in .

Remark 3

Consider performing the lasso on predictors in both and . The following Irrepresentable conditions are derived in Zhao and Yu (2006) for the lasso [17], where and . Compared with the lasso, the ridle conditions in (10) and (11) involve the parameter λ 2 that can allow it to more easily satisfy sign consistency conditions with suitable choices of λ 2. Consider a toy example to better understand the general results through a simple scenario. If and , then the left-hand sides of conditions (10) and (11) for the ridle become whereas the left-hand sides of conditions (12) and (13) for the lasso are Here, the ridle is sign consistent for λ 2/λ 1 sufficiently small, but the lasso is not if the elements of are large or is nearly singular. Thus, we expect that the ridle may perform better than lasso using predictors from both and in terms of model selection if predictors in are correlated with the irrelevant predictors in or the predictors in are highly correlated among themselves.

Remark 4

When λ 2=0, the ridle is reduced to the -unpenalized lasso. In this case, as no penalization is involved on mandatory coefficients, sign consistency conditions can be trivially obtained as where is as defined following conditions (12) and (13) for the lasso. Compared with the -unpenalized lasso, the ridle composes of an additional offsetting factor in conditions (10) and (11) that allows sign consistency inequalities to be more easily satisfied. For example, if , , and , then the left-hand sides of conditions (10) and (11) for the ridle become In this case, if , sign consistency conditions for the -unpenalized lasso in (14) are violated, whereas sign consistency conditions for the ridle may still be satisfied with suitably chosen λ 2.

Remark 5

When the mandatory covariates are irrelevant, , the offsetting terms in (10) and (11) for ridle sign consistency would vanish. Indeed, with λ 2/n→0, ridle sign consistency conditions are equivalent to those of both the lasso and -unpenalized lasso. However, this does not mean that the methods will perform similarly under finite samples. We will examine their finite-sample performances in the Analysis of Simulated Data in the “Results” section.

Efficient algorithm

We provide an efficient algorithm for computing the ridle. Programming code, written in Fortran, and its R-language wrapper for the algorithm described in this section are freely available online at http://sites.google.com/site/zhongyindaye/software. The ridle (1) minimizes over an objective function with convex and separable penalties. This allows us to employ the coordinate descent strategy [19-21] to compute for the ridle. In the coordinate descent, we update first for all coefficients of mandatory variables and, then, each optional coefficient , one at a time. This is iterated till practical convergence is reached. The algorithm is further sped up by iterating only through the mandatory and active set till convergence before updating all variables. We provide the coordinate descent updating equations as the following, where . Maximum value for λ 1 is , where are initial estimates for coefficients of mandatory covariates. The matrix inverse in (15) can be computed efficiently by taking the inverse of individual eigenvalues added to λ 2 after an initial singular value decomposition of .

Results

Analysis of simulated data

We evaluate the performances of the ridle via simulation studies. We examine effects of having different magnitudes of coefficients, correlations between mandatory and irrelevant predictors, and degrees of multicollinearity among mandatory covariates. We compare the ridle to the ridge, lasso, elastic net, and the lasso and elastic net without penalization on the mandatory covariates. We use the R package glmnet 2.0 to compute for the lasso and elastic net, where the penalization on mandatory covariates is specified using the penalty.factor option. In each example, we simulate 200 times from the true model, y=X β+σ , where ∼N(0,I). We use n=50 number of observations and p=250 predictors. Tuning parameters are estimated using 5-fold cross-validation. We measure prediction accuracy using the relative prediction error, , where is the population covariance matrix. Further, we examine variable selection performances using sensitivity, specificity, and g-measure. Sensitivity and specificity are, respectively, the marginal proportions of selecting relevant variables and discarding irrelevant variables correctly. In other words, sensitivity is the proportion of true positives among all relevant variables, whereas specificity is the proportion of true negatives among all irrelevant variables. The false positive rate is equal to 1 - specificity. As proportions, sensitivity and specificity allow intuitive comparisons across simulation settings with varying numbers of relevant and irrelevant variables in high-dimensional variable selection. We examine overall variable selection performances using the g-measure, . A g-measure close to 1 indicates accurate variable selection, whereas a g-measure close to 0 implies that few relevant variables or too many irrelevant variables are selected, or both. In Tables 1, 2, 3 and 4, we report medians and bootstrapped standard deviations of medians out of 500 re-samplings, in parentheses. Further, we boldface, as top measurements, the smallest rpe and two largest g-measures in each case.

Table 1

Simulation example 1: effect of signal strengths

	Method	rpe	g-measure	Sensitivity	Specificity
β ₀=0.5	Ridge	1.008 (0.009)
	Lasso	1.004 (0.018)	0.582 (0.009)	0.350 (0.018)	0.957 (0.006)
	Elastic net	0.923 (0.020)	0.676 (0.007)	0.600 (0.041)	0.848 (0.023)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	0.675 (0.028)	1.000 (0.000)	1.000 (0.000)	1.000 (0.000)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	0.697 (0.026)	1.000 (0.001)	1.000 (0.000)	1.000 (0.002)
	Ridle	0.281 (0.016)	0.998 (0.001)	1.000 (0.000)	0.996 (0.002)
β ₀=1.5	Ridge	6.549 (0.056)
	Lasso	3.300 (0.083)	0.839 (0.005)	0.750 (0.017)	0.926 (0.003)
	elastic net	3.230 (0.118)	0.853 (0.004)	0.900 (0.008)	0.850 (0.005)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	0.691 (0.023)	1.000 (0.000)	1.000 (0.000)	1.000 (0.000)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	0.701(0.028)	1.000 (0.001)	1.000 (0.000)	1.000 (0.001)
	Ridle	0.473 (0.014)	0.998 (0.001)	1.000 (0.000)	0.996 (0.002)
β ₀=3	Ridge	24.559 (0.317)
	Lasso	8.074 (0.433)	0.908 (0.005)	0.900 (0.013)	0.935 (0.002)
	Elastic net	6.735 (0.339)	0.903 (0.002)	0.950 (0.013)	0.852 (0.003)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	0.676 (0.032)	1.000 (0.000)	1.000 (0.000)	1.000 (0.000)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	0.725 (0.030)	1.000 (0.000)	1.000 (0.000)	1.000 (0.001)
	Ridle	0.605 (0.025)	0.998 (0.001)	1.000 (0.000)	0.996 (0.002)

The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates

n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced

Table 2

Simulation example 2: effect of correlation between mandatory and irrelevant predictors

	Method	rpe	g-measure	Sensitivity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ)	Sensitivity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O)	Specificity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O)
ρ ₀=0.25	Ridge	1.671 (0.012)
	Lasso	1.911 (0.022)	0.383 (0.034)	0.100 (0.032)	0.200 (0.028)	0.975 (0.008)
	Elastic net	1.744 (0.019)	0.585 (0.015)	0.400 (0.054)	0.600 (0.050)	0.835 (0.036)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	1.741 (0.028)	0.742 (0.012)	1.000 (0.000)	0.200 (0.037)	0.938 (0.003)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	1.657 (0.017)	0.757 (0.008)	1.000 (0.000)	0.500 (0.064)	0.833 (0.022)
	Ridle	1.492 (0.031)	0.773 (0.006)	1.000 (0.000)	0.200 (0.048)	0.931 (0.006)
ρ ₀=0.5	Ridge	1.807 (0.014)
	Lasso	2.045 (0.035)	0.571 (0.013)	0.300 (0.046)	0.400 (0.039)	0.925 (0.007)
	Elastic net	1.773 (0.034)	0.667 (0.008)	0.600 (0.014)	0.800 (0.048)	0.756 (0.020)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	1.922 (0.044)	0.794 (0.003)	1.000 (0.000)	0.400 (0.047)	0.929 (0.004)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	1.729 (0.040)	0.796 (0.007)	1.000 (0.000)	0.700 (0.048)	0.785 (0.022)
	Ridle	1.438 (0.057)	0.852 (0.006)	1.000 (0.000)	0.600 (0.049)	0.900 (0.004)
ρ ₀=0.75	Ridge	1.564 (0.022)
	Lasso	1.365 (0.029)	0.684 (0.008)	0.400 (0.032)	0.600 (0.012)	0.900 (0.003)
	Elastic net	1.237 (0.030)	0.745 (0.005)	0.700 (0.048)	0.900 (0.011)	0.775 (0.014)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	1.423 (0.037)	0.839 (0.005)	1.000 (0.000)	0.700 (0.026)	0.904 (0.006)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	1.310 (0.041)	0.847 (0.005)	1.000 (0.000)	0.800 (0.012)	0.840 (0.008)
	Ridle	0.886 (0.029)	0.875 (0.003)	1.000 (0.000)	0.700 (0.038)	0.908 (0.003)

The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. g-measure is estimated from all predictors. Sensitivity () is computed in terms of the mandatory variables only, whereas sensitivity () and specificity () are computed in terms of the optional variables only

n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced

Table 3

Simulation example 3: effect of multicollinearity among mandatory covariates

	Method	rpe	g-measure	Sensitivity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ)	Sensitivity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O)	Specificity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O)
ρ=0.75	Ridge	6.353 (0.022)
	Lasso	4.649 (0.167)	0.802 (0.011)	0.800 (0.000)	0.700 (0.048)	0.908 (0.004)
	Elastic net	4.410 (0.128)	0.804 (0.005)	1.000 (0.009)	0.700 (0.006)	0.858 (0.006)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	4.776 (0.260)	0.829 (0.005)	1.000 (0.000)	0.700 (0.031)	0.902 (0.007)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	5.402 (0.190)	0.823 (0.006)	1.000 (0.000)	0.700 (0.013)	0.871 (0.009)
	Ridle	2.699 (0.152)	0.893 (0.007)	1.000 (0.000)	0.900 (0.048)	0.904 (0.004)
ρ=0.9	Ridge	6.270 (0.026)
	Lasso	4.914 (0.148)	0.784 (0.010)	0.600 (0.089)	0.700 (0.036)	0.908 (0.004)
	Elastic net	4.336 (0.135)	0.816 (0.005)	0.800 (0.092)	0.700 (0.018)	0.867 (0.008)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	6.992 (0.337)	0.828 (0.008)	1.000 (0.000)	0.700 (0.031)	0.902 (0.006)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	7.245 (0.237)	0.827 (0.005)	1.000 (0.000)	0.700 (0.045)	0.860 (0.011)
	Ridle	3.000 (0.214)	0.890 (0.006)	1.000 (0.000)	0.800 (0.045)	0.900 (0.004)
ρ=0.99	Ridge	6.231 (0.031)
	Lasso	7.322 (0.200)	0.745 (0.005)	0.400 (0.000)	0.700 (0.000)	0.913 (0.003)
	Elastic net	5.003 (0.155)	0.804 (0.006)	0.800 (0.049)	0.700 (0.019)	0.883 (0.006)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	36.214 (2.064)	0.824 (0.006)	1.000 (0.000)	0.700 (0.046)	0.904 (0.005)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	33.583 (2.197)	0.830 (0.004)	1.000 (0.010)	0.700 (0.045)	0.867 (0.010)
	Ridle	4.193 (0.343)	0.890 (0.005)	1.000 (0.000)	0.800 (0.029)	0.904 (0.004)

n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced

Table 4

Simulation example 4: mandatory covariates are irrelevant

	Method	rpe	g-measure	Specificity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ)	Sensitivity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O)	Specificity (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O)
ρ ₀=0.25	Ridge	1.671 (0.012)
	Lasso	1.911 (0.022)	0.383 (0.034)	1.000 (0.000)	0.200 (0.028)	0.975 (0.008)
	Elastic net	1.744 (0.019)	0.585 (0.015)	0.600 (0.053)	0.600 (0.050)	0.835 (0.036)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	2.357 (0.032)	0.215 (0.103)	0.000 (0.000)	0.050 (0.024)	0.995 (0.003)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	2.210 (0.034)	0.308 (0.054)	0.000 (0.000)	0.525 (0.065)	0.732 (0.057)
	Ridle	1.854 (0.012)	0.309 (0.029)	0.000 (0.000)	0.100 (0.024)	0.982 (0.005)
ρ ₀=0.5	Ridge	1.807 (0.014)
	Lasso	2.045 (0.035)	0.571 (0.013)	0.800 (0.006)	0.400 (0.039)	0.925 (0.007)
	Elastic net	1.773 (0.034)	0.667 (0.008)	0.500 (0.048)	0.800 (0.048)	0.756 (0.020)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	2.242 (0.023)	0.299 (0.035)	0.000 (0.000)	0.100 (0.021)	0.982 (0.004)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	2.080 (0.028)	0.305 (0.094)	0.000 (0.000)	0.550 (0.072)	0.700 (0.079)
	Ridle	1.801 (0.039)	0.528 (0.032)	0.000 (0.000)	0.300 (0.038)	0.943 (0.005)
ρ ₀=0.75	Ridge	1.564 (0.022)
	Lasso	1.365 (0.029)	0.684 (0.008)	0.700 (0.041)	0.600 (0.012)	0.900 (0.003)
	Elastic net	1.237 (0.030)	0.745 (0.005)	0.300 (0.046)	0.900 (0.011)	0.775 (0.014)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	1.747 (0.043)	0.428 (0.003)	0.000 (0.000)	0.200 (0.000)	0.964 (0.003)
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	1.662 (0.043)	0.514 (0.016)	0.000 (0.000)	0.350 (0.023)	0.900 (0.015)
	Ridle	1.253 (0.042)	0.596 (0.017)	0.000 (0.000)	0.400 (0.026)	0.945 (0.003)

The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. g-measure is estimated from all predictors. specificity () is computed in terms of the mandatory variables only, whereas sensitivity () and specificity () are computed in terms of the optional variables only

n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced

Simulation example 1: effect of signal strengths The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced Simulation example 2: effect of correlation between mandatory and irrelevant predictors The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. g-measure is estimated from all predictors. Sensitivity () is computed in terms of the mandatory variables only, whereas sensitivity () and specificity () are computed in terms of the optional variables only n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced Simulation example 3: effect of multicollinearity among mandatory covariates The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. g-measure is estimated from all predictors. Sensitivity () is computed in terms of the mandatory variables only, whereas sensitivity () and specificity () are computed in terms of the optional variables only n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced Simulation example 4: mandatory covariates are irrelevant The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. g-measure is estimated from all predictors. specificity () is computed in terms of the mandatory variables only, whereas sensitivity () and specificity () are computed in terms of the optional variables only n=50, p=250, . The smallest rpe and largest two g-measures are boldfaced

Example 1 (Effect of signal strengths)

This example has β =β 0 for j∈{1,…,10,21,…,30} and β =0 otherwise. Predictors are generated from X∼N(0,Σ) where Σ =0.5|. σ=3. We assume the mandatory covariates to be comprised of the relevant variables so that . Table 1 displays prediction accuracy and variable selection performances for this example. First of all, by utilizing a priori information on mandatory covariates, the ridle has significantly smaller rpe’s than those of the ridge, lasso and elastic net with or without penalization on mandantory covariates. Additionally, the ridle has larger g-measure than those of the lasso and elastic net and similar g-measure with the mandantory-unpenalized lasso and elastic-net method. Sensitivity for the lasso and elastic net decreases dramatically as the signal strength weakens or β 0 becomes smaller. On the other hand, specificity for the lasso decreases while increasing for elastic net when β 0 becomes larger. Furthermore, the lasso and elastic net without penalization on mandantory variables outperforms the lasso and elastic net with penaliztion on the mandantory variables in terms of both prediction accuracy and variable selection. These suggest that, even though the elastic net does a better job than the lasso in terms of prediction accuracy, both methods may not be able to distinguish well between mandatory and irrelevant variables, and incorporating a priori knowledge on mandatory covariates can yield significant improvements.

Example 2 (Effect of correlation between mandatory and irrelevant predictors)

In this example, we have β =2 for j∈{2k:k=1,…,10}, β =1.5 for j∈{2k:k=11,…,20}, and β =0 otherwise. Predictors are generated from X∼N(0,Σ) where each element . Thus, relevant predictors are interspersed with irrelevant ones, to which they are correlated. Further, we assume and σ=6. presents an autocorrelated dependence structure, such that a variable x has a correlation of ρ 0 with its immediate neighbors x and x for 1 Table 2 presents prediction accuracy and variable selection performances for this example. The ridle performs the best in terms of rpe’s. When ρ 0 is large at 0.75, mandatory covariates are strongly correlated with some of the optional variables, and the -unpenalized lasso performs the worst in terms of prediction accuracy, except that of the ridge. This corroborates comments in Two-Predictor Case of “Methods” section that suggest the -unpenalized lasso can have large prediction errors under multicollinearity. Further, the ridle performs the best in terms of g-measures for overall variable selection in all scenarios.

Example 3 (Effect of multicollinearity among mandatory covariates)

Here, we have β =3 for j∈{1,…,5}, β =1.5 for j∈{6,…,10}, β =2 for j∈{16,…,20}, and β =0 otherwise. We set σ=3 and assume . Let Z∼N(0,1) and ε ∼N(0,1). We generate predictors as for and x ∼N(0,1) otherwise. This creates correlations of ρ among the mandatory covariates. In Table 3, we see that sensitivity () decreases for the lasso and elastic net as ρ increases. Additionally, the lasso and elastic net without penalization on mandantory variables have identical sensitivity () with the ridle. Furthermore, prediction error for the lasso without penalization on mandatory covariates increases dramatically as ρ increases, whereas the ridle has the lowest rpe’s. This corroborates Remark 3 of Variable Selection Consistency of the Ridle in “Methods” section, which suggests that the ridle may outperform the lasso when mandatory variables are highly correlated among themselves.

Example 4 (Mandatory covariates are irrelevant)

We repeat the simulation setting from example 2, but with the mandatory covariates defined as . In this case, the mandatory covariates are irrelevant. Table 4 presents prediction accuracy and variable selection performances for this scenario. The ridle underperforms the elastic net but outperforms all other variable selection methods in terms of prediction accuracy. Indeed, the ridle has significantly smaller rpe’s compared with the -unpenalized lasso and -unpenalized elastic net. Moreover, the ridle underperforms both the lasso and elastic net in terms of g-measures for overall variable selection. However, ridle outperforms the -unpenalized lasso and -unpenalized elastic net at ρ 0=0.5 and ρ 0=0.75, when the irrelevant mandatory covariates are moderately and highly correlated, respectively, with some of the relevant optional variables. These suggest that the ridle, although not designed to exclude mandatory covariates when they are irrelevant, can be more advantageous than related methods that include mandatory covariates, as the ridle penalizes coefficients of irrelevant mandatory covariates towards, although not equal to, 0 with the ridge penalty.

Gene expression analysis on histologic grades of breast cancer

Histologic grades are an important determinant of the aggressive potential of breast cancers and are of practical importance in the assessment and choice of treatment options. In this section, we apply our proposed method on a microarray gene expression dataset to determine genes that may be predictive of breast tumor histologic grade [22]. In this experiment, 251 frozen tumor tissure were collected from primary breast cancer patients and more than 12,000 genes were assayed on 251 subjects. We removed 2 subjects with missing outcomes and performed our analysis with the remaining 249 observations. Clinicopathological variables, such as ER status, PgR status, age and tumor size, measured at diagnosis, were obtained from patient records. Histologic grades are based on the widely used Nottingham Histologic Score system for prognosis of breast cancer [23]. There are three factors that pathologists consider in this scoring system: cell differentiation, nuclear features and mitotic activity [24]. Considerations of these factors allow the Nottingham Prognostic Index (NPI) to provide comprehensive prognosis of breast cancers. The three factors are each assigned a score from 1–3 based on clinical observations. A tumor is assigned a score of 1, 2, or 3 for cell differentiation if >75%, 10%-75%, or <10% of tumor area form glandular structures, respectively. A tumor has a score of 1, 2, or 3 for nuclear features if nuclei have little increase in size, larger than normal breast epithelial cells, or prominent nucleoli with occasionally very large sizes, respectively. Further, breast tumors have scores of 1, 2, or 3 for mitotic activity if ≤7, 8–14, or ≥15 mitoses per 10 high power microscopic fields are observed, respectively. Overall tumor grades are obtained by summing the scores for the three factors. Breast tumors with total scores of 3–5, 6–7, and 8–9 are assigned with tumor grades 1 (low), 2 (intermediate), and 3 (high), respectively, that represent the aggressive potential of breast tumors. The higher the grade is, the more likely it will spread or become aggressive. This dataset is available at the NCBI Gene Expression Omnibus (GEO) repository with GEO accession: gse3294. We focused our analysis on 430 genes from several well-known cancer-related pathways: PI3K [25, 26], p53 signaling [27-29], VEGF [30, 31], Hedgehog signaling [32, 33], ErbB signaling [34, 35], Ras signaling [36, 37] and Ion-channel family [38, 39]. Significant genes are selected as predictors of breast tumor grade along the 7 pathways by utilizing a sparse regression approach [40, 41]. In this strategy, tumor grade is regressed upon both the 4 clinicopathological covariates (ER status, PgR status, age and tumor size) and 430 gene expression levels, and significant predictor to tumor grade based on clinical covariates and genes are identified if they are retained in sparse regression analysis. We applied the ridle to perform variable selection on gene expression levels while conditioning on the 4 clinicopathological variables that we incorporated as mandatory covariates. We further compared our results with those from the ridge, lasso, elastic net, and lasso and elastic net without penalization on the 4 mandatory covariates. Table 5 presents the numbers of genes and clinical covariates selected and the mean-squared error (MSE) using the strategies of ridge, lasso, and elastic net on both the 4 clinicopathological variables and 430 gene expression levels. Moreover, we present results from the lasso and elastic net without penalization on the 4 mandatory covariates, in addition to the ridle. The ridge has the largest MSE, suggesting the need for sparse regression. The lasso and elastic net performed similarly with the lasso and elastic net without penalization on mandatory covariates, respectively, in terms of the MSE. On the other hand, the ridle performed the best with the smallest MSE among all methods. This suggests that the ridle may be advantageous in predicting histologic grades of breast cancer.

Table 5

Gene expression analysis on histologic grades of breast cancer

	No. selected \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ	No. selected \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}$\end{document}O	MSE
Ridge	4	430	0.487
Lasso	2	19	0.260
Elastic net	2	14	0.286
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized lasso	4	21	0.257
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {M}$\end{document}ℳ-unpenalized elastic net	4	7	0.296
Ridle	4	24	0.239

The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. The elastic net and -unpenalized elastic net are built with alpha=0.2575 and alpha=0.8462, respectively, selected by cross-validation. Numbers of selected mandatory covariates and optional variables , and mean-squared error (MSE) are shown. Smallest MSE is boldfaced

Gene expression analysis on histologic grades of breast cancer The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the mandatory covariates. The elastic net and -unpenalized elastic net are built with alpha=0.2575 and alpha=0.8462, respectively, selected by cross-validation. Numbers of selected mandatory covariates and optional variables , and mean-squared error (MSE) are shown. Smallest MSE is boldfaced Figure 2 depicts the genes selected via sparse regression methods. The ridle encompasses all of the genes selected by the lasso and elastic net with penalization on mandatory covariates, and nearly all of the genes selected by the elastic net (except 1 gene) and lasso without penalization on mandatory covariates (except 2 genes). Two genes are selected only by the ridle: the AREG and TRPM4 genes, which belong to the ErbB signaling pathway and ion-channel family, respectively.

Fig. 2

Selection of genes and clinicopathological variables. PgR, ER, p53Status, and AgeDiagnosis are clinicopathological covariates, whereas all others are genes. The -unpenalized lasso and -unpenalized elastic net were performed without penalization on the clinicopathological variables as mandatory covariates Cells are continuously exposed to stimuli from paracrine and endocrine factors. It is essential that the extracellular signals are interpreted by cell correctly in order to facilitate proper proliferative response. The ErbB family belongs to receptors of the tyrosine kinase family and plays pivotal roles in this process [34]. Members of the ErbB signaling pathway have been suggested as potential therapeutic targets [42]. Initial studies have also suggested that expression levels of AREG (amphiregulin) are associated with larger and more aggressive tumors through cell proliferation [43, 44]. Only the ridle identified AREG as predictive of histologic grades of breast cancers. The other gene selected only by the ridle is TRPM4 from the ion-channel family. Researches over the past few years have shown that ion channels are involved in the progression and pathology of a myriad of human cancers [39, 45, 46]. In addition, ion channels are known to play critical roles in gene expression, hormone secretion, cell volume regulation, and cell proliferation [47, 48]. The expression levels of ion-channel genes, including TRPM4, have been found to be predictive of and significantly associated with tumor progression [38]. Breast cancer is known to be highly correlated with hormone secretion. Breast tumors that are ER or PgR-positive are much more likely to respond to hormone therapy than tumors that are negative. Many of these may not be related to histologic grades of breast cancer. For example, in a previous study, twenty-four ion-channel genes were found to be differentially expressed between ER-negative and ER-positive tumors [38]. However, in our analysis, we only identified 1 gene, AREG, from the ion-channel family to be predictive of histologic grades of cancer. Thus, many of the 430 breast-cancer related genes may not be predictive of histologic grades but are expected to be highly correlated with the mandatory covariates, i.e. ER and PgR statuses. As suggested by both theoretical and simulation studies, the ridle can be advantageous when mandatory variables are correlated with the irrelevant optional ones. Results from the gene expression analysis further validate and demonstrate the performances of the ridle under this commonly seen scenario.

Discussion

In this article, we proposed the ridle for sparse regression with mandatory covariates. We provided both theoretical and simulation studies that demonstrated the efficacy of our method. In particular, our results suggest that the ridle may outperform the lasso and elastic net when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. The ridle can also improve upon performances of the lasso and elastic net when mandatory covariates have small or moderate effects. We employed the L1-norm penalty to induce sparsity on the optional set. This is chosen for its simplicity, computational ease, and successes in a myriad of applications; for example, L1-norm penalized regressions have been successfully applied in large-scale genome-wide association [3] and eQTL data studies [49]. However, other sparse regularization methods, such as the SCAD [7], adaptive lasso [8], Dantzig selector [9], etc. can also be utilized in place of the L1-norm penalty in (1). The ridle is related to the elastic net [41] that also employs both the L1-norm and L2-norm penalties. However, the elastic net applies both penalties upon all coefficients of the optional set, whereas the ridle applies the L1-norm to coefficients of the optional set and the L2-norm to coefficients of the mandatory set for simultaneous estimation of mandatory covariates while allowing selection for others. In this article, we applied our method in an interesting application to gene expression analysis where we identified more genes related to tumor grade while incorporating clinicopathological variables as mandatory covariates. In addition, the ridle can be applied in a myriad of other genomic studies where mandatory covariates are routinely required, such as when clinical, demographical, or experimental effects have to be incorporated in regression analysis of genomic data sets.

Conclusions

In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Mandatory covariates are routinely encountered in the analysis of genetic-biomedical data. For example, additional covariates describing clinical, demographical or experimental effects need to be included a priori without subjecting them to variable selection. Results suggest that the ridle may outperform current methods when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves.

34 in total

Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer.

Background

Methods

The Ridle

Ridle estimator in two special cases

Orthogonal design case.

Two-predictor case

Theoretical properties

Asymptotic normality of the Ridle

Theorem 1

Variable selection consistency of the Ridle.

Theorem 2

Theorem 3

Remark 1

Remark 2

Remark 3

Remark 4

Remark 5

Efficient algorithm

Results

Analysis of simulated data

Example 1 (Effect of signal strengths)

Example 2 (Effect of correlation between mandatory and irrelevant predictors)

Example 3 (Effect of multicollinearity among mandatory covariates)

Example 4 (Mandatory covariates are irrelevant)

Gene expression analysis on histologic grades of breast cancer

Discussion

Conclusions

Review 1. The ErbB signaling network: receptor heterodimerization in development and cancer.

Review 2. The Hedgehog and Wnt signalling pathways in cancer.

3. The Nottingham Prognostic Index in primary breast cancer.

Review 4. Voltage-sensitive ion channels and cancer.

5. Genome-wide association analysis by lasso penalized logistic regression.

6. Prediction of prognosis for prostatic adenocarcinoma by combined histological grading and clinical staging.

Review 7. Lessons from phase III clinical trials on anti-VEGF therapy for cancer.

8. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Review 9. Aberrant function of the Ras signal transduction pathway in human breast cancer.

10. Learning a prior on regulatory potential from eQTL data.

1. Variance Component Selection With Applications to Microbiome Taxonomic Data.