Literature DB >> 27656104

The lasso for high dimensional regression with a possible change point.

Sokbae Lee¹, Myung Hwan Seo², Youngki Shin³.

Abstract

We consider a high dimensional regression model with a possible change point due to a covariate threshold and develop the lasso estimator of regression coefficients as well as the threshold parameter. Our lasso estimator not only selects covariates but also selects a model between linear and threshold regression models. Under a sparsity assumption, we derive non-asymptotic oracle inequalities for both the prediction risk and the l1-estimation loss for regression coefficients. Since the lasso estimator selects variables simultaneously, we show that oracle inequalities can be established without pretesting the existence of the threshold effect. Furthermore, we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a factor that is nearly n-1 even when the number of regressors can be much larger than the sample size n. We illustrate the usefulness of our proposed estimation method via Monte Carlo simulations and an application to real data.

Entities: Chemical Disease Gene

Keywords: Lasso; Oracle inequalities; Sample splitting; Sparsity; Threshold models

Year: 2015 PMID： 27656104 PMCID： PMC5014306 DOI： 10.1111/rssb.12108

Source DB: PubMed Journal: J R Stat Soc Series B Stat Methodol ISSN： 1369-7412 Impact factor: 4.488

Introduction

The lasso and related methods have received rapidly increasing attention in statistics since the seminal work of Tibshirani (1996). For example, see Bühlmann and van de Geer (2011) as well as Fan and Lv (2010) and Tibshirani (2011) for a general overview and recent developments. In this paper, we develop a method for estimating a high dimensional regression model with a possible change point due to a covariate threshold, while selecting relevant regressors from a set of many potential covariates. In particular, we propose the penalized least squares (lasso) estimator of parameters, including the unknown threshold parameter, and analyse its properties under a sparsity assumption when the number of possible covariates can be much larger than the sample size. To be specific, let be a sample of independent observations such thatwhere, for each i, is an M×1 deterministic vector, is a deterministic scalar, follows and 1{·} denotes the indicator function. The scalar variable is the threshold variable and is the unknown threshold parameter. Since is a fixed variable in our set‐up, expression (1.1) includes a regression model with a change point at unknown time (e.g. ). In this paper, we focus on the fixed design for and independent normal errors . This set‐up has been extensively used in the literature (e.g. Bickel et al. (2009)). A regression model such as model (1.1) offers applied researchers a simple yet useful framework to model non‐linear relationships by splitting the data into subsamples. Empirical examples include cross‐country growth models with multiple equilibria (Durlauf and Johnson, 1995), racial segregation (Card et al., 2008) and financial contagion (Pesaran and Pick, 2007), among many others. Typically, the choice of the threshold variable is well motivated in applied work (e.g. initial per capita output in Durlauf and Johnson (1995), and the minority share in a neighbourhood in Card et al. (2008)), but selection of other covariates is subject to applied researchers' discretion. However, covariate selection is important in identifying threshold effects (i.e. non‐zero ) since a statistical model favouring threshold effects with a particular set of covariates could be overturned by a linear model with a broader set of regressors. Therefore, it seems natural to consider the lasso as a tool to estimate model (1.1). The statistical problem that we consider is to estimate unknown parameters when M is much larger than n. For the classical set‐up (estimation of parameters without covariate selection when M is smaller than n), estimation of model (1.1) has been well studied (e.g. Tong (1990), Chan (1993) and Hansen (2000)). Also, a general method for testing threshold effects in regression (i.e. testing in model (1.1)) is available for the classical set‐up (e.g. Lee et al. (2011)). Although there are many references on lasso‐type methods and also equally many on change points, sample splitting and threshold models, there seem to be only a handful of references that intersect both topics. Wu (2008) proposed an information‐based criterion for carrying out change point analysis and variable selection simultaneously in linear models with a possible change point; however, the method proposed in Wu (2008) would be infeasible in a sparse high dimensional model. In change point models without covariates, Harchaoui and Lévy‐Leduc (2008, 2010) proposed a method for estimating the location of change points in one‐dimensional piecewise constant signals observed in white noise, using a penalized least square criterion with an ‐type penalty. Zhang and Siegmund (2012) developed Bayes information criterion like criteria for determining the number of changes in the mean of multiple sequences of independent normal observations when the number of change points can increase with the sample size. Ciuperca (2014) considered a similar estimation problem to ours, but the corresponding analysis was restricted to the case when the number of potential covariates is small. In this paper, we consider the lasso estimator of regression coefficients as well as the threshold parameter. Since the change point parameter does not enter additively in model (1.1), the resulting optimization problem in the lasso estimation is non‐convex. We overcome this problem by comparing the values of standard lasso objective functions on a grid over the range of possible values of . Theoretical properties of the lasso and related methods for high dimensional data have been examined by Fan and Peng (2004), Bunea et al. (2007), Candès and Tao (2007), Huang et al. (2008a,b), Kim et al. (2008), Bickel et al. (2009) and Meinshausen and Yu (2009), among many others. Most of the references consider quadratic objective functions and linear or non‐parametric models with an additive mean 0 error. There has been recent interest in extending this framework to generalized linear models (e.g. van de Geer (2008) and Fan and Lv (2011)), to quantile regression models (e.g. Belloni and Chernozhukov (2011a), Bradic et al. (2011) and Wang et al. (2012)), and to hazards models (e.g. Bradic et al. (2012) and Lin and Lv (2013)). We contribute to this literature by considering a regression model with a possible change point and then deriving non‐asymptotic oracle inequalities for both the prediction risk and the ‐estimation loss for regression coefficients under a sparsity scenario. Our theoretical results build on Bickel et al. (2009). Since the lasso estimator selects variables simultaneously, we show that oracle inequalities that are similar to those obtained in Bickel et al. (2009) can be established without pretesting the existence of the threshold effect. In particular, when there is no threshold effect (), we prove oracle inequalities that are basically equivalent to those in Bickel et al. (2009). Furthermore, when , we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a factor of nearly when the number of regressors can be much larger than the sample size. To achieve this, we develop some sophisticated chaining arguments and provide sufficient regularity conditions under which we prove oracle inequalities. The superconsistency result of is well known when the number of covariates is small (see, for example, Chan (1993) and Seijo and Sen (2011a, b)). To the best of our knowledge, our paper is the first work that demonstrates the possibility of a nearly ‐bound in the context of sparse high dimensional regression models with a change point. The remainder of this paper is as follows. In Section 2 we propose the lasso estimator, and in Section 3 we give a brief illustration of our proposed estimation method by using a real data example in economics. In Section 4 we establish the prediction consistency of our lasso estimator. In Section 5 we establish sparsity oracle inequalities in terms of both the prediction loss and the ‐estimation loss for , while providing low level sufficient conditions for two possible cases of threshold effects. In Section 6 we present results of some simulation studies, and Section 7 concludes. The on‐line appendices consist of six sections: appendix A provides sufficient conditions for one of our main assumptions, appendix B gives some additional discussions on identifiability for , appendices C, D and E contain all the proofs, and appendix F provides additional numerical results.

Notation

We collect the notation that is used in the paper here. For following model (1.1), let denote the 2M×1 vector such that and let X(τ) denote the n×2M matrix whose ith row is . For an L‐dimensional vector a, let denote the ‐norm of a, and |J(a)| denote the cardinality of J(a), where . In addition, let denote the number of non‐zero elements of a, i.e. . Let denote the vector in that has the same co‐ordinates as a on J and zero co‐ordinates on the complement of J. For any n‐dimensional vector , define the empirical norm as . Let the superscript ‘(j)’ denote the jth element of a vector or the jth column of a matrix depending on the context. Finally, define , and . Then, we define the prediction risk as

Lasso estimation

Let . Then, using notation defined above, we can rewrite model (1.1) as Let . For any fixed , where is a parameter space for , consider the residual sum of squareswhere . We define the following 2M×2M diagonal matrix:For each fixed , define the lasso solution bywhere λ is a tuning parameter that depends on n and is a parameter space for . It is important to note that the scale normalizing factor D(τ) depends on τ since different values of τ generate different dictionaries X(τ). To see this more clearly, defineThen, for each and for each j=1,…,M, we have and . Using this notation, we rewrite the ‐penalty asTherefore, for each fixed , is the weighted lasso that uses a data‐dependent ‐penalty to balance covariates adequately. We now estimate byIn fact, for any finite n, is given by an interval and we simply define the maximum of the interval as our estimator. If we wrote the model by using then the convention would be the minimum of the interval being the estimator. Then the estimator of is defined as . In fact, our proposed estimator of (α,τ) can be viewed as the one‐step minimizer such that It is worth noting that we penalize and in expression (2.5), where is the change of regression coefficients between two regimes. Model (1.1) can be written aswhere . In view of model (2.6), alternatively, one might penalize and instead of and . We opted to penalize in this paper since the case corresponds to the linear model. If , then this case amounts to selecting the linear model.

Empirical illustration

In this section, we apply the proposed lasso method to growth regression models in economics. The neoclassical growth model predicts that economic growth rates converge in the long run. This theory has been tested empirically by looking at the negative relationship between long‐run growth rate and initial gross domestic product (GDP) given other covariates (see Barro and Sala‐i‐Martin (1995) and Durlauf et al. (2005) for literature reviews). Although empirical results confirmed the negative relationship between growth rate and initial GDP, there has been some criticism that the results depend heavily on the selection of covariates. Recently, Belloni and Chernozhukov (2011b) showed that lasso estimation can help to select the covariates in the linear growth regression model and that the lasso estimation results reconfirm the negative relationship between long‐run growth rate and initial GDP. We consider the growth regression model with a possible threshold. Durlauf and Johnson (1995) provided the theoretical background of the existence of multiple steady states and estimated the model with two possible threshold variables. They checked the robustness by adding other available covariates to the model, but it is not still free from the criticism of ad hoc variable selection. Our proposed lasso method might be a good alternative in this situation. Furthermore, as we shall show later, our method works well even if there is no threshold effect in the model. Therefore, one might expect more robust results from our approach. The regression model that we consider has the formwhere is the annualized GDP growth rate of country i from 1960 to 1985, is the log‐GDP in 1960 and is a possible threshold variable for which we use the initial GDP or the adult literacy rate in 1960 following Durlauf and Johnson (1995). Finally, is a vector of additional covariates related to education, market efficiency, political stability, market openness and demographic characteristics. In addition, contains cross‐product terms between lgdp60 and education variables. Table 1 gives a list of all covariates used and a description of each variable. We include as many covariates as possible, which might mitigate the potential omitted variable bias. The data set mostly comes from Barro and Lee (1994), and the additional adult literacy rate is from Durlauf and Johnson (1995). Because of missing observations, we have 80 observations with 46 covariates (including a constant term) when is the initial GDP (n=80 and M=46), and 70 observations with 47 covariates when is the literacy rate (n=70 and M=47). It is worthwhile to note that the number of covariates in the threshold models is bigger than the number of observations (2M>n in our notation). Thus, we cannot adopt the standard least squares method to estimate the threshold regression model.

Table 1

List of variables

Variable name	Description
Dependent variable
gr	Annualized GDP growth rate in the period 1960–1985
Threshold variables
gdp60	Real GDP per capita in 1960 (1985 price)
lr	Adult literacy rate in 1960
Covariates
lgdp60	Log‐GDP per capita in 1960 (1985 price)
lr	Adult literacy rate in 1960 (only included when Q=lr)
lsk	log(investment/output) annualized over 1960–1985; a proxy for log(physical
	savings rate)
lgrpop	log(population growth rate) annualized over 1960–1985
pyrm60	log(average years of primary schooling) in the male population in 1960
pyrf60	log(average years of primary schooling) in the female population in 1960
syrm60	log(average years of secondary schooling) in the male population in 1960
syrf60	log(average years of secondary schooling) in the female population in 1960
hyrm60	log(average years of higher schooling) in the male population in 1960
hyrf60	log(average years of higher schooling) in the female population in 1960
nom60	Percentage of no schooling in the male population in 1960
nof60	Percentage of no schooling in the female population in 1960
prim60	Percentage of primary schooling attained in the male population in 1960
prif60	Percentage of primary schooling attained in the female population in 1960
pricm60	Percentage of primary schooling complete in the male population in 1960
pricf60	Percentage of primary schooling complete in the female population in 1960
secm60	Percentage of secondary schooling attained in the male population in 1960
secf60	Percentage of secondary schooling attained in the female population in 1960
seccm60	Percentage of secondary schooling complete in the male population in 1960
seccf60	Percentage of secondary schooling complete in the female population in 1960
llife	log(life expectancy at age 0) averaged over 1960–1985
lfert	log(fertility rate) averaged over 1960–1985
edu/gdp	Government expenditure on eduction per GDP averaged over 1960–1985
gcon/gdp	Government consumption expenditure net of defence and education per GDP averaged over 1960–1985
revol	Number of revolutions per year over 1960–1984
revcoup	Number of revolutions and coups per year over 1960–1984
wardum	Dummy for countries that participated in at least one external war over 1960–1984
wartime	Fraction of time over 1960–1985 involved in external war
lbmp	log(1 + black market premium averaged over 1960–1985)
tot	Term‐of‐trade shock
lgdp60 × ‘educ’	Product of two covariates (interaction of lgdp60 and education variables from pyrm60 to seccf60); total 16 variables

List of variables Table 2 summarizes the model selection and estimation results when is the initial GDP. In the on‐line appendix F (see Table 4), we report additional empirical results with being the literacy rate. To compare different model specifications, we also estimate a linear model, i.e. all δs are 0s in model (3.1), by standard lasso estimation. In each case, the regularization parameter λ is chosen by the ‘leave‐one‐out’ cross‐validation method. For the range of the threshold parameter, we consider an interval between the 10% and 90% sample quantiles for each threshold variable.

Table 2

Model selection and estimation results with Q=gdp60a

Variable	Value for the linear model	Values for the threshold model, τ^=2898
Variable	Value for the linear model	β^	δ^
Constant	−0.0923	−0.0811	—
lgdp60	−0.0153	−0.0120	—
lsk	0.0033	0.0038	—
lgrpop	0.0018	—	—
pyrf60	0.0027	—	—
syrm60	0.0157	—	—
hyrm60	0.0122	0.0130	—
hyrf60	−0.0389	—	−0.0807
nom60	—	—	2.64 × 10−5
prim60	−0.0004	−0.0001	—
pricm60	0.0006	−1.73 × 10−4	−0.35×10−4
pricf60	−0.0006	—	—
secf60	0.0005	—	—
seccm60	0.0010	—	0.0014
llife	0.0697	0.0523	—
lfert	−0.0136	−0.0047	—
edu/gdp	−0.0189	—	—
gcon/gdp	−0.0671	−0.0542	—
revol	−0.0588	—	—
revcoup	0.0433	—	—
wardum	−0.0043	—	−0.0022
wartime	−0.0019	−0.0143	−0.0023
lbmp	−0.0185	−0.0174	−0.0015
tot	0.0971	—	0.0974
lgdp60 × pyrf60	—	−3.81×10−6	—
lgdp60 × syrm60	—	—	0.0002
lgdp60 × hyrm60	—	—	0.0050
lgdp60 × hyrf60	—	−0.0003	—
lgdp60 × nom60	—	—	8.26×10−6
lgdp60 × prim60	−6.02×10−7	—	—
lgdp60 × prif60	−3.47×10−6	—	−8.11×10−6
lgdp60 × pricf60	−8.46×10−6	—	—
lgdp60 × secm60	−0.0001	—	—
lgdp60 × seccf60	−0.0002	−2.87×10−6	—
λ	0.0004	0.0034
M(α^)	28	26
Number of covariates	46	92
Number of observations	80	80

The regularization parameter λ is chosen by the ‘leave‐one‐out’ cross‐validation method. denotes the number of covariates to be selected by the lasso estimator and a dash indicates that the regressor is not selected. Recall that is the coefficient when and that is the change of the coefficient value when .

Model selection and estimation results with Q=gdp60a The regularization parameter λ is chosen by the ‘leave‐one‐out’ cross‐validation method. denotes the number of covariates to be selected by the lasso estimator and a dash indicates that the regressor is not selected. Recall that is the coefficient when and that is the change of the coefficient value when . Main empirical findings are as follows. First, the marginal effect of lgdp60, which is given bywhere educ is a vector of education variables and and are subvectors of and corresponding to educ, is estimated to be negative for all the observed values of educ. This confirms the theory of the neoclassical growth model. Second, some non‐zero coefficients of interaction terms between lgdp60 and various education variables show the existence of threshold effects in both threshold model specifications. This result implies that the growth convergence rates can vary according to different levels of the initial GDP or the adult literacy rate in 1960. Specifically, in both threshold models, we have , but some s are not 0. Thus, conditionally on other covariates, there are different technological diffusion effects according to the threshold point. For example, a developing country (lower Q) with a higher education level will converge faster perhaps by absorbing advanced technology more easily and more quickly. Finally, the lasso with the threshold model specification selects a more parsimonious model than that with the linear specification even though the former doubles the number of potential covariates.

Prediction consistency of lasso estimator

In this section, we consider the prediction consistency of the lasso estimator. We make the following assumptions. For the parameter space for , any , including , satisfies for some constant . In addition, that satisfies . There are universal constants and such that uniformly in j and , and uniformly in j, where j=1,…,2M. There is no i≠j such that Assumption 1(a) imposes the boundedness for each component of the parameter vector. The first part of assumption 1(a) which implies that for any , seems to be weak, since the sparsity assumption implies that is much smaller than . Furthermore, in the literature on change point and threshold models, it is common to assume that the parameter space is compact. For example, see Seijo and Sen (2011a, b). The lasso estimator in expression (2.5) can be computed without knowing the value of , but must be specified. In practice, researchers tend to choose some strict subset of the range of observed values of the threshold variable. Assumption 1(b) imposes that each covariate is of the same magnitude uniformly over τ. In view of the assumption that , it is not stringent to assume that is bounded away from zero. Assumption 1(c) imposes that there is no tie among s. This is a convenient assumption such that we can always transform general to without loss of generality. This holds with probability 1 for the random‐design case if is continuously distributed. Definewhere and are defined in expression (2.3). Assumption 1(b) implies that is bounded away from zero. In particular, we have that . Recall thatwhere and . To establish theoretical results in the paper (in particular, oracle inequalities in Section 5), let be the lasso estimator defined by expression (2.5) withfor a constant A>2√2/μ, where μ ∈ (0,1) is a fixed constant. We now present the first theoretical result of this paper. (consistency of the lasso). Let assumption 1 hold. Let μ be a constant such that 0<μ<1, and let be the lasso estimator defined by expression (2.5) with λ given by equation (4.2). Then, with probability at least , we havewhere . The non‐asymptotic upper bound on the prediction risk in theorem 1 can be translated easily into asymptotic convergence. Theorem 1 implies the consistency of the lasso, provided that n→∞, M→∞ and . Recall that represents the sparsity of model (2.1). In view of equation (4.2), the condition requires that . This implies that can increase with n. Note that the prediction error increases as A or μ increases; however, the probability of correct recovery increases if A or μ increases. Therefore, there is a trade‐off between the prediction error and the probability of correct recovery.

Oracle inequalities

In this section, we establish finite sample sparsity oracle inequalities in terms of both the prediction loss and the ‐estimation loss for unknown parameters. First of all, we make the following assumption. (uniform restricted eigenvalue (URE) ). For some integer s such that 1⩽s⩽2M, a positive number and some set , the following condition holds: If were known, then assumption 2 is just a restatement of the restricted eigenvalue assumption of Bickel et al. (2009) with . Bickel et al. (2009) provided sufficient conditions for the restricted eigenvalue condition. In addition, van de Geer and Bühlmann (2009) showed the relationships between the restricted eigenvalue condition and other conditions on the design matrix, and Raskutti et al. (2010) proved that restricted eigenvalue conditions hold with high probability for a large class of correlated Gaussian design matrices. If is unknown as in our set‐up, it seems necessary to assume that the restricted eigenvalue condition holds uniformly over τ. We consider separately two cases depending on whether or not. On the one hand, if so that is not identifiable, then we need to assume that the URE condition holds uniformly on the whole parameter space, . On the other hand, if so that is identifiable, then it suffices to impose that the URE condition holds uniformly on a neighbourhood of . In the on‐line appendix A, we provide two types of sufficient conditions for assumption 2. One type is based on modifications of assumption 2 of Bickel et al. (2009) and the other type is in the same spirit as van de Geer and Bühlmann (2009), section 10.1. Using the second type of results, we verify primitive sufficient conditions for the URE condition in the context of our simulation designs. See the on‐line appendix A for details. The URE condition is useful for us to improve the result in theorem 1. Recall that, in theorem 1, the prediction risk is bounded by a factor of . This bound is too large to give us an oracle inequality. We shall show below that we can establish non‐asymptotic oracle inequalities for the prediction risk as well as the ‐estimation loss, thanks to the URE condition. The strength of the proposed lasso method is that it is not necessary to know or pretest whether or not. It is worth noting that we do not have to know whether there is a threshold in the model to establish oracle inequalities for the prediction risk and the ‐estimation loss for , although we divide our theoretical results into two cases below. This implies that we can make prediction and estimate precisely without knowing the presence of a threshold effect or without pretesting for it.

Case I: no threshold

We first consider the case that . In other words, we estimate a threshold model via the lasso method, but the true model is simply a linear model . This is an important case to consider in applications, because one may not be sure not only about covariates selection but also about the existence of the threshold in the model. Let denote the supremum (over ) of the largest eigenvalue of . Then, by definition, the largest eigenvalue of is bounded uniformly in by . The following theorem gives oracle inequalities for the first case. Suppose that . Let assumptions 1 and 2 hold with for 0<μ<1, and . Let be the lasso estimator defined by expression (2.5) with λ given by expression (4.2). Then, with probability at least we havefor some universal constant . To appreciate the usefulness of the inequalities derived above, it is worth comparing inequalities in theorem 2 with those in theorem 7.2 of Bickel et al. (2009). The latter corresponds to the case that is known a priori and in our notation. If we compare theorem 2 with theorem 7.2 of Bickel et al. (2009), we can see that the lasso estimator in model (2.5) gives qualitatively the same oracle inequalities as the lasso estimator in the linear model, even though our model is much more overparameterized in that δ and τ are added to β as parameters to estimate. Also, as in Bickel et al. (2009), there is no requirement on such that the minimum value of non‐zero components of is bounded away from zero. In other words, there is no need to assume the minimum strength of the signals. Furthermore, is well estimated here even if is not identifiable at all. Finally, note that the value of the constant is given in the proof of theorem 2 and that theorem 2 can be translated easily into asymptotic oracle results as well, since both κ and are bounded away from zero by the URE condition and assumption 1 respectively.

Case II: fixed threshold

This subsection explores the case where the threshold effect is well identified and discontinuous. We begin with the following additional assumptions to reflect this. (identifiability under sparsity and discontinuity of regression). For a given and for any η and τ such that and , there is a constant c>0 such that Assumption 3 implies, among other things, that for some and for any and τ such that ,This condition can be regarded as identifiability of . If were known, then a sufficient condition for the identifiability under the sparsity would be that URE holds for some . Thus, the main point in result (5.1) is that there is no sparse representation that is equivalent to when the sample is split by In fact, assumption 3 is stronger than just the identifiability of as it specifies the rate of deviation in f as τ moves away from which in turn dictates the bound for the estimation error of . We provide further discussions on assumption 3 in the on‐line appendix B. The restriction in assumption 3 is necessary since we consider the fixed design for both and . Throughout this section, we implicitly assume that the sample size n is sufficiently large such that is very small, implying that the restriction never binds in any of the inequalities below. This is typically true for the random‐design case if is continuously distributed. (smoothness of design). For any η>0, there is a constant C<∞ such that Assumption 4 has been assumed in the classical set‐up with a fixed number of stochastic regressors to exclude cases like has a point mass at or is unbounded. In our set‐up, assumption 4 amounts to a deterministic version of some smoothness assumption for the distribution of the threshold variable . When is a random vector, it is satisfied under the standard assumption that is continuously distributed and is continuous and bounded in a neighbourhood of for each j. To simplify the notation, in the following theorem, we assume without loss of generality that . Then . In addition, let where is the same constant in theorem 1. (well‐defined second moments). For any η such that , is bounded, whereand [·] denotes an integer part of any real number. Assumption 5 assumes that is well defined for any η such that . Assumption 5 amounts to some weak regularity condition on the second moments of the fixed design. Assumption 3 implies that and that is bounded away from zero. Hence, assumptions 3 and 5 imply that is bounded and bounded away from zero. To present the theorem below, it is necessary to make one additional technical assumption (see assumption 6 in the on‐line appendix E). We opted not to show assumption 6 here, since we believe that this is just a sufficient condition that does not add much to our understanding of the main result. However, we would like to point out that assumption 6 can hold for all sufficiently large n, provided that , as n→0. See remark 4 in the on‐line appendix E for details. We now give the main result of this section. Suppose that assumptions 1 and 2 hold with , for 0<μ<1, and . Furthermore, assumptions 3, 4 and 5 hold and let n be sufficiently large that assumption 6 in the on‐line appendix E holds. Let be the lasso estimator defined by expression (2.5) with λ given by expression (4.2). Then, with probability at least for some positive constants and , we havefor some universal constant . Theorem 3 gives the same inequalities (up to constants) as those in theorem 2 for the prediction risk as well as the ‐estimation loss for . It is important to note that is bounded by a constant times , whereas is bounded by a constant times . This can be viewed as a non‐asymptotic version of the superconsistency of to . As noted at the end of Section 5.1, since both κ and are bounded away from zero by the URE condition and assumption 1 respectively, theorem 3 implies asymptotic rate results immediately. The values of constants , and are given in the proof of theorem 3. The main contribution of this section is that we have extended the well‐known superconsistency result of when M

Monte Carlo experiments

In this section we conduct some simulation studies and check the properties of the lasso estimator proposed. The baseline model is model (1.1), where is an M‐dimensional vector generated from N(0,I), is a scalar generated from the uniform distribution on the interval of (0,1) and the error term is generated from . The threshold parameter is set to depending on the simulation design, and the coefficients are set to , and where c=0 or c=1. Note that there is no threshold effect when c=0. The number of observations is set to n=200. Finally, the dimension of in each design is set to M=50,100,200,400, so that the total numbers of regressors are 100, 200, 400 and 800 respectively. The range of τ is . We can estimate the parameters by the standard lasso–least angle regression algorithm of Efron et al. (2004) without much modification. Given a regularization parameter value λ, we estimate the model for each grid point of τ that spans over 71 equispaced points on . This procedure can be conducted by using the standard linear lasso. Next, we plug in the estimated parameter for each τ into the objective function and choose by expression (4.2). Finally, is estimated by . The regularization parameter λ is chosen by expression (4.2) where σ=0.5 is assumed to be known. For the constant A, we use four different values: A=2.8,3.2,3.6,4.0. Table 3 and Figs 1 and 2 summarize these simulation results. To compare the performance of the lasso estimator, we also report the estimation results of the least squares estimation (‘least squares’) available only when M=50 and two oracle models (oracle 1 and oracle 2). Oracle 1 assumes that the regressors with non‐zero coefficients are known. In addition to that, oracle 2 assumes that the true threshold parameter is known. Thus, when c≠0, oracle 1 estimates and τ by using least squares estimation whereas oracle 2 estimates only . When c=0, both oracle 1 and oracle 2 estimate only . All results are based on 400 replications of each sample.

Table 3

Simulation results with M=50a

Threshold parameter	Estimation method	Constant for λ	Prediction error			E[M(α^)]	E\|α^−α0\|1	E\|τ^−τ0\|1
Threshold parameter	Estimation method	Constant for λ	Mean	Median	Standard deviation	E[M(α^)]	E\|α^−α0\|1	E\|τ^−τ0\|1
Jump scale: c = 1
τ0=0.5	Least squares	None	0.285	0.276	0.074	100.00	7.066	0.008
	Lasso	A=2.8	0.041	0.030	0.035	12.94	0.466	0.010
		A=3.2	0.048	0.033	0.049	10.14	0.438	0.013
		A=3.6	0.067	0.037	0.086	8.44	0.457	0.024
		A=4.0	0.095	0.050	0.120	7.34	0.508	0.040
	Oracle 1	None	0.013	0.006	0.019	4.00	0.164	0.004
	Oracle 2	None	0.005	0.004	0.004	4.00	0.163	0.000
τ0=0.4	Least squares	None	0.317	0.304	0.095	100.00	7.011	0.008
	Lasso	A=2.8	0.052	0.034	0.063	13.15	0.509	0.016
		A=3.2	0.063	0.037	0.083	10.42	0.489	0.023
		A=3.6	0.090	0.045	0.121	8.70	0.535	0.042
		A=4.0	0.133	0.061	0.162	7.68	0.634	0.078
	Oracle 1	None	0.014	0.006	0.022	4.00	0.163	0.004
	Oracle 2	None	0.005	0.004	0.004	4.00	0.163	0.000
τ0=0.3	Least squares	None	2.559	0.511	16.292	100.00	12.172	0.012
	Lasso	A=2.8	0.062	0.035	0.091	13.45	0.602	0.030
		A=3.2	0.089	0.041	0.125	10.85	0.633	0.056
		A=3.6	0.127	0.054	0.159	9.33	0.743	0.099
		A=4.0	0.185	0.082	0.185	8.43	0.919	0.168
	Oracle 1	None	0.012	0.006	0.017	4.00	0.177	0.004
	Oracle 2	None	0.005	0.004	0.004	4.00	0.176	0.000
Jump scale: c = 0
—b	Least squares	None	6.332	0.460	41.301	100.00	20.936	—b
	Lasso	A=2.8	0.013	0.011	0.007	9.30	0.266
		A=3.2	0.014	0.012	0.008	6.71	0.227
		A=3.6	0.015	0.014	0.009	4.95	0.211
		A=4.0	0.017	0.016	0.010	3.76	0.204
	Oracle 1 and	None	0.002	0.002	0.003	2.00	0.054
	oracle 2

M denotes the column size of and τ denotes the threshold parameter. Oracle 1 and oracle 2 are estimated by least squares when sparsity is known and when sparsity and are known respectively. All simulations are based on 400 replications of a sample with 200 observations.

Not applicable.

Figure 1

Mean prediction errors and mean (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400

Figure 2

Mean ‐errors for α and τ (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400

Simulation results with M=50a M denotes the column size of and τ denotes the threshold parameter. Oracle 1 and oracle 2 are estimated by least squares when sparsity is known and when sparsity and are known respectively. All simulations are based on 400 replications of a sample with 200 observations. Not applicable. Mean prediction errors and mean (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400 Mean ‐errors for α and τ (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400 The reported mean‐squared prediction error PE for each sample is calculated numerically as follows. For each sample s, we have the estimates , and . Given these estimates, we generate new data of 400 observations and calculate the prediction error asThe mean, median and standard deviation of the prediction error are calculated from the 400 replications, . We also report the mean of and ‐errors for α and τ. Table 3 reports the simulation results for M=50. For simulation designs with M>50, the least squares estimator is not available, and we summarize the same statistics only for the lasso estimator in Figs 1 and 2. When M=50, across all designs, the lasso estimator proposed performs better than the least squares estimator in terms of mean and median prediction errors, the mean of and the ‐error for α. The performance of the lasso estimator becomes much better when there is no threshold effect, i.e. c=0. This result confirms the robustness of the lasso estimator for whether or not there is a threshold effect. However, the least squares estimator performs better than the lasso estimator in terms of estimation of when c=1, although the difference here is much smaller than the differences in prediction errors and the ‐error for α. From Figs 1 and 2, we can reconfirm the robustness of the lasso estimator when M=100,200,400. As predicted by the theory that was developed in previous sections, the prediction error and ‐errors for α and τ increase slowly as M increases. The graphs also show that the results are quite uniform across different regularization parameter values except A=4.0. In the on‐line appendix F, we report additional simulation results, while allowing correlation between covariates. Specifically, the M‐dimensional vector is generated from a multivariate normal N(0,Σ) distribution with , where denotes the (i,j) element of the M×M covariance matrix Σ and ρ=0.3. All other random variables are the same as above. We obtained very similar results to those for the previous cases: the lasso outperforms the least squares estimator, and the prediction error, the mean of and ‐errors increase very slowly as M increases. See further details in the on‐line appendix F, which also reports satisfactory simulation results regarding frequencies of selecting true parameters when both ρ=0 and ρ=0.3. In sum, the simulation results confirm the theoretical results that were developed earlier and show that the lasso estimator proposed will be useful for the high dimensional threshold regression model.

Conclusions

We have considered a high dimensional regression model with a possible change point due to a covariate threshold and have developed the lasso method. We have derived non‐asymptotic oracle inequalities and have illustrated the usefulness of our proposed estimation method via simulations and a real data application. We conclude this paper by providing some areas of future research. First, it would be interesting to extend other penalized estimators (e.g. the adaptive lasso of Zou (2006) and the smoothly clipped absolute deviation penalty of Fan and Li (2001)) to our set‐up and to see whether we would be able to improve the performance of our estimation method. Second, an extension to multiple change points is also an important research topic. There has been some advance in this direction, especially regarding key issues like computational cost and the determination of the number of change points (see, for example, Harchaoui and Lévy‐Leduc (2010) and Frick et al. (2014)). However, they are confined to a single regressor case, and the extension to a large number of regressors would be highly interesting. Finally, it would be also an interesting research topic to investigate the minimax lower bounds of the estimator proposed and its prediction risk like Raskutti et al. (2011, 2012) did in high dimensional linear regression set‐ups. ‘Online appendices’. Click here for additional data file.

5 in total

1. Non-Concave Penalized Likelihood with NP-Dimensionality.

Authors: Jianqing Fan; Jinchi Lv
Journal: IEEE Trans Inf Theory Date: 2011-08 Impact factor: 2.501

2. A Selective Overview of Variable Selection in High Dimensional Feature Space.

Authors: Jianqing Fan; Jinchi Lv
Journal: Stat Sin Date: 2010-01 Impact factor: 1.261

3. Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection.

Authors: Jelena Bradic; Jianqing Fan; Weiwei Wang
Journal: J R Stat Soc Series B Stat Methodol Date: 2011-06 Impact factor: 4.488

4. REGULARIZATION FOR COX'S PROPORTIONAL HAZARDS MODEL WITH NP-DIMENSIONALITY.

Authors: Jelena Bradic; Jianqing Fan; Jiancheng Jiang
Journal: Ann Stat Date: 2011 Impact factor: 4.028

5. Quantile Regression for Analyzing Heterogeneity in Ultra-high Dimension.

Authors: Lan Wang; Yichao Wu; Runze Li
Journal: J Am Stat Assoc Date: 2012-06-11 Impact factor: 5.033

5 in total

7 in total

1. High Dimensional Change Point Inference: Recent Developments and Extensions.

Authors: Bin Liu; Xinsheng Zhang; Yufeng Liu
Journal: J Multivar Anal Date: 2021-09-22 Impact factor: 1.473

2. Identification of Significant Secreted or Membrane-Located Proteins in Laryngeal Squamous Cell Carcinoma.

Authors: Li Yan; Chunyan Hu; Yangyang Ji; Lifen Zou; Yang Zhao; Yi Zhu; Xiaoshen Wang
Journal: J Immunol Res Date: 2022-05-23 Impact factor: 4.493

3. Multithreshold change plane model: Estimation theory and applications in subgroup identification.

Authors: Jialiang Li; Yaguang Li; Baisuo Jin; Michael R Kosorok
Journal: Stat Med Date: 2021-04-11 Impact factor: 2.497

4. Predicting the Risk of Psoriatic Arthritis in Plaque Psoriasis Patients: Development and Assessment of a New Predictive Nomogram.

Authors: Panpan Liu; Yehong Kuang; Li Ye; Cong Peng; Wangqing Chen; Minxue Shen; Mi Zhang; Wu Zhu; Chengzhi Lv; Xiang Chen
Journal: Front Immunol Date: 2022-01-20 Impact factor: 7.561

5. Identification and Validation of Immune Infiltration Phenotypes in Laryngeal Squamous Cell Carcinoma by Integrative Multi-Omics Analysis.

Authors: Li Yan; Xiaole Song; Gang Yang; Lifen Zou; Yi Zhu; Xiaoshen Wang
Journal: Front Immunol Date: 2022-02-24 Impact factor: 7.561

6. A Penalization Method for Estimating Heterogeneous Covariate Effects in Cancer Genomic Data.

Authors: Ziye Luo; Yuzhao Zhang; Yifan Sun
Journal: Genes (Basel) Date: 2022-04-15 Impact factor: 4.141

7. A Machine Learning Model for Predicting a Major Response to Neoadjuvant Chemotherapy in Advanced Gastric Cancer.

Authors: Yonghe Chen; Kaikai Wei; Dan Liu; Jun Xiang; Gang Wang; Xiaochun Meng; Junsheng Peng
Journal: Front Oncol Date: 2021-06-01 Impact factor: 6.244

7 in total