| Literature DB >> 27656104 |
Sokbae Lee1, Myung Hwan Seo2, Youngki Shin3.
Abstract
We consider a high dimensional regression model with a possible change point due to a covariate threshold and develop the lasso estimator of regression coefficients as well as the threshold parameter. Our lasso estimator not only selects covariates but also selects a model between linear and threshold regression models. Under a sparsity assumption, we derive non-asymptotic oracle inequalities for both the prediction risk and the l1-estimation loss for regression coefficients. Since the lasso estimator selects variables simultaneously, we show that oracle inequalities can be established without pretesting the existence of the threshold effect. Furthermore, we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a factor that is nearly n-1 even when the number of regressors can be much larger than the sample size n. We illustrate the usefulness of our proposed estimation method via Monte Carlo simulations and an application to real data.Entities:
Keywords: Lasso; Oracle inequalities; Sample splitting; Sparsity; Threshold models
Year: 2015 PMID: 27656104 PMCID: PMC5014306 DOI: 10.1111/rssb.12108
Source DB: PubMed Journal: J R Stat Soc Series B Stat Methodol ISSN: 1369-7412 Impact factor: 4.488
List of variables
|
|
|
|---|---|
|
| |
| gr | Annualized GDP growth rate in the period 1960–1985 |
|
| |
| gdp60 | Real GDP |
| lr | Adult literacy rate in 1960 |
|
| |
| lgdp60 | Log‐GDP |
| lr | Adult literacy rate in 1960 (only included when |
| ls | log(investment/output) annualized over 1960–1985; a proxy for log(physical |
| savings rate) | |
| lgr | log(population growth rate) annualized over 1960–1985 |
| pyrm60 | log(average years of primary schooling) in the male population in 1960 |
| pyrf60 | log(average years of primary schooling) in the female population in 1960 |
| syrm60 | log(average years of secondary schooling) in the male population in 1960 |
| syrf60 | log(average years of secondary schooling) in the female population in 1960 |
| hyrm60 | log(average years of higher schooling) in the male population in 1960 |
| hyrf60 | log(average years of higher schooling) in the female population in 1960 |
| nom60 | Percentage of no schooling in the male population in 1960 |
| nof60 | Percentage of no schooling in the female population in 1960 |
| prim60 | Percentage of primary schooling attained in the male population in 1960 |
| prif60 | Percentage of primary schooling attained in the female population in 1960 |
| pricm60 | Percentage of primary schooling complete in the male population in 1960 |
| pricf60 | Percentage of primary schooling complete in the female population in 1960 |
| secm60 | Percentage of secondary schooling attained in the male population in 1960 |
| secf60 | Percentage of secondary schooling attained in the female population in 1960 |
| seccm60 | Percentage of secondary schooling complete in the male population in 1960 |
| seccf60 | Percentage of secondary schooling complete in the female population in 1960 |
| llife | log(life expectancy at age 0) averaged over 1960–1985 |
| lfert | log(fertility rate) averaged over 1960–1985 |
| edu/gdp | Government expenditure on eduction per GDP averaged over 1960–1985 |
| gcon/gdp | Government consumption expenditure net of defence and education per GDP averaged over 1960–1985 |
| revol | Number of revolutions per year over 1960–1984 |
| revcoup | Number of revolutions and coups per year over 1960–1984 |
| wardum | Dummy for countries that participated in at least one external war over 1960–1984 |
| wartime | Fraction of time over 1960–1985 involved in external war |
| lbmp | log(1 + black market premium averaged over 1960–1985) |
| tot | Term‐of‐trade shock |
| lgdp60 × ‘educ’ | Product of two covariates (interaction of lgdp60 and education variables from pyrm60 to seccf60); total 16 variables |
Model selection and estimation results with Q=gdp60a
|
|
|
| |
|---|---|---|---|
|
|
| ||
| Constant | −0.0923 | −0.0811 | — |
| lgdp60 | −0.0153 | −0.0120 | — |
| ls | 0.0033 | 0.0038 | — |
| lgr | 0.0018 | — | — |
| pyrf60 | 0.0027 | — | — |
| syrm60 | 0.0157 | — | — |
| hyrm60 | 0.0122 | 0.0130 | — |
| hyrf60 | −0.0389 | — | −0.0807 |
| nom60 | — | — | 2.64 × 10 |
| prim60 | −0.0004 | −0.0001 | — |
| pricm60 | 0.0006 | −1.73 × 10 |
|
| pricf60 | −0.0006 | — | — |
| secf60 | 0.0005 | — | — |
| seccm60 | 0.0010 | — | 0.0014 |
| llife | 0.0697 | 0.0523 | — |
| lfert | −0.0136 | −0.0047 | — |
| edu/gdp | −0.0189 | — | — |
| gcon/gdp | −0.0671 | −0.0542 | — |
| revol | −0.0588 | — | — |
| revcoup | 0.0433 | — | — |
| wardum | −0.0043 | — | −0.0022 |
| wartime | −0.0019 | −0.0143 | −0.0023 |
| lbmp | −0.0185 | −0.0174 | −0.0015 |
| tot | 0.0971 | — | 0.0974 |
| lgdp60 × pyrf60 | — |
| — |
| lgdp60 × syrm60 | — | — | 0.0002 |
| lgdp60 × hyrm60 | — | — | 0.0050 |
| lgdp60 × hyrf60 | — | −0.0003 | — |
| lgdp60 × nom60 | — | — |
|
| lgdp60 × prim60 |
| — | — |
| lgdp60 × prif60 |
| — |
|
| lgdp60 × pricf60 |
| — | — |
| lgdp60 × secm60 | −0.0001 | — | — |
| lgdp60 × seccf60 | −0.0002 |
| — |
|
| 0.0004 | 0.0034 | |
|
| 28 | 26 | |
| Number of covariates | 46 | 92 | |
| Number of observations | 80 | 80 | |
The regularization parameter λ is chosen by the ‘leave‐one‐out’ cross‐validation method. denotes the number of covariates to be selected by the lasso estimator and a dash indicates that the regressor is not selected. Recall that is the coefficient when and that is the change of the coefficient value when .
Simulation results with M=50a
|
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|
|
|
|
| ||||||
|
| ||||||||
|
| Least squares | None | 0.285 | 0.276 | 0.074 | 100.00 | 7.066 | 0.008 |
| Lasso |
| 0.041 | 0.030 | 0.035 | 12.94 | 0.466 | 0.010 | |
|
| 0.048 | 0.033 | 0.049 | 10.14 | 0.438 | 0.013 | ||
|
| 0.067 | 0.037 | 0.086 | 8.44 | 0.457 | 0.024 | ||
|
| 0.095 | 0.050 | 0.120 | 7.34 | 0.508 | 0.040 | ||
| Oracle 1 | None | 0.013 | 0.006 | 0.019 | 4.00 | 0.164 | 0.004 | |
| Oracle 2 | None | 0.005 | 0.004 | 0.004 | 4.00 | 0.163 | 0.000 | |
|
| Least squares | None | 0.317 | 0.304 | 0.095 | 100.00 | 7.011 | 0.008 |
| Lasso |
| 0.052 | 0.034 | 0.063 | 13.15 | 0.509 | 0.016 | |
|
| 0.063 | 0.037 | 0.083 | 10.42 | 0.489 | 0.023 | ||
|
| 0.090 | 0.045 | 0.121 | 8.70 | 0.535 | 0.042 | ||
|
| 0.133 | 0.061 | 0.162 | 7.68 | 0.634 | 0.078 | ||
| Oracle 1 | None | 0.014 | 0.006 | 0.022 | 4.00 | 0.163 | 0.004 | |
| Oracle 2 | None | 0.005 | 0.004 | 0.004 | 4.00 | 0.163 | 0.000 | |
|
| Least squares | None | 2.559 | 0.511 | 16.292 | 100.00 | 12.172 | 0.012 |
| Lasso |
| 0.062 | 0.035 | 0.091 | 13.45 | 0.602 | 0.030 | |
|
| 0.089 | 0.041 | 0.125 | 10.85 | 0.633 | 0.056 | ||
|
| 0.127 | 0.054 | 0.159 | 9.33 | 0.743 | 0.099 | ||
|
| 0.185 | 0.082 | 0.185 | 8.43 | 0.919 | 0.168 | ||
| Oracle 1 | None | 0.012 | 0.006 | 0.017 | 4.00 | 0.177 | 0.004 | |
| Oracle 2 | None | 0.005 | 0.004 | 0.004 | 4.00 | 0.176 | 0.000 | |
|
| ||||||||
| — | Least squares | None | 6.332 | 0.460 | 41.301 | 100.00 | 20.936 | — |
| Lasso |
| 0.013 | 0.011 | 0.007 | 9.30 | 0.266 | ||
|
| 0.014 | 0.012 | 0.008 | 6.71 | 0.227 | |||
|
| 0.015 | 0.014 | 0.009 | 4.95 | 0.211 | |||
|
| 0.017 | 0.016 | 0.010 | 3.76 | 0.204 | |||
| Oracle 1 and | None | 0.002 | 0.002 | 0.003 | 2.00 | 0.054 | ||
| oracle 2 | ||||||||
M denotes the column size of and τ denotes the threshold parameter. Oracle 1 and oracle 2 are estimated by least squares when sparsity is known and when sparsity and are known respectively. All simulations are based on 400 replications of a sample with 200 observations.
Not applicable.
Figure 1Mean prediction errors and mean (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400
Figure 2Mean ‐errors for α and τ (♦, τ=0.3; □, τ=0.4; ◯, τ=0.5; △, c=0): (a) M=100; (b) M=200; (c) M=400