| Literature DB >> 20107611 |
Abstract
Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.Entities:
Year: 2010 PMID: 20107611 PMCID: PMC2810828 DOI: 10.1111/j.1467-9868.2009.00723.x
Source DB: PubMed Journal: J R Stat Soc Series B Stat Methodol ISSN: 1369-7412 Impact factor: 4.488
Variable selection performances of SPLS–NIPALS versus SPLS–SIMPLS algorithms
| SPLS–NIPALS | 9.75 / 12 / 13 | 0 / 0 / 2 |
| SPLS–SIMPLS | 7 / 9 / 13 | 0 / 2 / 5 |
First quartile/median/third quartile.
p case.
Mean-squared prediction error for simulations I and II†
| 40/400/10/0.1 | 31417.9 | 15717.1 | 31444.4 | 208.3 | 199.8 | 201.4 | 198.6 | 200.1 |
| (552.5) | (224.2) | (554.0) | (10.4) | (9.0) | (11.2) | (9.5) | (10.0) | |
| 40/400/10/0.2 | 31872.0 | 16186.5 | 31956.9 | 697.3 | 661.4 | 658.7 | 658.8 | 685.5 |
| (544.4) | (231.4) | (548.9) | (15.7) | (13.9) | (15.7) | (14.2) | (17.7) | |
| 40/400/30/0.1 | 31409.1 | 20914.2 | 31431.7 | 205.0 | 203.3 | 205.5 | 202.7 | 203.1 |
| (552.5) | (1324.4) | (554.2) | (9.5) | (10.1) | (11.1) | (9.4) | (9.7) | |
| 40/400/30/0.2 | 31863.7 | 21336.0 | 31939.3 | 678.6 | 661.2 | 663.5 | 663.5 | 684.9 |
| (544.1) | (1307.6) | (549.1) | (13.6) | (14.4) | (15.6) | (14.4) | (19.3) | |
| 80/40/20/0.1 | 29121.4 | 15678.0 | 485.2 | 538.4 | 494.6 | 720.0 | 533.9 | |
| (1583.2) | (652.9) | (48.4) | (70.5) | (63.0) | (240.0) | (75.3) | ||
| 80/40/20/0.2 | 30766.9 | 16386.5 | 1099.2 | 1019.5 | 965.5 | 2015.8 | 1050.7 | |
| (1386.0) | (636.8) | (86.0) | (74.6) | (74.7) | (523.6) | (84.5) | ||
| 80/40/40/0.1 | 29116.2 | 17416.1 | 502.4 | 506.9 | 497.7 | 522.7 | 545.3 | |
| (1591.7) | (924.2) | (54.0) | (66.9) | (62.8) | (69.4) | (77.1) | ||
| 80/40/40/0.2 | 29732.4 | 17940.8 | 1007.2 | 1013.3 | 964.4 | 1080.6 | 1018.7 | |
| (1605.8) | (932.2) | (82.9) | (78.7) | (74.6) | (165.6) | (74.9) | ||
p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise-to-signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV; SE, standard error.
Model accuracy for simulations I and II†
| 40/400/10/0.1 | 0.76 | 1.00 | 1.00 | 0.83 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 |
| 40/400/10/0.2 | 0.67 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 1.00 | 0.94 | 0.97 |
| 40/400/30/0.1 | 1.00 | 0.98 | 1.00 | 0.83 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 |
| 40/400/30/0.2 | 0.96 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 |
| 80/40/20/0.1 | 0.15 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 0.97 | 0.93 | 0.72 | 0.99 |
| 80/40/20/0.2 | 0.12 | 1.00 | 1.00 | 0.67 | 1.00 | 1.00 | 0.86 | 0.83 | 0.80 | 0.98 |
| 80/40/40/0.1 | 0.21 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 0.93 | 0.72 | 0.99 |
| 80/40/40/0.2 | 0.15 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 0.97 | 0.90 | 0.80 | 0.98 |
p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise-to-signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV.
Mean-squared prediction errors†
| PCR1 | 320.67 (8.07) | 308.93 (7.13) | 241.75 (5.62) | 2730.53 (75.82) |
| PLS1 | 301.25 (7.32) | 292.70 (7.69) | 209.19 (4.58) | 1748.53 (47.47) |
| Ridge regression | 304.80 (7.47) | 296.36 (7.81) | 211.59 (4.70) | 1723.58 (46.41) |
| Supervised PC | 252.01 (9.71) | 248.26 (7.68) | 134.90 (3.34) | 263.46 (14.98) |
| SPLS1(FDR) | 256.22 (13.82) | 246.28 (7.87) | 139.01 (3.74) | 290.78 (13.29) |
| SPLS1(CV) | 257.40 (9.66) | 261.14 (8.11) | 120.27 (3.42) | 195.63 (7.59) |
| Mixed variance–covariance | 301.05 (7.31) | 292.46 (7.67) | 209.45 (4.58) | 1748.65 (47.58) |
| Gene shaving | 255.60 (9.28) | 292.46 (7.67) | 119.39 (3.31) | 203.46 (7.95) |
| True | 224.13 (5.12) | 218.04 (6.80) | 96.90 (3.02) | 99.12 (2.50) |
PCR1, PCR with one component; PLS1, PLS with one component; SPLS1(FDR), SPLS with one component tuned by FDR control (FDR = 0.4); SPLS1(CV), SPLS with one component tuned by CV; True, true model.
Comparison of the number of selected TFs†
| Multivariate SPLS | 32 | 10 | 0.034 |
| Univariate SPLS | 70 | 17 | 0.058 |
| Lasso | 100 | 21 | 0.256 |
| Total | 106 | 21 |
Prob(K≥k) denotes the probability of observing at least k confirmed variables out of 85 unconfirmed and 21 confirmed variables in a random draw of s variables.
Fig. 1Estimated TF activities for the 21 confirmed TFs (plots for ABF-1, CBF-1, GCR2 and SKN7 are not displayed since the TF activities of the factors were zero by both the univariate and the multivariate SPLS; the y-axis denotes estimated coefficients and the x-axis is time; multivariate SPLS regression yields smoother estimates and exhibits periodicity): , estimated TF activities by the multivariate SPLS regression; , estimated TF activities by univariate SPLS
Fig. 2Estimated TF activities selected only by the multivariate SPLS regression; the magnitudes of the estimated TF activities are small but consistent across the time points