Xue Yu1, Yifan Sun2, Hai-Jun Zhou3,4,5. 1. Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, 100872, China. 2. Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, 100872, China. sunyifan@ruc.edu.cn. 3. CAS Key Laboratory for Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, 100190, China. zhouhj@itp.ac.cn. 4. School of Physical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China. zhouhj@itp.ac.cn. 5. MinJiang Collaborative Center for Theoretical Physics, MinJiang University, Fuzhou, 350108, China. zhouhj@itp.ac.cn.
Abstract
High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest-solution guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the shortest least-squares solution of the recursively decimated linear models, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outperforms LASSO, adaptive LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.
High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest-solution guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the shortest least-squares solution of the recursively decimated linear models, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outperforms LASSO, adaptive LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.
Detecting the relationship between a response and a set of predictors is a common problem encountered in different branches of scientific research. This problem is referred to as regression analysis in statistics. A major focus of regression analysis has been on linear regression models, which search for a linear relationship between the responses and the predictors. Consider the linear regression model of the following formwhere is the response vector, is an measurement matrix with being the i-th column, is the vector of p true regression coefficients, and are random errors with . The variance of is with being the variance of the noise level ( being the typical magnitude of the noise). Let be the number of nonzero entries in . We focus on the case where and , and the goal is to construct a sparse vector which serves as the best approximation to the hidden truth vector , given (the measurement results) and (the measurement matrix) but with (the noise vector) unknown.Such linear regression models are widely adopted in many practical applications because of their simplicity and interpretability. With the advancement in measurement technologies, high-dimensional data are nowadays accumulating with fast speed in a variety of fields such as genomics, neuroscience, systems biology, economics, and social science. In these high-dimensional data, the number p of predictors is often larger than the number n of samples or measurements (), making the solution of the linear regression problem far from being unique. Additional criteria need to be imposed to reduce the degeneracy of solutions and to select the most appropriate linear regression model. One of the most important criteria is sparsity. Motivated by empirical findings in genomics and other fields, we usually assume that the high-dimensional regression models are sparse, in the sense that only a relatively small number of predictors are important for explaining the observed data[1]. Associated with this sparsity criterion are two highly nontrivial issues in high-dimensional linear regression: (1) variables selection, namely to specify the most relevant predictors; and (2) parameters or coefficients estimation, namely to determine the individual contributions of the chosen predictors. Sparse high-dimensional linear regression has also been studied from the angle of compressed sensing[2].In principle, the regression coefficients can be specified by searching for the solution with the least number of nonzero elements, but this non-convex minimization problem is intractable in practice. Over the years a variety of approaches have been proposed to approximate the optimal solution. The existing approaches can roughly be divided into three categories: relaxation methods, physics-inspired message-passing methods, and greedy methods. The basic idea of the relaxation methods is to replace the non-smooth -norm penalty with a smooth approximation. Among them the Least Absolute Shrinkage and Selection Operator (LASSO)[3,4], which uses the -norm penalty, is the most popular one. LASSO is a convex optimization problem, which can be solved by methods such as LARS[5-7], coordinate descent[8] and proximal gradient descent[9]. However, due to the over-shrinking of large coefficients, LASSO is known to lead to biased estimates. To remedy this problem, some alternative methods have been proposed, including multi-stages methods such as adaptive LASSO[10] and the three-stage method[11], and non-convex penalties such as the smoothly clipped absolute deviation (SCAD) penalty[12] and the minimax concave penalty (MCP)[13].An alternative strategy comes from the approximate message-passing (AMP) methods, which are closely related to the Thouless-Anderson-Palmer equation in statistical physics that is capable of dealing with high-dimensional inference problems. They have shown remarkable success in sparse regression and compressed sensing[14-17]. However, the convergence issue limits the practical application of the AMP methods, especially on problems with highly correlated predictors. Recently, several algorithms such as Generalized AMP (GAMP)[18], SwAMP[19], adaptive damping[20], mean removal[20] and direct free-energy minimization[21] were proposed to fix this problem. Especially, the orthogonal or vector AMP (VAMP) algorithm[22,23] offers a robust alternative to the conventional AMP.Another line of research focuses on greedy methods for minimization such as orthogonal least squares (OLS)[24] and orthogonal matching pursuit (OMP)[25]. The main idea is to select a single variable vector that has the largest magnitude of (rescaled) inner product with the current residual response vector at each iteration step. A sure-independence-screening (SIS) method based on correlation learning was proposed to improve variable selection[26], and an iterative version of this SIS approach (ISIS) could be adopted to enhance the performance of variable selection[27]. Several more recently developed greedy methods proposed to select several variables at a time, including the iterate hard thresholding (IHT) algorithm[28,29], the primal-dual active set (PDAS) methods[30], and the adaptive support detection and root finding (ASDAR) approach[31].Most of the above-mentioned approximate methods generally assume that the measurement matrix satisfies some regularity conditions such as the irrepresentable condition and the sparse Riesz condition, for mathematical convenience or good algorithmic performance. Roughly speaking, these conditions require that the predictors should be fully uncorrelated or only weakly correlated. But these strict conditions are often not met in real-world applications. As such, it is desirable to develop an efficient and robust method applicable for more general correlation structures of the predictors. Recently, the shortest-solution guided decimation (SSD) algorithm[32] is proposed as a greedy method for solving high-dimensional linear regression. Similar to OLS and OMP, at each iteration step SSD selects a single variable as a candidate predictor. The difference is that this selection is based on the dense least-squares (i.e., shortest Euclidean length) solution of the decimated linear equations. Initial simulation results demonstrated that this SSD algorithm significantly outperforms several of the most popular algorithms (-based penalty methods, OLS, OMP, and AMP) when the measurement matrices are highly correlated.Although the SSD algorithm is highly competitive to other heuristic algorithms both for uncorrelated and correlated measurement matrices, a crucial assumption in its original implementation is that there is no measurement noise (). As we will demonstrate later, when the measurement noise is no longer negligible, the naive noise-free SSD algorithm fails to extract the sparse solution of linear regression. To overcome this difficulty, here we extend the SSD algorithm and propose the adaptive SSD algorithm (ASSD) to estimate the sparse high-dimensional regression models. Compared with the original SSD, the new ASSD algorithm adopts a much more relaxed termination condition to allow early stop. Furthermore and significantly, we add a second-stage screening to single out the truly important predictors after the first-stage estimation is completed.We test the performance of ASSD both on synthetic data (predictors and responses are both simulated) and on semi-synthetic data (real predictors but simulated responses, using the gene expression data from cancer samples). In comparison with the representative algorithms LASSO, adaptive LASSO (ALASSO), VAMP, and two greedy methods (ASDAR and SIS-LASSO), our extensive simulation results demonstrate that ASSD outperforms all these competing algorithms in terms of accuracy and robustness of variables selection and coefficients estimation. It appears that ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications. On the other hand, ASSD is generally slower than these other algorithms, pointing to a direction of further improvement. It may also be interesting to analyze theoretically the SSD and ASSD algorithms.
Methods
The shortest solution as a guidance vector
We first briefly summarize the key ideas behind the SSD algorithm[32]. Consider the singular value decomposition (SVD) of the measurement matrix : , where is an orthogonal matrix, and is a orthogonal matrix, and is an diagonal matrix of the singular values . Here and form a complete set of orthonormal basis vectors for the n- and p-dimensional real space respectively, so the vectors satisfy , and vectors satisfy , where is the Kronecker symbol: for and for . We can express the true coefficient vector as a linear combination of the basis vectors :Substituting the above expression into the regression function of model (Eq. 1), we obtain thatwith the parameter for beingHere is the Heaviside function: for and for . We define a vector asThenWe call the guidance vector[32]. This vector is dense and it is not the true coefficient vector we are seeking. However, interestingly, this dense vector does provide information about the locations of nonzero elements of (see Fig. 1 and also the earlier empirical observations[32]). To understand this important property of , firstly, we reformulate the matrices and as partitioned matrices: with and , and with . Then, we haveWe define , which is a symmetric matrix. According to Eq. (7), each element of iswhere , and . Since , we may expect that , and thus (here is the compression ratio). Expecting that and are almost independent of each other, we get that , where ± means that is positive or negative with roughly equal probability. Define as the sparsity of . Because is a sparse vector with only nonzero entries, the summation in the right hand side of Eq. (8) contains at most terms. Neglecting the possible weak correlations among (), we have , where is the mean magnitude of the coefficients. Putting the above approximations together, we finally getNotice that the second term in the right hand side of the Eq. (9) is independent of the index i. For the element that has the largest magnitude among all the elements of , we expect that the two terms in the right hand side of Eq. (9) have the same sign, and thus it will have a relatively large magnitude. It then follows that the corresponding is very likely to be nonzero and also .
Figure 1
Estimated guidance vector on an uncorrelated Gaussian measurement matrix with and . Each nonzero coefficient is uniform distributed in [0.5, 1]. Top: rank curves for the estimated guidance vector. Bottom: proportion q(r) of nonzero elements of among the r top-ranked indices i. In the two subfigures on the left panel, and ; and in the two subfigures on the right panel, and .
Estimated guidance vector on an uncorrelated Gaussian measurement matrix with and . Each nonzero coefficient is uniform distributed in [0.5, 1]. Top: rank curves for the estimated guidance vector. Bottom: proportion q(r) of nonzero elements of among the r top-ranked indices i. In the two subfigures on the left panel, and ; and in the two subfigures on the right panel, and .The above analysis offers a qualitative explanation on why the guidance vector can help us to locate the nonzero elements in the sparse vector . When there is no noise (), this guidance vector is easy to determine and it is the shortest Euclidean-length solution of an underdetermined linear equation. In the presence of measurement noise (), however, the conditional expectation is unknown. Then we cannot get the exact value of the guidance vector but can only get an approximate . Consider the estimatorwhere is the Moore-Penrose inverse of . Notice that is nothing but the shortest length (i.e., minimum norm) least-square solution of linear model (Eq. 1) (hereinafter referred to as the “shortest-solution”). It can be proved that is the best linear unbiased estimator to (see Supplementary Section A). Combined with the above theoretical analysis, we conjecture that is also helpful for us to guess which elements of the true coefficient vector are nonzero. The validity of this conjecture has been confirmed by simulation results. Figure 1 shows the magnitude of elements of the estimated guidance vector in descending order (top) and the proportion q(r) of nonzero elements of among the r top-ranked indices i (bottom) for datasets generated from models with uncorrelated Gaussian measurement matrices with , , and . It can be seen that indeed contains important clues about the nonzero elements of : for the indices k that are ranked in the top in terms of magnitude of , the corresponding values have high probabilities to be nonzero. In particular, for the 10 top-ranked indices in the examples of , the corresponding entries in are nearly all nonzero.In practice, the estimated guidance vector can be solved through LQ decomposition or convex optimization which is more efficient than SVD. In our simulation studies, we employ the convex optimization method[32], see Supplementary Section B for the explicit formula.
Shortest-solution guided Decimation
Based on the above theoretical analysis and empirical results, we now try to solve the linear model (Eq. 1) through a shortest-solution guided decimation algorithm. Specifically, let be a p-dimensional coefficient vector. Assume that the k-th element of the guidance vector has the largest magnitude. If all the other elements of the vector are known, can be uniquely determined as the solution of the minimization problemPlugging Eq. (11) into model (Eq. 1), we obtain thatwhere , namely the vector formed by deleting from ; is an decimated measurement matrix with its column vector beingand is the residual of the original response vector,Notice that Eq. (12) has the identical form as that of the original linear model (Eq. 1). Therefore, we can obtain the corresponding estimated guidance vector through the least-squares solution of Eq. (12). Then, we repeat the above decimation process (Eqs. 11–14) to further shrink the residual response vector, until the certain stopping criterion is met. Suppose that a total number L of elements of have been picked during this whole decimation process. We can uniquely and easily determine the values of these L elements by setting all the other elements to be zero and then backtracking the L constructed equations of the form Eq. (11).Simulation results of SSD with the naive stopping criterion on a single uncorrelated Gaussian measurement matrix (, ). Top: trace of -norm of the residual response vector . The red stars on the horizontal axis signify that the identified element of , with k being the index of the largest magnitude element of , is indeed nonzero. Bottom: results of coefficient estimation. The two subfigures on the left panel display the noise-free situation (), and those on the right panel display the noise situation ().Up to now, this above SSD algorithm is the same as the original algorithm[32]. In the original SSD algorithm, the stopping criterion is that the magnitude of the residual response vector becomes less than a prespecified threshold (e.g., ). We test the performance of SSD on a single noisy problem instance, to test if this stopping criterion is still appropriate for the noisy situation. Figure 2 shows the trace of the SSD process on two datasets with noise level (left) and (right). The two datasets have the identical measurement matrix and the true coefficient vector with nonzero elements, which are sampled from uniform distribution . We see that, for the noise-free situation, the decimation stops (i.e., ) after steps (top left) with all the nonzero elements of being recovered exactly (bottom left). However, once the noise is added, there is a significant increase in the number of decimation steps (, top right), and the resulting coefficient vector is dense and is dramatically different from (bottom right). These results suggest that the stopping criterion used in the original SSD algorithm is no longer appropriate for the linear regression model with noise and needs to be improved.
Figure 2
Simulation results of SSD with the naive stopping criterion on a single uncorrelated Gaussian measurement matrix (, ). Top: trace of -norm of the residual response vector . The red stars on the horizontal axis signify that the identified element of , with k being the index of the largest magnitude element of , is indeed nonzero. Bottom: results of coefficient estimation. The two subfigures on the left panel display the noise-free situation (), and those on the right panel display the noise situation ().
With an additional examination on the bottom right panel of Fig. 2, we find that during the early steps of the decimation process the identification of the nonzero elements of is highly accurate. Specifically, there are only four mistakes in identification in the initial 30 decimation steps. In later decimation steps, however, the index k of the largest-magnitude element is no longer reliable, in the sense that the true value of may be zero. These misidentified elements are too numerous to be corrected by the subsequent backtracking process of SSD, and the resulting coefficient vector is then quite different from . These observations indicate the necessity of stopping the decimation process earlier.Firstly, we set an upper bound for the number of decimation steps. It has been established that the true coefficient vector cannot be reconstructed consistently with a sample of size n if there are more than nonzero elements[33]. We therefore takeWe repeat the shortest-solution guided decimation only up to steps. Additionally, we estimate by the solution of the minimization problem[34]Under certain conditions on the RIP (restricted isometry property) constant of , the estimation error measured in -norm of is of the order of [34]. Inspired by this insight, we terminate the SSD process earlier than steps once the Euclidean length of the residual response vector (i.e., ) is smaller than a prespecified value .
Second-stage thresholding after SSD
Even after the early stopping strategy is applied to the decimation process, we find that some of the zero-valued coefficients are still predicted to be nonzero by the algorithm. To reduce this false-positive fraction as much as possible, we propose a second-stage thresholding procedure to the SSD algorithm. The idea is to manually reset some of the coefficients if the value predicted by the SSD algorithm is below a certain threshold value. This refinement procedure turns out to be rather effective in improving the variable selection accuracy.Suppose that after early stopping L elements of are assigned with nonzero values, and the indices of all the zero-valued coefficients form a set A (i.e., if and only if ). We sort the absolute values of these L estimated coefficients in an ascending order (say ), and use the first L/2 of them to calculate an empirical measure of coefficients uncertainty aswhere m means the average value of the considered L/2 elements, . Notice is distinct in meaning from the the noise magnitude of the original model system (Eq. 1). We adopt a data-driven procedure to determine the optimal thresholding level. First we setto be the basic thresholding level[35](see also the initial work on thresholding to wavelet coefficients[36]). Next, we take the actual thresholding level to be with taking discrete values. As increases from zero to a relatively large value R (e.g., ), the threshold value becomes more and more elevated. At a given value of , we first update the index set A by adding some indices i to A if , and then we update the remaining elements of by solving the minimization problemFinally, we compute the BIC (Bayesian Information Criterion) index[33] aswhere means the number of nonzero elements in the vector , namely . The BIC value is a trade-off between the prediction error and the model complexity. We choose the value of such that achieves the minimum value, and consider the corresponding coefficient vector as the final solution of the linear regression problem (Eq. 1).
An initial demonstration of ASSD performance
We summarize the above ideas in the pseudo-code of Algorithm 1. This ASSD algorithm has two parts: decimation with early stopping, followed by refinement by second-stage thresholding.Let us work on a small example case to better appreciate the working characteristics of ASSD. We generate an random Gaussian matrix with and , whose elements are i.i.d. distributed. The truth coefficient vector has nonzero elements, each of which is sampled from uniform distributions and with equal probability, and 970 zero elements. The response vector is generated from the linear regression model (Eq. 1) with error level . We compare the performance of ASSD with that of the original SSD which does not conduct early-stopping nor the second-stage thresholding, and that of SSD1, which only adopts early-stopping but skips the second-stage thresholding.The algorithmic results shown in Fig. 3 reveal that all these three algorithms assigned good approximate values for the nonzero elements of . SSD has a high false-positive rate (154 of the zero elements of are misclassified as nonzero), and early-stopping dramatically reduces this rate (only 8 false-positive predictions in SSD1). By applying the second-stage thresholding, ASSD achieves a zero false-positive rate. In addition, ASSD and SSD1 are more efficient than SSD (SSD, 27.2 s; SSD1, 7.83 s; ASSD, 8.2 s). Overall, the presence of the measurement noise usually renders SSD to produce a solution with a high false-positive rate, and two modifications of ASSD, i.e., a modified early-stopping criterion, coupled with a second-stage thresholding, are proposed to reduce the false-positive rate as much as possible.
Figure 3
Performance comparison on the original SSD (left), SSD1 which is SSD with early-stopping, and ASSD. The measurement matrix is , while the noise level is . The blue crosses mark the 30 nonzero elements of the true coefficient vector , the red circles mark the elements of the estimated coefficient vector . In ASSD, we set and .
Performance comparison on the original SSD (left), SSD1 which is SSD with early-stopping, and ASSD. The measurement matrix is , while the noise level is . The blue crosses mark the 30 nonzero elements of the true coefficient vector , the red circles mark the elements of the estimated coefficient vector . In ASSD, we set and .
Results
Model implementation
To better gauge the performance of ASSD, we compare ASSD with five different methods: LASSO, Adaptive LASSO (ALASSO) , VAMP, SIS+LASSO, and ASDAR. We implemented all these methods in Matlab. Our implementation of LASSO uses the function lasso. For ALASSO, we use the LASSO solution as the initial estimator, and set the weight as , . For VAMP, we use the publicly available Matlab package[37]. The algorithm SIS+LASSO first selects variables based on SIS and then runs LASSO to further reduce the number of falsely identified nonzero coefficients. We implement ASDAR by using the Matlab package sdar[31].For ASSD, we set , (if not specified) and (in practical applications, if (the s.d. of noise) is unknown, we can set to be a small value, e.g. 0.1 as used in Fig. 3). For LASSO, ALASSO, and SIS+LASSO, the tuning parameters are selected by using 10-fold cross validation. For ASDAR, we set and stop the iteration if the number of identified nonzero elements is greater than , or the residual norm is smaller than , or the distance of two subsequent solutions (measured in -norm) is smaller than 1. For VAMP, a small amount of damping is useful when the measurement matrix is ill-conditioned. We set the dampling parameter to be 0.95. Other parameters of VAMP, including the maximum number of iterations, the tolerance for stopping, are the default values in public-domain GAMPmatlab toolbox[37].We focus on four metrics for algorithmic comparisons: (1) the relative error (RE) of estimation, defined as ; (2) the true positive counts (TP) and (3) the false positive counts (FP) of variable selection, and (4) the CPU time in seconds. In each scenario, we calculate the average and standard deviation of these four metrics over 96 independent runs.
Results on three types of measurement matrices
We first consider different types of measurement matrices of the same size. In our numerical experiments we set and . The number of nonzero coefficients in is set to be , with each of them being generated from the uniform distribution . Three types of measurement matrix are considered:The response vector is generated via the linear regression model Eq. (1), in which the random errors are independently generated from normal distribution with means 0 and variance .Correlated Gaussian matrix: Each row of the matrix is drawn independently from , where , with and 0.7 corresponding to no and strong correlations.Structured matrix: The matrix is the product of an matrix and an matrix . Both and are random Gaussian matrices whose elements are independently generated from . The rank r is closely related to the degree of correlation between elements in matrix . When , the elements in matrix are weakly correlated or even uncorrelated. As r approaches n from above, the elements in matrix are more and more correlated. We consider two scenarios: and corresponding to weakly correlated and highly correlated (or structured) matrices, respectively.Real-world matrix: We choose the gene expression data from The Cancer Genome Atlas (TCGA) ovarian cancer samples[38] and we use the dataset provided by two earlier studies[39,40]. The dataset is available athttps://bioinformatics.mdanderson.org/Supplements/ResidualDisease/. There are 594 samples and 22, 277 genes in the original dataset. We randomly subsample the samples and genes to obtain a measurement matrix .
Correlated Gaussian matrix
Table 1 shows the results on Gaussian matrices. Here and hereinafter, the standard deviations of metrics are shown in the parentheses, and in each column, the numbers in boldface indicate the best performers. It is observed that ASSD has the best performance in variable selection. ASDAR achieves similar performance with ASSD when there is no correlation (), but it suffers from identifying more false positives when increases to . For estimation, ASSD again has the best or close-to-the-best performance compared with VAMP. Although VAMP produces a smaller relative error than ASSD when , its performance deteriorates significantly when the correlation is high.
Table 1
Simulation results on Gaussian measurement matrices with , , , , and or 0.7.
ASSD shows no advantage in speed. ASSD is similar to LASSO and ALASSO in computation time, but it is much slower than ASDAR and VAMP.Simulation results on Gaussian measurement matrices with , , , , and or 0.7.Significant values are in bold.
Structured matrices
Results on the structured matrices are reported in Table 2. We see that when the rank number r of the matrix is large, i.e. , VAMP, ASDAR, and ASSD are on the top of the list in metrics for variable selection (TP and FP) and the metric for estimation (RE). As the rank number r approaches n (), ASDAR becomes less accurate in variable selection and coefficient estimation than VAMP and ASSD. The favorable performance of VAMP is not unexpected because it can achieve Bayes-optimal estimation for a class of structured measurement matrix, namely that of the rotationally-invariant matrix. It is encouraging to observe that ASSD performs comparably well in variable selection and estimation, even when the measurement matrix is highly structured.
Table 2
Simulation results on structured measurement matrices with , , , , and .
r
Methods
TP
FP
RE
Time
2300
LASSO
40 (0)
32 (9.53)
4.05E−02 (3.59E−03)
9.40 (2.05)
ALASSO
40 (0)
32 (9.50)
4.03E−02 (3.17E−03)
12.66 (2.67)
VAMP
40 (0)
0 (0)
1.69E-03 (2.06E-04)
0.09 (0.05)
SIS+LASSO
22 (2.10)
26 (3.47)
7.02E−01 (5.16E−02)
0.80 (0.05)
ASDAR
40 (0)
0 (0)
1.69E−03 (2.06E−04)
0.13 (0.03)
ASSD
40 (0)
0 (0)
1.69E−03 (2.04E−04)
15.76 (2.67)
305
LASSO
40 (0)
101 (19.64)
9.41E−02 (2.43E−02)
30.38 (15.04)
ALASSO
40 (0)
100 (19.47)
9.31E−02 (2.28E−02)
44.84 (22.98)
VAMP
40 (0)
0 (0)
4.98E−03 (6.40E−04)
0.08 (0.05)
SIS+LASSO
14 (2.36)
30 (3.88)
8.70E−01 (4.43E−02)
0.99 (0.07)
ASDAR
38 (5.26)
30 (41.50)
1.25E−01 (3.06E−01)
0.37 (0.26)
ASSD
40 (0)
0 (0)
4.99E−03 (6.36E−04)
17.22 (7.06)
Significant values are in bold.
Simulation results on structured measurement matrices with , , , , and .Significant values are in bold.
Real-world matrix
Table 3 shows the results on a real measurement matrix which is a subsample of gene expression data from TCGA ovarian cancer samples. ASSD again has the best performance both in variable selection and in coefficient estimation. LASSO, ALASSO, SIS+LASSO, and VAMP do not work well on the real matrix as they identify too many false positives and produce significantly larger estimation errors. ASDAR is similar to ASSD in terms of estimation error, but it is inferior to ASSD in terms of variable selection. It is indeed quite a remarkable observation that only the ASSD algorithm achieves almost perfect accuracy for this real-world problem instance.
Table 3
Simulation results on a real-world measurement matrix with , , and .
Methods
TP
FP
RE
Time
LASSO
39 (1.84)
118 (15.87)
4.80E− 01 (9.02E−02)
32.10 (3.34)
ALASSO
39 (1.84)
103 (14.70)
4.36E−01 (8.12E−02)
48.07 (5.99)
VAMP
40 (0.32)
90 (375.89)
5.40E−00 (4.11E+01)
0.18 (0.10)
SIS+LASSO
7 (2.36)
20 (4.05)
9.46E−01 (3.77E−02)
2.14 (0.73)
ASDAR
35 (6.09)
18 (16.46)
4.01E−01 (2.79E−01)
0.33 (0.16)
ASSD
36 (4.79)
8 (7.81)
3.55E−01 (2.06E−01)
31.16 (3.52)
We conduct additional simulation to examine dependence of the proposed ASSD on the distribution of coefficients. First, consider strict sparsity case. The nonzero coefficients are sampled from the uniform distribution and with equal probability; and (b) sampled from uniform distribution . The other settings are the same as above. We use the real-world matrix as a representative and present the results in Supplementary Table S1-S2. Second, consider weak sparsity case, where the coefficients are sampled from uniform distribution and other coefficients are set to be 0.001. Supplementary Table S3 reports the results on the real-world matrix. We also consider a larger noise with . The nonzero coefficients are sampled from uniform distribution . The simulation results on the real-world matrix are displayed in Supplementary Table S4. Under the above four scenarios, ASSD generally have favorable performance in variable selection and coefficient estimation, in particular, it has the lowest false positives. It is noted that, compared to LASSO and ALASSO, ASSD identifies less true nonzero coefficients. A possible reason is that the adopted stopping criterion is a little bit too strict (i.e., L is too small), and the values of TP are expected to increase with looser stopping criterion. These results demonstrate that ASSD is especially well-suited to scenarios which put a higher value on precision than recall of variable selection.Simulation results on a real-world measurement matrix with , , and .
Influence of model parameters
We now investigate more closely the effect of each of the model parameters (the sample size n, the number of predictors p, and the sparsity level ) on the performance of LASSO, VAMP, SIS+LASSO, ASDAR, and ASSD. From the above simulation results, we observe that ALASSO performs better than LASSO in variable selection; however, the improvements over LASSO is not large, but with greater computational cost. As such, we do not report the results of ALASSO in this section. The same three types of measurement matrices are examined: Gaussian matrix with , structured matrix with the rank number , and the real-world matrix. The nonzero elements of are i.i.d. random values drawn from the uniform distribution over [0.5, 1]. We generate the response vector from the linear regression model (1). The random errors are generated independently from . The simulation results are based on 96 independent repeats.Simulation results on a correlated Gaussian measurement matrix (, , , ): influence of the sample size n on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).Simulation results on a correlated Gaussian measurement matrix (, , , ): influence of the covariates number p on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).Simulation results on a correlated Gaussian measurement matrix (, , , ): influence of the sparsity level on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).Figure 4 shows the influence of sample size n on the relative errors (top left panel), true positives (top right panel), false positives (bottom left panel), and probability of exact identification of nonzero coefficients (bottom right panel) when the measurement matrix is a correlated Gaussian one. Results obtained on the other two types of measurement matrices can be found as Supplementary Fig. S1 and Supplementary Fig. S2 . (For the real-world matrix and the structured measurement matrices, the results of VAMP are too unstable to be shown here and hereinafter.) As expected, the performances of all the methods improve as n increases. For the real-world and the correlated Gaussian matrices, ASDAR and ASSD perform comparably well in estimation accuracy. However, ASSD performs significantly better than ASDAR in the accuracy of variable selection. Specifically, ASSD can exactly recover the support when , whereas the success probability of ASDAR is only . Similar observations are obtained for the real-world measurement matrix. For the structured measurement matrices, VAMP has the best performance, and ASSD again has close-to-the-best performance compared with ASDAR, LASSO, and SIS+LASSO.
Figure 4
Simulation results on a correlated Gaussian measurement matrix (, , , ): influence of the sample size n on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).
Figure 5 shows the influence of the number of covariates p on the performances of the four methods for a correlated Gaussian measurement matrix. Data are generated from the model with and . We see that ASSD always produces the lowest relative errors and FP, and the highest TP. In particular, the probability of exactly recovering the support of the true coefficient vector of ASSD is higher than that of the other methods as p increases, which indicates that ASSD is more robust to the number of covariates. Similar observations are also made for the other types of measurement matrices as Supplementary Fig. S3 and Supplementary Fig. S4 .
Figure 5
Simulation results on a correlated Gaussian measurement matrix (, , , ): influence of the covariates number p on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).
The influence of the sparsity level on the performance of the four methods for a correlated Gaussian measurement matrix is presented in Fig. 6. The corresponding results obtained for the other types of matrices are presented as Supplementary Fig. S5 and Supplementary Fig. S6 . Data are generated with and . We use for ASSD and ASDAR as well since the maximum . When the number of nonzero elements increases, the performances of all the four methods become worse. When is small (e.g. ), ASSD generally has the best performance in the accuracy of estimation and variable selection (i.e., has highest TP and probability of exact identification of nonzero coefficients). However, when is large, ASSD performs worse than some comparison methods. For example, when , ASSD produces higher RE and lower TP than LASSO. These results indicate that ASSD is well-suited to “strong sparsity” scenario where the number of nonzero coefficients is small.
Figure 6
Simulation results on a correlated Gaussian measurement matrix (, , , ): influence of the sparsity level on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).
In summary, our simulation results demonstrate that the proposed ASSD is more accurate and robust in variable selection and coefficient estimation than LASSO, VAMP, SIS+LASSO, and ASDAR. This ASSD algorithm is a promising heuristic method for highly correlated random and real-world measurement matrices.
Discussion
In this paper, we proposed the adaptive shortest-solution guided decimation (ASSD) algorithm to estimate high-dimensional sparse linear regression models. Compared to the original SSD algorithm which is developed for linear regression models without noise[32], the ASSD algorithm takes into account the effect of measurement noise and adopts an early-stopping strategy and a second-stage thresholding procedure, resulting in significantly better performance in variables selection (which columns are relevant) and coefficients estimation (what are the corresponding regression values ). Extensive simulation studies demonstrate that ASSD has favorable performance, and outperforms the comparison methods in variable selection, and is competitive with or outperforms VAMP and ASDAR in coefficient estimation. It is robust to the model parameters, and it is especially robust for different types of measurement matrices such as those whose entries are highly correlated. These numerical results suggest that ASSD can serve as an efficient and robust tool for real-world sparse estimation problems.In terms of speed, ASSD is slower than VAMP and ASDAR and this is an issue to be further improved in the future. To accelerate ASSD, on the one hand, we can select a small fraction of elements in coefficient vector instead of just one of them in each decimation step, and on the other hand, we can adopt a more delicate early-stopping strategy to further reduce the unnecessary decimation steps. In addition, the rigorous theoretical understanding of ASSD needs to be pursued. We have only considered the linear regression model in this paper. It will be interesting to generalize ASSD to other types of models, such as the logistic model and cox model.Supplementary Information.
Authors: Susan L Tucker; Kshipra Gharpure; Shelley M Herbrich; Anna K Unruh; Alpa M Nick; Erin K Crane; Robert L Coleman; Jamie Guenthoer; Heather J Dalton; Sherry Y Wu; Rajesha Rupaimoole; Gabriel Lopez-Berestein; Bulent Ozpolat; Cristina Ivan; Wei Hu; Keith A Baggerly; Anil K Sood Journal: Clin Cancer Res Date: 2014-04-22 Impact factor: 12.531