Literature DB >> 27861526

Applications of Spectral Gradient Algorithm for Solving Matrix ℓ2,1-Norm Minimization Problems in Machine Learning.

Abstract

The main purpose of this study is to propose, then analyze, and later test a spectral gradient algorithm for solving a convex minimization problem. The considered problem covers the matrix ℓ2,1-norm regularized least squares which is widely used in multi-task learning for capturing the joint feature among each task. To solve the problem, we firstly minimize a quadratic approximated model of the objective function to derive a search direction at current iteration. We show that this direction descends automatically and reduces to the original spectral gradient direction if the regularized term is removed. Secondly, we incorporate a nonmonotone line search along this direction to improve the algorithm's numerical performance. Furthermore, we show that the proposed algorithm converges to a critical point under some mild conditions. The attractive feature of the proposed algorithm is that it is easily performable and only requires the gradient of the smooth function and the objective function's values at each and every step. Finally, we operate some experiments on synthetic data, which verifies that the proposed algorithm works quite well and performs better than the compared ones.

Entities: Chemical Disease

Mesh：

Year: 2016 PMID： 27861526 PMCID： PMC5115710 DOI： 10.1371/journal.pone.0166169

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1 Introduction

The tasks in medical diagnosis [1], text classification [2-5], biomedical informatics [6, 7] and other applications [8-12] are always related to each other. Hence, capturing the shared information among each task becomes the key issue to learn [13-15]. Given the training set of t tasks and , where A is the data for the j-th task and b is the corresponding response. We let be the sparse feature for the j-th task, and let be the joint feature to be learned. In order to select features globally, it encourages several rows of X to be zeros and solves the following ℓ2,1-norm regularized least squares [16, 17] where μ > 0 is a weighting parameter, and ‖X‖2,1 is defined by the sum of the ℓ2-norm of each row of a matrix. It is well known that the ℓ2,1-norm is used to encourage the multiple predictions from different tasks to share similar parameter sparsity patterns. In the past few years, several algorithms have been proposed, analyzed, and tested to solve the nonsmooth convex minimization Problem (1). The algorithm in [18] transformed Eq (1) equivalently into a smooth convex optimization problem and minimized consequently by Nesterov’s gradient method. The method in [16] reformulated Eq (1) as a constrained optimization problem and minimized alternately. The algorithm in [19] and its variant [20] reformulated the problem as an equivalent constrained minimization by introducing an auxiliary variable, and then minimized the corresponding augmented Lagrange function alternatively. Finally, for another accelerated proximal gradient version of the algorithm [19], one can refer to [21]. Unlike all the research activities which mainly concerned about Problem (1), in this paper, we focus on the following generalized nonsmooth convex optimization problem where is continuously differentiable (may be non-convex) and bounded below. Clearly, Model (2) includes Eq (1) as a special case when F is a least square. As we all know, the spectral gradient method was originated by Barzilai and Borwein [22] for solving smooth unconstrained minimization problems, later was developed in [23-26], and then was extended to solve ℓ1-regularized nonsmooth minimization [27]. However, its numerical performance in solving matrix ℓ2,1-norm involved nonsmooth minimization problems is still undiscovered. Therefore, extending the spectral gradient algorithm to solve Problem (2) may have significance both in theory and practice. The first contribution of this study lies in the design of the search direction at each iteration, which is derived by minimizing a quadratic approximated model of the objective function and at the same time making full use of the special structure of the ℓ2,1-norm. We also show that the generated direction descends automatically provided that the spectral coefficient is positive. The second contribution of the paper is the nonmonotone line search, which is used to improve the algorithm’s performance. At each iteration, the algorithm requires the gradient of the smooth term and the value of the objective function, which means it has the ability to solve high dimensional problems. Finally, we do performance comparisons with a couple of solvers IAMD_MFL and SLEP, which illustrate that the proposed method is fast, efficient, and competitive. The paper is organized as follows. In Section 2, we provide some notations and preliminaries, and construct the new algorithm together with its properties. In Section 3, we establish the global convergence of the algorithm. In Section 4, we report some numerical results and do some performance comparisons. Finally, we conclude our paper in Section 5.

2 Algorithm

2.1 Notations and preliminaries

In the first place, we summarize the notations used in this paper. Matrices are written as uppercase letters. Vectors are described as lowercase letters. For the matrix X, its i-th row and j-th column are denoted by X and X:, respectively. The Frobenius norm and the ℓ2,1-norm of the matrix are defined as, respectively, For any two matrices , we define 〈X, Y〉 = tr(X⊤ Y) (the standard trace inner product in ), so that . If , we denote “Diag(x)” the diagonal matrix possessing the components of vector x on the diagonal. We define “⊤” as the transpose of a vector or a matrix. For the sake of simplicity, we let Φ(X) = F(X) + μ‖X‖2,1. Additional notations will be introduced when they occur. We now quickly review the spectral gradient method for the unconstrained smooth minimization problem where is a continuously differentiable function. The spectral gradient method is defined by where one of the choices of λ (named as spectral coefficient) is given by where s = x − x and y = ∇f(x) − ∇f(x). Obviously, if , i.e. λ > 0, the search direction descends automatically at current point.

2.2 Algorithm

Now, we turn our attention to the original Model (2). Since the ℓ2,1-norm is nodifferentiable, we approximate the objective function by the following quadratic function Q: where is the gradient of F at X; Λ is the so-called spectral coefficient which defined by where S = X − X and Y = ∇F(X) − ∇F(X). Minimizing Eq (3) yields Denote M = X + D and . One can get The favorable structure of Eq (5) make the i-th row of matrix M write explicitly as where the convention 0 ⋅ 0/0 = 0 is followed. Hence, the search direction at current point can be expressed as Obviously, the Eq (6) reduces to at the case of μ = 0, which means Eq (6) covers the traditional spectral gradient direction as a special case. The following lemma verifies that D is a descent direction when the optimal solution is not achieved. Theorem 1 Suppose that Λ > 0 and D is determined by Eq (6). Then and Proof. By the differentiability of F and the convexity of ‖X‖2,1, we have that for any θ ∈ (0, 1], which is exactly Eq (7). Noting that D is the minimizer of Eq (3) and θ ∈ (0, 1], by Eq (3) and the convexity of ‖X‖2,1, one can get Hence, i.e., Recalling θ ∈ (0, 1], the above inequality indicates Eq (8) is correct. To improve the algorithm’s performance, we use the classical nonmonotone line search [28] to find a suitable stepsize along the direction. It is well known that this technique allows the functional values to increase occasionally in some iterations but decrease in the whole iterative process. Letting δ ∈ (0, 1), ρ ∈ (0, 1) and be a given positive integer, we choose the smallest nonnegative integer j such that the stepsize satisfies where (m(0) = 0) and From Eq (8), it is clear that whenever D ≠ 0, which shows that Eq (9) is well-defined. In summary, the full steps of the Nonmonotone Spectral Gradient algorithm for -norm minimization (abbr. NSGL21) can be described as follows: Algorithm 1 (NSGL21) Step 0. Choose initial point X0, constants μ > 0, , ρ ∈ (0, 1), δ ∈ (0, 1) and positive integer . Set k: = 0. Step 1. Stop if ‖D‖ = 0. Otherwise, continue. Step 2. Compute D via Eq (6). Step 3. Compute α via Eq (9). Step 4. Let X: = X+α d. Step 5. Let k: = k+1. Go to Step 1. As is stated in the proceeding section that the generated direction descend automatically whenever Λ > 0. To ensure Λ > 0, we choose a sufficiently small Λ(min) > 0 and a sufficiently large Λ(max) > 0, such that Λ is forced as This approach ensures that the hereditary descent property is guaranteed at each and every step. Remark 1. The steps of the proposed algorithm is novel and different to other existing approaches. The well-known approach [18] reformulated Problem (2) as the following constrained smooth convex optimization problem and then solved via the Nesterov’s method. The method in [19] paid attention least square Model (1) and used an auxiliary variable to transform the model equivalently as An alternating direction method of multiplier is used immediately to solve the resulting model and closed-form solution are derived at each subproblem. Clearly, our proposed algorithm is different from the above mentioned approaches in sense that we solve the original Model (2) directly without any transformation.

3 Convergence analysis

This section is devoted to establishing the global convergence of algorithm NSGL21. For this purpose, we make the following assumption. Assumption 1. The level set Ω = {X: F(X) ≤ F(X0)} is bounded. Lemma 2. Suppose that the Assumption 1 holds and the sequence {X} is generated by Algorithm 1. Then X is a stationary point of Problem (2) if and only if D = 0. Proof. In the case of D ≠ 0, Lemma 1 shows that D is a descent direction, which implies that X is not a stationary point of F. On the other hand, since D = 0 is the solution of Eq (5), for any with ξ > 0 we have Combining the fact F(X + ξD) − F(X) = 〈∇F(X), ξD〉 + o(ξ) with Eq (11), it yields which indicates that X is a stationary point of F. Lemma 3. Let l(k) be an integer such that Then the sequence {Φ(X)} is nonincreasing and the search direction D satisfies Proof. It is not difficult to see that Φ(X) ≤ Φ(X), which indicates that the maximum value of the objective function is nonincreasing at each iteration. Moreover, by Eq (9), we have that for all , By Assumption 1, the sequence {Φ(X)} admits a limit as k → ∞. Hence, it follows that On the other hand, by the definition of Δ in Eq (10) and the inequality Eq (8), it is easy to deduce that Combining with Eq (13), one get which indicates the desirable result Eq (12). Theorem 1. Let the sequence {X} and {D} be generated by Algorithm 1. Then, there exists a subsequence such that Proof. Let be a limit point of {X}, and be a subsequence of {X} converging to . Then by Eq (12) either , or there exists a subsequence () such that In this condition, we assume that there exists a constant ϵ > 0 such that Since α is the first value to satisfy Eq (9), it follows from Step 3 in Algorithm 1 that there exists an index such that, for all and , Since F is continuously differentiable, by the mean-value theorem on F, we can find that there exists a constant θ ∈ (0, 1), such that Combining with Eq (17), we have Since α → 0 in Eq (15), we have α < ρ as k → ∞. It is not difficult to show that Subtracting left side of Eq (18) by Δ and noting the definition of Δ, it is distinct that Noting Eq (19), thus Eq (18) shows that Taking the limit as , k → ∞ in the both sides of Eq (20) and using the smoothness of F, we obtain which implies ‖D‖ → 0 as , k → ∞. This yields a contradiction because Eq (16) indicates that ‖D‖ is bounded.

4 Numerical experiments

In this section, we present numerical results to illustrate the feasibility and efficiency of the algorithm NSGL21. In particular, we also test against the recent solvers IADM_MFL and SLEP for performance comparison. In running SLEP (Sparse Learning with Efficient Projections), we use the code at http://www.public.asu.edu/~jye02/Software/SLEP/index.htm in its Matlab package, and choose mFlag = 1 and lFlag = 1 for using an adaptive line search. All experiments are carried out under Windows 7 and Matlab v7.8 (2009a) running on a Lenovo laptop with an Intel Pentium CPU at 2.5 GHz and 4 GB of memory. As [16], in the first test, is generated from a 5-dimensional Gaussian distribution with zero-mean and con-variance diag{1, 0.64, 0.49, 0.36, 0.25}. Regarding each , we keep adding up to 20 irrelevant dimensions which are exactly zeros. The training and test data A is Gaussian matrices and their response data b is generated by where ω is zero-mean Gaussian noise with standard deviation 1.e − 2. We start NSGL21 from zero point and terminate the iterative process when where tol > 0 is a tolerance. The quality of the solution X* is measured by the relative error to , i.e., In this test, we take , μ = 1e − 2, t = 200, n = 15, tol = 1e − 3, Λ(min) = 10−20, Λ(max) = 1020, and m = 100 for all j = 1, 2, …, t. Moreover, to compare the performance of these algorithms in a fair way, we run each code from zero point, use all the default parameter values, and observe their convergence behavior in obtaining similar accurate solutions. To specifically illustrate the performance of each algorithm, we draw a couple of figures to show their convergence behaviors with respect to the relative error and computing time proceed in Figs 1 and 2.

Fig 1

Comparison results of NSGL21, IADM MFL, and SLEP.

The x-axes represents the number of iterations and the y-axes represents the relative error.

Fig 2

Comparison results of NSGL21, IADM MFL, and SLEP.

The x-axes represents the CPU time in seconds and the y-axes represents the relative error.

Comparison results of NSGL21, IADM MFL, and SLEP.

The x-axes represents the number of iterations and the y-axes represents the relative error. The x-axes represents the CPU time in seconds and the y-axes represents the relative error. Observing Figs 1 and 2, we clearly know that IADM_MFL and NSGL21 produced faithful results expect for SLEP. We have tried to run SLEP with more iterations in our experiments’ preparation, but it cannot achieve progress any more. Meanwhile, NSGL21 requires less number of iterations than IADM_MFL to achieve the similar quality of solutions. In both plots, we see that the green line lies at the bottom of each plot in most cases, which indicates that NSGL21 is superior to the other two solvers. The simple test is not enough to verify that NSGL21 is the winner. To further illustrate the benefit of NSGL21, we give some insights to the behavior of NSGL21 with different dimensions and different number of tasks. The results are listed in Table 1, which contains the number of iterations (Iter), the CPU time in seconds (Time), the relative errors (RelErr), and the final functional values (Fun).

Table 1

Comparison results of NSGL21 with IADM_MFL and SLEP.

		NSGL21				IADM_MFL				SLEP
t	n	Iter	Time	Error	Fun	Iter	Time	Error	Fun	Iter	Time	Error	Fun
50	5	12	0.03	1.32e-3	0.49	23	0.05	3.49e-3	0.53	32	0.06	1.66e-2	2.27
50	10	12	0.03	1.95e-3	0.48	32	0.06	2.28e-3	0.49	29	0.06	1.67e-2	2.26
50	15	14	0.03	2.33e-3	0.47	34	0.08	2.53e-3	0.48	29	0.03	1.67e-2	2.26
50	20	15	0.03	3.00e-3	0.45	39	0.06	2.79e-3	0.46	30	0.06	1.66e-2	2.26
50	25	14	0.05	3.49e-3	0.44	42	0.06	2.87e-3	0.45	33	0.09	1.65e-2	2.25
100	5	11	0.06	1.39e-3	0.83	24	0.05	1.62e-3	0.84	32	0.09	1.51e-2	3.61
100	10	12	0.05	2.13e-3	0.81	29	0.06	2.25e-3	0.83	41	0.09	1.52e-2	3.66
100	15	17	0.06	2.49e-3	0.79	33	0.09	2.55e-3	0.82	32	0.12	1.49e-2	3.57
100	20	15	0.09	2.99e-3	0.75	38	0.11	2.37e-3	0.79	32	0.11	1.50e-2	3.59
100	25	19	0.12	3.43e-3	0.74	43	0.14	2.72e-3	0.80	28	0.16	1.55e-2	3.73
150	5	12	0.06	1.43e-3	1.14	24	0.08	1.81e-3	1.16	35	0.14	1.51e-2	5.19
150	10	14	0.09	1.98e-3	1.11	29	0.09	2.44e-3	1.15	33	0.16	1.49e-2	5.18
150	15	17	0.12	2.57e-3	1.08	34	0.17	2.91e-3	1.15	32	0.22	1.51e-2	5.20
150	20	15	0.16	3.04e-3	1.03	40	0.20	2.79e-3	1.11	35	0.17	1.50e-2	5.16
150	25	19	0.22	3.45e-3	0.99	45	0.23	3.00e-3	1.08	35	0.28	1.49e-2	5.14
200	5	12	0.12	1.41e-3	1.45	24	0.12	1.68e-3	1.46	45	0.12	1.53e-2	7.10
200	10	12	0.12	1.94e-3	1.41	29	0.19	2.09e-3	1.45	41	0.14	1.53e-2	7.10
200	15	17	0.19	2.57e-3	1.35	33	0.25	2.54e-3	1.41	33	0.25	1.51e-2	6.98
200	20	15	0.19	3.10e-3	1.32	38	0.25	3.09e-3	1.41	34	0.25	1.51e-2	6.95
200	25	19	0.28	3.52e-3	1.26	43	0.31	3.22e-3	1.35	27	0.28	1.57e-2	7.30
250	5	11	0.12	1.43e-3	1.74	24	0.17	1.58e-3	1.75	38	0.28	1.55e-2	8.80
250	10	14	0.25	2.01e-3	1.68	31	0.25	2.30e-3	1.74	37	0.31	1.55e-2	8.77
250	15	17	0.28	2.58e-3	1.61	36	0.28	3.00e-3	1.71	33	0.31	1.54e-2	8.70
250	20	15	0.31	3.02e-3	1.56	39	0.36	3.13e-3	1.66	34	0.37	1.53e-2	8.70
250	25	19	0.37	3.46e-3	1.50	46	0.45	3.63e-3	1.62	30	0.25	1.61e-2	9.26
300	5	12	0.22	1.40e-3	2.04	26	0.25	1.77e-3	2.07	35	0.28	1.54e-2	10.55
300	10	12	0.23	2.04e-3	1.96	30	0.27	2.31e-3	2.03	45	0.42	1.57e-2	10.77
300	15	17	0.37	2.52e-3	1.90	35	0.37	3.10e-3	2.03	35	0.39	1.53e-2	10.50
300	20	14	0.37	3.03e-3	1.83	41	0.51	3.52e-3	1.96	34	0.31	1.54e-2	10.52
300	25	20	0.58	3.52e-3	1.72	45	0.62	4.36e-3	1.91	29	0.44	1.62e-2	11.26

From Table 1, we clearly observe that each algorithm requires more computing time with the increase of the problems’ dimensions and the number of tasks. Meanwhile, the number of iterations required by NSGL21 and IADM_MFL increases slightly at the higher dimensions case. We also observe that, for all the tested problems, both NSGL21 and IADM_MFL are terminated abnormally in producing similar quality solutions in sense of comparable relative errors and final function values. However, SLEP cannot generate acceptable solutions although more iterations are permitted in experiments’ preparation. Hence, we conclude that NSGL21 and IADM_MFL perform better than SLEP. Now, we turn our attention to the performance comparison of solvers IADM_MFL and NSGL21. For getting similar quality of solutions, we take notice that NSGL21 is faster than IADM_MFL and saves at least 50% number of iterations. It is reasonable to make an conclusion that NSGL21 is the winner among the compared solvers.

5 Conclusions

In this paper, we have proposed, then analyzed, and later tested a nonmonotone spectral gradient algorithm for solving ℓ2,1-norm regularized minimization problem. The type of this problem mainly appears in computer version, text classification and biomedical informatics. Due to the nonsmoothness of the regularization term, the task of minimizing the problem is full of challenges. To the best of our knowledge, SLEP and IADM_MFL are the only available solvers of solving this problem. However, both solvers transferred equivalently to an equality-constrained minimization problem and then minimized alternatively. As we all know that the spectral gradient algorithm is very effective to solve smooth minimization problem. Hence, its performance in solving ℓ2,1-norm regularized problems is worthy of investigating. Certainly, it is the main motivation of our paper. At each iteration, the method proposed in this paper minimizes an approximal quadratic model of the objective function to produce a search direction. We showed that the generated direction descends automatically and the algorithm converges globally under some mild conditions. Additionally, the numerical experiments illustrate that the proposed algorithm is competitive with or even performs better than SLEP and IADM_MFL. Of course, this is the numerical contribution of our paper. We have said that the ℓ2,1-norm regularized minimization problem is partly arising in multi-task learning for capturing joint feather between each task. However, we did not test its real performance by using real data, this should be our further task to investigate. Finally, we expect that the proposed method and its extensions could produce even applications for problems in relevant areas of the machine learning.

2 in total

Review 1. A review of feature selection techniques in bioinformatics.

Authors: Yvan Saeys; Iñaki Inza; Pedro Larrañaga
Journal: Bioinformatics Date: 2007-08-24 Impact factor: 6.937

2. A Robust Regularization Path Algorithm for $\nu $ -Support Vector Classification.

Authors: Bin Gu; Victor S Sheng
Journal: IEEE Trans Neural Netw Learn Syst Date: 2016-02-24 Impact factor: 10.451

2 in total