Literature DB >> 23509447

Single directional SMO algorithm for least squares support vector machines.

Xigao Shao1, Kun Wu, Bifeng Liao.   

Abstract

Working set selection is a major step in decomposition methods for training least squares support vector machines (LS-SVMs). In this paper, a new technique for the selection of working set in sequential minimal optimization- (SMO-) type decomposition methods is proposed. By the new method, we can select a single direction to achieve the convergence of the optimality condition. A simple asymptotic convergence proof for the new algorithm is given. Experimental comparisons demonstrate that the classification accuracy of the new method is not largely different from the existing methods, but the training speed is faster than existing ones.

Entities:  

Mesh:

Year:  2013        PMID: 23509447      PMCID: PMC3590457          DOI: 10.1155/2013/968438

Source DB:  PubMed          Journal:  Comput Intell Neurosci


1. Introduction

In a classification problem, we consider a set of training samples, that is, the input vectors {x } along with corresponding class labels {y } . Our task is to find a deterministic function that best represents the relation between input vectors and class labels. For classification or forecasting problems in machine learning, support vector machine (SVM) has been adopted in many applications because of its high precision [1-4]. SVMs require the solution of a quadratic programming problem. Another successful method for machine learning is least squares support vector machine (LS-SVM) [5]. Instead of solving a quadratic programming problem as in SVMs, the solutions of a set of linear equations are obtained in LS-SVMS. There are many proposed algorithms for training LS-SVMs: Suykens et al. proposed an iterative algorithm based on conjugate gradient (CG) algorithms [6]; Ferreira et al. presented a gradient system which can train the LS-SVM model [7] effectively; Chua introduced efficient computations for large least square support vector machine classifiers [8]; Chu et al. improved the efficiency of the CG algorithm by using one reduced system of linear equations [9]; Keerthi and Shevade extended the sequential minimal optimization (SMO) algorithms to solve the linear equations in LS-SVMs where the maximum violating pair (MVP) was selected as the working set [10]; based on the idea of SMO algorithm, Lifeng Bo et al. presented an improved method for working set selection by using functional gain (FG) [11]; Jian et al. designed a multiple kernel learning algorithm for LS-SVMs by convex programming [12]; and so on. These numerical algorithms are computationally attractive. Empirical comparisons show that SMO algorithm is more efficient than CG one for the large scale datasets. Fast SVM training speed with SMO algorithm is an important goal for practitioners and many other proposals have been given for this in the literature. Initially, Platt presented two heuristics that resulted in a bit cumbersome selection [13]. Later, Keerthi et al. introduced the concept of a violating pair to denote two coefficients which cause a violation in the KKT optimality conditions of the dual, and the authors suggested to select always the pair that violated them the most, that is, the maximum violating pair (MVP) [14]. Finally, Fan et al. proposed a second order selection that usually results in faster training than the MVP rule [15]. By the above improvement, we can decrease the computational expense of SMO algorithm, while there are repeated selections of some concrete updating patterns in sequential minimal optimization. They are called training cycles. Barbero et al. studied the presence of them from a geometrical point of view [16]. They pointed out that the training cycles can be partially collapsed in a single updating vector that gave better optimal directions. The idea for training cycles can reduce the number of iterations and kernel operations for SMO algorithm. Inspired by Barbero et al. [16], we present a single directional SMO algorithm for LS-SVMs, abbreviated as SD-SMO algorithm. In optimization procedure, an adaptive objective function is selected, and the single directional steps are given for the lagrangian multipliers, which can lessen the number of training cycles and further reduce iterations and kernel operations for SMO algorithm. Experiments show that the training time for LS-SVMs by SD-SMO algorithm can be reduced significantly, and it has a testing accuracy which is not largely different from traditional SMO algorithm. The rest of this paper has the following structure. In the next section, LS-SVMs are briefly reviewed. In Section 3, SD-SMO algorithm for LS-SVMs is provided and the convergence of the improved algorithm is proved theoretically. Based on standard datasets, computational experiments describing the effectiveness of the improved algorithm are presented in Section 4. Finally, Section 5 is devoted to concluding remarks.

2. LS-SVM

In this section, we concisely review the basic principles of LS-SVMs. Given a training dataset of N points {x , y } with input data and output data , we consider the following optimization problem in primal weight space: such that where γ is a regularization factor, e is the difference between the desired output y and the actual output, and φ(·) is a nonlinear function mapping the data points into a high-dimensional Hibert space; in addition, the dot product in the high-dimensional space is equivalent to a positive-definite kernel function k(x , x ) = φ(x ) φ(x ). In primal weight space, a linear classifier in the new space takes the following form: The weight vector w may be infinite dimensional; hence, using (1) to find the solutions is impossible in general. In order to solve this problem, we would compute the model in the dual space instead of the primal space. Let b = 0, and the simple problem without a bias term is considered in this paper as in the paper by Keerthi and Shevade [10]. The Lagrangian for the simple problem is where α are Lagrangian multipliers and are called support values. The Karush-Kuhn-Tucker (KKT) conditions for optimality are After elimination of w and e, we could obtain the following linear system: where y = [y 1, y 2,…, y ], α = [α 1, α 2,…α ], and K ∈ R is the kernel matrix. By solving the linear system (6), α ′s are obtained; hence, LS-SVM greatly simplifies the problem. The resulting LS-SVM model for function estimation is For the choice of the kernel function k(·, ·), there are several possibilities: k(x, x ) = x x (linear LS-SVM); k(x, x ) = (x x + 1) (polynomial LS-SVM of degree d); k(x, x ) = exp⁡{−||x−x ||2 2/σ 2} (RBF LS-SVM); k(x, x ) = tanh(k x x + θ) (MLP LS-SVM). In this case, we focus on the choice of an RBF LS-SVM for the sequel. When solving large linear systems, we should apply iterative methods to (6), which was introduced by Jiao et al. [17]. The speed of convergence depends on the condition number of the matrix in (6). It is influenced by the choice of (γ, σ) in the case of RBF LS-SVM. In the following section, we will discuss the algorithm of SMO versions and give the proof of convergence for SD-SMO algorithm.

3. SMO and SD-SMO Algorithms for LS-SVM

For solving the LS-SVM problem, the matrix in (6) is usually fully dense and may be too large to be stored. Decomposition methods are designed to handle the difficulties, see Jiao et al. [17]. Unlike other optimization algorithms which update the whole Lagrangian multipliers vector α in each iterative process, the decomposition algorithm modifies only a subset of α per iteration. We denote the subset as the working set B. The SMO algorithm was developed in [10] as a decomposition method to solve the dual problems arising in LS-SVM formulations. In each iteration, SMO algorithm restricts B to have only two elements. Because of the problem (4) without the bias term b, SMO can be simplified to optimize B with only one element at an iteration. By substituting the KKT conditions (5) into the Lagrangian (4), the dual problem is to maximize the following objective function: where Q(x , x ) = K(x , x ) + σ /γ, and σ = 1 if i = j and 0 otherwise. The SMO algorithm for (8) is sketched in the following.

Algorithm 1

SMO algorithm for (8) is as follows. Set k = 1 and find α = 0 as the initial feasible solution. If the stop criterion is satisfied, stop. If not, find a one-element working set B = {i}⊂{1,…, N}. Define D ≡ {1,…, N}∖B and α and α to be subvectors of α corresponding to B and D, respectively. Solve the following subproblem with the variable α : where is a permutation of the matrix Q. Set α to be the optimal solution of (9) and α ≡ α . Set k ← k + 1 and go back to step (2). In order to find working set B, we usually consider whether the KKT conditions is violated or not. The KKT conditions for the dual problem (8) are ∂L/∂α = 0, which lead to y − ∑ α Q(x , x ) = 0, i = 1,2,…, N. If we define then the KKT optimality condition is violated if there exists any index point i such that f ≠ 0. SMO algorithm for (8) achieves the convergence of optimal process when f → 0, for all i. A simple illustration of this is shown in Figure 1.
Figure 1

SMO sketch map, where f represents the kth iteration for f , for all i.

Since only one component is updated per iteration, the decomposition method can be quite costly and suffers from slow convergence. For this reason, many researchers improved SMO algorithm. For example, Chen et al. improved SMO algorithm by using the shrinking and caching techniques [18]; Barbero et al. presented a cycle-breaking acceleration of SVM training [16]; and Lin et al. provided three-parameter sequential minimal optimization for support vector machines [19]. As mentioned by Barbero et al. in [16], SMO algorithm is not free of cycle-related problems. For all i in working set B, if α is optimized with step t (t > 0 or t < 0) in a single direction per iteration, the number of cycles in SD-SMO algorithm will be reduced. We now detail SD-SMO formulation in the LS-SVM training process. Define Then, the KKT optimality condition is violated if there exists any index point i such that F ≠ 0. SD-SMO algorithm works by optimizing only one α at each iteration and keeping the others fixed, that is, α is adjusted by a sign-invariable step t  (t > 0  or  t < 0) per iteration as follows: The update of α causes the change of all the f as and; therefore, the function value of F will change. At each iteration we need to be sure that the sign of f is not variable, that is, if f ≥ (or ≤) 0, then f ≥ ( or ≤) 0. As k increases, f → 0+ (or 0−) with the sign keeping invariable. A simple illustration of this is shown in Figure 2.
Figure 2

SD-SMO sketch map, where f represents the kth iteration for f , for all j.

To derive the optimal step t and the termination conditions of iteration, we define F as Because f → 0 as k → ∞, F (t) ≤ F (0). Therefore, let ΔF = −(F (t) − F (0)) and it can be written as The optimal step is obtained by maximizing ΔF as and the optimal step t opt can induce the change of F as Hence we can choose an index point j which has the maximum value of F /2Q(x , x ) and update α by (12) and (16). Suppose F(α) = (f 1, f 2,…, f ,…, f ) and ||F(α )||2 2 = ∑ F , then {||F(α )||2 2} is a decreasing sequence. In fact, as k → ∞, ||F(α )||2 2 → 0. Therefore ||F(α )||2 2 can be used as a termination criterion for the iterative algorithm as where ε is a positive constant. The flowchart of SD-SMO algorithm is shown in Algorithm 2.

Algorithm 2

SD-SMO algorithm for (8) is as follows. Set k = 1 and choose α such that f ≥ 0 (or f ≤ 0) for all j = 1,2,…, N. If α satisfies (18), stop. If not, select p 1 = arg⁡max⁡(F /2Q(x , x )) Update α using t opt = f /Q(x , x ) and (12). While f ≥ 0 (f ≤ 0), k = k + 1, go back to step (2). One theoretical property of SD-SMO algorithm is presented in the following.

Theorem 3

The sequence α generated by SD-SMO algorithm converges to the global optimal solution of (8).

Proof

According to the definition of ||F(α )||2 2 and combining (16) and (17), the following equation holds: The positive-definite kernel function implies K(x , x ) ≥ 0, furthermore ||α −α ||2 2 = (t opt)2, and the following equation is obtained: Equality (20) yields that {||F(α )||2 2} is a decreasing sequence. Together with ||F(α )||2 2 ≥ 0, we have that {||F(α )||2 2} converges. Applying (20) again, we get that {α − α } converges to 0 as k → ∞. Since F (∀j) is a positive-definite quadratic form, {||F(α)||2 2} = ∑ F is a positive-definite quadratic form too. Therefore, the set {α | ||F(α)||2 2 ≤ ||F(α 0)||2 2} is a compact set. {α } lies in this set, so it is a bounded sequence. Let be the limit point of any convergent subsequence {α }, k ∈ Γ. For all j, . According to the definition of F(α ), 0 ≤ F (α ) ≤ F(α ). Inequality (18) yields lim⁡{||F(α )||2 2} = 0; furthermore, for all j, . While , so , . From the KKT conditions, is the global optimal solution of (8). Since L(α) is strictly convex, (8) has a unique global solution and we denote it as α*. Assume that {α } does not converge to α*. Then, for all ϵ > 0, there exists an infinite subset such that for all , ||α − α*|| > ϵ. Because {α }, for all is a compact set, there is a convergent subsequence. Without loss of generality, we assume its limit to be . Thus, . Since is the global optimal solution of (8), this contradicts that is the unique global optimal solution. The proof of Theorem is completed.

4. Numerical Experiments

Under the framework Algorithm 2, we conduct experiments to check whether using SD-SMO is really faster than using SMO or not in this section. There have been two techniques for working set selection in SMO-type decomposition methods. The former is first order SMO (FO-SMO) algorithm and the latter is second order SMO (SO-SMO) algorithm for LS-SVM classifiers [20]; that is, the former uses first order information to achieve fast convergence and the latter uses second order information. Two groups of experiment have been done in order to compare SD-SMO with the above two algorithms. All methods are implemented in MATLAB and executed on a personal computer with Intel(R) Core(TM) i3 2.53 GHz processors, 2.00-GB memory, and Windows 7 operation systems. For all algorithms, the optimization process is terminated when the maximal violation of the KKT conditions is within ε = 0.001. For simplicity, we consider only Gaussian kernel k(x, x ) = exp⁡{−||x−x ||2 2/2σ 2} to construct LS-SVM.

4.1. The Comparison of SD-SMO with First Order SMO

In this section, we compare SD-SMO with first order SMO on four benchmark datasets for evaluating the performance of the proposed method. We compare the two methods in terms of computational cost, which is measured by the number of iteration. The examples introduced by Keerthi and Shevade [10] are used. Datasets used for this purpose are Banana, Image, Waveform, and Splice. For each dataset, the value of σ 2 is determined by the five-fold cross validation on a small random subset. In the first experiment, we vary γ over a small range because the extremely small and large γ values are usually of little interest. We try the following nine γ values: 2, i = −4, −3,…, 3,4. In Table 1, the computational costs associated with the four datasets as functions of γ are given when the optimization process is terminated.
Table 1

Computational costs for first order SMO (FO-SMO) and SD-SMO algorithms.

log⁡2 γ   Banana Image Waveform Splice 
σ 2 = 1.8221 σ 2 = 2.7183 σ 2 = 24.5325 σ 2 = 29.9612
FO-SMOSD-SMO FO-SMOSD-SMO FO-SMOSD-SMO FO-SMOSD-SMO
−4 0.4460 0.3548 0.4838 0.1104 0.5375 0.2234 0.4375 0.3166
−3 0.5023 0.3542 0.5150 0.1191 0.5854 0.2499 0.4683 0.3152
−2 0.6379 0.3381 0.5844 0.1217 0.6109 0.2343 0.5066 0.3029
−1 0.8733 0.2632 0.7413 0.1248 0.6682 0.2245 0.6060 0.2662
0 1.3545 0.2231 0.9816 0.1283 0.7440 0.1879 0.7738 0.2105
1 2.3782 0.1607 1.4816 0.1326 0.8512 0.1672 1.3078 0.1775
2 2.4793 0.0679 1.8371 0.2927 0.9569 0.1490 1.3537 0.1675
3 2.6521 0.0486 2.3751 0.2136 1.0829 0.1369 1.7175 0.1481
4 2.8906 0.0231 2.9305 0.2205 1.2195 0.1344 2.1520 0.1402

Note: each unit corresponds to 104 iterations.

As a basis for the comparisons, Table 1 shows the computational costs of first order SMO and SD-SMO algorithms at different values of parameter γ. For first order SMO algorithm, the computational cost increases with the increase of γ. While for SD-SMO algorithm, it is not so. For instance, see the computational cost of SD-SMO for the Banana and Waveform datasets. From Table 1, we can see that the number of iterations of SD-SMO algorithm is much smaller than that of first order SMO one, especially for Image dataset. In order to further show the performance of SD-SMO algorithm, Tables 2 and 3 are given. The tables report the training time and the generalization performance of first order SMO and SD-SMO algorithms for four benchmark datasets. The generalization performance is illustrated by the classification accuracy of an independent test set for each dataset.
Table 2

Training time (in seconds) and classification accuracy in parentheses for first order SMO (FO-SMO) and SD-SMO algorithms.

log⁡2 γ Banana Image
σ 2 = 1.8221 σ 2 = 2.7183
FO-SMO SD-SMO FO-SMO SD-SMO
−4 43.6589 (0.8675) 35.947 (0.895) 7.90140 (0.9012) 2.47260 (0.9214)
−3 47.3385 (0.8753) 35.3045 (0.8712) 8.41620 (0.9156) 2.50380 (0.9324)
−2 59.8110 (0.8832) 34.3882 (0.8653) 9.76570 (0.9223) 2.59740 (0.9348)
−1 88.6335 (0.8889) 28.9070 (0.8377) 11.7874 (0.9382) 2.57400 (0.9358)
0 129.505 (0.8877) 22.5036 (0.8667) 15.3895 (0.9430) 2.58180 (0.9410)
1 220.437 (0.8900) 16.1617 (0.8502) 23.4157 (0.9521) 2.60520 (0.9511)
2 229.891 (0.8943) 8.42400 (0.7853) 31.1026 (0.9588) 3.93120 (0.9602)
3 238.068 (0.8977) 3.47140 (0.7032) 41.611 (0.967) 4.2979 (0.963)
4 259.36 (0.898) 2.02800 (0.6126) 50.6560 (0.9616) 4.50900 (0.9578)
Table 3

Training time (in seconds) and classification accuracy in parentheses for first order SMO (FO-SMO) and SD-SMO algorithms.

log⁡2 γ Waveform Splice
σ 2 = 24.5325 σ 2 = 29.9612
FO-SMO SD-SMO FO-SMO SD-SMO
−4 43.4541 (0.9094) 35.4434 (0.8404) 31.9303 (0.8649) 44.4478 (0.6507)
−3 46.5039 (0.9108) 36.1884 (0.8918) 33.2688 (0.8736) 44.0110 (0.7061)
−2 48.8049 (0.9114) 37.635 (0.908) 36.0175 (0.8910) 41.9830 (0.8944)
−1 52.907 (0.912) 35.3499 (0.8948) 43.6085 (0.8963) 37.730 (0.911)
0 58.9295 (0.9096) 29.9522 (0.8974) 55.2503 (0.9037) 33.4865 (0.8866)
1 67.2830 (0.9071) 26.6060 (0.8955) 72.543 (0.911) 26.1801 (0.8826)
2 79.3185 (0.9068) 24.5008 (0.8859) 94.8392 (0.9060) 23.3596 (0.8769)
3 86.3930 (0.9004) 22.9251 (0.8876) 121.219 (0.9054) 21.7434 (0.8750)
4 95.7465 (0.9100) 22.4251 (0.8860) 153.243 (0.9032) 21.0508 (0.8746)
From Tables 2 and 3, we can see that the generalization capabilities of both methods are comparable, but the training time of SD-SMO algorithm is shorter than first order SMO algorithm. For instance, in the case of Image dataset, the training time for first order SMO algorithm with the best generalization performance is 41.6108 s. It represents the equivalent of ten times the cost of SD-SMO algorithm. The classification accuracy for Image dataset with SD-SMO algorithm is 0.963, and it is almost equal to the one with first order SMO algorithm. In consequence, the efficacy and feasibility of the proposed SD-SMO algorithm is superior to that of first order SMO one for LS-SVMs.

4.2. The Comparison of SD-SMO with Second Order SMO

To further explore the performance of the proposed method, we compare SD-SMO with second order SMO by a second set of experiments on the datasets Titanic, Heart, Breast Cancer, Thyroid, and Pima (available in [21]). We use the datasets provided in [21] to certify the good generalization properties of the proposed method. In Table 4, the number of iterations and execution times per experiment is reported. The misclassification rates are also reported in Table 4.
Table 4

Number of iterations (in thousands), execution times (in seconds), and average misclassification rates for second order SMO (SO-SMO) and SD-SMO algorithms.

Dataset Iterations Executiontimes Misclassification rate
SO-SMO SD-SMO SO-SMO SD-SMO SO-SMO SD-SMO
Titanic 277.1512 59.7346 1129.2009 80.9348 23.5723 23.5612
Heart 5.8993 2.2315 10.3623 4.4652 16.1117 17.1092
Cancer 10.1908 4.1127 21.6765 9.0972 27.6643 27.8764
Thyroid 30.1537 17.7325 77.3341 52.5521 5.5123 5.6725
Pima 60.6751 30.7366 104.9616 69.8546 25.0155 25.7761
It can be seen that for these datasets it is better to use SD-SMO in Cancer, Pima, and Titanic. The results in Table 4 shows that the biggest improvement with SD-SMO happens for Titanic. Therefore, this is further evidence on the previous observation that for large-scale problems SD-SMO outperforms second order SMO. The final set of experiments aims to ascertaining how well the SMO algorithm scales for large-scale datasets when it uses the different working set selections. In order to test this, we use the datasets a8a and covtype.binary, available with several increasing numbers of patterns in [22]. In Figure 3, we plot the results for a8a with C = 2, σ 2 = 10 and covtype.binary with C = 10, σ 2 = 10, respectively. As it can be seen, the number of iterations scales linearly with the training set size. Note that SD-SMO needs less iterations to convergence, as expected. And the reduction is greater for covtype.binary because of its larger value of C. In any case, the scaling is linear in both cases.
Figure 3

Variation of the number of iterations with training set size for a8a (a) and covtype (b).

5. Conclusion

In this paper, a new algorithm, that is, SD-SMO, is proposed. It can be used to select working set for LS-SVM classifier training, and its asymptotic convergence is proved theoretically. Based on SMO formulation, the path of one-side convergence is used effectively in our method. The number of iterations and kernel operations in SD-SMO algorithm is less than that of the traditional SMO algorithm, so the new algorithm provides faster convergence speed. Simulation experiments have been carried out on four benchmark datasets. The empirical comparisons demonstrate that SD-SMO algorithm is much more efficient in terms of computational time than first order and second order SMO, and at the same time there are no large differences in terms of accuracy.
  7 in total

1.  SMO algorithm for least-squares SVM formulations.

Authors:  S S Keerthi; S K Shevade
Journal:  Neural Comput       Date:  2003-02       Impact factor: 2.026

2.  An improved conjugate gradient scheme to the solution of least squares SVM.

Authors:  Wei Chu; Chong Jin Ong; S Sathiya Keerthi
Journal:  IEEE Trans Neural Netw       Date:  2005-03

3.  SMO-based pruning methods for sparse least squares support vector machines.

Authors:  Xiangyan Zeng; Xue-Wen Chen
Journal:  IEEE Trans Neural Netw       Date:  2005-11

4.  A study on SMO-type decomposition methods for support vector machines.

Authors:  Pai-Hsuen Chen; Rong-En Fan; Chih-Jen Lin
Journal:  IEEE Trans Neural Netw       Date:  2006-07

5.  Fast sparse approximation for least squares support vector machine.

Authors:  Licheng Jiao; Liefeng Bo; Ling Wang
Journal:  IEEE Trans Neural Netw       Date:  2007-05

6.  Working set selection using functional gain for LS-SVM.

Authors:  Liefeng Bo; Licheng Jiao; Ling Wang
Journal:  IEEE Trans Neural Netw       Date:  2007-09

7.  Design of a multiple kernel learning algorithm for LS-SVM by convex programming.

Authors:  Ling Jian; Zhonghang Xia; Xijun Liang; Chuanhou Gao
Journal:  Neural Netw       Date:  2011-03-12
  7 in total
  1 in total

1.  An automated and fast system to identify COVID-19 from X-ray radiograph of the chest using image processing and machine learning.

Authors:  Murtaza Ali Khan
Journal:  Int J Imaging Syst Technol       Date:  2021-03-01       Impact factor: 2.177

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.