Literature DB >> 27143958

Online Sequential Projection Vector Machine with Adaptive Data Mean Update.

Lin Chen1, Ji-Ting Jia1, Qiong Zhang1, Wan-Yu Deng1, Wei Wei2.   

Abstract

We propose a simple online learning algorithm especial for high-dimensional data. The algorithm is referred to as online sequential projection vector machine (OSPVM) which derives from projection vector machine and can learn from data in one-by-one or chunk-by-chunk mode. In OSPVM, data centering, dimension reduction, and neural network training are integrated seamlessly. In particular, the model parameters including (1) the projection vectors for dimension reduction, (2) the input weights, biases, and output weights, and (3) the number of hidden nodes can be updated simultaneously. Moreover, only one parameter, the number of hidden nodes, needs to be determined manually, and this makes it easy for use in real applications. Performance comparison was made on various high-dimensional classification problems for OSPVM against other fast online algorithms including budgeted stochastic gradient descent (BSGD) approach, adaptive multihyperplane machine (AMM), primal estimated subgradient solver (Pegasos), online sequential extreme learning machine (OSELM), and SVD + OSELM (feature selection based on SVD is performed before OSELM). The results obtained demonstrated the superior generalization performance and efficiency of the OSPVM.

Entities:  

Mesh:

Year:  2016        PMID: 27143958      PMCID: PMC4838813          DOI: 10.1155/2016/5197932

Source DB:  PubMed          Journal:  Comput Intell Neurosci


1. Introduction

In many real applications, such as text mining, visual tracking, and dynamical interest perception, there are always two problems: (1) new data arriving sequentially and (2) the data which is in high-dimensional space. For the first problem, many online sequential algorithms have been proposed [1-13]. SGBP [1] is one of the main variants of BP for sequential learning applications in which the network parameters are learned iteratively on the basis of first-order information. Crammer and Lee [7] proposed a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. Hoi et al. [8] proposed an online multiple kernel classification algorithm which learns a kernel-based prediction function by selecting a subset of predefined kernel functions in an online learning fashion. Wang et al. [9] proposed a Fourier online gradient descent algorithm that applies the random Fourier features for approximating kernel functions. Zhao et al. [14] proposed a fast bounded online gradient descent algorithm for scalable kernel-based applications that aims to constrain the number of support vectors by a predefined budget. Zhang et al. [11] proposed an online kernel learning algorithm which measures the difficulty in correctly classifying a training example by the derivative of a smooth loss function and gave more chance to a difficult example to be a support vector than an easy one via a sampling scheme. Shalev-Shwartz et al. [12] proposed a simple and effective stochastic subgradient descent algorithm primal estimated subgradient solver (Pegasos) for solving the optimization problem cast by Support Vector Machines (SVMs). Wang et al. [13] proposed an adaptive multihyperplane machine (AMM) model that consists of a set of linear hyperplanes (weights), each assigned to one of the multiple classes and predicts based on the associated class of the weight that provides the largest prediction. Wang et al. [10] proposed a budgeted stochastic gradient descent (BSGD) approach for training SVMs which keeps the number of support vectors bounded during training through several budget maintenance strategies. OSELM [15] is a very fast sequential algorithm derived from batch extreme learning machine (ELM) [16] in which the input weights are randomly generated and the output weights are determined by incremental least square. The aforementioned algorithms have their own advantages, respectively, in solving online learning problems for new data. However, they all thought that data preprocessing is independent on the model online learning. Different to these approaches, we propose an online learning algorithm OSPVM (online sequential projection vector machine) based on batch-PVM which enjoys the properties of combining data preprocessing (data centering and dimension reduction) and the model learning as a total. In our earlier work we have proposed incremental PVM [17] which can learn PVM incrementally; however, it cannot update data mean automatically. Data mean update is very important for improving the generalized performance of OSPVM. When new samples arrive, if the data mean is not updated, the components (features) obtained by SVD/PCA will shift and degrade the generalized performance. The proposed OSPVM algorithm enjoys three prosperities: (1) the mean of data can be updated dynamically, (2) projection vectors can be updated incrementally to capture more useful features from new data, and (3) the number of hidden nodes can be adjusted adaptively to ensure enough learning capability. The paper is organized as follows. Section 2 gives a brief review of the batch-PVM. Section 3 presents the derivation of OSPVM. Performance evaluation of OSPVM is shown in Section 4 based on the benchmark problems in different areas. Conclusions based on the study and experiments are made in Section 5.

2. Review of Projection Vector Machine

This section briefly reviews the batch-PVM developed by Deng et al. [18] to provide the necessary background for the development of OSPVM in Section 3. In order to make it easy to read, some symbols are defined: [A, B]: horizontal concatenation of matrix A and B; [A; B]: vertical concatenation of matrix A and B; : mean vector of A; 11× = [1,1,…, 1]1×.

2.1. Single Hidden Layer Feedforward Neural Network (SLFN)

For n arbitrary distinct samples ℵ = {(x , t )} , where x = [x , x ,…, x ] ∈ R and t = [t , t ,…, t ] ∈ R , a standard SLFN with hidden nodes and activation function g(x) are mathematically modeled aswhere w = [w , w ,…, w ] ∈ R is the input weight vector connected with the ith hidden nodes and the input nodes, b ∈ R is the threshold of ith hidden nodes, and β = [β , β ,…, β ] ∈ R is the output weight vector connecting with the ith hidden nodes and the output nodes. w · x denotes the inner product of w and x . If b is treated input weights and denoted as w , then w can be extended to w = [w , w ,…, w , w ] ∈ R and the sample x is extended to [x ; 1]. Equation (1) can be transformed asThe above n equations can be written compactly aswhere , , X = [x 1,…, x ], and T = [t 1,…, t ]. To train an SLFN, one may wish to find specific W, β to minimize the following cost function:Gradient-based learning algorithms [19] are generally used to search (W, β) by minimizing ξ(W, β), but they are time-consuming and maybe stop at a local minima. Extreme learning machine (ELM) [16, 20] randomly chooses input weights W and analytically determines the output weights β by Moore-Penrose generalized inverse. ELM can learn hundreds of times faster than gradient-based learning algorithms. But for the high-dimension and small-sample data, ELM will become unstable seriously especially when the data is sparse (there are many zero features). In order to tackle this problem, we have proposed batch projection vector machine (Batch-PVM) [18].

2.2. Batch Projection Vector Machine (Batch-PVM)

Batch-PVM combines SLFN together with SVD seamlessly, in which the input weights of SLFNs are calculated from SVD. Given data X, through data centralization and extension, the data is transformed as and its low rank SVD iswhere d is the truncated rank, U is the projection vectors by which the data is mapped into low-dimension space:Since the role of input weights of SLFN can be treated as dimension reduction, thus they can be directly obtained byNaturally, the number of hidden nodes is determined byThe problem becomes linear problem, and thus the output weights β can be obtained byThe experimental results in many classification and regression problems show that Batch-PVM is faster and more accurate than the familiar two-stage methods in which dimension reduction and SLFN training are independent. The batch-PVM assumes that all the training data (samples) are available, but, in real applications, some training data has been accumulated but at the same time new data will arrive chunk-by-chunk or one-by-one (a special case of chunk). The batch-PVM has to be modified for this case so as to make it able to learn online sequentially [21, 22].

3. The Proposed Online Sequential Algorithm

The seamless combination of dimension reduction and SLFN training facilitates the design of sequential online learning. Once the SVD is updated for new samples, the dimension reduction projection matrix and all the parameters of SLFN can be updated conveniently.

3.1. Data Mean and Projection Vectors Update

Assume that n training samples ℵ = {(x , t )} have been available so far, the inputs and targets are denoted as A = [x 1, x 2,…, x ] and T = [t 1, t 2,…, t ], respectively. By centralization (subtracting the mean of the inputs) and extension, the data can be transformed asThe SVD of with the truncated rank r isAssume that kth chunk of data ℵ = {(x , t )} is presented where the new inputs and targets are denoted as B = [x 1, x 2,…, x ] and T = [t 1, t 2,…, t ], respectively, and the horizontal concatenation of A and B is denoted as C = [A, B ]. The update task is to get the new mean and SVD of ; that is,There are many sophisticated algorithms that have been developed to efficiently update SVD as more data arrive [23]. However, most approaches assume that the sample mean is fixed when updating the eigenbasis or equivalently that the data is inherently zero-mean. This assumption does not hold in many applications. New samples will lead to the change of data mean and thus the mean needs to be recomputed before updating SVD. One approach proposed by Hall et al. [24] considered the change of the mean while updating SVD as one set of new data arrives. However the high computational cost is a bottleneck of this method applied to many applications. Here we will extend Sequential Karhunen-Loeve [25] algorithm to make it suitable for updating SVD efficiently with mean update simultaneously. First we update the mean. The mean vector of A and B is , , so the mean vector of C isIt is not difficult to find thatwhere and . Since the SVD of has been known, this means that we can compute the SVD of by incremental algorithm [18]:Denote ; then we have . Therefore,Substituting it into (15), we haveIt is obvious that the SVD of can be calculated based on the SVD of . Perform QR-decomposition of ,Substituting (19) into (18) we havePerform SVD on :Substituting (21) into (20) we get the SVD of :Go back to the SVD of . Let be component of orthogonal to U ; that is,We can get the following partitioned form:Let . The SVD of M can be computed in constant time regardless of the following:So we get the SVD of ,

3.2. Hidden Nodes Update Adaptively

The number of hidden nodes is very important for SLFN [6]. Too many hidden nodes lead to overfitting while too few hidden nodes might lead to insufficiency of learning capability. When new training samples are presented the hidden nodes should be added to ensure the SLFN model possesses enough learning capability. OP-ELM [26] ranked the hidden nodes by multiresponse sparse regression (MRSR) and then make the final decision over the appropriate number of nodes by Leave-One-Out (LOO) validation method. I-ELM [27] increase random hidden nodes one-by-one until the residual error is smaller than one given threshold value. EI-ELM [28] selected the optimized random hidden nodes from one random hidden nodes set before increasing hidden node one-by-one. C-ELM [29] associate each model term to a regularized parameter; as a result, insignificant ones are automatically penalized and unselected. Since, in PVM, the number of hidden nodes is equal to the target low rank d of SVD, we will adopt accumulation ratio of principle components to determine the number of nodes. The accumulation ratio is defined by [30] as follows:where σ denotes the singular value constituting the singular value diagonal matrix Λ = Diag{σ 1, σ 2,…, σ }, denotes the number of hidden nodes, and r is number of nonzero singular values. By choosing one proper value that makes hold, where θ is a given threshold value, we can get the new number of hidden nodes. The new input weights can be updated byThe output weight is updated byThe algorithm can be summarized as Algorithm 1.
Algorithm 1

OSPVM algorithm.

3.3. Theoretical Analysis: OSPVM versus OSELM

It is very difficult to prove OSPVM is better than OSELM strictly. So here we just give some theoretical analysis about OSPVM being better than OSELM from feature learning opinion. As discussed in literature [31], minimizing reconstruction error is one very important condition to learn useful features. Reconstruction error of OSELM can be written aswhere X ∈ R is inputs (n is the number of instances and m is the dimensionality of data), X rand ∈ R (N is the number of hidden nodes) is input weights which are random values, and ‖·‖ 2 is Frobenius norm. Reconstruction error of OSPVM can be written as W SVD is input weights and obtained by singular value decomposition (SVD) as follows:Substituting into E OSPVM, we have is the error of optimized rank- approximation of X; that is, is the minima of reconstruction error with rank . Therefore, the reconstruction error E OSELM of OSELM must be larger than that of OSPVM: E OSELM > E OSPVM. In summary, when OSELM and OSPVM are with the same number of hidden nodes , E OSPVM is always smaller than E OSELM. Another condition to obtain better generalization performance is to make the hidden nodes as few as possible (Occam's Razor theory). Considering these two conditions, we can get the inferences: (1) when OSPVM and OSELM are with the same number of hidden nodes and satisfying N ≪ n, the reconstruction error of OSPVM is smaller than OSELM (E OSPVM < E OSELM). This will help OSPVM to obtain better generalization performance in general, and (2) for the same reconstruction error OSPVM always needs less hidden nodes than OSELM. According to Occam's Razor theory, OSPVM will produce better generalization performance than OSELM with less hidden nodes. Next, we briefly explain why OSPVM is better than SVD + OSELM in generalization performance in most cases. Similar to OSPVM, SVD + OSELM represents the data by SVD to obtain more useful features. However, SVD + OSELM discards the projection vectors obtained by SVD and still uses randomly values as input weights. In contrast, OSPVM uses the resulted projection vectors as input weights and thus can avoid the instability of random weights. So OSPVM can produce better generalization performance than SVD + ELM in most cases.

4. Performance Evaluation

4.1. Datasets and Experimental Settings

We select OSELM, BSGD, AMM, and Pegasos to compare with OSPVM on various UCI benchmark problems as shown in Table 1. For fair comparison, the feature selection by SVD is first conducted before these algorithms. The number of reduced dimensions d and the number of hidden nodes are both gradually increased by an interval of 5 and the nearly optimal combinations are selected by cross-validation method. OSELM code is downloaded from ELM homepage (http://www.ntu.edu.sg/home/egbhuang/). BSGD, AMM, and Pegasos are downloaded from the BudgetedSVM website (http://www.dabi.temple.edu/budgetedsvm/). OSPVM and SVD + OSELM are implemented by ourselves. For OSELM and Batch-PVM, the number of hidden nodes is gradually increased by an interval of 5 and the nearly optimal one is then selected by cross-validation method. For OSPVM, the accumulation rate threshold θ is chosen in the range of [0.95,0.99] by cross-validation method for every especial application. The activation functions for OSELM, OSPVM, SVD + OSELM, and Batch-PVM are all set as sigmoid function g(x) = 1/(1 + e −). For BSGD we set the kernel as Gaussian kernel 𝒦(x , x ) = exp⁡(−(1/σ)‖x − x ‖2), the budget maintenance strategy is set as “merging” which is more accurate than another alternate “removing,” and the number of budgeted support vectors is determined by cross-validation method. For AMM, the limit on the number of weights per class in AMM is determined by cross-validation method, and the learning rate is set to 0.0001. All the simulations are running in MATLAB 7, Pentium i7 920@2.67 GHZ CPU, and 6 G RAM environment. Average results of 20 trials of simulations for each fixed size of SLFN are obtained and then finally the best performance including training accuracy, testing accuracy, training time, testing time, and t-test is reported in this paper. t-test [32] is used to evaluate the performance difference of the algorithms. Denoting testing accuracies on the five datasets of ith algorithm as a = a , a ,…, a , t value can be computed as follows:where and denote mean value of a and a , v 2 and v 2 represent the variance of a and a , and n and n denote the number of datasets (here n = n = 5). By checking t-table, we can obtain the significant level p. Notice that the smaller the p value the more significant the difference.
Table 1

The specifications of the benchmark problems.

Dataset #Training set #Testing set #Attributes #Classes
Face 200 200 1600 10
Secom 1254 313 590 2
Arcene 400 500 10000 2
Dexter 1400 1200 20000 2
Multi.fea.400 1600 650 10
News20 3993 15935 62061 20
Sector 3207 6412 55197 105
OSPVM is first compared with Bach-PVM, BSGD, AMM, and Pegasos in this section. The number of hidden nodes, training time, testing time, training accuracy, and testing accuracy are reported in Table 2. The t-test results including t value and significant level p are summarized in Table 3. We can find from Table 2 that OSPVM can achieve nearly the same generalization to Batch-PVM while the training time is longer than Batch-PVM. The 16-by-16 mode is faster than one-by-one. Taking “Face” dataset as an example, the training time of OSELM is about 1.5 seconds and 13.07 seconds in 16-by-16 and 1-by-1 model, respectively. The reason lies in the fact that the bigger the chunk size, the fewer the update frequency. Batch-PVM just needs 0.46 seconds for “Face” dataset. In fact, Batch-PVM is one extreme case that initial data is entire data and does not need any update. For new samples, OSPVM can learn incrementally while Batch-PVM has to be retrained from the start. Taking “Face” dataset as an example, the average updating time of OSPVM for every sample is around 1.5/200 = 0.0075 seconds, while, for Batch-PVM, since it has to be retrained from the start, the updating time for every sample will be about 0.460 seconds. OSPVM is much faster than Batch-PVM in updating time for each sample. Table 2 also reported the results of considered algorithms BSGD, AMM, and Pegasos. The training time of BSGD, AMM, and Pegasos consists of the costs of dimension reduction and model training. From Tables 2 and 3 we can find that OSPVM can obtain competitive generalization performance in comparison to BSGD with t = 0.183 and p > 0.1 and significantly better than AMM (t = 4.141 and 0.01 > p > 0.001) and Pegasos (t = 2.267 and 0.1 > p > 0.05) while taking shorter training time. Still taking “Face” dataset as an example, BSGD, AMM, and Pegasos need 1.542, 1.99, and 1.530 seconds to obtain 91.63%, 88.75%, and 86.38% testing accuracy while OSPVM takes 1.50 seconds for 92.87% accuracy.
Table 2

Comparison of OSPVM, Batch-PVM, BSGD, AMM, and Pegasos.

Dataset Algorithms Nodes (θ) Training time (s) Testing time (s) Training accuracy Testing accuracy
Face OSPVM (40, 16-by-16) 51 (0.96) 1.50 s 0.0004 s 99.89% 92.87%
OSPVM (40, 1-by-1) 43 (0.99) 13.07 s 0.0005 s 99.20% 91.20%
Batch-PVM 65 0.460 s 0.0005 s 99.81% 92.30%
SVD + BSGD [10] 200 1.542 s 0.0835 s 99.92% 91.63%
SVD + AMM Online [13] 200 1.990 s 0.0300 s 99.82% 88.75%
SVD + Pegasos [12] 1.530 s 0.0240 s 99.11% 86.38%

Secom OSPVM (40, 16-by-16) 61 (0.96) 1.67 s 0.007 s 94.08% 93.14%
OSPVM (40, 1-by-1) 16 (0.96) 4.01 s 0.0004 s 94.14% 93.3%
Batch-PVM 60 0.525 s 0.0073 s 93.37% 93.35%
SVD + BSGD 100 1.801 s 0.0083 s 95.12% 93.13%
SVD + AMM Online 100 12.19 s 0.031 s 94.11% 87.87%
SVD + Pegasos 1.660 s 0.026 s 93.16% 89.12%

Arcene OSPVM (40, 16-by-16) 106 (0.96) 61.17 s 0.0005 s 95.88% 90.50%
OSPVM (40, 1-by-1) 39 (0.96) 130.6 s 0.0004 s 93.5% 86.7%
Batch-PVM 85 5.06 s 0.00038 s 94.63% 90.80%
SVD + BSGD 200 65.22 s 0.0335 s 95.92% 90.43%
SVD + AMM Online 200 81.69 s 0.06 s 94.89% 87.75%
SVD + Pegasos 56.41 s 0.044 s 94.42% 86.31%

Dexter OSPVM (40, 16-by-16) 176 (0.96) 131.1 s 0.004 s 97.88% 92.25%
OSPVM (40, 1-by-1) 86 (0.96) 619.3 s 0.004 s 96.0% 91.20%
Batch-PVM 160 10.36 s 0.005 s 98.38% 91.25%
SVD + BSGD 200 148.54 s 0.003 s 97.98% 92.63%
SVD + AMM Online 200 178.19 s 0.003 s 96.81% 89.95%
SVD + Pegasos 119.40 s 0.004 s 95.87% 87.36%

Multi.fea. OSPVM (40, 16-by-16) 55 (0.96) 4.93 s 0.0053 s 98.16% 94.40%
OSPVM (40, 1-by-1) 38 (0.96) 13.4 s 0.0047 s 96.6% 93.4%
Batch-PVM 160 1.83 s 0.0192 s 99.98% 95.67%
SVD + BSGD 200 5.54 s 0.0095 s 98.42% 94.63%
SVD + AMM Online 200 10.79 s 0.03 s 99.82% 92.15%
SVD + Pegasos 4.46 s 0.034 s 99.82% 91.88%

News20 OSPVM (40, 16-by-16) 1110 (0.96) 1283 s 19.8 s 85.26% 83.10%
OSPVM (40, 1-by-1) 1100 (0.96) 1949 s 19.9 s 85.6% 83.14%
Batch-PVM 1000 1060 s 19.2 s 84.89% 83.12%
SVD + BSGD 1200 2289 s 18.6 s 83.52% 82.33%
SVD + AMM Online 1200 2679 s 21.3 s 83.83% 82.25%
SVD + Pegasos 1679 s 19.2 s 83.22% 81.81%

Sector OSPVM (40, 16-by-16) 130 (0.96) 10.12 s 0.20 s 88.86% 78.40%
OSPVM (40, 1-by-1) 150 (0.96) 18.4 s 0.21 s 86.6% 79.04%
Batch-PVM 160 2.13 s 0.21 s 87.98% 79.01%
SVD + BSGD 200 7.53 s 0.34 s 87.44% 76.68%
SVD + AMM Online 200 12.69 s 0.33 s 86.81% 76.65%
SVD + Pegasos 6.45 s 0.34 s 86.12% 75.88%

Note: since OSPVM is equivalent to PVM rather than an approximation,  if it has the same experimental setting (same number of hidden nodes and same training and testing splits), OSPVM and PVM would obtain the same performance (training accuracy and testing accuracy).

Table 3

t value and significant level of OSPVM versus BSGD, AMM, and Pegasos.

SVD + BSGD (88.78%) SVD + AMM (86.47%) SVD + Pegasos (85.53%)
OSPVM (16-by-16) (89.23%) t = 0.183, p > 0.1 t = 4.141, 0.01 > p > 0.001 t = 2.267, 0.1 > p > 0.05
OSPVM (1-by-1) (88.18%) t = 1.434, p > 0.1 t = 1.958, 0.1 > p > 0.05 t = 2.932, 0.05 > p > 0.01

4.2. One-by-One

In this section we will compare OSPVM, OSELM, and SVD + OSELM in one-by-one case. Their training and testing accuracy are reported in Table 4, t values are shown in Table 6 and training time and testing time are reported in Table 5. As observed from Tables 4 and 5, although OSELM can learn at the fastest speed, OSPVM can produce better generalization performance than OSELM with t = 0.950 and p > 0.1. OSPVM obtained improved performance in most cases compared to SVD + OSELM while saving training time. Taking “Face” dataset as an example, SVD + OSELM takes 22.40 s to produce 91.0% accuracy while OSPVM takes 13.07 s to reach 91.2% accuracy. The reason lies in the fact that OSPVM can learn useful features similar to SVD + ELM and remove the redundancy between dimension reduction and neural network training. For SVD + OSELM, two control parameters including target dimensions and the number of hidden nodes need to be tuned, while for OSPVM only one parameter needs to be determined. This will make OSPVM more simple to determine parameter settings and more convenient for usage in real applications than SVD + OSELM. As shown in Table 7 where the hidden nodes and target dimension are reported, OSPVM needs less hidden nodes than OSELM and SVD + OSELM. This means that OSPVM can achieve better responding ability than other algorithms.
Table 4

Comparison of training and testing accuracy (in %) (one-by-one).

Dataset SVD + OSELM OSPVM OSELM
Training accuracy Testing accuracy Training accuracy Testing accuracyTraining accuracyTesting accuracy
Face 99.8% 91.0% 99.2% 91.2% 98.1% 88.5%
Secom 93.3% 93.0% 94.14% 93.3% 93.2% 92.4%
Arcene 93.0% 83.0% 93.5% 86.7% 86.1% 81.1%
Dexter 95.7% 91.4% 96.0% 91.2% 75.6% 86.2%
Multi.fea.99.0% 92.8% 96.6%93.4% 96.5%93.0%
News20 85.12% 82.9%85.6%83.14% 85.5%83.0%
Sector 89.11% 77.8% 88.6%79.04% 89.1%78.1%
Table 6

t value and significant level of OSPVM versus SVD + ELM (1-by-1) and OSELM (1-by-1).

SVD + OSELM (1-by-1) (86.73%) OSELM (1-by-1) (86.04%)
OSPVM (1-by-1) (88.18%) t = 0.7858, p > 0.1 t = 0.950, p > 0.1
Table 5

Comparison of training and testing time (in seconds) (one-by-one).

Dataset SVD + OSELM OSPVM OSELM
Training timeTesting time Training time Testing time Training time Testing time
Face 22.40 s0.0006 s 13.07 s 0.0005 s 0.156 s 0.035 s
Secom 7.809 s0.015 s 4.010 s 0.0004 s 0.346 s 0.029 s
Arcene131.5 s 0.0004 s130.6 s 0.0004 s 4.390 s 0.337 s
Dexter 619.3 s 0.001 s519.8 s0.0006 s 9.218 s 0.281 s
Multi.fea.13.51 s 0.042 s13.40 s0.0167 s 1.164 s0.097 s
News20 1987 s19.1 s1949 s19.9 s 611 s19.7 s
Sector 18.79 s 0.22 s18.4 s0.21 s 3.34 s0.39 s
Table 7

The number of hidden nodes (one-by-one).

DatasetSVD + OSELMOSPVMOSELM
#Target dimensions #Hidden nodes
Face 43 60 43 72
Secom 16 60 16 72
Arcene 39 110 39 160
Dexter 86 170 86 200
Multi.fea.38 180 38 160
News20 780 1200 1100 1200
Sector 90 150 150 250

4.3. Chunk-by-Chunk

The performance of OSPVM, SVD + OSPVM, and OS-ELM in chunk-by-chunk mode (here we select 16-by-16 as an example) is reported in Tables 8, 9, 10, and 11. The results are similar to one-by-one model. Table 9 shows that OSPVM needs longer training time than OSELM but shorter training time than SVD + OSELM. Tables 8, 10, and 11 show that OSPVM obtained better generalization performance and more compact structure than OSELM and SVD + OSELM in most cases. This means that OSPVM can improve the stability of OSELM in solving small-sample and high-dimensional problems and inherits the advantage of OSELM in aspect of learning efficiency.
Table 8

Comparison of training and testing accuracy (in %) (16-by-16).

Dataset SVD + OSELM OSPVM OSELM
Training accuracy Testing accuracy Training accuracyTesting accuracyTraining accuracy Testing accuracy
Face 99.82% 91.50% 99.89% 92.07% 98.22% 87.7%
Secom 93.34% 93.36% 94.08% 93.14% 93.32% 93.3%
Arcene 93.63% 89.90% 95.88% 90.50% 94.1% 89.7%
Dexter 89.75% 91.90% 97.88% 92.25% 72.6% 88.5%
Multi.fea.99.48% 93.49% 98.16% 94.40% 96.78% 93.0%
News20 86.11% 83.09% 85.26% 83.10% 85.24% 81.0%
Sector89.18% 78.19% 88.86% 78.40% 88.78% 76.20%
Table 9

Comparison of training and testing time (in seconds) (16-by-16).

Dataset SVD + OSELM OSPVM OSELM
Training time Testing time Training time Testing time Training time Testing time
Face 1.58 s 0.0005 s1.5 s0.0004 s0.078 s 0.035 s
Secom 1.85 s 0.018 s 1.67 s0.007 s0.061 s0.040 s
Arcene 61.4 s0.0008 s 61.17 s 0.0005 s 2.03 s 0.55 s
Dexter 135.7 s 0.0006 s131.1 s0.0004 s4.88 s0.718 s
Multi.fea. 5.15 s 0.0218 s4.93 s 0.0053 s 0.26 s0.098 s
News20 1283 s 19.8 s1283 s 19.8 s 0.26 s0.098 s
Sector 10.7 0.21 s10.12 0.20 s 5.26 s0.38 s
Table 10

t value and significant level (p) of OSPVM versus SVD + ELM (16-by-16) and OSELM (16-by-16).

SVD + OSELM (16-by-16) (88.7%) OSELM (16-by-16) (87.05%)
OSPVM (16-by-16) (89.23%) t = 0.7900, p > 0.1 t = 2.45, 0.05 > p > 0.01
Table 11

The number of hidden nodes (16-by-16).

DatasetSVD + OSELMOSPVM OSELM
#Target dimensions #Hidden nodes
Face 62 60 62 72
Secom 54 60 54 72
Arcene 110 110 106 300
Dexter 176 170 176 400
Multi.fea. 55 180 55 160
News20 780 1200 1100 1200
Sector 90 150 150 250

4.4. Adaptive Increase of the Number of Hidden Nodes

Figure 1(a) shows the curve of hidden nodes changing with increase of training samples. We can find that the hidden nodes of OSPVM grow adaptively when the new samples (chunk size is 40) are presented. Figure 1(b) shows the curve of training accuracy and testing accuracy change with increase of the samples. We can observe that the cover capability (training accuracy) and generalized performance (testing accuracy) of the model always remain stable.
Figure 1

Adaptive model updating with respect to new samples (on Face dataset).

4.5. Equivalence of OSPVM and PVM

Data mean update together with projection vectors update is to ensure the obtained OSPVM is an accurate model which is equivalent to PVM rather than an approximation (if there is no data mean update, an approximate model would be obtained). This means that if having the same parameter setting (same number of hidden nodes, same training and testing splits, etc.), OSPVM and PVM would obtain the same performance (training accuracy and testing accuracy). To verify the equivalence of them, we run these two algorithms at the same setting on the benchmarks. From the results shown in Table 12, it can be found that OSPVM will obtain the same training accuracy and testing accuracy as PVM. This illustrates from experimental aspect that OSPVM is equivalent to PVM instead of an approximation and thus can obtain the same generalized ability.
Table 12

Training and testing accuracy of PVM and OSPVM with the same hidden nodes.

Dataset Algorithms #Hidden nodes Training accuracy Testing accuracy
Face OSPVM (1-by-1 and 16-by-16) 65 99.81% 92.30%
Batch-PVM 65 99.81% 92.30%

Secom OSPVM (1-by-1 and 16-by-16) 60 93.37% 93.35%
Batch-PVM 60 93.37% 93.35%

Arcene OSPVM (1-by-1 and 16-by-16) 85 94.63% 90.80%
Batch-PVM 85 94.63% 90.80%

Dexter OSPVM (1-by-1 and 16-by-16) 160 98.38% 91.25%
Batch-PVM 160 98.38% 91.25%

Multi.fea. OSPVM (1-by-1 and 16-by-16) 160 99.98% 95.67%
Batch-PVM 160 99.98% 95.67%

News20 OSPVM (1-by-1 and 16-by-16) 1000 84.89% 83.12%
Batch-PVM 1000 84.89% 83.12%

Sector OSPVM (1-by-1 and 16-by-16) 160 87.98% 79.01%
Batch-PVM 160 87.98% 79.01%

4.6. The Influence of Mean Update to Generalized Performance of OSPVM

To display the influence of the mean update to the generalized performance of OSPVM, we run OSPVM with two different settings, respectively, that is, “with mean update” and “no mean update,” on the same datasets including  Face, Secom, Arcene, Dexter, and Multi.fea. For “with mean update” setting, the data is centralized to mean and dynamically adjusted as well when the subsequent chunk of data arrives. The variation curves of the testing accuracy with respect to the chunk of training data under these two different settings are illustrated in Figure 2 (labeled as “with mean update” and “no mean update,” resp.). It can be found that, on each dataset, OSPVM with mean update always obtains better generalized performance than no mean update. Take Face dataset as an example, on the first 40 training samples, OSPVM with mean update attains 73.5% in terms of testing accuracy while “no mean update” attains 72.3%. Along with the arrival of the subsequent training data, OSPVM with mean update is also always superior to no mean update. In time of the last chunk of data arrival, the obtained testing accuracy “with mean update” reaches 94% while “no mean update” reaches 90%. From the point of view of theoretical analysis, the performance improvement is possibly due to two aspects:These experimental and theoretical analyses show that mean update has important positive influence to the generalized performance of OSPVM. With help of mean update, OSPVM can process dynamical data more adaptively and effectively.
Figure 2

The influence of mean update to OSPVM.

From principle component analysis perspective, the useful features are those directions with maximum variance [33]. In order to capture these directions, the data should be firstly centralized because, if there is no centralization, the first obtained direction which is from the origin to the centre will be shifted and the successive directions are also shifted consequently. On the other hand, from multivariate probability distribution perspective [34], the datasets are usually treated as a multivariate Gaussian distribution that is represented as the amount of the mean plus the variation along the principal vectors. By centering the data to the mean, the variational component of the data can be cancelled out and thus capture purely variational component of the data.

5. Conclusion and Future Work

In this paper, an effective online sequential learning algorithm (OSPVM) has been proposed for high-dimensional and no-stationary data. Data mean, projection vectors, and neural network model can be updated simultaneously by one time pass of new samples. The algorithm can handle the new data arriving by one-by-one and chunk-by-chunk. Apart from setting the threshold value of accumulation ratio, no other parameter needs to be determined. Performance of OSPVM including training time and generalized performance is compared with some several typical online learning algorithms on real world benchmark problems. The results show that OSPVM can produce better generalization performance with more compact network structure than other algorithms in most cases. In our next work, we would further study how to improve computational efficiency to make it suitable for large data analytic. Additionally we would study more smart method to determine the threshold value of accumulation ratio adaptively.
  10 in total

1.  Extreme learning machine for regression and multiclass classification.

Authors:  Guang-Bin Huang; Hongming Zhou; Xiaojian Ding; Rui Zhang
Journal:  IEEE Trans Syst Man Cybern B Cybern       Date:  2011-10-06

2.  Compact extreme learning machines for biological systems.

Authors:  Kang Li; Jing Deng; Hai-Bo He; Yurong Li; Da-Jun Du
Journal:  Int J Comput Biol Drug Des       Date:  2010-09-16

3.  An efficient sequential learning algorithm for growing and pruning RBF (GAP-RBF) networks.

Authors:  Guang-Bin Huang; P Saratchandran; Narasimhan Sundararajan
Journal:  IEEE Trans Syst Man Cybern B Cybern       Date:  2004-12

4.  A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation.

Authors:  Guang-Bin Huang; P Saratchandran; Narasimhan Sundararajan
Journal:  IEEE Trans Neural Netw       Date:  2005-01

5.  A fast and accurate online sequential learning algorithm for feedforward networks.

Authors:  Nan-Ying Liang; Guang-Bin Huang; P Saratchandran; N Sundararajan
Journal:  IEEE Trans Neural Netw       Date:  2006-11

6.  Sequential Karhunen-Loeve basis extraction and its application to images.

Authors:  A Levey; M Lindenbaum
Journal:  IEEE Trans Image Process       Date:  2000       Impact factor: 10.856

7.  Incremental learning of chunk data for online pattern classification systems.

Authors:  Seiichi Ozawa; Shaoning Pang; Nikola Kasabov
Journal:  IEEE Trans Neural Netw       Date:  2008-06

8.  OP-ELM: optimally pruned extreme learning machine.

Authors:  Yoan Miche; Antti Sorjamaa; Patrick Bas; Olli Simula; Christian Jutten; Amaury Lendasse
Journal:  IEEE Trans Neural Netw       Date:  2009-12-08

9.  A sequential learning scheme for function approximation using minimal radial basis function neural networks.

Authors:  Y Lu; N Sundararajan; P Saratchandran
Journal:  Neural Comput       Date:  1997-02-15       Impact factor: 2.026

10.  Information potential fields navigation in wireless ad-hoc sensor networks.

Authors:  Wei Wei; Yong Qi
Journal:  Sensors (Basel)       Date:  2011-05-03       Impact factor: 3.576

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.