Literature DB >> 24459436

The generalization error bound for the multiclass analytical center classifier.

Abstract

This paper presents the multiclass classifier based on analytical center of feasible space (MACM). This multiclass classifier is formulated as quadratic constrained linear optimization and does not need repeatedly constructing classifiers to separate a single class from all the others. Its generalization error upper bound is proved theoretically. The experiments on benchmark datasets validate the generalization performance of MACM.

Entities: Chemical Disease Gene

Mesh：

Year: 2013 PMID： 24459436 PMCID： PMC3891430 DOI： 10.1155/2013/574748

Source DB: PubMed Journal: ScientificWorldJournal ISSN： 1537-744X

1. Introduction

Multiclass classification is an important and on-going research subject in machine learning. Its application is immense, such as machine vision [1, 2], text and speech categorization [3, 4], natural language processing [5], and disease diagnosis [6, 7]. Two kinds of approaches have been proposed to solve multiclass classification problem [8]. The first multiclass classification approach is extending binary classifier to handle the multiclass case directly. This included neural networks, decision trees, support vector machines, naive Bayes, and K-nearest neighbors. The second approach decomposes the multiclass classification problem into several binary classification tasks. Several methods are used for this decomposition: one-versus-all [9], all-versus-all [10], and error-correcting output coding [11]. The one-versus-all approach reduces the problem of classifying among K classes into K binary problems, where each problem discriminates a given class from the other K − 1 classes. For the all-versus-all method, a binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. This requires building K(K − 1)/2 binary classifiers for K classes problem. When testing a new example, voting is performed among the classifiers and the class with the maximum number of votes wins. For error-correcting output coding, it works by training N binary classifiers to distinguish between the K different classes. Each class is given a codeword of length N according to a binary matrix M. Each row of M corresponds to a certain class. The above multiclass classification algorithms need construct binary classifier repeatedly to separate a single class from all the others for K classes problem, which leads to daunting computation and low efficiency of classification. Reference [12] proposes multiclass support vector machine (MSVM), which corresponds to simple quadratic optimization and need not repeat constructing binary classifier. However, support vector machine corresponds to the center of the largest inscribed hypersphere of feasible space. When the feasible space, that is, the space of hypotheses consistent with the training data, is elongated or asymmetric, support vector machine is not effective [13]. To address the above problems, multiclass classifier based on the analytical center of feasible space (MACM) is proposed. At the same time, in order to validate its generalization performance theoretically, its generalization error upper bound is formulated and proved. And the experiments on benchmark dataset validate the generalization performance of MACM.

2. Multiclass Analytical Center Classifier

To facilitate the discussion of multiclass analytical center classifier, the following definitions are introduced.

Definition 1 (chunk)

A vector, v = (v 1,…, v ) ∈ ℜ , is broken into k chunks (v 1,…, v ) where the ith chunk v = (v (,…, v ).

Definition 2 (expansion)

Let Vec⁡(x, i) ∈ ℜ be a vector where x ∈ ℜ is embedded in kd dimensions space by writing the coordinates of x in the ith chunk of a vector in ℜ . 0 denotes the zero vector of length ℓ. Then, Vec⁡(x, i) can be written formally as the concatenation of three vectors, Vec⁡(x, i) = (0(, x, 0() ∈ ℜ . And define Vec⁡(x, i, j) = Vec⁡(x, i) − Vec⁡(x, j) as a vector where x is embedded in the ith chunk and −x is embedded in the jth chunk of a vector in ℜ .

Definition 3

Given the sample (x, y) ∈ ℜ × {1,…, k}, its expansion is defined as G(x, y) = {Vec⁡(x, i, j) | i = y, j = {1,…, k}/i}; the expansion of the whole sample set S = {(x 1, y 1), (x 2, y 2),…} is defined as G(S) = ⋃( G(x, y).

Definition 4 (piecewise linear separability)

The point sets A ∈ ℜ , i = 1,…, k (k represents the class labels andm the number of samples belonging to ith class), are piecewise linear separable if there exists w ∈ ℜ , γ ∈ ℜ, i = 1,…, k, where d represents the dimension of point, such that

Definition 5 (piecewise linear classifier)

Assume w = (w 1,…, w ), where (w 1,…, w ) ∈ ℜ = ℜ × ⋯×ℜ . Given a new point x ∈ ℜ , a piecewise linear classifier is a function f : ℜ → {1,…, k} as follows: where arg max returns to a class label corresponding to the maximum value. To simplify the notation for the formulation of multiclass analytical center classifier, we consider an augmented weight space as follows. Let then, inequality (1) can be rewritten as Let . According to Definition 2, embedding into ℜ space, inequality (4) has the following form: Consider that Thus, inequality (6) can be rewritten as follows: Inequality (7) represents the feasible space of in the higher dimension space ℜ . Similar to the binary classification based on the analytical center of version space [8], we define the slack variable , i, j = 1,…, k, i ≠ j, ℓ = 1,…, m and then have the following minimization problem, whose solver corresponds to the analytical center of higher dimension space: In order to further simplify the formulation of multiclass analytical center classifier, we introduce some notations as follows: M = k∗(k − 1)∗∑ m , ; let B represent the ith row vector of B. Then, the optimization problem (8) can be rewritten as follows: After solving the optimization problem (9) to get the optimal weight , we have a piecewise linear classifier f : ℜ → {1,…, k} computed in the following way: where argmax returns to a class label corresponding to the maximum value. If the dataset is not piecewise linear separable, the kernel function is used to map the data into high dimension linear space.

3. Generalization Error Bound of Multiclass Analytical Center Classifier

In order to analyze the generalization error bound theoretically, we introduce the definition of classification margin and data radius and then deduce the margin-based generalization error bound of MACM.

Definition 6 (classification margin)

Given the linear classifier , the classification margin of the sample (x , y ) ∈ ℜ × {1,…, k} is defined as follows: For the whole training set S = {(x 1, y 1), (x 2, y 2),…, (x , y )}, the minimal margin is as follows:

Definition 7 (data radius)

Given dataset S = {(x 1, y 1), (x 2, y 2),…, (x , y )}, the data radius is defined as follows:

Theorem 8

Define data radius of dataset S as ς(S) and data radius of dataset Vec⁡(S, i, j) as ς(Vec⁡(S, i, j)); if ς(S) ≤ R, then ς(Vec⁡(S, i, j)) ≤ 2R.

Proof

Consider the following: Because ||Vec⁡(x, i)|| = ||(0, x, 0)|| = ||x||, ς(Vec⁡(S, i, j)) ≤ 2∗max⁡⁡||x || = 2R. This ends the proof of Theorem 8.

Theorem 9 (see [14])

Consider thresholding of a real-valued function ℋ with unit weight vectors ||w|| = 1 on the inner product space X and fix margin r ∈ ℜ +. For any probability distribution 𝒟 on X × {−1,1} with support in a ball of radius R around the origin, with probability 1 − δ over m random samples S, any hypothesis h ∈ ℋ with margin (h) ≥ r on S has error more than provided m > 2/ε, 64R 2/r 2 < m. From Definition 3 and inequality (7), it is shown that to correctly classify the sample A , i = 1,…, k, is satisfied. Here, one introduces the samples' pairs and , where 0, 1 denote the corresponding dimension vector with elements 0 or 1, respectively. So, one can construct the new samples' set P(A) = P +(A)⋃P −(A) ∈ ℜ × {−1, +1}.

Theorem 10

The binary classification of sample P(A) by analytical center classifier is equivalent to the multiclass classification of sample A = {A } by multiclass analytical center classifier. Assume that {X +, Y +} ∈ P +(A), {X −, Y −} ∈ P −(A), i = 1,…, |P +(A)|; then, binary classification is to solve the following feasible problem: Suppose the bias b equals 0, because P −(A) and P +(A) are symmetrical on the origin. The feasible constraints can be rewritten as follows: The feasible constraints (17) define the feasible space of weight vector w ∈ ℜ ; the binary classification by analytical center classifier can be formulated as follows: Because X + ∈ G(A) and |P +(A)| = |G(A)|, the problem (18) is equivalent to problem (8). This ends the proof of Theorem 10.

Theorem 11

Consider the classifiers' set ℋ from Definition 5 with ∑||w || = 1 on the inner product space X, where h : ℜ → {1,…, k} and fix margin r ∈ ℜ +. For any probability distribution 𝒟 on X × {1,…, k} with support in a ball of radius R around the origin, with probability 1 − δ over m random samples S, any hypothesis h ∈ ℋ with margin (h) ≥ r on S has error more than provided m > 2/ε, 64R 2/r 2 < m. Because the sample in P(S) is not independent, the generalization error bound cannot be attained from Theorem 9. Theorem 9 is independent of the sample distribution, so we can construct a new sample distribution 𝒟′. According to the new distribution and dataset S to generate the independent sample set P′(S) with m samples, that is, for every (x, y) ∈ S, define P′(x, y) as the point sampled uniformly and randomly from P(S) according to the distribution 𝒟′; then, we have P′(S) = ⋃( P′(x, y). From Theorem 8, the data radius ς(P′(S)) of P′(S) satisfies ς(P′(S)) ≤ 2R. The generalization error of hypothesis h over P′(S) from Theorem 9 can be calculated as follows: Event E which denotes a sample in P(S) is wrongly classified and event C which denotes the misclassification occurs in P′(S). From the above analysis, the misclassification of any sample in P(S) causes the misclassification of the point in P′(S), so that the probability of events E and C satisfies the following inequality: Because the cardinality of P(S) equals 2(k − 1), the probability of sample misclassification in P(S) is written as follows: From union bound theorem, we have the following inequality: So the generalization error of hypothesis h over S is This ends the proof of Theorem 11.

4. Computational Experiments

In this section, we present the computational results comparing multiclass analytical center classifier (MACM) and multiclass support vector machine (MSVM) [12]. A description of each of the datasets follows this paragraph. The kernel function for the piecewise nonlinear MACM and MSVM methods is k(x, x ) = (x x ′/N + 1), where n is the desired polynomial. Wine Recognition Data. The wine dataset uses the chemical analysis of wine to determine the cultivar. There are 178 points with 13 features. This is a three class dataset distributed as follows: 59 points in class 1, 71 points in class 2, and 48 points in class 3. Glass Identification Database. The Glass dataset is used to identify the origin of a sample of glass through chemical analysis. This dataset is comprised of six classes of 214 points with 9 features. The distribution of points by class is as follows: 70 float processed building windows, 17 float processed vehicle windows, 76 nonfloat processed building windows, 13 containers, 9 tableware, and 29 headlamps. Table 1 contains the results for MACM and MSVM on wine and glass datasets. As anticipated, MACM produces better testing generalization than MSVM.

Table 1

The generalization error of MACM and MSVM.

Dataset	Classifier	Degree of polynomial
Dataset	Classifier	1	3
Wine	M-ACM	97.74	98.65
Wine	M-SVM	97.19	97.75

Glass	M-ACM	56.46	69.38
Glass	M-SVM	55.14	66.15

5. Summary

In this paper, the multiclass analytical center classifier based on the analytical center of feasible space, which corresponds to a simple quadratic constrained linear optimization, is proposed. At the same time, in order to validate its generalization performance theoretically, its generalization error upper bound is formulated and proved. By the experiments on wine recognition and glass identification dataset, it is shown that the multiclass analytical center classifier outperforms multiclass support vector machine in generalization error.

3 in total

1. A comparison of methods for multiclass support vector machines.

Authors: Chih-Wei Hsu; Chih-Jen Lin
Journal: IEEE Trans Neural Netw Date: 2002

2. Scalable active learning for multiclass image classification.

Authors: Ajay J Joshi; Fatih Porikli; Nikolaos P Papanikolopoulos
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2012-11 Impact factor: 6.226

3. Comparison of SVM and ANFIS for snore related sounds classification by using the largest Lyapunov exponent and entropy.

Authors: Haydar Ankışhan; Derya Yılmaz
Journal: Comput Math Methods Med Date: 2013-09-30 Impact factor: 2.238

3 in total