Literature DB >> 30515195

Generalization Bounds for Coregularized Multiple Kernel Learning.

Abstract

Multiple kernel learning (MKL) as an approach to automated kernel selection plays an important role in machine learning. Some learning theories have been built to analyze the generalization of multiple kernel learning. However, less work has been studied on multiple kernel learning in the framework of semisupervised learning. In this paper, we analyze the generalization of multiple kernel learning in the framework of semisupervised multiview learning. We apply Rademacher chaos complexity to control the performance of the candidate class of coregularized multiple kernels and obtain the generalization error bound of coregularized multiple kernel learning. Furthermore, we show that the existing results about multiple kennel learning and coregularized kernel learning can be regarded as the special cases of our main results in this paper.

Entities: Chemical Disease Species

Mesh：

Year: 2018 PMID： 30515195 PMCID： PMC6236656 DOI： 10.1155/2018/1853517

Source DB: PubMed Journal: Comput Intell Neurosci

1. Introduction

Kernel-based learning is related to achieve nonlinear machine learning tasks from linear ones. In the real applications, selecting a good or suitable kernel for the kernel-based learning is an important and difficult task. To this end, an approach named multiple kernel learning has been developed, and it allows to automatically choose the best kernel from a predefined kernel class. The earliest work of multiple kernel learning can be traced back to the research in [1], where the authors proposed to automatically pick up a linear combination of candidate kernels for the support vector machines based on a semidefinite programming approach. Theoretical generalization analysis of multiple kernel learning has been widely studied by many researchers [1-7]. In particular, Ying and Campbell in [2] proposed a novel generalization bound (Rademacher chaos complexity) for the study of multiple kernel learning. However, the discussions in [2] were for single view and supervised learning. In this paper, we will employ Rademacher chaos complexity proposed in [2] to study the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework. Semisupervised multiview learning as an area of machine learning is trained with both labeled samples and unlabeled samples, and the unlabeled samples are helpful to reduce the amount of the labeled samples. Semisupervised multiview learning supposes that the train samples can be represented by multiple views. The coregularized least squares algorithm—a semisupervised version of regularized least squares with two views—is a typical multiview learning model that uses the unlabeled samples to estimate the view incompatibility of models [8, 9]. Rosenberg in [10] extended the coregularized least squares algorithm to the case of kernel cotraining. And Brefeld et al. and Rosenberg in [11, 12, 13] discussed the generalization bound of kernel-based learning with multiple (or two) views in the semisupervised learning framework. However, the discussions in [11, 12, 13] supposed that the kernel used to construct the reproducing kernel Hilbert space is predefined. Therefore, their results cannot be used to the analysis of multiple kernel learning. In this paper, we will discuss the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework. And we show that the results in [2] and [11] can be regarded as the special cases of our main results. The rest of the paper is organized as follows. In Section 2, we introduce some basic notations and definitions for later discussion. In Section 3, we discuss the related research and put forward the question that will be studied in this paper. In Section 4, we present our main results. In Section 5, we give the main proofs for our main results proposed in Section 4. In Section 6, we give a comparative discussion of our results to the existing work and show that the results about multiple kennel learning in [2] and coregularized kernel learning in [11] can be regarded as the special cases of our main results. The last Section 7 concludes this paper.

2. Notations and Definitions

In this section, we introduce notations and definitions for later discussions: Let ℕ be the set of natural numbers and ℝ be the set of real numbers. Let ℕ={1,2,…, n}, n ∈ ℕ. Let (Ω, 𝒜, P) be a probability space; that is, Ω alone is called the sample space, 𝒜 is a σ-algebra on Ω, and P is a probability measure on (Ω, 𝒜). And Ω has the structure Ω=𝒳 × 𝒴(⊂ℝ), where 𝒳 and 𝒴 are the input space and output space, respectively. Denote P as the marginal distribution on 𝒳. For ignoring the discussion of measure theory, we simply denote (Ω, 𝒜, P) as (Ω, P). Let ℱ be the set of all measurable functions f : 𝒳⟶𝒴. Assume that ℋ is a subset of ℱ. That is, ℋ ⊂ ℱ, the set ℋ is called the hypothesis class. Let S={z=(x, y), i ∈ ℕ} be a finite set of the labeled training samples, and assume these samples are independent and identically distributed (i.i.d.) according to P. Denote the bold letter as a vector; for example, z presents a vector (z1, z2,…, z). For the sign |·|, if D is a set, we use |D| to represent the number of elements of a set and if D is a function, we use |D| to represent the absolute value of the function D. If A is a matrix, we use A to represent the transpose of the matrix A. Let L be the loss function, L : 𝒴 × 𝒴⟶[0, +∞), and the loss of f on a sample point z=(x, y) is defined by L(f, z) or L(f(x), y). In learning theory, one of the purposes is to pick up a function f in hypothesis space ℋ that minimizes the following generalization error: Generally speaking, the distribution P in the above Equation (1) is unknown. Rather than minimizing EP[L(f(X), Y)], we usually minimize the empirical or training error below:where the sign S represents the finite labeled samples and z=(x, y) ∈ S, i ∈ ℕ. In this paper, the main quantity we are interested in is the following uniform estimation of the difference between the generalization error and empirical error: For the discussion in the later sections, we introduce the following four definitions and one lemma (Definition 5 is proposed by us). Definition 1 (Empirical Rademacher Complexity) [. Let ℋ be a class of functions f : Ω⟶ℝ. The samples x, i ∈ ℕ, are independently drawn from the probability space (Ω, P). The empirical Rademacher complexity can be defined aswhere the random variables σ, i ∈ ℕ, are Rademacher variables, and σ presents a vector (σ1, σ2,…, σ). Definition 2 (Empirical Rademacher Chaos Complexity) [. Let ℋ be a class of functions f : Ω×Ω⟶ℝ. The samples x, i ∈ ℕ, are independently drawn from the probability space (Ω, P). The empirical Rademacher chaos complexity can be defined aswhere the random variables σ, i ∈ ℕ, are Rademacher variables, and σ presents a vector (σ1, σ2,…, σ). Definition 3 (Reproducing Kernel Hilbert Space, RKHS) [. The functionis a reproducing kernel of the Hilbert space ℋ if and only if. For any x ∈ 𝒳, K(·, x) ∈ ℋ For any f ∈ ℋ, for any x ∈ 𝒳, 〈f, K(x)〉ℋ=f(x) A Hilbert space of functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space.

Remark 1.

The second condition in the above Definition 3 is called “the reproducing property”: the value of the function f at the point x is reproduced by the inner product of f with K(x). From the above two conditions, for any, (x, y) ∈ 𝒳 × 𝒳, it is clear that In real applications, the solution to many reproducing kernel Hilbert space optimization questions is contained in a special subspace of the reproducing kernel Hilbert space, often known as the “span of the data”. The span of the data for an reproducing kernel Hilbert space ℋ is the linear subspace (Appendix A.3 on page 75 of [13]): For simplicity, we denote the above linear subspace by ℒ. Let 𝒦 ⊂ {K : 𝒳 × 𝒳⟶ℝ} be a class of kernels. In this paper, we assume that is finite. Definition 4 (Subnormalized Functional with Degenerate Dimension n). If a loss function ℓ : (∏ℋ) × 𝒴⟶[0, +∞) satisfiesthen we call ℓ a subnormalized functional with degenerate dimension n on (∏ℋ) × 𝒴.

Lemma 1 .

[13] Let ℋ be a reproducing kernel Hilbert space with kernel K : 𝒳 × 𝒳⟶ℝ, and consider any point x ∈ 𝒳. If ℒ is a closed subspace containing K(x), then the projection of f onto ℒ has the same value at x as f does. That is, For multiple kernel learning, the main task is to automatically choose a kernel K from a predefined class 𝒦 of kernels, and find a function from the reproducing kernel Hilbert space ℋ that is most suitable to model the given samples. In this paper, our purpose is to minimizeover the class (∪ℋ) × (∪ℋ). Here, let 𝒦 and 𝒦′ be the classes of kernels. ℋ and ℋ denote the reproducing kernel Hilbert spaces. m represents the amount of the labeled points (x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, and u represents the amount of the unlabeled points x ∈ 𝒳, i ∈ ℕ. The signs γ1, γ2, and λ mean the regularization parameters. The functions f1 and f2, respectively, represent two viewers, and the function ℓ(f1(·), f2(·), ·) is the labeled loss function, which measures the performance of f1 and f2 on the labeled points z, i ∈ ℕ.

3. Related Work

In [11], Rosenberg and Bartlett used Rademacher complexity to bound the coregularized kernel class in the semisupervised two-view learning framework, and two viewers are two predefined reproducing kernel Hilbert spaces (ℋ1 and ℋ2, respectively). Take labeled points z=(x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, and unlabeled points x ∈ 𝒳, i ∈ ℕ. The coregularized least squares algorithm discussed in [11] can be described as follows:where m represents the amount of the labeled points (x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, and u represents the amount of the unlabeled points x ∈ 𝒳, i ∈ ℕ. The signs γ1, γ2, and λ mean the regularization parameters. The functions f1 and f2, respectively, represent two viewers, and the function ℓ(f1(·), f2(·), ·) is the labeled loss function, which measures the performance of f1 and f2 on the labeled points z, i ∈ ℕ. In [11], the final output is denoted as (f1+f2)/2. In [2], Ying and Campbell applied the Rademacher chaos complexity to study the generalization of multiple kernel learning in the supervised learning framework. The multiple kernel learning model they considered is as follows:where m represents the amount of the labeled points (x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ. The function ℓ(f(x), y) is the loss function. The sign λ means the regularization parameters. And ℋ denotes the reproducing kernel Hilbert spaces. In Equation (13), ℋ is not predefined and depends on the kernel choose from the class 𝒦 of kernel. In this paper, we are interested in the topic of coregularized multiple kernel learning; that is, the two reproducing kernel Hilbert spaces are not defined in advance. Our discussions are in the framework of semisupervised multiview learning. We give this learning question as the following two-layer minimization formation:where let 𝒦 and 𝒦′ be the classes of kernels. ℋ and ℋ denote the reproducing kernel Hilbert spaces. m represents the amount of the labeled points (x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, and u represents the amount of the unlabeled points x ∈ 𝒳, i ∈ ℕ. The signs γ1, γ2, and λ mean the regularization parameters. The functions f1 and f2, respectively, represent two viewers, and the function ℓ(f1(·), f2(·), ·) is the labeled loss function, which measures the performance of f1 and f2 on the labeled points z, i ∈ ℕ.

Remark 2.

We will explain the following: Equation (14) given in this paper is different from Equation (12): The solution from Equation (14) is to minimize on the class (∪ℋ) × (∪ℋ), while the solution from Equation (12) is to minimize on the class ℋ × ℋ. The solution from Equation (14) is through two minimization steps: first, it finds the most suitable reproducing kernel for the given samples. Second, it obtains the best function/model from the found reproducing kernel Hilbert spaces in the first step. While in Equation (12), it only needs to get the best function/model from the reproducing kernel Hilbert spaces which are predefined. Equation (14) given in this paper is different from Equation (13): The solution from Equation (14) is to minimize on the product space (∪ℋ) × (∪ℋ), while the solution from Equation (12) is to minimize on the space (∪ℋ). The minimization item in Equation (13) is much simpler because the analysis on Equation (13) is limited to supervised learning and single view and does not deal with unlabeled samples. From the above discussion, we can see that the generalization analysis about Equation (14) will make more meaningful and bring greater challenge. In the next section, we will present the main results of this paper.

4. Generalization Bounds

In this section, we assume that the loss function ℓ in Equation (11) is the subnormalized functional with degenerate dimension 2 on ℋ × ℋ × 𝒴, K1 ∈ 𝒦, K2 ∈ 𝒦′. In Equation (11), let f1=0 and f1=0, we have Note that 𝒬(f1, f2) ≥ 0, and then for any samples (x, y), i ∈ ℕ and x, i ∈ ℕ, the class of candidate reproducing kernel Hilbert spaces is defined as follows: That is, the solution (f1, f2) minimizing 𝒬(f1, f2) belongs to the class ℋ × ℋ. For simplicity, we use ℋ to denote ℋ × ℋ in the next sections. As the assumption in [11], the final predictor for the coregularized multiple kernel learning is selected from the following class:

Remark 3.

In Equations (16) and (17), we can see that ℋ and mainly depend on the prescribed set of candidate kernels and the unlabeled data. For any , we define the expected loss as Equation (2), and we can use the loss function L to compute the labeled empirical loss in Equation (11). For the given samples (x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, the loss can be also computed as Equation (3). If L satisfies the Lipschitz continuous condition on , we introduce the constant defined byand the local Lipschitz constant denoted as In the end of this section, we give the main theorems in this paper.

Theorem 1 .

Let the function L be a subnormalized functional with degenerate dimension 1 on and satisfy the Lipschitz continuous condition on . Let the labeled samples z=(x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, be independent random variables drawn from the probability space (𝒳 × 𝒴, P), and the unlabeled samples x ∈ 𝒳, i ∈ ℕ, be independent random variables drawn from the probability space (𝒳, P). Then, for any δ ∈ (0,1), with probability at least 1 − δ, for any , the following inequality holdswhere σ presents a vector (σ1, σ2,…, σ) and σ, i ∈ ℕ, are Rademacher variables, (1/γ1) · K1+(1/γ2) · K2=O · D(K1, K2) · O with diagonal elements D(K1, K2) ≥ 0, i ∈ ℕ, and orthogonal matrix O, and

Corollary 1 .

Under the assumption of Theorem 1, we have the following inequality:

Corollary 2 .

Under the assumption of Theorem 1 and assume 𝒦=𝒦′ and γ1=γ2, the following inequality holds

Remark 4.

We will proof Theorem 1 and Corollaries 1 and 2 in Section 5. And in Section 6, we will reveal that Theorem 1 and Corollary 1 are the extensions of the results in [2] and [11], respectively.

5. Proofs

In this section, we will prove Theorem 1 and Corollaries 1 and 2 in Section 4. As the preparation for the next proof, we give two following lemmas (Lemmas 2 and 3) in advance.

Lemma 2 .

Let the function L be a subnormalized functional with degenerate dimension 1 on , and satisfy the Lipschitz continuous condition on . Let the labeled samples z, i ∈ ℕ, be independently drawn from the (𝒳 × 𝒴, P) and the unlabeled samples x ∈ 𝒳, i ∈ ℕ, be independently from the probability space (𝒳, P). Then, for any δ ∈ (0,1), with probability at least 1 − δ, we have the following inequality:where σ presents a vector (σ1, σ2,…, σ) and σ, i ∈ ℕ, are Rademacher variables.

Proof

For simplicity, let Replacing the ith sample (x, y) in the labeled samples with (x′, y′), we have By McDiarmid's inequality (Theorem D.3 on page 372 of [15]), with probability at least 1 − δ/2, we have By the symmetrization argument (page 36 of [15]), we bound the expectation of the first term on the right-hand side of the above inequality (28) as follows: Again, applying McDiarmid's inequality to the right-hand side of the above inequality (29), with probability at least 1 − δ/2, we have Combining inequalities (28), (29), and (30) yield with probability at least 1 − δ:

Lemma 3 .

Under the assumption of Lemma 2, for any δ ∈ (0,1), with probability at least 1 − δ, we have the following inequality:where, σ presents a vector (σ1, σ2,…, σ) and σ, i ∈ ℕ, are Rademacher variables. As defined in Equation (19), Lsuploc is the local Lipschitz constant of L. And by the contraction property of Rademacher complexity (Lemma 26.9 on page 331 of [16] and Theorem 7 of [17]), the result is as follows.

Lemma 4 .

If Equation (11) has a solution, then, for a fixedK1 ∈ 𝒦and a fixedK2 ∈ 𝒦′, it has a solution(f1, f2)as follows:for some α=(α1, α2,…,α) ∈ ℝ and β=(β1, β2,…,β) ∈ ℝ. That is, the solution belongs to (∪ℒ) × (∪ℒ). The result follows in a similar way to Proposition 2.3.1 in [11].

Lemma 5 .

Let the labeled samples z=(x, y) ∈ 𝒳 × 𝒴, i ∈ ℕ, be independent random variables drawn from the probability space (𝒳 × 𝒴, P) and the unlabeled samples x ∈ 𝒳, i ∈ ℕ, be independent random variables drawn from the probability space (𝒳, P). The following inequality holdswhere σ presents a vector (σ1, σ2,…, σ) and σ, i ∈ ℕ, are Rademacher variables, (1/γ1) · K1+(1/γ2) · K2=O · D(K1, K2) · O with diagonal elements D(K1, K2) ≥ 0, i ∈ ℕ, and orthogonal matrix O, and As defined in Equation (17), we can rewrite the left-hand side of inequality (35) aswhere the sign ℋ is defined in Equation (16). From the assumptions, we have By the reproducing property and Lemma 1, for any K1 ∈ 𝒦 and K2 ∈ 𝒦′, for anyi ∈ ℕ, we have Combining Equations (39), (40), (41), and (42) yields thatwhere By Lemma 4, for any K1 ∈ 𝒦 and K2 ∈ 𝒦′, we have (this is similar to Section 5.2.1 converting to Euclidean space in [11])where Hence, we havewhere Note that the matrix Λ is not full rank, and by using the similar steps in [11], we can rewrite Equation (47) aswhere In the above equations, the matrices E and E are, respectively, diagonal matrices containing nonzero eigenvalues. And we write the projections of α and β onto the column spaces of K1 and K2 as V · a and W · b. Next, we try to relate Equation (47) to Rademacher Chaos complexity. Note that The first equation can be easily obtained by the discussion of Section 5.2.4 in [11]. The second equation from Then, we have The second equation uses Equation (51), the forth inequality is obtained by using Jensen's inequality twice, and the last inequality uses the definition of Rademacher chaos complexity and the finite of kernels. For any , it is easy to show that Combining Lemmas 2, 3, 4, and 5, the result is as follows. By Gershgorin Circle Theorem (Theorem 7.2.1 on page 381 of [18]), the D(K1, K2) in Equation (21) can be estimated as follows: Then, we can write Equation (22) as So, the result is as follows. By the assumption in Corollary 2, we have By substituting Equations (57) and (58) into Equation (24), the Corollary 2 follows.

6. Discussion

In the above two sections, we give our main results and proof these results. In this section, we will give a comparative discussion of our results to the existing work (in [2] and [11]). First, we can see that the termin Theorem 1 reflects the compatibility of two viewers on the unlabeled samples. For multiple kernel learning in supervised learning. If we let u=0 and γ1=γ2=λ, then we have By Corollary (1) and Equations (60) and (61), we can get that Then, the main result in [2] recovers from Corollary (1). Therefore, the main result in [2] becomes the special case of Corollary (1).

Remark 5.

For the discussion in Section 3, if we set u=0 and γ1=γ2=λ and by combining Equation (17), then we have that Equation (14) reduces to Equation (13). Furthermore, let |𝒦|=|𝒦′|=1 and 𝒦=𝒦′, and we have Substituting inequality (63) into inequality (62), we can obtain the generalization bound for the single kernel learning in the framework of supervised learning as follows: (2) For coregularized kernel learning in semisupervised learning. If we let K1 ∈ 𝒦, |𝒦|=1, K2 ∈ 𝒦′, and |𝒦′|=1, by equation we have And note that Equation (65) is the same as the supremum evaluation in Section 5.2.2 in [11]. So, the main result in [11] recovers from Theorem 1. Then, we have that the main result in [11] can be regarded as the special case of Theorem 1.

Remark 6.

For the discussion in Section 3, if we set K1 ∈ 𝒦, |𝒦|=1, K2 ∈ 𝒦′, and |𝒦′|=1, then we have that Equation (14) reduces to Equation (12). In Figure 1, we show the relations between the main results in this paper and the results in [2] and [11].

Figure 1

Semisupervised learning as supervised learning when u=0. And if 𝒦 has a single kernel, we think that it is the special case of multiple kernel learning. The scope of the discussion in [2] is the intersection of the green and blue ellipses, the scope of the discussion in [11] is the yellow ellipse, and the cope of the discussion in this paper is the blue ellipse.

7. Conclusion

In this paper, based on semisupervised two-viewers learning, we study the generalization bound of coregularized multiple kernel learning. The main research tool is Rademacher chaos complexity which we use to control the performance of the candidate class of coregularized multiple kernels. In this paper, we mainly blend the work in [2] and [11] to discuss the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework. First, we discuss the differences between the learning question proposed by us and the learning questions in [2] and [11]. Then, we analyze the generalization bound of coregularized multiple kernel learning in the semisupervised multiview learning framework. And we show that the existing results in [2] and [11] can be regarded as the special cases of our main results.

2 in total

1. Rademacher chaos complexities for learning the kernel problem.

Authors: Yiming Ying; Colin Campbell
Journal: Neural Comput Date: 2010-11 Impact factor: 2.026

2. Refined rademacher chaos complexity bounds with applications to the multikernel learning problem.

Authors: Yunwen Lei; Lixin Ding
Journal: Neural Comput Date: 2014-01-30 Impact factor: 2.026

2 in total

1 in total

1. Deep convolution stack for waveform in underwater acoustic target recognition.

Authors: Shengzhao Tian; Duanbing Chen; Hang Wang; Jingfa Liu
Journal: Sci Rep Date: 2021-05-05 Impact factor: 4.379

1 in total