Literature DB >> 33266692

Semi-Supervised Minimum Error Entropy Principle with Distributed Method.

Abstract

The minimum error entropy principle (MEE) is an alternative of the classical least squares for its robustness to non-Gaussian noise. This paper studies the gradient descent algorithm for MEE with a semi-supervised approach and distributed method, and shows that using the additional information of unlabeled data can enhance the learning ability of the distributed MEE algorithm. Our result proves that the mean squared error of the distributed gradient descent MEE algorithm can be minimax optimal for regression if the number of local machines increases polynomially as the total datasize.

Entities: Chemical Disease

Keywords: MEE algorithm; distributed method; gradient descent; information theoretical learning; reproducing kernel Hilbert spaces; semi-supervised approach

Year: 2018 PMID： 33266692 PMCID： PMC7512566 DOI： 10.3390/e20120968

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

The minimum error entropy (MEE) principle is an important criterion proposed in information theoretical learning (ITL) [1] and was firstly addressed for adaptive system training by Erdogmus and Principe [2]. It has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, minimum cross-entropy for model selection, and some other topics [3,4,5,6,7,8]. Taking entropy as a measure of the error, the MEE principle can extract the information contained in data fully and produce robustness to outliers in the implementation of algorithms. Let be an explanatory variable with values taken in a compact metric space Y be a real response variable with , and be a prediction function. For a given set of labeled examples (N denotes the sample size) and a windowing function the MEE principle is to find a minimizer of the empirical quadratic entropy: where is the scaling parameter. Its goal is to solve the problem , where is the noise and is the target function. Taking a function MEE belongs to pairwise learning problems, which involves with the intersections of example pairs. Since logarithmic function is monotonic, we only consider the empirical information error of MEE: in the optimization process. Borrowing the idea from Reference [9], we introduced the Mercer kernel and employed the reproducing kernel Hilbert space (RKHS) as our hypothesis space. With is defined as the linear span of the functions set which is equipped with the inner product and the reproducing property For the G nonconvex, we usually solve Equation (1) using the kernel-based gradient descent method as follows. It starts with and is updated by: in the t-th step, where is a step size, ∇ is the gradient operator and: as we know that the example pairs will grow quadratically with the increasing example size N, which will bring the computational burden in the MEE implementation. Thus, it is necessary to reduce the algorithmic complexity by the distributed method based on a divide-and-conquer strategy [10]. Semi-supervised learning (SSL) [11] has attracted extensive attention as an emerging field in machine learning research and data mining. Actually, in many practical problems, few data are given, but a large number of unlabeled data are available, since labeling data requires a lot of time, effort or money. In this paper, we study a distributed MEE algorithm in the framework of SSL and show that the learning ability of the MEE algorithm can be enhanced by the distributed method and the combination of labeled data with unlabeled data. There are mainly three contributions in this paper. The first one is that we derive the explicit learning rate of the gradient descent method for distributed MEE in the context of SSL, which is comparable to the minimax optimal rate of the least squares in regression. This implies that the MEE algorithm can be an alternative of the least squares in SSL in the sense that both of them have the same prediction power. The second one is that we provide the theoretical upper bound for the number of local machines guaranteeing the optimal rate in the distributed computation. The last one is that we extend the range of the target function allowed in the distributed MEE algorithm. In Table 1, we summarize some notations used in this paper.

Table 1

List of notations used throughout the paper.

Notation	Meaning of the Notation
X	the explanatory variable
Y	the response variable
X	X∈X, a compact subset of an Euclidian space Rn
Y	Y∈Y, a subset of R
ρ(·,·)	a Boreal measure on X×Y
ρX	the marginal probability measure of ρ on X
ρ(y\|x)	the conditional probability measure of y∈Y given X=x
gρ(x)	the mean regression function gρ(x)=∫Yydρ(y\|x)
fρ(x,u)	the target function of MEE induced by fρ(x,u)=gρ(x)-gρ(u)
K	a reproducing kernel on X×X
D	the labeled data set D={(x1,y1),…,(xN,yN)}
N	the size of labeled data set D
⌈N/4⌉	the largest integer not exceeding N/4
\|D\|	the cardinality of D, \|D\|=N
D*	the unlabeled data set D*={x1,…,xS}
S	the size of unlabeled data set D*
\|D*\|	the cardinality of D, \|D\|=S
D˜	training data set used in the distributed MEE algorithm, consisting of D and D*
\|D˜\|	the cardinality of D˜, \|D˜\|=N+S
m	the number of local machines
D˜l	the lth subset of D˜, 1≤l≤m
G	the loss function of MEE algorithm
LK	the integral operator associated with K
LK,D˜	the empirical operator of LK on D˜
ft+1,D	the function output by the kernel gradient descent MEE algorithm
	with data D and kernel K after t iterations
ft+1,Dl	the function output by the kernel gradient MEE algorithm
	with data Dl and kernel K after t iterations
f¯t+1,D˜	the global output averaging over local outputs ft+1,D˜l,l=1,…,m

2. Algorithms and Main Results

We considered MEE for the regression problem. To allow noise in sampling processes, we assumed that a Borel measure is defined on the product space Let be the conditional distribution of for any given , and the marginal distribution on For the semi-supervised MEE algorithm, our goal was to estimate the regression function , from labeled examples and unlabeled examples drawn from the distribution and , respectively. Based on the divide-and-conquer strategy, both D and are partitioned equally into m subsets, and Here, we denote the size of subsets and , i.e., We construct a new dataset by: where: Based on the gradient descent algorithm (Equation (2)), we can get a set of local estimators for each subset Then, the global estimator averaging over these local estimators is given by: In the pairwise setting, our target function which is the difference of the regression function Denote by the space of square integrable functions on the product space : The goodness of is usually measured by the mean squared error Throughout the paper, we assumed that and for some constant almost surely. Without generality, windowing function G is assumed to be differentiable and satisfies for and there exists some p such that and: It is easy to check that the Gaussian kernel satisfies the assumptions above with Before we present our main results, define an integral operator associated with the kernel K by: Our error analysis for the distributed MEE algorithm (Equation (3)) is stated in terms of the following regularity condition: where denotes the r-th power of on and is well defined, since the operator is positive and compact with the Mercer kernel We use the effective dimension [12,13] to measure the complexity of with respect to which is defined to be the trace of the operator as: To obtain optimal learning rates, we need to quantify of . A suitable assumption is: that When The following theorem shows that the distributed gradient descent algorithm (Equation (3)) can achieve the optimal rate by providing the iteration time T and the maximal number of local machines, whose proof can be found in Section 3. Assume Equations ( Let the iteration time and : then for any where Under the same conditions of Theorem 1, if the scaling parameter: then for any The rate If no unlabeled data is engaged in the algorithm (Equation ( A series of distributed works [

3. Proof of Main Result

In this section we prove our main results in Theorem 1. To this end, we introduce the data-free gradient descent method in for the least squares, defined as and: Recalling the definition of , it can be written as: Following the standard decomposition technique in leaning theory, we split the error into the sample error and the approximation error

3.1. Approximation Error

Firstly, we estimate the approximation error It has been proven in Reference [20] and shown in the lemmas as follows. Define and when where Moreover, we derive the uniform bound of the sequence by Equation (10) when , which is useful in our analysis. Here and in the sequel, denote as the polynomial operator associated with an operator L defined by and We use the conventional notation Define where Using Equation (10) iteratively from t to then we have that: With Equation (5): Let be the eigenvalues of the operator and , since is positive and , then the norm: For each by a simple calculation, we have: Thus, we have: By the elementary inequality with it follows that: Together with Equation (12), then the proof is completed by taking □

3.2. Sample Error

Define the empirical operator by: and for any : Then, the MEE gradient descent algorithm (Equation (2)) on can be written as: where: and: In the sequel, denote: With these preliminaries in place, we now turn to the estimates of the sample error presented in the following Lemma, whose proof can be found in the Appendix. Here and in the sequel, we use the conventional notation Let where the constant and: With the help of Lemma above, to bound the sample error we first need to estimate the quantities the quantities and Denote ( is the cardinality of D). In previous work [19,21,22,23], we have foundnd that each of the following inequality holds with confidence at least : By Lemma 3, we also see that the function is crucial to determine To get a tight bound for the learning error, we should choose an appropriate according to the regularity of the target function. When and we take When is out of the space and we let Now, we give the first main result when the target function is out of with Assume Equation ( where Decompose into: The estimate of is presented in Lemma 1. We only need to handle by Lemma 3. For any and , by Equation (11), we have . Take in Lemma 3, then: and: Noticing the elementary inequality then: and: Plugging the above inequalities into term 1 and term 2, then: and: where and . By Equation (16), for any fixed there exist three subsets with measure at least such that: and: Thus, for any fixed with confidence at least there holds: and: Therefore, with confidence at least there holds: and: Thus, by Equation (18), it follows that with confidence at least by scaling to , there holds: where Similarly, with confidence at least such that: By Equation (19), it follows that with confidence at least by scaling to : where Together with Lemma 1, we obtain the desired bound (Equation (17)) with □ Next, we give the result when the target function is in with Assume Equation ( where The proof is similar to that of Theorem 2. Here we omit it. With these preliminaries in place, we can prove our main result in Theorem 1. We first prove Equation (8) by Theorem 2 when Let and . Notice that and for with and Equation (7), we obtain that: and: Thus: It follows for : Thus, by the above estimates: Thus: and: Putting the above estimates into Theorem 2, we have the desired conclusion (Equation (8)) with: When we apply Theorem 3 and take the same proof procedure a above. Then, the conclusion (Equation (8)) can be obtained. The proof is completed. □

4. Simulation and Conclusions

In this section, we provide the simulation to verify our theoretical statements. We assume that the inputs are independently drawn according to the uniform distribution on Consider the regression model , where is the independent Gaussian noise and: Define the pairwise kernel by where: We apply the kernel K to the distributed algorithm (Equation (3)). In Figure 1, we plot the mean squared error of Equation (3) for and when the number of local machines m varies. Note that and it is a standard distributed MEE algorithm without unlabeled data. When m becomes large, the red curve increases dramatically. However, when we add 300 or 600 unlabeled data, the error curves begin to increase very slowly. This coincides with our theory that using unlabeled data can enlarge the range of m in the distributed method.

Figure 1

The mean square errors for the size of unlabeled data as the number of local machines m varies.

This paper studied the convergence rate of the distribute gradient descent MEE algorithm in a semi-supervised setting. Our results demonstrated that using additional unlabeled data can improve the learning performance of the distributed MEE algorithm, especially in enlarging the range of m to guarantee the learning rate. As we know, there are many gaps between theory and empirical studies. We regard this paper as mainly a theoretical paper and expect that the theoretical analysis give some guidance to real applications.

3 in total