Literature DB >> 33267358

Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion.

Abstract

In the framework of statistical learning, we study the online gradient descent algorithm generated by the correntropy-induced losses in Reproducing kernel Hilbert spaces (RKHS). As a generalized correlation measurement, correntropy has been widely applied in practice, owing to its prominent merits on robustness. Although the online gradient descent method is an efficient way to deal with the maximum correntropy criterion (MCC) in non-parameter estimation, there has been no consistency in analysis or rigorous error bounds. We provide a theoretical understanding of the online algorithm for MCC, and show that, with a suitable chosen scaling parameter, its convergence rate can be min-max optimal (up to a logarithmic factor) in the regression analysis. Our results show that the scaling parameter plays an essential role in both robustness and consistency.

Entities: Disease

Keywords: correntropy; maximum correntropy criterion; online algorithm; reproducing kernel Hilbert spaces; robustness

Year: 2019 PMID： 33267358 PMCID： PMC7515137 DOI： 10.3390/e21070644

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Regression analysis is an important problem in many fields of science. The traditional least squares method may be the most used algorithm for regression in practice. However, it only relies on the mean squared error and belongs to second-order statistics, whose optimality depends heavily on the assumption of Gaussian noise. Thus, it usually performs poorly when the noise is not normally distributed. Alterative approaches have been proposed to deal with outliers or heavy-tailed distributions. A generalized correlation function named correntropy [1] is introduced as a substitute for the least squares loss, and the maximum correntropy criterion (MCC) [2,3,4,5] is used to improve robustness in situations of non-Gaussian and heavy-tailed error distributions. Recently, MCC has been succeeded in many real applications, e.g., wind power forecasting and pattern recognition [6,7]. In the standard framework of statistical learning, let be an explanatory variable with values taken in a compact metric space Y be a real response variable with Here we investigate the application of MCC in the following regression model where is the noise and is the regression function, defined as the conditional mean at each The purpose of regression is to estimate the unknown target function according to the sample which is drawn independently from the underlying unknown probability distribution on For a hypothesis function with the scaling parameter the correntropy between and Y is defined by where is the Exponential function For the given sample the empirical form of is When applied to regression problems, MCC intends to maximize the empirical correntropy over a certain underlying hypothesis space , that is MCC in regression problems has shown its efficiency for cases when the noises are non-Gaussian, and also with large outliers, see [8,9,10]. It also has drawn much attention in the signal processing, machine learning and optimization communities [2,5,11,12,13,14]. Let be a Mercer kernel, i.e., a continuous, symmetric and positive semi-definite function. We say that K is a positive semi-definite, if for any finite set and the matrix is positive semi-definite. An RKHS associated with the Mercer kernel K is defined as the completion of the linear span of the functions set It has the reproducing property for any and Since is compact, the RKHS is contained in the space of continuous functions on with the norm Moreover, if is a Euclidean ball in with some then the Sobolev space is an RKHS. For more families of RKHS in statistical learning, one can refer to [15]. Denote then, by the reproducing property (2), there holds Denote as the correntropy induced regression loss, given by Associated with this regression loss and the RKHS MCC for regression (1) in the context of learning theory is reformulated as Notice that is not convex, MCC algorithms are usually implemented by various gradient descent methods [14,16,17]. In this paper, we take the online gradient descent method as follows to solve the above optimization scheme (4) since it is scalable to large datasets and applicable to situations where the samples are presented in sequence. Given the sample where In the literature, most MCC algorithms have been implemented for linear models and cannot be applied to analysis of data with nonlinear structures. Kernel methods provide efficient non-parametric learning algorithms for dealing with nonlinear features. So, RKHS are used in this work as hypothesis spaces in the design of learning algorithms. An online algorithm for MCC has been used in practical applications for more than one decade, but there still is a lack of the theoretical guarantee or strict analysis for its asymptotical convergence. Because the optimization problem arising from MCC is not convex, the global optimization convergence of the online algorithm (5) for MCC is not unconditionally guaranteed. This also makes the theoretical analysis for MCC essentially difficult. In fact, vast numerical studies show that MCC can lead robust estimators while keeping convenient convergence properties. Thus, our goal is to fill the gap between the theoretical analysis and the optimization process so that the output function of the online algorithm (5) can converge to a global minima while the existing work can not ensure the global optimization of this output. To this end, we study the approximation ability of generated by (5) at the T-iteration to the regression function We derive the explicit error rate for (5) with suitable choice of step sizes, which is competitive with those in the regression analysis. In this work, we show that the scaling parameter plays an important role in providing robustness and a fast convergence rate.

2. Preliminaries and Main Results

We begin with some preliminaries and notations. Throughout the paper, we assume that the unknown distribution on can be decomposed into the marginal distribution on and the conditional distribution at each We also require that almost surely for some In the regression analysis, the approximation power of by (5) is usually measured in terms of the mean squared error in -metric that is defined as To present our main result for the error bound of , the assumption on the target function will be given as below. Define an integral operator associated with the kernel K by By the reproducing property (2) of for any it can be expressed as Since K is a Mercer kernel, is compact and positive. Denote as the r-th power of then it is well defined for any by the spectral theorem. Let be the eigenvalues of arranged in decreasing order. The corresponding eigenfunctions form an orthonormal basis of space. Hence, the regularity space is expressed as [18] It implies that for any , there holds In particular, we know that for any and satisfying Throughout the paper, the regularity assumption holds for , i.e., and This assumption is called the source condition [19] in inverse problems and it characterizes the smoothness of the target function Obviously, the larger the parameter r is, the higher the regularity of is. The general source conditions considered in inverse problems usually take the form of where is non-decreasing and called the index function. It is clear that when The above assumption is a special case of (9) with and It should be pointed that our analysis in this work also can applied to more general cases by taking source conditions (9). We are now in a position to state our convergence rate for (5) in -space as well as in by choosing the step size For brevity, let without losing generality and denote the expectation as for each Define and if where the constants Besides the error It has been proved in [ where When Recent paper [ In the work [ the obtained rate is It is easy to check that the roots of the second derivative of Based on the above remarks, we see that the convergence rate of online kernel-based MCC is comparable to that of the least squares that has appeared in the literature [24]. Meanwhile, MCC’s redescending property will produce robustness to various outliers including sub-Gaussain, Student’s t-distribution, and Cauchy distribution. These all shows the superiority of MCC in a variety of applications, such as clustering, classification and feature selection [14]. At the end of this section, we would like to point out that although our work is carried out under the boundness condition of it can be extended to more general situations such as the moment conditions [20].

3. Proofs of Main Result

In this section, we prove our main results in Theorem 1. First, we derive the uniform bound for the iteration sequence by (5). Define We prove (12) by induction. It is trivial that (12) holds for Suppose (12) holds for Notice that Write (12) as where Then by (2), and Then, we have For the part of the above inequality, we have Since it follows that and Recall that for all then Based on the above analysis, Then the proof is completed. □ Next, we will establish a proposition which is crucial to prove the convergence rates in Theorem 1. It is closely related to the generalization error of Define the generalization error for any measurable function by The regression function that we want to learn or approximate is a minimizer of , that is A simple computation yields the relation for For brevity, set the operator for and Define furthermore, if where Denote and define a random variable By (5), we have that for any Applying the above equality iteratively from to we get that by It follows from the elementary inequality that for any that To prove (14), we consider the part of the first term on the right-hand side of (18) Observe that is only dependent on , not on Thus, by the fact that we have We consider the second term on the right-hand side of (19). It can be rewritten as When by (20), Obviously, the above equality holds for So, with (7), we get To bound we have where the last inequality is derived from (3). Applying Lemma A1 with , and we have Based on the above analysis, we have Now, we estimate the last term on the right-hand side of (19). Using (20) again, we have Plugging (21) and (22) into (19), we get This together with (18) yields the desired conclusion (14). Now we turn to bound in -norm. By (17) again, we have Following the similar procedure in estimating (14), we also get Noticing that then the bound (15) is obtained. □ Based on the error bounds of in Proposition 1, we need to estimate the generalization error Define then for We shall prove (25) by induction. Obviously, (25) holds for Suppose (25) holds for Applying (14) with , then Since the Gaussian G is Lipschitz continuous, we have that for each where the last inequality is derived from (3). Notice that by there holds for each Then the last term on the right-hand side of (26) is bounded as For the first term , it is easy to get that Putting the above estimates into (26) and using the relation (13) with we have By the restriction (24) of and Lemma A3, we know that Plugging it into (28), we have Then the proof is completed. □ With these preliminaries in place, we shall prove our main results. We shall prove Theorem 1 by Proposition 1. First, we will use (14) to estimate the error rate for (5) in -space. For the first term on the right-hand side of (14), applying Lemma A2 with and , we have that For the second term on the right-hand side of (14), the choice of and T in Theorem 1 implies that the restriction (24) holds. Then we can put the bound (12) into (25) and get that for This together with Lemma A3 yields that Finally, we bound the last term on the right-hand side of (14). Notice that Then, using the estimate (27) and the bound (12) of , we have Based on the above analysis, the conclusion (10) is obtained by taking Similarity, we can get the conclusion (11) by taking □

2 in total

1. Robust principal component analysis based on maximum correntropy criterion.

Authors: Ran He; Bao-Gang Hu; Wei-Shi Zheng; Xiang-Wei Kong
Journal: IEEE Trans Image Process Date: 2011-01-06 Impact factor: 10.856

2. Maximum Correntropy Criterion for Robust Face Recognition.

Authors: Ran He; Wei-Shi Zheng; Bao-Gang Hu
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2010-12-10 Impact factor: 6.226

2 in total