Kevin R Moon1, Kumar Sricharan2, Kristjan Greenewald3, Alfred O Hero4. 1. Genetics Department and Applied Math Program, Yale University, New Haven, CT 06520, USA. 2. Intuit Inc., Mountain View, CA 94043, USA. 3. IBM Research, Cambridge, MA 02142, USA. 4. Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI 48109, USA.
Abstract
Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem.
Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem.
Information divergences are integral functionals of two probability distributions and have many applications in the fields of information theory, statistics, signal processing, and machine learning. Some applications of divergences include estimating the decay rates of error probabilities [1], estimating bounds on the Bayes error [2,3,4,5,6,7,8] or the minimax error [9] for a classification problem, extending machine learning algorithms to distributional features [10,11,12,13], testing the hypothesis that two sets of samples come from the same probability distribution [14], clustering [15,16,17], feature selection and classification [18,19,20], blind source separation [21,22], image segmentation [23,24,25], and steganography [26]. For many more applications of divergence measures, see reference [27]. There are many information divergence families including Alpha- and Beta-divergences [28] as well as f-divergences [29,30]. In particular, the f-divergence family includes the well-known Kullback–Leibler (KL) divergence [31], the Rényi- divergence integral [32], the Hellinger–Bhattacharyya distance [33,34], the Chernoff- divergence [5], the total variation distance, and the Henze–Penrose divergence [6].Despite the many applications of divergences between continuous random variables, there are no nonparametric estimators of these functionals that achieve the parametric mean squared error (MSE) convergence rate, are simple to implement, do not require knowledge of the boundary of the density support set, and apply to a large set of divergence functionals. In this paper, we present the first information divergence estimator that achieves all of the above. Specifically, we address the problem of estimating divergence functionals when only a finite population of independent and identically distributed (i.i.d.) samples is available from the two d-dimensional distributions that are unknown, nonparametric, and smooth. Our contributions are as follows:We propose the first information divergence estimator, referred to as EnDive, that is based on ensemble methods. The ensemble estimator takes a weighted average of an ensemble of weak kernel density plug-in estimators of divergence where the weights are chosen to improve the MSE convergence rate. This ensemble construction makes it very easy to implement EnDive.We prove that the proposed ensemble divergence estimator achieves the optimal parametric MSE rate of , where N is the sample size when the densities are sufficiently smooth. In particular, EnDive achieves these rates without explicitly performing boundary correction which is required for most other estimators. Furthermore, we show that the convergence rates are uniform.We prove that EnDive obeys a central limit theorem and thus, can be used to perform inference tasks on the divergence such as testing that two populations have identical distributions or constructing confidence intervals.
1.1. Related Work
Much work has focused on the problem of estimating the entropy and the information divergence of discrete random variables [1,29,35,36,37,38,39,40,41,42,43]. However, the estimation problem for discrete random variables differs significantly from the continuous case and thus employs different tools for both estimation and analysis.One approach to estimating the differential entropy and information divergence of continuous random variables is to assume a parametric model for the underlying probability distributions [44,45,46]. However, these methods perform poorly when the parametric model does not fit the data well. Unfortunately, the structure of the underlying data distribution is unknown for many applications, and thus the chance for model misspecification is high. Thus, in many of these applications, parametric methods are insufficient, and nonparametric estimators must be used.While several nonparametric estimators of divergence functionals between continuous random variables have been previously defined, the convergence rates are known for only a few of them. Furthermore, the asymptotic distributions of these estimators are unknown for nearly all of them. For example, Póczos and Schneider [10] established a weak consistency for a bias-corrected k-nearest neighbor (nn) estimator for Rényi- and other divergences of a similar form where k was fixed. Li et al. [47] examined k-nn estimators of entropy and the KL divergence using hyperspherical data. Wang et al. [48] provided a k-nn based estimator for KL divergence. Plug-in histogram estimators of mutual information and divergence have been proven to be consistent [49,50,51,52]. Hero et al. [53] provided a consistent estimator for Rényi- divergence when one of the densities is known. However none of these works studied the convergence rates or the asymptotic distribution of their estimators.There has been recent interest in deriving convergence rates for divergence estimators for continuous data [54,55,56,57,58,59,60]. The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition [61]:Let
for allFrom Definition 1, it is clear that if a function (f) belongs to , then f is continuously differentiable up to order . In this work, we show that EnDive achieves a parametric MSE convergence rate of when and , depending on the specific form of the divergence function.Nguyen et al. [56] proposed an f-divergence estimator that estimates the likelihood ratio of the two densities by solving a convex optimization problem and then plugging it into the divergence formulas. The authors proved that the minimax MSE convergence rate is parametric when the likelihood ratio is a member of the bounded Hölder class with . However, this estimator is restricted to true f-divergences and may not apply to the broader class of divergence functionals that we consider here (as an example, the divergence is not an f-divergence). Additionally, solving the convex problem of [56] has similar computational complexity to that of training a support vector machine (SVM) (between and ), which can be demanding when N is large. In contrast, the EnDive estimator that we propose requires only the construction of simple density plug-in estimates and the solution of an offline convex optimization problem. Therefore, the most computationally demanding step in the EnDive estimator is the calculation of the density estimates, which has a computational complexity no greater than .Singh and Póczos [58,59] provided an estimator for Rényi- divergences as well as general density functionals that use a “mirror image” kernel density estimator. They proved that these estimators obtain an MSE convergence rate of when for each of the densities. However their approach requires several computations at each boundary of the support of the densities which is difficult to implement as d gets large. Also, this computation requires knowledge of the support (specifically, the boundaries) of the densities which is unknown in most practical settings. In contrast, while our assumptions require the density support sets to be bounded and the boundaries to be smooth, knowledge of the support is not required to implement EnDive.The “linear” and “quadratic” estimators presented by Krishnamurthy et al. [57] estimate divergence functionals that include the form for given and where and are probability densities. These estimators achieve the parametric rate when and for the linear and quadratic estimators, respectively. However, the latter estimator is computationally infeasible for most functionals, and the former requires numerical integration for some divergence functionals which can be computationally difficult. Additionally, while a suitable - indexed sequence of divergence functionals of this form can be constructed that converge to the KL divergence, this does not guarantee convergence of the corresponding sequence of divergence estimators, as shown in reference [57]. In contrast, EnDive can be used to estimate the KL divergence directly. Other important f-divergence functionals are also excluded from this form including some that bound the Bayes error [2,4,6]. In contrast, our method applies to a large class of divergence functionals and avoids numerical integration.Finally, Kandasamy et al. [60] proposed influence function-based estimators of distributional functionals including divergences that achieve the parametric rate when . While this method can be applied to general functionals, the estimator requires numerical integration for some functionals. Additionally, the estimators in both Kandasamy et al. [60] and Krishnamurthy et al. [57] require an optimal kernel density estimator. This is difficult to construct when the density support is bounded as it requires difficult computations at the density support set boundary and therefore, knowledge of the density support set. In contrast, Endive does not require knowledge of the support boundary.In addition to the MSE convergence rates, the asymptotic distribution of divergence estimators is of interest. Asymptotic normality has been established for certain divergences between a specific density estimator and the true density [62,63,64]. This differs from the problem we consider where we assume that both densities are unknown. The asymptotic distributions of the estimators in references [56,57,58,59] are currently unknown. Thus, it is difficult to use these estimators for hypothesis testing which is crucial in many scientific applications. Kandasamy et al. [60] derived the asymptotic distribution of their data-splitting estimator but did not prove similar results for their leave-one-out estimator. We establish a central limit theorem for EnDive which greatly enhances its applicability in scientific settings.Our ensemble divergence estimator reduces to an ensemble entropy estimator as a special case when data from only one distribution is considered and the other density is set to a uniform measure (see reference [28] for more on the relationship between entropy and information divergence). The resultant entropy estimator differs from the ensemble entropy estimator proposed by Sricharan et al. [65] in several important ways. First, the density support set must be known for the estimator in reference [65] to perform the explicit boundary correction. In contrast, the EnDive estimator does not require any boundary correction. To show this requires a significantly different approach to prove the bias and variance rates of the EnDive estimator. Furthermore, the EnDive results apply under more general assumptions for the densities and the kernel used in the weak estimators. Finally, the central limit theorem applies to the EnDive estimator which is currently unknown for the estimator in reference [65].We also note that Berrett et al. [66] proposed a modification of the Kozachenko and Leonenko estimator of entropy [67] that takes a weighted ensemble estimation approach. While their results require stronger assumptions for the smoothness of the densities than ours do, they did obtain the asymptotic distribution of their weighted estimator and they also showed that the asymptotic variance of the estimator is not increased by taking a weighted average. This latter point is an important selling point of the ensemble framework—we can improve the asymptotic bias of an estimator without increasing the asymptotic variance.
1.2. Organization and Notation
The paper is organized as follows. We first derive the MSE convergence rates in Section 2 for a weak divergence estimator, which is a kernel density plug-in divergence estimator. We then generalize the theory of optimally weighted ensemble entropy estimation developed in reference [65] to obtain the ensemble divergence estimator EnDive from an ensemble of weak estimators in Section 3. A central limit theorem and uniform convergence rate for the ensemble estimator are also presented in Section 3. In Section 4, we provide guidelines for selecting the tuning parameters based on experiments and the theory derived in the previous sections. We then perform experiments in Section 4 that validate the theory and establish the robustness of the proposed estimators to the tuning parameters.Bold face type is used for random variables and random vectors. The conditional expectation given a random variable is denoted as . The variance of a random variable is denoted as , and the bias of an estimator is denoted as .
2. The Divergence Functional Weak Estimator
This paper focuses on estimating functionals of the form
where is a smooth functional, and and are smooth d-dimensional probability densities. If
g is convex, and , then defines the family of f-divergences. Some common divergences that belong to this family include the KL divergence () and the total variation distance (). In this work, we consider a broader class of functionals than the f-divergences, since g is allowed to be very general.To estimate , we first define a weak plug-in estimator based on kernel density estimators (KDEs), that is, a simple estimator that converges slowly to the true value in terms of MSE. We then derive the bias and variance expressions for this weak estimator as a function of sample size and bandwidth. We then use the resulting bias and variance expressions to derive an ensemble estimator that takes a weighted average of weak estimators with different bandwidths and achieves superior MSE performance.
2.1. The Kernel Density Plug-in Estimator
We use a kernel density plug-in estimator of the divergence functional in (1) as the weak estimator. Assume that i.i.d. realizations are available from and i.i.d. realizations are available from . Let be the kernel bandwidth for the density estimator of . For simplicity of presentation, assume that and . The results for the more general case of differing sample sizes and bandwidths are given in Appendix C. Let be a kernel function with and where is the norm of the kernel (K). The KDEs for and are, respectively,
where . is then approximated as
2.2. Convergence Rates
For many estimators, MSE convergence rates are typically provided in the form of upper (or sometimes lower) bounds on the bias and the variance. Therefore, only the slowest converging terms (as a function of the sample size (N)) are presented in these cases. However, to apply our generalized ensemble theory to obtain estimators that guarantee the parametric MSE rate, we required explicit expressions for the bias of the weak estimators in terms of the sample size (N) and the kernel bandwidth (h). Thus, an upper bound was insufficient for our work. Furthermore, to guarantee the parametric rate, we required explicit expressions of all bias terms that converge to zero slower than .To obtain bias expressions, we required multiple assumptions on the densities and , the functional g, and the kernel K. Similar to reference [7,54,65], the principal assumptions we make were that (1) , and g are smooth; (2) and have common bounded support sets ; and (3) and are strictly lower bounded on . We also assume (4) that the density support set is smooth with respect to the kernel (). The full technical assumptions and a discussion of them are contained in Appendix A. Given these assumptions, we have the following result on the bias of :For a general g, the bias of the plug-in estimatorTo apply our generalized ensemble theory to the KDE plug-in estimator (), we required only an upper bound on its variance. The following variance result required much less strict assumptions than the bias results in Theorem 1:Assume that the functional g in (From Theorems 1 and 2, we observe that and are required for to be unbiased, while the variance of the plug-in estimator depends primarily on the sample size (N). Note that the constants depend on the densities and and their derivatives which are often unknown.
2.3. Optimal MSE Rate
From Theorem 1, the dominating terms in the bias are observed to be and . If no bias correction is performed, the optimal choice of h that minimizes MSE isThis results in a dominant bias term of order . Note that this differs from the standard result for the optimal KDE bandwidth for minimum MSE density estimation which is for a symmetric uniform kernel when the boundary bias is ignored [68].Figure 1 shows a heatmap showing the leading bias term as a function of d and N when . The heatmap indicates that the bias of the plug-in estimator in (2) is small only for relatively small values of d. This is consistent with the empirical results in reference [69] which examined the MSE of multiple plug-in KDE and k-nn estimators. In the next section, we propose an ensemble estimator that achieves a superior convergence rate regardless of the dimensions (d) as long as the density is sufficiently smooth.
Figure 1
Heat map showing the predicted bias of the divergence functional plug-in estimator based on Theorem 1 as a function of the dimensions (d) and sample size (N) when . Note that the phase transition in the bias as the dimensions (d) increase for a fixed sample size (N); the bias remains small only for relatively small values of . The proposed weighted ensemble estimator EnDive eliminates this phase transition when the densities and the function g are sufficiently smooth.
2.4. Proof Sketches of Theorems 1 and 2
To prove the bias expressions in Theorem 1, the bias is first decomposed into two parts by adding and subtracting within the expectation creating a “bias” term and a “variance” term. Applying a Taylor series expansion on the bias and variance terms results in expressions that depend on powers of and , respectively. Within the interior of the support, moment bounds can be derived from properties of the KDEs and a Taylor series expansion of the densities. Near the boundary of the support, the smoothness assumption on the boundary is required to obtain an expression of the bias in terms of the KDE bandwidth (h) and the sample size (N). The full proof of Theorem 1 is given in Appendix E.The proof of the variance result takes a different approach. The proof uses the Efron–Stein inequality [70] which bounds the variance by analyzing the expected squared difference between the plug-in estimator when one sample is allowed to differ. This approach provides a bound on the variance under much less strict assumptions on the densities and the functional g than is required for Theorem 1. The full proof of Theorem 2 is given in Appendix F.
3. Weighted Ensemble Estimation
From Theorem 1 and Figure 1, we can observe that the bias of the MSE-optimal plug-in estimator decreases very slowly as a function of the sample size (N) when the data dimensions (d) are not small, resulting in a large MSE. However, by applying the theory of optimally weighted ensemble estimation, we can obtain an estimator with improved performance by taking a weighted sum of an ensemble of weak estimators where the weights are chosen to significantly reduce the bias.The ensemble of weak estimators is formed by choosing different values of the bandwidth parameter h as follows. Set to be real positive numbers that index . Thus, the parameter l indexes over different neighborhood sizes for the KDEs. Define the weight and That is, for each estimator there is a corresponding weight value (). The key to reducing the MSE is to choose the weight vector (w) to reduce the lower order terms in the bias while minimizing the impact of the weighted average on the variance.
3.1. Finding the Optimal Weight
The theory of optimally weighted ensemble estimation is a general theory that is applicable to any estimation problem as long as the bias and variance of the estimator can be expressed in a specific way. An early version of this theory was presented in reference [65]. We now generalize this theory so that it can be applied to a wider variety of estimation problems. Let N be the number of available samples and let be a set of index values. Given an indexed ensemble of estimators of some parameter (E), the weighted ensemble estimator with weights satisfying is defined as
is asymptotically unbiased as long as the estimators are asymptotically unbiased. Consider the following conditions on :The bias is expressible as
where are constants that depend on the underlying density and are independent of N and l, is a finite index set with , and are basis functions depending only on parameter l and not on the sample size (N).The variance is expressible asAssume conditionsThe weight vector (From condition , we can write the bias of the weighted estimator asThe variance of the weighted estimator is bounded asThe optimization problem in (4) zeroes out the lower-order bias terms and limits the norm of the weight vector (w) to prevent the variance from exploding. This results in an MSE rate of when the dimensions (d) are fixed and when L is fixed independently of the sample size (N). Furthermore, a solution to (4) is guaranteed to exist if and the vectors are linearly independent. This completes our sketch of the proof of Theorem 3. ☐
3.2. The EnDive Estimator
The parametric rate of in MSE convergence can be achieved without requiring . This can be accomplished by solving the following convex optimization problem in place of the optimization problem in Theorem 3:
where the parameter is chosen to achieve a trade-off between bias and variance. Instead of forcing , the relaxed optimization problem uses the weights to decrease the bias terms at a rate of , yielding an MSE convergence rate of In fact, it was shown in reference [71] that the optimization problem in (6) guarantees the parametric MSE rate as long as the conditions of Theorem 3 are satisfied and a solution to the optimization problem in (4) exists (the conditions for this existence are given in the proof of Theorem 3).We now construct a divergence ensemble estimator from an ensemble of plug-in KDE divergence estimators. Consider first the bias result in (3) where g is general, and assume that . In this case, the bias contains a term. To guarantee the parametric MSE rate, any remaining lower-order bias terms in the ensemble estimator must be no slower than . Let where . Then . We therefore obtain an ensemble of plug-in estimators and a weighted ensemble estimator . The bias of each estimator in the ensemble satisfies the condition with and for . To obtain a uniform bound on the bias with respect to w and , we also include the function with corresponding . The variance also satisfies the condition . The optimal weight () is found by using (6) to obtain an optimally weighted plug-in divergence functional estimator with an MSE convergence rate of as long as and . Otherwise, if , we can only guarantee the MSE rate up to . We refer to this estimator as the Ensemble Divergence (EnDive) estimator and denote it as .We note that for some functionals (g) (including the KL divergence and the Renyi- divergence integral), we can modify the EnDive estimator to obtain the parametric rate under the less strict assumption that . For details on this approach, see Appendix B.
3.3. Central Limit Theorem
The following theorem shows that an appropriately normalized ensemble estimator converges in distribution to a normal random variable under rather general conditions. Thus, the same result applies to the EnDive estimator . This enables us to perform hypothesis testing on the divergence functional which is very useful in many scientific applications. The proof is based on the Efron–Stein inequality and an application of Slutsky’s Theorem (Appendix G).Assume that the functional g is Lipschitz in both arguments with the Lipschitz constant
where
3.4. Uniform Convergence Rates
Here, we show that the optimally weighted ensemble estimators achieve the parametric MSE convergence rate uniformly. Denote the subset of with densities bounded between and as .Let
where p and q are d-dimensional probability densities. Additionally, let
where C is a constant.The proof decomposes the MSE into the variance plus the square of the bias. The variance is bounded easily by using Theorem 2. To bound the bias, we show that the constants in the bias terms are continuous with respect to the densities p and q under an appropriate norm. We then show that is compact with respect to this norm and then apply an extreme value theorem. Details are given in Appendix H.
4. Experimental Results
In this section, we discuss the choice of tuning parameters and validate the EnDive estimator’s convergence rates and the central limit theorem. We then use the EnDive estimator to estimate bounds on the Bayes error for a single-cell bone marrow data classification problem.
4.1. Tuning Parameter Selection
The optimization problem in (6) has parameters , L, and . By applying (6), and the resulting MSE of the ensemble estimator is
where each term in the sum comes from the bias and variance, respectively. From this expression and (6), we see that the parameter provides a tradeoff between bias and variance. Increasing enables the norm of the weight vector to be larger. This means the feasible region for the variable w increases in size as increases which can result in decreased bias. However, as contributes to the variance term, increasing may result in increased variance.If all of the constants in (3) and an exact expression for the variance of the ensemble estimator were known, then could be chosen to optimize this tradeoff in bias and variance and thus minimize the MSE. Since these constants are unknown, we can only choose based on the asymptotic results. From (8), this would suggest setting . In practice, we find that for finite sample sizes, the variance in the ensemble estimator is less than the upper bound of . Thus, setting is unnecessarily restrictive. We find that, in practice, setting works well.Upon first glance, it appears that for fixed L, the set that parameterizes the kernel widths can, in theory, be chosen by minimizing in (6) over in addition to w. However, adding this constraint results in a non-convex optimization problem since w does not lie in the non-negative orthant. A parameter search over possible values for is another possibility. However, this may not be practical as generally decreases as the size and spread of increases. In addition, for finite sample sizes, decreasing does not always directly correspond to a decrease in MSE, as very high or very low values of can lead to inaccurate density estimates, resulting in a larger MSE.Given these limitations, we provide the following recommendations for . Denote the value of the minimum value of l such that as and the diameter of the support as D. To ensure the KDEs are bounded away from zero, we require that . As demonstrated in Figure 2, the weights in are generally largest for the smallest values of . This indicates that should also be sufficiently larger than to render an adequate density estimate. Similarly, should be sufficiently smaller than the diamter (D) as high bandwidth values can lead to high bias in the KDEs. Once these values are chosen, all other values can then be chosen to be equally spaced between and .
Figure 2
The optimal weights from (6) when , , , and l are uniformly spaced between 1.5 and 3. The lowest values of l are given the highest weight. Thus, the minimum value of bandwidth parameters should be sufficiently large to render an adequate estimate.
An efficient way to choose and is to select the integers and and compute the and nearest neighbor distances of all the data points. The bandwidths and can then be chosen to be the maximums of these corresponding distances. The parameters and can then be computed from the expression . This choice ensures that a minimum of points are within the kernel bandwidth for the density estimates at all points and that a maximum of points are within the kernel bandwidth for the density estimates at one of the points.Once and have been chosen, the similarity of bandwidth values and basis functions increases as L increases, resulting in a negligible decrease in the bias. Hence, L should be chosen to be large enough for sufficient bias but small enough so that the bandwidth values are sufficiently distinct. In our experiments, we found to be sufficient.
To validate our theoretical convergence rate results, we estimated the Rényi- divergence integral between two truncated multivariate Gaussian distributions with varying dimension and sample sizes. The densities had means of , and covariance matrices of , where is a d-dimensional vector of ones, and is a identity matrix. We restricted the Gaussians to the unit cube and used .The left plots in Figure 3 show the MSE (200 trials) of the standard plug-in estimator implemented with a uniform kernel and the proposed optimally weighted estimator EnDive for various dimensions and sample sizes. The parameter set was selected based on a range of k-nearest neighbor distances. The bandwidth used for the standard plug-in estimator was selected by setting , where was chosen from to minimize the MSE of the plug-in estimator. For all dimensions and sample sizes, EnDive outperformed the plug-in estimator in terms of MSE. EnDive was also less biased than the plug-in estimator and even had lower variance at smaller sample sizes (e.g., ). This reflects the strength of ensemble estimators—the weighted sum of a set of relatively poor estimators can result in a very good estimator. Note also that for the larger values of N, the ensemble estimator MSE rates approached the theoretical rate based on the estimated log–log slope given in Table 1.
Figure 3
(Left) Log–log plot of MSE of the uniform kernel plug-in (“Kernel”) and the optimally weighted EnDive estimator for various dimensions and sample sizes. (Right) Plot of the true values being estimated compared to the average values of the same estimators with standard error bars. The proposed weighted ensemble estimator approaches the theoretical rate (see Table 1), performed better than the plug-in estimator in terms of MSE and was less biased.
Table 1
Negative log–log slope of the EnDive mean squared error (MSE) as a function of the sample size for various dimensions. The slope was calculated beginning at . The negative slope was closer to 1 with than for indicating that the asymptotic rate had not yet taken effect at .
Estimator
d=5
d=10
d=15
Nstart=102
0.85
0.84
0.80
Nstart=102.375
0.96
0.96
0.95
To illustrate the difference between the problems of density estimation and divergence functional estimation, we estimated the average pointwise squared error between the KDE and in the previous experiment. We used exactly the same bandwidth and kernel as the standard plug-in estimators in Figure 3 and calculated the pointwise error at 10,000 points sampled from . The results are shown in Figure 4. From these results, we see that the KDEs performed worse as the dimension of the densities increased. Additionally, we observe by comparing Figure 3 and Figure 4, the average pointwise squared error decreased at a much slower rate as a function of the sample size (N) than the MSE of the plug-in divergence estimators, especially for larger dimensions (d).
Figure 4
Log–log plot of the average pointwise squared error between the KDE and for various dimensions and sample sizes using the same bandwidth and kernel as the standard plug-in estimators in Figure 3. The KDE and the density were compared at 10,000 points sampled from .
Our experiments indicated that the proposed ensemble estimator is not sensitive to the tuning parameters. See reference [72] for more details.
4.3. Central Limit Theorem Validation: KL Divergence
To verify the central limit theorem of the EnDive estimator, we estimated the KL divergence between two truncated Gaussian densities, again restricted to the unit cube. We conducted two experiments where (1) the densities were different with means of , and covariances of matrices , , ; and where (2) the densities were the same with means of and covariance matrices of . For both experiments, we chose and four different sample sizes (N). We found that the correspondence between the quantiles of the standard normal distribution and the quantiles of the centered and scaled EnDive estimator was very high under all settings (see Table 2 and Figure 5) which validates Theorem 4.
Table 2
Comparison between quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the KL divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. The parameter gives the correlation coefficient between the quantiles, while is the estimated slope between the quantiles. The correspondence between quantiles was very high for all cases.
N
Same
Different
1−ρ
β
1−ρ
β
100
2.35×10−4
1.014
9.97×10−4
0.993
500
9.48×10−5
1.007
5.06×10−4
0.999
1000
8.27×10−5
0.996
4.30×10−4
0.988
5000
8.59×10−5
0.995
4.47×10−4
1.005
Figure 5
QQ-plots comparing the quantiles of a standard normal random variable and the quantiles of the centered and scaled EnDive estimator applied to the Kullback–Leibler (KL) divergence when the distributions were the same and different. Quantiles were computed from 10,000 trials. These plots correspond to the same experiments as in Table 2 when N = 100 and N = 1000. The correspondence between quantiles is high for all cases.
4.4. Bayes Error Rate Estimation on Single-Cell Data
Using the EnDive estimator, we estimated bounds on the Bayes error rate (BER) of a classification problem involving MARS-seq single-cell RNA-sequencing (scRNA-seq) data measured from developing mouse bone marrow cells enriched for the myeloid and erythroid lineages [73]. However, we first demonstrated the ability of EnDive to estimate the bounds on the BER of a simulated problem. In this simulation, the data were drawn from two classes where each class distribution was a dimensional Gaussian distribution with different means and the identity covariance matrix. We considered two cases, namely, the distance between the means was 1 or 3. The BER was calculated in both cases. We then estimated upper and lower bounds on the BER by estimating the Henze–Penrose (HP) divergence [4,6]. Figure 6 shows the average estimated upper and lower bounds on the BER with standard error bars for both cases. For all tested sample sizes, the BER was within one standard deviation of the estimated lower bound. The lower bound was also closer, on average, to the BER for most of the tested sample sizes (lower sample sizes with smaller distances between means were the exceptions). Generally, these resuls indicate that the true BER is relatively close to the estimated lower bound, on average.
Figure 6
Estimated upper (UB) and lower bounds (LB) on the Bayes error rate (BER) based on estimating the HP divergence between two 10-dimensional Gaussian distributions with identity covariance matrices and distances between means of 1 (left) and 3 (right), respectively. Estimates were calculated using EnDive, with error bars indicating the standard deviation from 400 trials. The upper bound was closer, on average, to the true BER when N was small (≈100–300) and the distance between the means was small. The lower bound was closer, on average, in all other cases.
We then estimated similar bounds on the scRNA-seq classification problem using EnDive. We considered the three most common cell types within the data: erythrocytes (eryth.), monocytes (mono.), and basophils (baso.) ( respectively). We estimated the upper and lower bounds on the pairwise BER between these classes using different combinations of genes selected from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with the hematopoietic cell lineage [74,75,76]. Each collection of genes contained 11–14 genes. The upper and lower bounds on the BER were estimated using the Henze–Penrose divergence [4,6]. The standard deviations of the bounds for the KEGG-based genes were estimated via 1000 bootstrap iterations. The KEGG-based bounds were compared to BER bounds obtained from 1000 random selections of 12 genes. In all cases, we compared the bounds to the performance of a quadratic discriminant analysis classifier (QDA) with 10-fold cross validation. Note that to correct for undersampling in scRNA-seq data, we first imputed the undersampled data using MAGIC [77].All results are given in Table 3. From these results, we note that erythrocytes are relatively easy to distinguish from the other two cell types as the BER lower bounds were within nearly two standard deviations of zero when using genes associated with platelet, erythrocyte, and neutrophil development as well as a random selection of 12 genes. This is corroborated by the QDA cross-validated results which were all within two standard deviations of either the upper or lower bound for these gene sets. In contrast, the macrophage-associated genes seem to be less useful for distinguishing erythrocytes than the other gene sets.
Table 3
Misclassification rate of a quadratic discriminant analysis classifier (QDA) classifier and estimated upper bounds (UB) and lower bounds (LB) of the pairwise BER between mouse bone marrow cell types using the Henze–Penrose divergence applied to different combinations of genes selected from the KEGG pathways associated with the hematopoietic cell lineage. Results are presented as percentages in the form of mean ± standard deviation. Based on these results, erythrocytes are relatively easy to distinguish from the other two cell types using these gene sets.
Platelets
Erythrocytes
Neutrophils
Macrophages
Random
Eryth. vs. Mono., LB
2.8±1.5
1.2±0.6
0.6±0.6
8.5±1.2
14.4±8.4
Eryth. vs. Mono., UB
5.3±2.9
2.4±1.3
1.2±1.3
15.5±1.9
23.2±12.3
Eryth. vs. Mono., Prob. Error
0.9
0.4
1.3
3.4
7.2±5.4
Eryth. vs. Baso., LB
0.5±0.6
0.05±0.12
0.6±0.5
5.1±0.9
11.9±5.5
Eryth. vs. Baso., UB
1.0±1.1
0.1±0.2
1.1±0.9
9.6±1.6
20.3±8.8
Eryth. vs. Baso., Prob. Error
1.2
0.3
1.9
3.6
6.8±5.0
Baso. vs. Mono., LB
31.1±1.8
27.8±3.1
27.1±2.6
31.6±1.3
32.1±2.6
Baso. vs. Mono., UB
42.8±1.4
39.9±2.8
39.4±2.4
43.2±1.0
43.5±1.2
Baso. vs. Mono., Prob. Error
28.8
30.9
23.9
22.4
29.7±5.7
We also found that basophils are difficult to distinguish from monocytes using these gene sets. Assuming the relative abundance of each cell type is representative of the population, a trivial upper bound on the BER is which is between all of the estimated lower and upper bounds. The QDA results were also relatively high (and may be overfitting the data in some cases based on the estimated BER bounds), suggesting that different genes should be explored for this classification problem.
5. Conclusions
We derived the MSE convergence rates for a kernel density plug-in estimator for a large class of divergence functionals. We generalized the theory of optimally weighted ensemble estimation and derived an ensemble divergence estimator EnDive that achieves the parametric rate when the densities are more than d times differentiable. The estimator we derived can be applied to general bounded density support sets and can be implemented without knowledge of the support, which is a distinct advantage over other competing estimators. We also derived the asymptotic distribution of the estimator, provided some guidelines for tuning parameter selection, and experimentally validated the theoretical convergence rates for the case of empirical estimation of the Rényi- divergence integral. We then performed experiments to examine the estimator’s robustness to the choice of tuning parameters, validated the central limit theorem for KL divergence estimation, and estimated bounds on the Bayes error rate for a single cell classification problem.We note that based on the proof techniques employed in our work, our weighted ensemble estimators are easily extended beyond divergence estimation to more general distributional functionals which may be integral functionals of any number of probability distributions. We also show in Appendix B that EnDive can be easily modified to obtain an estimator that achieves the parametric rate when the densities are more than times differentiable and the functional g has a specific form that includes the Rényi and KL divergences. Future work includes extending this modification to functionals with more general forms. An important divergence of interest in this context is the Henze–Penrose divergence that we used to bound the Bayes error. Further future work will focus on extending this work on divergence estimation to k-nn based estimators where knowledge of the support is, again, not required. This will improve the computational burden, ask-nn estimators require fewer computations than standard KDEs.
Authors: David van Dijk; Roshan Sharma; Juozas Nainys; Kristina Yim; Pooja Kathail; Ambrose J Carr; Cassandra Burdziak; Kevin R Moon; Christine L Chaffer; Diwakar Pattabiraman; Brian Bierie; Linas Mazutis; Guy Wolf; Smita Krishnaswamy; Dana Pe'er Journal: Cell Date: 2018-06-28 Impact factor: 41.582