Salimeh Yasaei Sekeh1, Alfred O Hero1. 1. Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA.
Abstract
This paper proposes a geometric estimator of dependency between a pair of multivariate random variables. The proposed estimator of dependency is based on a randomly permuted geometric graph (the minimal spanning tree) over the two multivariate samples. This estimator converges to a quantity that we call the geometric mutual information (GMI), which is equivalent to the Henze-Penrose divergence. between the joint distribution of the multivariate samples and the product of the marginals. The GMI has many of the same properties as standard MI but can be estimated from empirical data without density estimation; making it scalable to large datasets. The proposed empirical estimator of GMI is simple to implement, involving the construction of an minimal spanning tree (MST) spanning over both the original data and a randomly permuted version of this data. We establish asymptotic convergence of the estimator and convergence rates of the bias and variance for smooth multivariate density functions belonging to a Hölder class. We demonstrate the advantages of our proposed geometric dependency estimator in a series of experiments.
This paper proposes a geometric estimator of dependency between a pair of multivariate random variables. The proposed estimator of dependency is based on a randomly permuted geometric graph (the minimal spanning tree) over the two multivariate samples. This estimator converges to a quantity that we call the geometric mutual information (GMI), which is equivalent to the Henze-Penrose divergence. between the joint distribution of the multivariate samples and the product of the marginals. The GMI has many of the same properties as standard MI but can be estimated from empirical data without density estimation; making it scalable to large datasets. The proposed empirical estimator of GMI is simple to implement, involving the construction of an minimal spanning tree (MST) spanning over both the original data and a randomly permuted version of this data. We establish asymptotic convergence of the estimator and convergence rates of the bias and variance for smooth multivariate density functions belonging to a Hölder class. We demonstrate the advantages of our proposed geometric dependency estimator in a series of experiments.
Entities:
Keywords:
Friedman–Rafsky test statistic; Henze–Penrose mutual information; bias and variance tradeoff; convergence rates; geometric mutual information; minimal spanning trees; optimization
Estimation of multivariate dependency has many applications in fields such as information theory, clustering, structure learning, data processing, feature selection, time series prediction, and reinforcement learning, see [1,2,3,4,5,6,7,8,9,10], respectively. It is difficult to accurately estimate the mutual information in high-dimensional settings, specially where the data is multivariate with an absolutely continuous density with respect to Lebesgue measure—the setting considered in this paper. An important and regular measure of dependency is the Shannon mutual information (MI), which has seen extensive use across many application domains. However, the estimation of mutual information can often be challenging. In this paper, we focus on a measure of MI that we call the Geometric MI (GMI). This MI measure is defined as the asymptotic large sample limit of a randomized minimal spanning tree (MST) statistic spanning the multivariate sample realizations. The GMI is related to a divergence measure called the Henze–Penrose divergence [11,12], and related to the multivariate runs test [13]. In [14,15], it was shown that this divergence measure can be used to specify a tighter bound for the Bayes error rate for testing if a random sample comes from one of two distributions the bound in [14,15] is tighter than previous divergence-type bounds such as the Bhattacharrya bound [16]. Furthermore, the authors of [17] proposed a non-parametric bound on multi-class classification Bayes error rate using a global MST graph.Let and be random variables with unknown joint density and marginal densities and , respectively, and consider two hypotheses: , and are independent and , and are dependent,The GMI is defined as the Henze–Penrose divergence between and which can be used as a dependency measure. In this paper, we prove that for large sample size n the randomized MST statistic spanning the original multivariate sample realizations and a randomly shuffled data set converges almost surely to the GMI measure. A direct implication of [14,15] is that the GMI provides a tighter bound on the Bayes misclassification rate for the optimal test of independence. In this paper, we propose an estimator based on a random permutation modification of the Friedman–Rafsky multivariate test statistic and show that under certain conditions the GMI estimator achieves the parametric mean square error (MSE) rate when the joint density is bounded and smooth. Importantly unlike other measures of MI, our proposed GMI estimator does not require explicit estimation of the joint and marginal densities.Computational complexity is an important challenge in machine learning and data science. Most plug-in-based estimators, such as the kernel density estimator (KDE) or the K-nearest-neighbor (KNN) estimator with known convergence rate, require runtime complexity of , which is not suitable for large scale applications. Noshad et al. proposed a graph theoretic direct estimation method based on nearest-neighbor ratios (NNR) [18]. The NNR estimator is based on k-NN graph and computationally more tractable than other competing estimators with complexity . The construction of the minimal spanning tree lies at the heart of the GMI estimator proposed in this paper. Since the GMI estimator is based on the Euclidean MST the dual-tree algorithm by March et al. [19] can be applied. This algorithm is based on the construction of Borůvka [20] and implements the Euclidean MST in approximately time. In this paper, we experimentally show that for large sample size the proposed GMI estimator has faster runtime than the KDE plug-in method.
1.1. Related Work
Estimation of mutual information has a rich history. The most common estimators of MI are based on plug-in density estimation, e.g., using the histogram, kernel density or kNN density estimators [21,22]. Motivated by ensemble methods applied to divergence estimation [23,24], in [22] an ensemble method for combining multiple KDE bandwidths was proposed for estimating MI. Under certain smoothness conditions this ensemble MI estimator was shown to achieve parametric convergence rates.Another class of estimators of multivariate dependency bypasses the difficult density estimation task. This class includes the statistically consistent estimators of Rényi- and KL mutual information which are motivated by the asymptotic limit of the length of the KNN graph, [25,26] when joint density is smooth. The estimator of [27] builds on KNN methods for Rényi entropy estimation. The authors of [26], showed that when MI is large the KNN and KDE approaches are ill-suited for estimating MI since the joint density may be insufficiently smooth when there are strong dependencies. To overcome this issue an assumption on the smoothness of the density is required, see [28,29], and [23,24]. For all these methods, the optimal parametric rate of MSE convergence is achieved when the densities are either d, or times differentiable [30]. In this paper, we assume that joint and marginal densities are smooth in the sense that they belong to Hölder continuous classes of densities , where the smoothness parameter and the Lipschitz constant .A MI measure based on the Pearson chi-square divergence was considered in [31] that is computational efficient and numerically stable. The authors of [27,32] used nearest-neighbor graph and minimal spanning tree approaches, respectively, to estimate Rényi mutual information. In [22], a non-parametric mutual information estimator was proposed using a weighted ensemble method with parametric convergence rate. This estimator was based on plug-in density estimation, which is challenging in high dimension.Our proposed dependency estimator differs from previous methods in the following ways. First, it estimates a different measure of mutual information, the GMI. Second, instead of using the KNN graph the estimator of GMI uses a randomized minimal spanning tree that spans the multivariate realizations. The proposed GMI estimator is motivated by the multivariate runs test of Friedman and Rafsky (FR) [33] which is a multivariate generalization of the univariate Smirnov maximum deviation test [34] and the Wald-Wolfowitz [35] runs test in one dimension. We also emphasize that the proposed GMI estimator does not require boundary correction, in contrast to other graph-based estimators, such as, the NNR estimator [18], scalable MI estimator [36], or cross match statistic [37].
1.2. Contribution
The contribution of this paper has three componentsWe propose a novel non-parametric multivariate dependency measure, referred to as geometric mutual information (GMI), which is based on graph-based divergence estimation. The geometric mutual information is constructed using a minimal spanning tree and is a function of the Friedman–Rafsky multivariate test statistic.We establish properties of the proposed dependency measure analogous to those of Shannon mutual information, such as, convexity, concavity, chain rule, and a type of data-processing inequality.We derive a bound on the MSE rate for the proposed geometric estimator. An advantage of the estimator is that it achieves the optimal MSE rate without the need for boundary correction, which is required for most plug-in estimators.
1.3. Organization
The rest of the paper is organized as follows. In Section 2, we define the geometric mutual information and establish some of its mathematical properties. In Section 2.2 and Section 2.3, we introduce a statistically consistent GMI estimator and derive a bound on its mean square error convergence rate. In Section 3 we verify the theory through experiments.Throughout the paper, we denote statistical expectation by and the variance by abbreviation . Bold face type indicates random vectors. All densities are assumed to be absolutely continuous with respect to non-atomic Lebesgue measure.
2. The Geometric Mutual Information (GMI)
In this section, we first review the definition of the Henze–Penrose (HP) divergence measure defined by Berisha and Hero in [13,14]. The Henze–Penrose divergence between densities f and g with domain for parameter is defined as follows (see [13,14,15]):
where . This functional is an f-divergence [38], equivalently, as an Ali-Silvey distance [39], i.e., it satisfies the properties of non-negativity, monotonicity, and joint convexity [15]. The measure (1) takes values in and if and only if almost surely.The mutual information measure is defined as follows. Let , , and be the marginal and joint distributions, respectively, of random vectors , where and are positive integers. Then by using (1), a Henze–Penrose generalization of the mutual information between and , is defined byWe will show below that has a geometric interpretation in terms of the large sample limit of a minimal spanning tree spanning n sample realizations of the merged labeled samples . Thus, we call the GMI between and . The GMI satisfies similar properties to other definitions of mutual information, such as Shannon and Rényi mutual information. Recalling (3) in [14], an alternative form of is given by
where
The function was defined in [13] and is called the geometric affinity between and . The next subsection of the paper is dedicated to the basic inequalities and properties of the proposed GMI measure (2).
2.1. Properties of the Geometric Mutual Information
In this subsection we establish basic inequalities and properties of the GMI, , given in (2). The following theorem shows that is a concave function in and a convex function in . The proof is given in Appendix A.1.Denote byConcavity inThe inequality is strict unless eitherConvexity inThe inequality is strict unless eitherThe GMI, , satisfies properties analogous to the standard chain rule and the data-processing inequality [40]. For random variables , and with conditional density we define the conditional GMIThe next theorem establishes a relation between the joint and conditional GMI.For given d-dimensional random vector
whereFor Theorem 2 reduces toPlease note that when , the inequality (8) is trivial since . The proof of Theorem 2 is given in Appendix A.2. Theorem 2 is next applied to the case where and form a Markov chain. The proof of the following “leany” data-processing inequality (Proposition 1) is provided in Appendices section, Appendix A.3.Suppose random vectors
whereFurthermore, if both and together hold true, we have .The inequality in (10) becomes interpretable as the standard data-processing inequality , when
since
2.2. The Friedman–Rafsky Estimator
Let a random sample from be available. Here we show that the GMI can be directly estimated without estimating the densities. The estimator is inspired by the MST construction of [33] that provides a consistent estimate of the Henze–Penrose divergence [14,15]. We denote by the i-th joint sample and by the sample set . Divide the sample set into two subsets and with the proportion and , where .Denote by the setThis means that for each given the first element the second element is replaced by a randomly selected . This results in a random shuffling of the binary relation relating in . The estimator of is derived based on the Friedman–Rafsky (FR) multivariate runs test statistic [33] on the concatenated data set, . The FR test statistic is defined as the number of edges in the MST spanning the merged data set that connect a point in to a point in . This test statistic is denoted by . Please note that since the MST is unique with probability one (under the assumption that all density functions are Lebesgue continuous) then all inter point distances between nodes are distinct. This estimator converges to almost surely as . The procedure is summarized in Algorithm 1.Theorem 3 shows that the output in Algorithm 1 estimates the GMI with parameter . The proof is provided in Appendix A.4.For given proportionality parameterPlease note that the asymptotic limit in (11) depends on the proportionality parameter . Later in Section 2.4, we discuss the choice of an optimal parameter . In Figure 1, we illustrate the MST constructed over merged independent () and highly dependent () data sets drawn from two-dimensional normal distributions with correlation coefficients . Notice that the edges of the MST connecting samples with different colors, corresponding to independent and dependent samples, respectively, are indicated in green. The total number of green edges is the FR test statistic .
Figure 1
The MST and FR statistic of spanning the merged set of normal points when and are independent (denoted in blue points) and when and are highly dependent (denoted in red points). The FR test statistic is the number of edges in the MST that connect samples from different color nodes (denoted in green) and it is used to estimate the GMI .
2.3. Convergence Rates
In this subsection we characterize the MSE convergence rates of the GMI estimator of Section 2.2 in the form of upper bounds on the bias and the variance. This MSE bound is given in terms of the sample size n, the dimension d, and the proportionality parameter . Deriving convergence rates for mutual information estimators has been of interest in information theory and machine learning [22,27]. The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition [41]. Here we assume , and have support sets , , and , respectively, and are smooth in the sense that they belong to Hölder continuous classes of densities K), [42,43]:(Hölder class): Let whereTo explore the optimal choice of parameter we require bounds on the bias and variance bounds, provided in Appendix A.5. To obtain such bounds, we will make several assumptions on the absolutely continuous densities , , and support sets , , :Each of the densities belong to with smoothness parameters and Lipschitz constant K.The volumes of the support sets are finite, i.e., .All densities are bounded i.e., there exist two sets of constants and such that , and .The following theorem on the bias follows under assumptions (A.1) and (A.3):For given wherePlease note that according to Theorem 13 in [44], the constant is lower bounded by , and is the binary entropy i.e.,A proof of Theorem 4 is given in Appendix A.5. The next theorem gives an upper bound on the variance of the FR estimator . The proof of the variance result requires a different approach than the bias bound (the Efron–Stein inequality [45]). It is similar to arguments in ([46], Appendix A.3), and is omitted. In Theorem 5 we assume that the densities , , and are absolutely continuous and bounded (A.3).Given
where
2.4. Minimax Parameter
Recall assumptions (A.1), (A.2), and (A.3) in Section 2.3. The constant can be chosen to minimize the maximum the MSE converges rate where the maximum is taken over the space of Hölder smooth joint densities .Throughout this subsection we use the following notations:,and ,,and , where is the smoothness parameter,.Now define byConsider the following optimization problem:
where
andPlease note that in (18), are constants, and only depends on the dimension d. Also, in (19), and are constants. Let be the optimal i.e., be the solution of the optimization problem (16). Set
such that is (17) when . For , the optimal choice of in terms of maximizing the MSE is and the saddle point for the parameter , denoted by , is given as follows:, if ., if ., if .Further details are given in Appendix A.6.
3. Simulation Study
In this section, numerical simulations are presented that illustrate the theory in Section 2. We perform multiple experiments to demonstrate the utility of the proposed GMI estimator of the HP-divergence in terms of the dimension d and the sample size n. Our proposed MST-based estimator of the GMI is compared to density plug-in estimators of the GMI, in particular the standard KDE density plug-in estimator of [22], where the convergence rates of Theorems 4 and 5 are validated. We use multivariate normal simulated data in the experiments. In this section, we also discuss the choice of the proportionality parameter and compare runtime of the proposed GMI estimator approach with KDE method.Here we perform four sets of experiments to illustrate the proposed GMI estimator. For the first set of experiments the MSE of the GMI estimator in Algorithm 1 is shown in Figure 2-left. The samples were drawn from d-dimensional normal distribution, with various sample sizes and dimensions . We selected the proportionality parameter and computed the MSE in terms of the sample size n. We show the log–log plot of MSE when n varies in . Please note that the empirically optimal proportion depends on n, so to avoid the computational complexity we fixed for this experiment. The experimental result shown in Figure 2-left validates the theoretical MSE growth rates derived from (13) and (14), i.e., decreasing sub-linearly in n and increasing exponentially in d.
Figure 2
(left) Log–log plot of theoretical and experimental MSE of the proposed MST-based GMI estimator as a function of sample size n for and fixed smoothness parameter . (right) The GMI estimator was implemented using two approaches, Algorithm 1 and KDE method where the KDE-GMI used KDE density estimators in the formula (2). In this experiment, samples are generated from the two-dimensional normal distribution with zero mean and covariance matrix (21) for various value of .
In Figure 2-right, we compare the proposed MST-based GMI estimator with the KDE-GMI estimator [22]. For the KDE approach, we estimated the joint and marginal densities and then plugged them into the proposed expression (2). The bandwidth h used for the KDE plug-in estimator was set as . The choice of h minimizes the bound on the MSE of the plug-in estimator. We generated data from the two-dimensional normal distribution with zero mean and covariance matrixThe coefficient is varied in range . The true GMI was computed by the Monte Carlo approximation to the integral (2). Please note that as increases, the MST-GMI outperforms the KDE-GMI approach. In this set of experiments .Figure 3 again compares the MST-GMI estimator with the KDE-GMI estimator. samples are drawn from the multivariate standard normal distribution with dimensions and . In both cases the proportionality parameter . The left plots in Figure 3 show the MSE (100 trials) of the GMI estimator implemented with an KDE estimator (with bandwidth as in Figure 2 i.e., ) for dimensions and various sample sizes. For all dimensions and sample sizes the MST-GMI estimator also outperforms the plug-in KDE-GMI estimator based on the estimated log–log MSE slope given in Figure 3 (left plots). The right plots in Figure 3 compares the MST-GMI with the KDE-GMI. In this experiment, the error bars denote standard deviations with 100 trials. We observe that for higher dimension and larger sample size n, the KDE-GMI approaches the true GMI at a slower rate than the MST-GMI estimator. This reflects the power of the proposed graph-based approach to estimating GMI.
Figure 3
MSE log–log plots as a function of sample size n (left) for the proposed MST-GMI estimator (“Estimated GMI”) and the standard KDE-GMI plug-in estimator of GMI. The right column of plots correspond to the GMI estimated for dimension (top) and (bottom). In both cases the proportionality parameter is . The MST-GMI estimator in both plots for sample size n in outperforms the KDE-GMI estimator, especially for larger dimensions.
The comparison between MSEs for various dimension d is shown in Figure 4 (left). This experiment highlights the impact of higher dimension on the GMI estimators. As expected, for larger sample size n, MSE decreases while for higher dimension it increases. In this setting, we have generated samples from standard normal distribution with size and . From Figure 4 (left) we observe that for larger sample size, MSE curves are ordered based on their corresponding dimensions. Results in Section 2.4 strongly depend on the lower bounds and upper bounds and provide optimal parameter in the range , therefore in the experiment section we only analyze one case where the lower bounds and upper bounds are known and the optimal becomes . Figure 4 (right) illustrates the MSE vs proportion parameter when samples are generated from truncated normal distribution with . First, following Section 2.4, we compute the bound and then derive the optimal in this range. Therefore, each experiment with different sample size and provides different range . We observe that the MSE does not appeared a monotonic function in and its behavior strongly depends on sample size n, d, and density functions’ bounds. Additional study of the dependency is described in Appendix A.6. In this set of experiments , therefore following the results in Section 2.4, we have . In this experiment the optimal value of is always the lower bound and indicated in the Figure 4 (right).
Figure 4
MSE log–log plots as a function of sample size n for the proposed FR estimator. We compare the MSE of our proposed FR estimator for various dimensions (left). As d increases, the blue curve takes larger values than green and orange curves i.e., MSE increases as d grows. However, this is more evidential for large sample size n. The second experiment (right) focuses on optimal proportion for and . is the optimal for .
The parameter is studied further for three scenarios where the lower bounds and upper bounds are assumed unknown, therefore results in Section 2.4 are not applicable. In this set of experiments, we varied in the range to divide our original sample. We generated sample from an isotropic multivariate standard normal distribution in all three scenarios (all features are independent). Therefore, the true GMI is zero and in all scenarios the GMI column, corresponding to the MST-GMI, is compared with zero. In each scenario we fixed dimension d and sample size n and varied . The dimension and sample size in Scenarios 1,2, and 3 are and , respectively. In Table 1 the last column () stars the parameter with the minimum MSE and GMI in each scenario. Table 1 shows that in these sets of experiments when , the GMI estimator has less MSE (i.e., is more accurate) than when or . This experimentally demonstrates that if we split our training data, the proposed Algorithm 1 performs better with .
Table 1
Comparison between different scenarios of various dimensions and sample sizes in terms of parameter . We applied the MST-GMI estimator to estimate the GMI () with . We varied dimension and sample size in each scenario. We observe that for , the MST-GMI estimator provides lowest MSE when indicated by star (*).
Overview Table for Different d, n, and α
Experiments
Dimension (d)
Sample Size (n)
GMI (Iα)
MSE (×10−4)
Parameter (α)
Scenario 1–1
6
1000
0.0229
12
0.2
Scenario 1–2
6
1000
0.0143
4.7944
0.5 *
Scenario 1–3
6
1000
0.0176
6.3867
0.8
Scenario 2–1
8
1500
0.0246
11
0.2
Scenario 2–2
8
1500
0.0074
1.6053
0.5 *
Scenario 2–3
8
1500
0.0137
5.3863
0.8
Scenario 3–1
10
2000
0.0074
2.3604
0.2
Scenario 3–2
10
2000
0.0029
0.54180
0.5 *
Scenario 3–3
10
2000
0.0262
11
0.8
Finally, Figure 5 shows the runtime as a function of sample size n. We vary sample size in the range . Observe that for smaller number of samples the KDE-GMI method is slightly faster but as n becomes large we see significant relative speedup of the proposed MST-GMI method.
Figure 5
Runtime of KDE approach and proposed MST-based estimator of GMI vs sample size. The proposed GMI estimator achieves significant speedup, while for small sample size, the KDE method becomes overly fast. Please note that in this experiment the sample is generated from the Gaussian distribution in dimension .
4. Conclusions
In this paper, we have proposed a new measure of mutual information, called Geometric MI (GMI), which is related to the Henze–Penrose divergence. The GMI can be viewed as dependency measure that is the limit of the Friedman–Rafsky test statistic, which depends on the MST over all data points. We established some properties of the GMI in terms of convexity/concavity, chain rule, and a type of data-processing inequality. A direct estimator of the GMI, called the MST-GMI, was introduced that uses random permutations of observed relationships between variables in the multivariate samples. An explicit form for the MSE convergence rate bound was derived that depends on a free parameter called the proportionality parameter. An asymptotically optimal form for this free parameter was given that minimizes the MSE convergence rate. Simulation studies were performed that illustrate and verify the theory.