| Literature DB >> 33267337 |
Damián G Hernández1, Inés Samengo1.
Abstract
Determining the strength of nonlinear, statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here, we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables-the one with minimal entropy-is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our proposal focuses on the strength of the interaction between the two variables, without seeking to model the specific way in which they are related. A distinctive property of our method is that the main data statistics determining the amount of mutual information is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences.Entities:
Keywords: Bayesian estimation; bias; mutual information; sampling
Year: 2019 PMID: 33267337 PMCID: PMC7515115 DOI: 10.3390/e21060623
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1A scheme of our method to estimate the mutual information between two variables X and Y. (a) We collect a few samples of a variable x with a large number of effective states , each sample characterized by a binary variable y (the two values represented in white and gray). We consider different hypotheses about the strength with which the probability of each y-value varies with x; (b) one possibility is that the conditional probability of each of the two y-values hardly varies with x. This situation is modeled by assuming that the different are random variables governed by a Beta distribution with a large hyper-parameter ; (c) On the other hand, the conditional probability could vary strongly with x. This situation is modeled by a Beta distribution with a small hyper-parameter . (d) As varies, so does the prior mutual information (Equation (18)). This prior is obtained by averaging all the values obtained from different possible sets of marginal distributions that can be generated when sampling the prior of Equation (17). The shaded area around the solid line illustrates such fluctuations in when .
Figure 2Comparison of the performance of four different estimators for : the plug-in estimator, NSB estimator used in the limit of infinite states, PYM estimator, and our estimator (Equation (20)) calculated with the that maximizes the marginal likelihood (Equation (21)). The curves represent the average over 50 different data sets , with the standard deviation displayed as a colored area around the mean. (a) estimates of mutual information as a function of the total number of samples N, when the values of are generated under the hypothesis of our method (Equation (17)). We sample once the marginal probabilities (as described in [14]), as well as the conditionals with . The effective size of the system is . The exact value of is shown as a horizontal dashed line; (b) E vbvbsgv gtimates of mutual information, for data sets where the conditional probabilities have spherical symmetry. X, a binary variable of dimension 12, corresponds to the presence of 12 delta functions equally spaced in a sphere (, for all x). We generate the conditional probabilities such that they are invariant under rotations of the sphere, namely , being R a rotation. To this aim, we set as a sigmoid function of a combination of frequency components () of the spherical spectrum [24]. The effective size of the system is ; (c) estimates of mutual information, for a conditional distribution far away from our hypotheses. The x states are generated as Bernoulli () binary vectors of dimension , while the conditional probabilities depend on the parity of the sum of the components of the vector. When the sum is even we set , and when is odd, is generated by sampling a mixture of two deltas of equal weight with . The resulting distribution of -values contains three peaks, and therefore, cannot be described with a Dirichlet distribution. The effective size of the system is ; (d) bias in the estimation as a function of the value of mutual information. Settings remain the same as in (a), but fixing and changing in the conditional; (e) bias in the estimation as a function of the value of mutual information. Settings as in (b), but fixing and changing the gain of the sigmoid in the conditional; (f) bias in the estimation as a function of the value of mutual information. Settings as in (c), but fixing and changing in the conditional.
Figure 3Verification of the accuracy of the analytically predicted mean posterior information (Equation (20)) and variance (Equation (A4)) in the severely under-sampled regime. A collection of 13,500 distributions are constructed by sampling and , with varying in the set and from Equation (19). Each distribution has an associated . From each , we take five (5) sets of just samples. (a) the values of are grouped according to the multiplicities produced by the samples, averaged together, and depicted as the y component of each data point. The x component is the analytical result of Equation (20), based on the sampled multiplicities; (b) same analysis for the standard deviation of the information (the square root of the variance calculated in Equation (A4)).