Literature DB >> 33267337

Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples.

Abstract

Determining the strength of nonlinear, statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here, we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables-the one with minimal entropy-is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our proposal focuses on the strength of the interaction between the two variables, without seeking to model the specific way in which they are related. A distinctive property of our method is that the main data statistics determining the amount of mutual information is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences.

Entities: Gene Mutation

Keywords: Bayesian estimation; bias; mutual information; sampling

Year: 2019 PMID： 33267337 PMCID： PMC7515115 DOI： 10.3390/e21060623

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Inferring the statistical dependencies between two variables from a few measured samples is a ubiquitous task in many areas of study. Variables are often linked through nonlinear, relations, which contain stochastic components. The standard measure employed to quantify the amount of dependency is the mutual information, defined as the reduction in entropy of one of the variables when conditioning the other variable [1,2]. If the states of the joint distribution are well-sampled, the joint probabilities can be estimated by the observed frequencies. Replacing such estimates in the formula for the mutual information yields the so-called “plug-in” estimator of mutual information. However, unless all the states of the joint distribution are sampled extensively, this procedure typically over-estimates the mutual information [3,4,5]. In fact, when the number of samples is of the order of the effective number of joint states, even independent variables tend to appear as correlated. The search for an estimator of mutual information that remains approximately unbiased even with small data samples is an open field of research [6,7,8,9,10,11]. Here, we focus on discrete variables, and assume it is not possible to overcome the scarceness of samples by grouping states that are close according to some metric. In addition to corrections that only work in the limit of large samples [12], the state of the art for this problem corresponds to quasi-Bayesian methods that estimate mutual information indirectly through measures of the entropies of the involved variables [8,13,14]. These approaches have the drawback of not being strictly Bayesian, since the linear combination of two or more Bayesian estimates of entropies does not, in general, yield a Bayesian estimator of the combination of entropies [8]. The concern is not so much to remain within theoretical Bayesian purity, but rather to avoid frameworks that may be unnecessarily biased, or where negative estimates of information may arise. Here, we propose a new method for estimating mutual information that is valid in the specific case in which there is an asymmetry between the two variables: One variable has a large number of effective states, and the other only a few. Examples of asymmetric problems can be found for instance in neuroscience, when assessing the statistical dependence between the activity of a population of neurons and a few stereotyped behaviors, as “lever to the left” or “lever to the right.” If the neural activity is represented as a collection of binary strings, with zeroes and ones tagging the absence or presence of spikes in small time bins, the set of possible responses is huge, and typically remains severely undersampled in electrophysiological experiments. If the behavioral paradigm is formulated in terms of a binary forced choice, the behavioral response only has two possible states. In our approach, no hypotheses are made about the probability distribution of the large-entropy variable, but the marginal distribution of the low-entropy variable is assumed to be well sampled. The prior is chosen so as to accurately represent the amount of dispersion of the conditional distribution of the low-entropy variable around its marginal distribution. This prior is motivated in the framework of Bayesian estimation of information, and then tested with selected examples in which the hypotheses implied in the prior are fulfilled in varying degree, from high to low. The examples show that our estimator has very low bias, even in the severely under-sampled regime, where there are few coincidences that is, when any given state of the large-entropy variable has a low probability of being sampled more than once. The key data statistics that determine the estimated information is the inhomogeneity of the distribution of the low-entropy variable in those states of the high-entropy variable where two or more samples are observed. In addition to providing a practical algorithm to estimate mutual information, our approach sheds light on the way in which just a few samples reveal those specific properties of the underlying joint probability distribution that determine the amount of mutual information.

2. Bayesian Approaches to the Estimation of Entropies

We seek a low-bias estimate of the mutual information between two discrete variables. Let X be a random variable with a large number of effective states with probabilities , and Y be a variable that varies in a small set , with . Given the conditional probabilities , the marginal and joint probabilities are and , respectively. The entropy is and can be interpreted as the average number of well-chosen yes/no questions required to guess the sampled value of Y (when using a logarithm of base two). Since the entropy is a function of the probabilities, and not of the actual values taken by the random variable, here we follow the usual notation in which the expressions and are taken to represent the same concept, defined in Equation (1). The conditional entropy is the average uncertainty of the variable Y once X is known, The mutual information is the reduction in uncertainty of one variable once we know the other [2] Our aim is to estimate when Y is well sampled, but X is severely undersampled, in particular, when the sampled data contain few coincidences in X. Hence, for most values x, the number of samples is too small to estimate the conditional probability from the frequencies . The plug-in estimators of entropy, conditional entropy and mutual information are defined as the ones in which all the probabilities appearing in Equations (1)–(3) are estimated naïvely by the frequencies obtained in a given sample. In fact, when , the plug-in estimator typically underestimates severely [5], and often leads to an overestimation of . One possibility is to estimate and using a Bayesian estimator, and then insert the obtained values in Equation (3) to estimate the mutual information. We now discuss previous approaches to Bayesian estimators for entropy, to later analyze the case of information. For definiteness, we focus on , but the same logic applies to , or . We seek the Bayesian estimator of the entropy conditional to having sampled the state a number of times. We denote the collection of sampled data as a vector , and the total number of samples . The Bayesian estimator of the entropy is a function that maps each vector on a non-negative real number , in such a way that the posterior expected error is minimized [15,16]. In the Bayesian framework, the sampled data are known, whereas the underlying distribution originating the data is unknown, and must be inferred. In this framework, the estimator of the entropy is the function for which is minimized. The Bayesian nature of this framework is embodied in the fact that the integral weighs all candidate distributions that could have originated the data with the posterior distribution , which can be related to the multinomial distribution by means of Bayes rule. The estimator can be found by taking the derivative of Equation (4) with respect to and equating to zero. The result is [17] where the first equation results from the minimization procedure, and the second, from using Bayes rule. As a result, the Bayesian estimator is the expected value of . Since is the multinomial distribution and since the normalization constant can be calculated from the integral the entire gist of the Bayesian approach is to find an adequate prior to plug into Equations (5) and (7). For the sake of analytical tractability, is often decomposed into a weighted combination of distributions that can be easily integrated, each tagged by one or a few parameters, here generically called that vary within a certain domain, The decomposition requires to introduce a prior . Hence, the former search for an adequate prior is now replaced by the search for an adequate prior . The replacement implies an assumption and also a simplification. The family of priors that can be generated by Equation (8) does not necessarily encompass the entire space of possible priors. The decomposition relies on the assumption that the remaining family is still rich enough to make good inference about the quantity of interest, in this case, the entropy. The simplification stems from the fact that the search for is more restricted than the search for because the space of alternatives is smaller (the dimensionality of is typically high, whereas the one of is low). Two popular proposals of Bayesian estimators for entropies are Nemenman–Shafee–Bialek (NSB) [13] and Pitman–Yor Mixture (PYM) [14]. In NSB, the functions are Dirichlet distributions, in which takes the role of a concentration parameter. In PYM, these functions are Pitman–Yor processes, and stands for two parameters: one accounting for the concentration, and the other for the so-called discount. In both cases, the Bayesian machinery implies where is the weight of each in the estimation of the expected entropy When choosing the family of functions , it is convenient to select them in such a way that the weight can be solved analytically. However, this is not the only requirement. In order to calculate the integral in , the prior also plays a role. The decomposition of Equation (8) becomes most useful when the arbitrariness in the choice of is less serious than the arbitrariness in the choice of . This assumption is justified when is peaked around a specific -value, so that, in practice, the shape of hardly has an effect. In these cases, a narrow range of relevant -values is selected by the sampled data, and all assumptions about the prior probability outside this range play a minor role. For the choices of the families proposed by NSB and PYM, can be calculated analytically, and one can verify that, indeed, a few coincidences in the data suffice for a peak to develop. In both cases, the selected is one for which favours a range of values that are compatible with the measured data (as assessed by ), and also produce non-negligible entropies (Equation (10)). When the chosen Bayesian estimates of the entropies are plugged into Equation (3) to obtain an estimate of the information, each term is dominated by its own preferred . Since the different entropies are estimated independently, the values selected by the data to dominate the priors and need not be compatible with the ones dominating the priors of the joint or the conditional distributions. As a consequence, the estimation of the mutual information is no longer Bayesian, and can suffer from theoretical issues, as, for example, yield negative estimates [8]. A first alternative would be to consider an integrable prior containing a single for the joint probability distribution , and then replace H by I in the equations above, to calculate . This procedure was tested by Archer et al. [8], and the results were only good when the collection of values governing the data were well described by a distribution that was contained in the family of proposed priors . The authors concluded that mixtures of Dirichlet priors do not provide a flexible enough family of priors for highly-structured joint distributions, at least for the purpose of estimating mutual information. To make progress, we note that can be written as where and stand for the -dimensional vectors and , and represents the Kullback–Leibler divergence. The average divergence between and captures a notion of spread. Therefore, the mutual information is sensitive not so much to the value of the probabilities , but rather, to their degree of scatter around the marginal . The parameters controlling the prior should hence be selected in order to match the width of the distribution of values, and not so much each probability. With this intuition in mind, in this paper, we put forward a new prior for the whole ensemble of conditional probabilities obtained for different x values. In this prior, the parameter controls the spread of the conditionals around the marginal .

3. A Prior Distribution for the Conditional Entropies

Our approach is valid when the total number of samples N is at least of the order of magnitude of , since in this regime, some of the x states are expected to be sampled more than once [18,19]. In addition, the marginal distribution must be well sampled. This regime is typically achieved when X has a much larger set of available states than Y. In this case, the maximum likelihood estimators of the marginal probabilities can be assumed to be accurate that is, In this paper, we put forward a Dirichlet prior distribution centered at that is, where contains the conditional probabilities corresponding to different x values. Large values select conditional probabilities close to , while small values imply a large spread that pushes the selection towards the border of the -simplex. For the moment, for simplicity, we work with a prior defined on the conditional probabilities , and make no effort to model the prior probability of . In practice, we estimate the values of with the maximum likelihood estimator . Since X is assumed to be severely undersampled, this is a poor procedure to estimate . Still, the effect on the mutual information turns out to be negligible, since the only role of in Equation (11) is to weigh each of the Kullback–Leibler divergences appearing in the average. If is large, each -value appears in several terms of the sum, rendering the individual value of the accompanying irrelevant, only the sum of all those with the same matters. In Section 6, we tackle the full problem of making Bayesian inference both in and . The choice of prior of Equation (13) is inspired in three facts. First, captures the spread of around , as implied by the Kullback–Leibler divergence in Equation (13). Admittedly, this divergence is not exactly the one governing the mutual information (Equation (11)), since and are swapped. Yet, it is still a measure of spread. The exchange, as well as the denominator in Equation (13), were introduced for the sake of the second fact, namely, analytical tractability. The third fact regards the emergence of a single relevant when the sampled data begin to register coincidences. If we follow the Bayesian rationale of the previous section, now replacing the entropy by the mutual information, we can again define a weight for the parameter where can be obtained analytically, and is a well behaved function of its arguments, whereas For each x, the vector varies in a -dimensional simplex. For we take the multinomial The important point here is that the ratio of the Gamma functions of Equation (14) develops a peak in as soon as the collected data register a few coincidences in x. Hence, with few samples, the prior proposed in Equation (13) renders the choice of inconsequential. Assuming that the marginal probability of Y is well-sampled, the entropy is well approximated by the plug-in estimator . For each , the expected posterior information can be calculated analytically, where is the digamma function. A code implementing the estimate of Equation (16) is publicly available in the site mentioned in the Supplementary Material below. When the system is well sampled, , so the effect of becomes negligible, the digamma functions tend to logarithms, and the frequencies match the probabilities. In this limit, Equation (16) coincides with the plug-in estimator, which is consistent [20]. As a consequence, our estimator is also consistent. The rest of the paper focuses on the case in which the marginal probability of X is severely undersampled.

4. A Closer Look on the Case of a Symmetric and Binary -Variable

In this section, we analyze the case of a binary Y-variable, which, for simplicity, is assumed to be symmetric that is, , such that nats. In this case, the Dirichlet prior in each factor of Equation (13) becomes a Beta distribution and . Large values of mostly select conditional probabilities close to . If all conditional probabilities are similar, and similar to the marginal, the mutual information is low, since the probability of sampling a specific y value hardly depends on x. Instead, small values of produce conditional probabilities around the borders ( or ). In this case, is strongly dependent on x (see Figure 1b), so the mutual information is large. The expected prior mutual information can be calculated using the analytical approach developed by [14,17],

Figure 1

A scheme of our method to estimate the mutual information between two variables X and Y. (a) We collect a few samples of a variable x with a large number of effective states , each sample characterized by a binary variable y (the two values represented in white and gray). We consider different hypotheses about the strength with which the probability of each y-value varies with x; (b) one possibility is that the conditional probability of each of the two y-values hardly varies with x. This situation is modeled by assuming that the different are random variables governed by a Beta distribution with a large hyper-parameter ; (c) On the other hand, the conditional probability could vary strongly with x. This situation is modeled by a Beta distribution with a small hyper-parameter . (d) As varies, so does the prior mutual information (Equation (18)). This prior is obtained by averaging all the values obtained from different possible sets of marginal distributions that can be generated when sampling the prior of Equation (17). The shaded area around the solid line illustrates such fluctuations in when .

The prior information is a slowly-varying function of the order of magnitude of , namely of . Therefore, if a uniform prior in information is desired, it suffices to choose a prior on such that , When , the expected posterior information (Equation (16)) becomes The marginal likelihood of the data given is also analytically tractable. The likelihood is binomial for each x, so The posterior for can be obtained by adding a prior , as . The role of the prior becomes relevant when the number of coincidences is too low for the posterior to develop a peak (see below). In order to gain intuition about the statistical dependence between variables with few samples, we here highlight the specific aspects of the data that influence the estimator of Equation (20). Grouping together the terms of Equation (21) that are equal, the marginal likelihood can be rewritten in terms of the multiplicities that is, the number of states x with specific occurrences or , where The posterior for is independent from states x with just a single count, as . Only states x with coincidences matter. In order to see how the sampled data favor a particular , we search for the -value (denoted as ) that maximizes in the particular case where at most two samples coincide on the same x, obtaining Denoting the fraction of 2-count states that have one count for each y-value as , Equation (24) implies that the likelihood is maximized at If the y-values are independent of x, we expect . This case corresponds to a large and, consequently, to little information. On the other side, for small , the parameter is also small and the information grows. Moreover, the width of is also modulated by . When the information is large, the peak around is narrow. Low information values, instead, require more evidence, and they come about with more uncertainty around . In Equation (24), the data only intervene through and , which characterize the degree of asymmetry of the y-values throughout the different x-states. This asymmetry, hence, constitutes a sufficient statistics for . If a prior is included, the that maximizes the posterior may shift, but the effect becomes negligible as the number of coincidences grows. We now discuss the role of the selected in the estimation of information, Equation (20), focusing on the conditional entropy . First, in terms of the multiplicities, the conditional entropy can be rewritten as where is the fraction of the N samples that fall in states x with r counts, and is the fraction of all states x with counts, of which n correspond to one y-value (whichever) and for the other. Finally, is the estimation of the entropy of a binary variable after samples, A priori, , as in Figure 1d. Surprisingly, from the property , it turns out that (in fact, ). Hence, if only a single count breaks the symmetry between the two y-values, there is no effect on the conditional entropy. This is a reasonable result, since a single extra count is no evidence of an imbalance between the underlying conditional probabilities, it is just the natural consequence of comparing the counts falling on an even number of states (2) when taking an odd number of samples. Expanding the first terms for the conditional entropy, In the severely under-sampled regime, these first terms are the most important ones. Moreover, when evaluating these terms in (Equation (25)), the conditional entropy simplifies into Typically, takes most of the weight, so the estimation is close to the prior evaluated at the -value that maximizes the marginal likelihood (or the posterior). When dealing with few samples, it is important to have not just a good estimate of the mutual information, but also a confidence interval. Even a small information may be relevant, if the evidence attests that it is strictly above zero. The theory developed here also allows us to estimate the posterior variance of the mutual information, as shown in Appendix A. The variance (Equation (A4)) is shown to be inversely proportional to the number of states , thereby implying that our method benefits from a large number of available states X, even if undersampled. If an estimator, such as ours, is guaranteed to provide a non-negative estimate for all possible sets of sampled data, it cannot be free of bias, not at least if the samples are generated by an independent distribution (), for which the true information vanishes. In Appendix B, we show that, in this specific case, the bias decreases with the square root of the number of coincidences. This number may be large, even in the severely undersampled regime, if . If the distribution is approximately uniform, the number of coincidences is proportional to the square of the total number of samples, so the bias is inversely proportional to N.

5. Testing the Estimator

We now analyze the performance of our estimator in three examples where the number of samples N is below or in the order of the effective size of the system . In this regime, most observed x-states have very few samples. In each example, we define the probabilities and with three different criteria, giving rise to collections of probabilities that can be described with varying success by the prior proposed in this paper, Equation (17). Once the probabilities are defined, the true value of the mutual information can be calculated, and compared to the one estimated by our method in 50 different sets of samples of the measured data. We present the estimate obtained with from Equation (20) evaluated in the that maximizes the marginal likelihood . We did not observe any improvement when integrating over the whole posterior with the prior of Equation (19), except when or were of order 1. This fact implies the existence of a well-defined peak in the marginal likelihood. In Figure 2, the performance of our estimator is compared with that of three other methods widely employed in the literature: Plug in, NSB and PYM. In addition, two other estimators were evaluated, but not shown in the figure to avoid cramming: the one of reference [21], which is a particularly convenient case of the Schuermann family of estimators [22], and the one of reference [23], extensively used in ecology. Their estimates fell between the plug-in estimator and NSB in the first case, and between NSB and PYM in the second case.

Figure 2

Comparison of the performance of four different estimators for : the plug-in estimator, NSB estimator used in the limit of infinite states, PYM estimator, and our estimator (Equation (20)) calculated with the that maximizes the marginal likelihood (Equation (21)). The curves represent the average over 50 different data sets , with the standard deviation displayed as a colored area around the mean. (a) estimates of mutual information as a function of the total number of samples N, when the values of are generated under the hypothesis of our method (Equation (17)). We sample once the marginal probabilities (as described in [14]), as well as the conditionals with . The effective size of the system is . The exact value of is shown as a horizontal dashed line; (b) E vbvbsgv gtimates of mutual information, for data sets where the conditional probabilities have spherical symmetry. X, a binary variable of dimension 12, corresponds to the presence of 12 delta functions equally spaced in a sphere (, for all x). We generate the conditional probabilities such that they are invariant under rotations of the sphere, namely , being R a rotation. To this aim, we set as a sigmoid function of a combination of frequency components () of the spherical spectrum [24]. The effective size of the system is ; (c) estimates of mutual information, for a conditional distribution far away from our hypotheses. The x states are generated as Bernoulli () binary vectors of dimension , while the conditional probabilities depend on the parity of the sum of the components of the vector. When the sum is even we set , and when is odd, is generated by sampling a mixture of two deltas of equal weight with . The resulting distribution of -values contains three peaks, and therefore, cannot be described with a Dirichlet distribution. The effective size of the system is ; (d) bias in the estimation as a function of the value of mutual information. Settings remain the same as in (a), but fixing and changing in the conditional; (e) bias in the estimation as a function of the value of mutual information. Settings as in (b), but fixing and changing the gain of the sigmoid in the conditional; (f) bias in the estimation as a function of the value of mutual information. Settings as in (c), but fixing and changing in the conditional.

In the first example (Figure 2a,d), the probabilities are obtained by sampling a Pitman–Yor distribution with concentration parameter and tail parameter (as described in [14]). These values correspond to a Pitman–Yor prior with a heavy tail. The conditional probabilities are defined by sampling a symmetric Beta distribution , as in Equation (17). In Figure 2a, we use . Once the joint probability is defined, 50 sets of samples are generated. The effective size of the system is . We compare our estimator to the plug-in estimator (Plug-in), NSB and PYM when applied to and (all methods coincide in the estimation of ). Our estimator has a low bias, even when the number of samples per effective state is as low as . The variance is larger than in the Plug-in estimator, comparable to NSB and smaller than PYM. All the other methods (Plug-in, NSB and to a lesser extend PYM) overestimate the mutual information. In Figure 2d, the performance of the estimators is also tested for different values of the exact mutual information , which we explore by varying . For each , the conditional probabilities are sampled once. Each vector contains samples, and is sampled 50 times. Our estimates have very low bias, even as the mutual information goes to zero —namely, for independent variables. Secondly, we analyze an example where the statistical relation between X and Y is remarkably intricate (example inspired by [25]), which underscores the fact that making inference about the mutual information does not require inferences on the joint probability distribution. The variable x is a binary vector of dimension 12. Each component represents the presence or absence of one of a maximum of 12 delta functions equally spaced on the surface of a sphere. There are possible x vectors, and they are governed by a uniform prior probability: . The conditional probabilities are generated in such a way that they be invariant under rotations of the sphere that is, , where R is a rotation. Using a spherical harmonic representation [24], the frequency components of the spherical spectrum are obtained, where is the combination of deltas. The conditional probabilities are defined as a sigmoid function of . The offset of the sigmoid is chosen such that , and the gain such that nats. In this example, and, unlike the Dirichlet prior implied by our estimator, has some level of roughness (inset in Figure 2b), due to peaks coming from the invariant classes in . Hence, the example does not truly fit into the hypothesis of our method. With these settings, the effective size of the system is . Our estimator has little bias (Figure 2b,e), even with samples per effective state. In this regime, around of the samples fall on x states that occur only once (), on states that occur twice and on states with three counts, or maybe four. As mentioned above, in such cases, the value of is very similar to the one that would be obtained by evaluating the prior information of (Equation (18)) at the that maximizes the marginal likelihood , which in turn is mainly determined by . In Figure 2e, the estimator is tested with a fixed number of samples for different values of the mutual information, which we explore by varying the gain of the sigmoid. The bias of the estimate is small in the entire range of mutual informations. In the third place, we consider an example where the conditional probabilities are generated from a distribution that is poorly approximated by a Dirichlet prior. The conditional probabilities are sampled from three Dirac deltas, as , with . The delta placed in could be approximated by a Dirichlet prior with a large , while the other two deltas could be approximated by a small , but there is no single value of that can approximate all three deltas at the same time. The x states are generated as Bernoulli () binary vectors of dimension , while the conditional probabilities depend on the parity of the sum of the components of the vector x. When the sum is even, we assign , and when it is odd, we assign or , both options with equal probability. Although in this case our method has some degree of bias, it still preserves a good performance in relation to the other approaches (see Figure 2c,f). The marginal likelihood contains a single peak in an intermediate value of , coinciding with none of the deltas in , but still capturing the right value of the mutual information. As in the previous examples, we also test the performance of the estimator for different values of the mutual information, varying in this case the value of (with ). Our method performs acceptably for all values of mutual information. The other methods, instead, are challenged more severely, probably because a large fraction of the x states have a very low probability, and are therefore difficult to sample. Those states, however, provide a crucial contribution to the relative weight of each of the three values of . PYM, in particular, sometimes produces a negative estimate for , even on average. Finally, we check numerically the accuracy of the analytically predicted mean posterior information (Equation (20)) and variance (Equation (A4)) in the severely under-sampled regime. The test is performed in a different spirit than the numerical evaluations of Figure 2. There, averages were taken for multiple samples of the vector , from a fixed choice of the probabilities and . The averages of Equations (20) and (A4), however, must be interpreted in the Bayesian sense. The square brackets in and represent averages taken for a fixed data sample , and unknown underlying probability distributions and . We generate many such distributions with (a Dirichlet Process with concentration parameter ) and . A total of 13,500 distributions are produced, with sampled from Equation (19), and three equiprobable values of . For each of these distributions, we generate five (5) sets of just samples, thereby constructing a list of 13,500 cases, each case characterized by specific values of and . Following the Bayesian rationale, we partition this list in classes, each class containing all the cases that end up in the same set of multiplicities —for example, . For each of the 100 most occurring sets of multiplicities (which together cover of all the cases), we calculate the mean and the standard deviation of the mutual information of the corresponding class, and compare them with our predicted estimates and , using the prior from Equation (19). Figure 3 shows a good match between the numerical (y-axis) and analytical (x-axis) averages that define the mean information (panel a) and the standard deviation (b). The small departures from the diagonal stem from the fact that the analytical average contains all the possible and , even if some of them are highly improbable for one given set of multiplicities. The numerical average, instead, includes the subset of the 13,500 explored cases that produced the tested multiplicity. All the depicted subsets contained many cases, but, still, they remained unavoidably below the infinity covered by the theoretical result.

Figure 3

Verification of the accuracy of the analytically predicted mean posterior information (Equation (20)) and variance (Equation (A4)) in the severely under-sampled regime. A collection of 13,500 distributions are constructed by sampling and , with varying in the set and from Equation (19). Each distribution has an associated . From each , we take five (5) sets of just samples. (a) the values of are grouped according to the multiplicities produced by the samples, averaged together, and depicted as the y component of each data point. The x component is the analytical result of Equation (20), based on the sampled multiplicities; (b) same analysis for the standard deviation of the information (the square root of the variance calculated in Equation (A4)).

We have also tested cases where Y takes more than two values, and where the marginal distribution is not uniform, observing similar performance of our estimator.

6. A Prior Distribution for the Large Entropy Variable

The prior considered so far did not model the probability of the large-entropy variable X. Throughout the calculation, the probabilities were approximated by the maximum likelihood estimator . Here, we justify such procedure by demonstrating that proper Bayesian inference on hardly modifies the estimation of the mutual information. To that end, we replace the prior of Equation (13) by another prior that depends on both and . The simplest hypothesis is to assume that the prior factorizes as , implying that the marginal probabilities are independent of the conditional probabilities . We propose , so that the marginal probabilities are drawn from a Dirichlet Process with concentration parameter , associated with the total number of pseudo-counts. After integrating in and in , the mean posterior mutual information for fixed hyper-parameters and is Before including the prior , in the severely undersampled regime, the mean posterior information was approximately equal to the prior information evaluated in the best (Equation (16)). The new calculation (Equation (30)) contains the prior information explicitly, weighted by , that is, the ratio between the number of pseudo-counts from the prior and the total number of counts. Thereby, the role of the non-observed (but still inferred) states is established. The independence assumed between and implies that The inference over coincides with the one of PYM with the tail parameter as [14], since where is the number of states x with at least one sample. With few coincidences in x, develops a peak around a single -value that represents the number of effective states. Compared to the present Bayesian approach, maximum likelihood underestimates the number of effective states (or entropy) in x. Since the expected variance of the mutual information decreases with the square root of the number of effective states, the Bayesian variance is reduced with respect to the one of the Plug-in estimator.

7. Discussion

In this work, we propose a novel estimator for mutual information of discrete variables X and Y, which is adequate when X has a much larger number of effective states than Y. If this condition does not hold, the performance of the estimator breaks down. We inspire our proposal in the Bayesian framework, in which the core issue can be boiled down to finding an adequate prior. The more the prior is dictated by the data, the less we need to assume from outside. Equation (11) implies that the mutual information is the spread of the conditional probabilities of one of the variables (for example, , but the same holds for ) around the corresponding marginal ( or , respectively). This observation inspires the choice of our prior (Equation (13)), which is designed to capture the same idea, and, in addition, to be analytically tractable. We choose to work with a hyper-parameter that regulates the scatter of around , and not the scatter of around because the asymmetry in the number of available states of the two variables makes the of the first option (and not the second) strongly modulated by the data, by the emergence of a peak in . Although our proposal is inspired in previous Bayesian studies, the procedure described here is not strictly Bayesian, since our prior (Equation (13)) requires the knowledge of , which depends on the sampled data. However, in the limit in which is well sampled, this is a pardonable crime, since is defined by a negligible fraction of the measured data. Still, Bayesian purists should employ a two-step procedure to define their priors. First, they should perform Bayesian inference on the center of the Dirichlet distribution of Equation (13) by maximizing , and then replace in Equation (13) by the inferred . For all practical purposes, however, if the conditions of validity of our method hold, both procedures lead to the same result. By confining the set or possible priors to those generated by Equation (13), we relinquish all aspiration to model the prior of, say, , in terms of the observed frequencies at . In fact, the preferred -value is totally blind to the specific x-value of each sampled datum. Only the number of x-values containing different counts of each y-value matters. Hence, the estimation of mutual information is performed without attempting to infer the specific way the variables X and Y are related, a property named equitability [26], and that is shared also by other methods [8,13,14]. Although this fact may be seen as a disadvantage, deriving a functional relation between the variables can actually bias the inference on mutual information [26]. Moreover, fitting a relation is unreasonable in the severe under-sampled regime, in which not all x-states are observed, most sampled x-states contain a single count, and few x-states contain more than two counts—at least without a strong assumption about the probability space. In fact, if the space of probabilities of the involved variables has some known structure or smoothness condition, other approaches that estimate information by fitting the relation first may perform well [9,10,11]. Part of the approach developed here could be extended to continuous variables or spaces with a determined metric. This extension is left for future work. In the explored examples, our estimator had a small bias, even in the severely under-sampled regime, and it outperformed other estimators discussed in the literature. More importantly, the second and third examples of Section 5 showed that it even worked when the collection of true conditional probabilities was not contained in the family of priors generated by . In these cases, the success of the method relies on the peaked nature of the posterior distribution . Even if the selected provides a poor description of the actual collection of probabilities, the dominant captures the right value of mutual information. This is the sheer instantiation of the equitability property discussed above. When the number of samples N is much larger than the total number of available states , our estimator of mutual information tends to the plug-in estimator, which is known to be consistent [20]. Consequently, our estimator is also consistent. By construction, is equal to the mutual information averaged over all distributions and compatible with the measured data, each weighted by its posterior probability. As such, it can never produce a negative result, which is a desirable property. The down side is that the estimator must be positively biased, at least, when the true information vanishes. The derivation in Appendix B shows that, when the number of samples , this bias is inversely proportional to the square root of the number of pairwise coincidences or, when is fairly uniform, to the inverse of the total number of samples. Moreover, the factor of proportionality is significantly smaller than the one obtained in the bound of the bias of other frequentist estimators [3,27]. If the number of samples N grows even further, the bias tends to zero, since the bias of all consistent estimators vanishes asymptotically [28]. Our method provides also a transparent way to identify the statistics that matter, out of all the measured data. Quite naturally, the x-states that have not been sampled provide no evidence in shaping , as indicated by Equation (14), and only shift the posterior information towards the prior (Equation (30)). More interestingly, the x-states with just a single count are also irrelevant, both in shaping and in modifying the posterior information away from the prior. These states are unable to provide evidence about the existence of either flat or skewed conditional probabilities . Only the states x that have been sampled at least twice contribute to the formation of a peak in , and in deviating the posterior information away from the prior. Our method can also be extended to generalizations of mutual information designed to characterize the degree of interdependence of more than two variables [29,30,31,32,33,34,35,36,37,38,39,40,41], as long as all but one of the variables are extensively sampled. The applicability of the method to these measures will be the object of future work, once a certain degree of consensus has been reached regarding their meaning and range of applicability. Several fields can benefit from the application of our estimator of mutual information. Examples can be found in neuroscience, when studying whether neural activity (a variable with many possible states) correlates with a few selected stimuli or behavioral responses [12,42,43], or in genomics, to understand associations between genes (large-entropy variable) and a few specific phenotypes [44]. The method can also shed light on the development of rate-distortion methods to be employed in situations in which only a few samples are available. In particular, it can be applied within the information bottleneck framework [45,46], aimed at extracting a maximally compressed representation of an input variable, but still preserving those features that are relevant for the prediction of an output variable. The possibility of detecting statistical dependencies with only few samples is of key importance, not just for analyzing data sets, but also to understand how living organisms quickly infer dependencies in their environments and adapt accordingly [47].

8. Conclusions

We have proposed a novel estimator for the mutual information between two variables, applicable to those cases in which the marginal distribution of one of the variables—the one with minimal entropy—is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods discussed in the literature. The main data statistics determining the estimated value is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences.

19 in total

Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples.

1. Introduction

2. Bayesian Approaches to the Estimation of Entropies

3. A Prior Distribution for the Conditional Entropies

4. A Closer Look on the Case of a Symmetric and Binary -Variable

5. Testing the Estimator

6. A Prior Distribution for the Large Entropy Variable

7. Discussion

8. Conclusions

1. Temporal correlations and neural spike train entropy.

2. Efficiency and ambiguity in an adaptive neural code.

3. An exact method to quantify the information transmitted by different mechanisms of correlational coding.

4. Estimating mutual information.

5. Entropy and information in neural spike trains: progress on the sampling problem.

6. How many clusters? An information-theoretic perspective.

7. Tight data-robust bounds to mutual information combining shuffling and model selection techniques.

8. Estimating functions of probability distributions from a finite set of samples.

9. Information estimation using nonparametric copulas.

10. Millisecond-scale motor encoding in a cortical vocal area.

1. On Generalized Schürmann Entropy Estimators.

2. Inferring a Property of a Large System from a Small Number of Samples.