Literature DB >> 33285906

On the Difference between the Information Bottleneck and the Deep Information Bottleneck.

Abstract

Combining the information bottleneck model with deep learning by replacing mutual information terms with deep neural nets has proven successful in areas ranging from generative modelling to interpreting deep neural networks. In this paper, we revisit the deep variational information bottleneck and the assumptions needed for its derivation. The two assumed properties of the data, X and Y, and their latent representation T, take the form of two Markov chains T - X - Y and X - T - Y . Requiring both to hold during the optimisation process can be limiting for the set of potential joint distributions P ( X , Y , T ) . We, therefore, show how to circumvent this limitation by optimising a lower bound for the mutual information between T and Y: I ( T ; Y ) , for which only the latter Markov chain has to be satisfied. The mutual information I ( T ; Y ) can be split into two non-negative parts. The first part is the lower bound for I ( T ; Y ) , which is optimised in deep variational information bottleneck (DVIB) and cognate models in practice. The second part consists of two terms that measure how much the former requirement T - X - Y is violated. Finally, we propose interpreting the family of information bottleneck models as directed graphical models, and show that in this framework, the original and deep information bottlenecks are special cases of a fundamental IB model.

Entities: Chemical Disease Gene Species

Keywords: Markov assumption; Markov chain; conditional independence; deep variational information bottleneck; information bottleneck; mutual information

Year: 2020 PMID： 33285906 PMCID： PMC7516540 DOI： 10.3390/e22020131

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Deep latent variable models, such as generative adversarial networks [1] and the variational autoencoder (VAE) [2], have attracted much interest in the last few years. They have been used in many application and formed a conceptual basis for a number of extensions. One of popular deep latent variable models is the deep variational information bottleneck (DVIB) [3]. Its foundational idea is that of applying deep neural networks to the information bottleneck (IB) model [4], which finds a sufficient statistic T of a given variable X while retaining side information about a variable Y. The original IB model, as well as DVIB, assumes the Markov chain . Additionally, in the latter model, the Markov chain appears by construction. The relationship between the two assumptions and how it influences the set of potential solutions have been neglected so far. In this paper, we clarify this relationship by showing that it is possible to lift the original IB assumption in the context of the deep variational information bottleneck. It can be achieved by optimising a lower bound on the mutual information between T and Y, which follows naturally from the model’s construction. This explains why DVIB can optimise over a set of distributions which is not overly restrictive. This paper is structured as follows. In Section 2 we describe the information bottleneck and deep variational information bottleneck models, along with their extensions. Section 3 introduces the lower bound on the mutual information which makes it possible to lift the original IB assumption, and makes possible the interpretation of this bound. It also contains the specifications of IB as a directed graphical model. We provide concluding remarks in Section 4.

2. Related Work on the Deep Information Bottleneck Model

The information bottleneck was originally introduced in [4] as a compression technique in which a random variable X is compressed while preserving relevant information about another random variable Y. The problem was originally formulated using only information theory concepts. No analytical solution exists for the original formulation; however, an additional assumption that X and Y are jointly Gaussian distributed leads to a special case of the IB, the Gaussian information bottleneck, introduced in [5], where the optimal compression is also Gaussian distributed. The Gaussian information bottleneck has been further extended to sparse compression and to meta-Gaussian distributions (multivariate distributions with a Gaussian copula and arbitrary marginal densities) in [6]. The idea of applying deep neural networks to model the information common to X and T as well as Y and T has resulted in the formulation of the deep variational information bottleneck [3]. This model has been extended to account for invariance to monotonic transformations of the input variables in [7]. The information bottleneck method has also recently been applied to the analysis of deep neural networks in [8], by quantifying mutual information between the network layers and deriving an information theory limit on deep neural network efficiency. This has lead to attempts at explaining the behaviour of deep neural networks with the IB formalism [9,10]. We now proceed to formally define the IB and DVIB models. Throughout this paper, we adopt the following notation. Define the Kullback–Leibler divergence (KL divergence) between two (discrete or continuous) probability distributions P and Q as . Note that the KL divergence is always non-negative. The mutual information between X and Y is defined as Since the KL divergence is not symmetric, the divergence between the product of the marginals and the joint distribution has also been defined as the lautum information [11]:Both quantities have conditional counterparts: Let denote entropy for discrete and differential entropy for continuous X. Analogously, denotes conditional entropy for discrete and conditional differential entropy for continuous X and Y.

2.1. Information Bottleneck

Given two random vectors X and Y, the information bottleneck method [4] searches for a third random vector T, which, while compressing X, preserves information contained in Y. The resulting variational problem is defined as follows: where is a parameter defining the trade-off between compression of X and preservation of Y. The solution is the optimal conditional distribution of . No analytical solution exists for the general IB problem defined by Equation (4); however, for discrete X and Y, a numerical approximation of the optimal distribution T can be found with the Blahut–Arimoto algorithm for rate-distortion function calculation [4]. Note that the assumed property of the solution is used in the derivation of the model.

2.1.1. Gaussian Information Bottleneck

For Gaussian distributed , let the partitioning of the joint covariance matrix be denoted as follows: The assumption that X and Y are jointly Gaussian distributed leads to the Gaussian information bottleneck [5] where the solution T of Equation (4) is also Gaussian distributed. T is then a noisy linear projection of X; i.e., , where is independent of X. This means that . The IB optimisation problem defined in Equation (4) becomes an optimisation problem over the matrix A and noise covariance matrix : Recall that for n-dimensional Gaussian distributed random variables, entropy, and hence mutual information, have the following form: where and denote covariance matrices of X and , respectively. The notation is used for the determinant of a matrix M. The Gaussian information bottleneck problem has an analytical solution, given in [5]: for a fixed , Equation (6) is optimised by and A having an analytical form depending on and eigenpairs of . Here again, the assumption is used in the derivation of the solution.

2.1.2. Sparse Gaussian Information Bottleneck

Sparsity of the compression in the Gaussian IB can be ensured by requiring the projection matrix A to be diagonal; i.e., . It has been shown in [6] that since for any positive definite and symmetric A, the sparsity requirement simplifies Equation (6) to minimisation over diagonal matrices with positive entries . I.e.: with and independent of X.

2.2. Deep Variational Information Bottleneck

The deep variational information bottleneck [3] is a variational approach to the problem defined in Equation (4). The main idea is to parametrise the conditionals and with neural networks so that the two mutual informations in Equation (4) can be directly recovered from two deep neural nets. To this end, one can express the mutual informations as follows: where the last equality in Equation (9) follows from the Markov assumption in the information bottleneck model: . The conditional is computed by sampling from the latent representation T as in the variational autoencoder [2]. Note that this form of the DVIB makes sure that one is only required to sample from the data distribution , the variational decoder , and the stochastic encoder —implemented as deep neural networks parametrised by and , respectively. In the latter, T depends only on X because of the assumption.

Deep Copula Information Bottleneck

The authors of [3] argue that the entropy term in the last line of Equation (9) can be omitted, as Y is a constant. It has, however, been pointed out [7] that the IB solution should be invariant to monotonic transformations of both X and Y, since the problem is defined only in terms of mutual information which exhibits such invariance (i.e., for an invertible f). The term remaining in Equation (9) after leaving out does not have this property. Furthermore, problems limiting the DVIB when specifying marginal distributions of and in Equations (8) and (9) have been identified [7]. These considerations have lead to the formulation of the deep copula information bottleneck, where the data are subject to the following transformation , where and are the Gaussian and empirical cumulative distribution functions, respectively. This transformation makes them depend only on their copula and not on the marginals. This has also been shown to result in superior interpretability and disentanglement of the latent space T.

2.3. Bounds on Mutual Information in Deep Latent Variable Models

The deep information bottleneck model can be thought of as an extension of the VAE. Indeed, one can incorporate a variational approximation of the posterior to Equation (9) and by obtain [3]. A number of other bounds and approximations of mutual information have been considered in the literature. Many of them are motivated by obtaining a better representation of the latent space T. The article [12] considers different encoding distributions and derives a common bound for on the rate-distortion plane. The authors subsequently extend this bound to the case where it is independent of the sample, which makes it possible to compare VAE and generative adversarial networks [13]. The authors of [14] use a Gaussian relaxation of the mutual information terms in the information bottleneck to bound them from below. They then proceed to compare the resulting method to canonical correlation analysis [15]. Extensions of generative models with an explicit regularisation in the form of a mutual information term have been proposed [16,17]. In the latter, an explicit lower bound on the mutual information between the latent space T and the generator network is derived. Similarly, implicit regularisation of generative models in the form of dropout has been shown to be equivalent to the deep information bottleneck model [18,19]. The authors also mention that both Markov properties should hold in the IB solution, and note that is enforced by construction, while is only approximated by the optimal joint distribution of X, Y, and T. They do not, however, analyse the impact of both Markov assumptions and the relationship between them.

3. The Difference between Information Bottleneck Models

In this section, we focus on the difference between the original and deep IB models. First, we examine how the different Markov assumptions lead to different forms that the term admits. In Section 3.2, we consider both models and show that describing them as directed graphical models makes it possible to elucidate a fundamental property shared by all IB models. We then proceed to summarise the comparison in Section 3.3.

3.1. Clarifying the Discrepancy between the Assumptions in IB and DVIB

3.1.1. Motivation

The derivation of the deep variational information bottleneck model described in Section 2.2 uses the Markov assumption (last line of Equation (9), Figure 1a). At the same time, by construction, the model adheres to the data generating process described by the following structural equations ( are noise terms independent of X and T, respectively):

Figure 1

Markov assumptions for the information bottleneck and the deep information bottleneck.

This implies that the Markov chain is satisfied in the model, too (Figure 1b). Requiring that both Markov chains hold in the resulting joint distribution can be overly restrictive (note that no directed acyclic graph with three vertices to which such a distribution is faithful exists). Thus, the question of whether the property in DVIB can be lifted arises. In what follows, we show that it is indeed possible. Recall from Section 2.2 that the DVIB model relies on sampling only from the data , encoder , and decoder . Therefore, for optimising the latent IB, we want to avoid specifying the full conditional , since this would require us to explicitly model the joint influence of both X and Y on T (which might be a complex distribution). We now proceed to show how to bound in a way that only involves sampling from the encoder and circumvents modelling without using the assumption.

3.1.2. Bound Derivation

First, adopt the mutual information from the penultimate line of Equation (9) (i.e., without assuming the property): Now, rewrite Equation (11) using (i.e., : X and Y are conditionally independent given T): Focusing on in Equation (12), we obtain: Plugging Equation (13) into Equation (12), i.e., averaging over X, and using again, we arrive at:

3.1.3. Interpretation

According to Equation (14), the mutual information consists of three terms: its lower bound which is actually optimised in DVIB and its extensions, which are 0 when both Markov assumptions are satisfied, and the entropy term . Equation (14) shows how to bound the mutual information term in the IB model (Equation (4)) so that the value of the bound depends on the data and marginals , without using the Markov assumption . If we again implement the marginal distributions as deep neural nets and , (Equation (14) provides the lower bound which is actually optimised in DVIB (Equation (9)). By training the networks, we find parameters and in and such that both and are close to zero. The terms and can thus be interpreted as a measure of how much the original IB assumption is violated during the training of the model that implements by construction. The difference between the original IB and DVIB is that in the former, is used to derive the general form of the solution T, while is approximated as closely as possible by T (as noted in [18]. In the latter, is forced by construction, and is approximated by optimising the lower bound given by Equation (14). The "distance" to a distribution satisfying both assumptions is measured by the tightness of the bound.

3.2. The Original IB Assumption Revisited

3.2.1. Motivation for Conditional Independence Assumptions in Information Bottleneck Models

In the original formulation of the information bottleneck (Section 2.1 and Equation (4)), given by , one optimises over while disregarding any dependence on Y. This suggests that the defining feature of the IB model is the absence of a direct functional dependence of T on Y. This can be achieved, e.g., by the first structural equation in Equation (10): That means any influence of Y on T must go through X. Note that this is implied by the original IB assumption , but not the other way around. In particular, the model given by can also be parametrized such that there is no direct dependence of T on Y, as in, e.g., Equation (10). This means that DVIB, despite optimising a lower bound on the IB, implements the defining feature of IB as well.

3.2.2. Information Bottleneck as a Directed Graphical Model

The above discussion leads to the conclusion that the IB assumptions might also be described by directed graphical models. Such models encode conditional independence relations with d-separation (for the definition and examples of d-separation in directed acyclic graphs, see [20] or [21] (Chapters 1.2.3 and 11.1.2)). In particular, any pair of variables d-separated by Z is conditionally independent given Z. The arrows of the directed acyclic graph (DAG) are assumed to correspond to the data generating process described by a set of structural equation (as in Equation (10)). Therefore, the following probability factorisation and data generating process hold for a DAG model: where stands for the set of direct parents of and are exogenous noise variables. Let us now focus again on the motivation for the assumption in Equation (4). It prevents the model from choosing a degenerate solution of (in which case and ). Note, however, that while is a sufficient condition for such a solution to be excluded (which justifies the correctness of the original IB), the necessary condition is that T cannot depend directly on Y. This means that the IB Markov assumption can be indeed reduced to requiring the absence of a direct arrow from Y to T in the underlying DAG. Note that this can be achieved in the undirected model too. One thus wishes to avoid degenerate solutions which impair the bottleneck nature of T: it should contain information about both X and Y, the trade-off between them being steered by . It is therefore necessary to exclude DAG structures which encode independence of X and T as well as Y and T. Such independences are achieved by collider structures (with two different variables pointing towards a common child) in DAGs; i.e., and (they lead to degenerate solutions of and , respectively). To sum up, the goal of asserting the conditional independence assumption in Equation (4) is to avoid degenerate solutions which impair the bottleneck nature of the representation T. When modelling the information bottleneck with DAG structures, one has to exclude the arrow and collider structures. A simple enumeration of the possible DAG models for the information bottleneck results in 10 distinct models listed in Table 1.

Table 1

Directed graphical models of the information bottleneck.

Defining Markov Assumption			Other/None
Admissible DAG models	T→X→YT←X→YT←X←Y

As can be seen, considering the information bottleneck as a directed graphical model (DAG) makes room for a family of models which fall into three broad categories, satisfying one of the two undirected Markov assumptions or , as described in Section 3.1, or neither of them (see Table 1). The difference between particular models lies in the necessity to specify different conditional distributions and parametrise them, which might lead to situations in which no joint distribution exists (which is likely to be the case in the third category). Focusing on the two first categories, we see that the former corresponds to the standard parametrisations of the information bottleneck and the Gaussian information bottleneck (see Section 2.1). In the latter, we see the deep information bottleneck (Equation (10)) as the first DAG. Note also that the second DAG satisfying the assumption in Table 1 defines the probabilistic CCA model [22]. This is not surprising, since the solutions of CCA and the Gaussian information bottleneck use eigenvectors of the same matrix [5].

3.3. Comparing IB and DVIB Assumptions

The original and deep information bottleneck models differ by using different Markov assumptions (see Figure 1) in the derivation of the respective solutions. As demonstrated in Section 3.1, DVIB optimises a lower bound on the objective function of IB. The tightness of the bound measures to what extent the IB assumption (Figure 1a) is violated. As described in Section 3.2, characterising both models as directed graphical models results in two different DAGs for the IB and DVIB. Both models are summarised in Table 2.

Table 2

Comparison of the information bottleneck and deep variational information bottleneck.

	Information Bottleneck (IB)	Deep Information Bottleneck (DVIB)
Assumed Markov chain
Possible set of structural equations	T=fT(X,ηT),Y=fY(X,ηY)	T=fT(X,ηT),Y=fY(T,ηY)
Corresponding DAG
Optimised term corresponding to I(T;Y)	EP(X,Y)EP(T\|X)logP(Y\|T)+I(Y;T\|X)+L(Y;T\|X)+H(Y)	EP(X,Y)EP(T\|X)logP(Y\|T)+H(Y)

4. Conclusions

In this paper, we showed how to lift the information bottleneck’s Markov assumption in the context of the deep information bottleneck model, in which holds by construction. This result explains why standard implementations of the deep information bottleneck can optimise over a larger amount of joint distributions while only specifying the marginal . It is made possible by optimising the lower bound on the mutual information provided here, rather than the full mutual information. We also provided a description of the information bottleneck as a DAG model and showed that it is possible to identify a fundamental necessary feature of the IB in the language of directed graphical models. This property is satisfied for both the original and deep information bottlenecks.

1 in total

1. Information Dropout: Learning Optimal Representations Through Noisy Computation.

Authors: Alessandro Achille; Stefano Soatto
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2018-01-10 Impact factor: 6.226

1 in total