Literature DB >> 33265602

A Measure of Information Available for Inference.

Abstract

The mutual information between the state of a neural network and the state of the external world represents the amount of information stored in the neural network that is associated with the external world. In contrast, the surprise of the sensory input indicates the unpredictability of the current input. In other words, this is a measure of inference ability, and an upper bound of the surprise is known as the variational free energy. According to the free-energy principle (FEP), a neural network continuously minimizes the free energy to perceive the external world. For the survival of animals, inference ability is considered to be more important than simply memorized information. In this study, the free energy is shown to represent the gap between the amount of information stored in the neural network and that available for inference. This concept involves both the FEP and the infomax principle, and will be a useful measure for quantifying the amount of information available for inference.

Entities: Chemical Disease Species

Keywords: free-energy principle; independent component analysis; infomax principle; internal model hypothesis; principal component analysis; unconscious inference

Year: 2018 PMID： 33265602 PMCID： PMC7513032 DOI： 10.3390/e20070512

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Sensory perception comprises complex responses of the brain to sensory inputs. For example, the visual cortex can distinguish objects from their background [1], while the auditory cortex can recognize a certain sound in a noisy place with high sensitivity, a phenomenon known as the cocktail party effect [2,3,4,5,6,7]. The brain (i.e., a neural network) has acquired these perceptual abilities without supervision, which is referred to as unsupervised learning [8,9,10]. Unsupervised learning, or implicit learning, is defined as the learning that happens in the absence of a teacher or supervisor; it is achieved through adaptation to past environments, which is necessary for higher brain functions. An understanding of the physiological mechanisms that mediate unsupervised learning is fundamental to augmenting our knowledge of information processing in the brain. One of the consequent benefits of unsupervised learning is inference, which is the action of guessing unknown matters based on known facts or certain observations, i.e., the process of drawing conclusions through reasoning and estimation. While inference is thought to be an act of the conscious mind in the ordinary sense of the word, it can occur even in the unconscious mind. Hermann von Helmholtz, a 19th-century physicist/physiologist, realized that perception often requires inference by the unconscious mind and coined the word unconscious inference [11]. According to Helmholtz, conscious inference and unconscious inference can be distinguished based on whether conscious knowledge is involved in the process. For example, when an astronomer computes the positions or distances of stars in space based on images taken at various times from different parts of the orbit of the Earth, he or she performs conscious inference because the process is “based on a conscious knowledge of the laws of optics”; by contrast, “in the ordinary acts of vision, this knowledge of optics is lacking” [11]. Thus, the latter process is performed by the unconscious mind. Unconscious inference is crucial for estimating the overall picture from partial observations. In the field of theoretical and computational neuroscience, unconscious inference has been translated as the successive inference of the generative process of the external world (in terms of Bayesian inference) that animals perform in order to achieve perception. One hypothesis, the so-called internal model hypothesis [12,13,14,15,16,17,18,19], states that animals reconstruct a model of the external world in their brain through past experiences. This internal model helps animals infer hidden causes and predict future inputs automatically; in other words, this inference process happens unconsciously. This is also known as the predictive coding hypothesis [20,21]. In the past decade, a mathematical foundation for unconscious inference, called the free-energy principle (FEP), has been proposed [13,14,15,16,17], and is a candidate unified theory of higher brain functions. Briefly, this principle hypothesizes that parameters of the generative model are learned through unsupervised learning, while hidden variables are inferred in the subsequent inference step. The FEP provides a unified framework for higher brain functions including perceptual learning [14], reinforcement learning [22], motor learning [23,24], communication [25,26], emotion, mental disorders [27,28], and evolution. However, the difference between the FEP and a related theory, namely the information maximization (infomax) principle, which states that a neural network maximizes the amount of sensory information preserved in the network [29,30,31,32], is still not fully understood. In this study, the relationship between the FEP and the infomax principle is investigated. As one of the most simple and important examples, the study focuses on blind source separation (BSS), which is the task of separating sensory inputs into hidden sources (or causes) [33,34,35,36]. BSS is shown to be a subset of the inference problem considered in the FEP, and variational free energy is demonstrated to represent the difference between the information stored in the neural network (which is the measure of the infomax principle [29]) and the information available for inferring current sensory inputs.

2. Methods

2.1. Definition of a System

Let us suppose as hidden sources that follow parameterized by a hyper-parameter set ; as sensory inputs; as neural outputs; as background noises that follow parameterized by ; as reconstruction errors; and , , and as nonlinear functions (see also Table 1). The generative process of the external world (or the environment) is described by a stochastic equation as:

Table 1

Glossary of expressions.

Expression	Description
Generative process	A set of stochastic equations that generate the external world dynamics
Recognition model	A model in the neural network that imitates the inverse of the generative process
Generative model	A model in the neural network that imitates the generative process
s∈RN	Hidden sources
x∈RM	Sensory inputs
θ	A set of parameters
λ	A set of hyper-parameters
ϑ≡{s,θ,λ}	A set of hidden states of the external world
u∈RN	Neural outputs
W∈RN×M,V∈RM×N	Synaptic strength matrices
γ	State of neuromodulators
φ≡{u,W,V,γ}	A set of the internal states of the neural network
z∈RM	Background noises
ϵ∈RM	Reconstruction errors
p(x)	The actual probability density of x
p(φ\|x),p(x,φ),p(φ)	Actual probability densities (posterior densities)
p∗(u\|γ),p∗(φ)≡p∗(u\|γ)p∗(W,V,γ)	Prior densities
p∗(x\|φ)≡p∗(x\|u,V,γ)	Likelihood function
p∗(x),p∗(φ\|x),p∗(x,φ)	Statistical models
Δx≡∏iΔxi	Finite spatial resolution of x, Δxi>0
〈·〉p(x)≡∫·p(x)dx	Expectation of · over p(x)
H[x]≡〈−log(p(x)Δx)〉p(x)	Shannon entropy of p(x)Δx
〈−log(p∗(x)Δx)〉p(x)	Cross entropy of p∗(x)Δx over p(x)
DKL[p(·)\|\|p∗(·)]≡logp(·)p∗(·)p(·)	KLD between p(·) and p∗(·)
I[x;φ]≡DKL[p(x,φ)\|\|p(x)p(φ)]	Mutual information between x and φ
S(x)≡logp(x)p∗(x)	Surprise
S¯≡〈S(x)〉p(x)	Surprise expectation
F(x)≡S(x)+DKL[p(φ\|x)\|\|p∗(φ\|x)]	Free energy
F¯≡〈F(x)〉p(x)	Free energy expectation
X[x;φ]≡logp∗(x,φ)p(x)p(φ)p(x,φ)	Utilizable information between x and φ

Recognition and generative models of the neural network are defined as follows: Figure 1 illustrates the structure of the system under consideration. For the generative model, the prior distribution of u is defined as with a hyper-parameter set and the likelihood function as , where indicates a statistical model and is a Gaussian distribution characterized by the mean and covariance . Moreover, suppose , , and as parameter sets for f, g, and h, respectively, as a hyper-parameter set for and , and as a hyper-parameter set for and . Here, hyper-parameters are defined as parameters that determine the shape of distributions (e.g., the covariance matrix). Note that W and V are assumed as synaptic strength matrices for feedforward and backward paths, respectively, while is assumed as a state of neuromodulators similarly to [13,14,15]. In this study, unless specifically mentioned, parameters and hyper-parameters refer to slowly changing variables, so that W, V, and can change their values. Equations (1)–(3) are transformed into probabilistic representations.

Figure 1

Schematic images of a generative process of the environment (left) and recognition and generative models of the neural network (right). Note that the neural network can access only the states in the right side of the dashed line, including x (see text in Section 2.2). Black arrows are causal relationships in the external world. Blue arrows are information flows of the neural network (i.e., actual causal relationships in the neural network), while red arrows are hypothesized causal relationships (to imitate the external world) when the generative model is considered. See main text and Table 1 for meanings of variables and functions.

Note that is Dirac’s delta function and is a statistical model given a model structure m. For simplification, let be a set of hidden states of the external world and be a set of internal states of the neural network. By multiplying by Equation (4), by Equation (5), and by Equation (6), Equations (4)–(6) become where is the prior distribution for and is a statistical model given a model structure m, which is determined by the shapes of and . The expression of is used instead of to emphasize the difference between and . While is the actual joint probability of (which corresponds to the posterior distribution), , i.e., the product of the likelihood function and the prior distribution, represents the generative model that the neural network expects to follow. Typically, elements of are supposed to be independent of each other, . For example, sparse priors about parameters are sometimes used to prevent the over-learning [37], while a generative model with sparse priors for outputs is known as a sparse coding model [38,39]. As shown later, the inference and learning are achieved by minimizing the difference between and . At that time, minimizing the difference between and acts as a constraint or a regularizer that prevents over-learning (see Section 2.3 for details).

2.2. Information Stored in the Neural Network

Information is defined as the negative log of probability [40]. When is the probability of given sensory inputs x, its information is given by [nat], where 1 nat = 1.4427 bits. When x takes continuous values, by coarse graining, is replaced with , where is the probability density of x and is the product of the finite spatial resolutions of x’s elements (). The expectation of over gives the Shannon entropy (or average information), which is defined by where represents the expectation of · over . Note that the use of instead of is useful because this is non-negative ( takes a value between 0 and 1). This is a coarse binning of x and the spatial resolution takes a small but nonzero value so that the addition of constant has no effect except for sliding the offset value. If and only if is Dirac’s delta function (strictly, at one bin and 0 otherwise), is realized. For the system under consideration (Equations (7)–(9)), the information shared between the external world states and the internal states of the neural network is defined by mutual information [41] Note that is the joint probability of and . Moreover and are their marginal distributions, respectively. This mutual information takes a non-negative value and quantifies how much and are related with each other. High mutual information indicates the internal states are informative for explaining the external world states, while zero mutual information means they are independent of each other. However, the only information that the neural network can directly access is the sensory input. This is the case because the system under consideration can be described as a Bayesian network (see [42,43] for details on the Markov blanket). Hence, the entropy of the external world states under a fixed sensory input gives information that the neural network cannot infer. Moreover, there is no feedback control from the neural network to the external world in this setup. Thus, under a fixed x, and are conditionally independent of each other. From , we can obtain Using Shannon entropy, becomes where is the conditional entropy of x given . Thus, maximization of is the same as maximization of for this system. As , , and are non-negative, has the range . Zero mutual information occurs if and only if x and are independent, while occurs if and only if x is fully explained by . In this manner, describes the information about the external world stored in the neural network. Note that this can be expressed using the Kullback–Leibler divergence (KLD) [44] as . The KLD takes a non-negative value and indicates the divergence between two distributions. The infomax principle states that “the network connections develop in such a way as to maximize the amount of information that is preserved when signals are transformed at each processing stage, subject to certain constraints” [29], see also [30,31,32]. According to the infomax principle, the neural network is hypothesized to maximize to perceive the external world. However, does not fully explain the inference capability of a neural network. For example, if neural outputs just express the sensory input itself (), is easily achieved, but this does not mean that the neural network can predict or reconstruct input statistics. This is considered in the next section.

2.3. Free-Energy Principle

If one has a statistical model determined by model structure m, the information calculated based on m is given by the negative log likelihood , which is termed as the surprise (or the marginal likelihood) of the sensory input and expresses the unpredictability of the sensory input for the individual. The neural network is considered to minimize the surprise in the sensory input using the knowledge about the external world, to perceive the external world [13]. To infer if an event is likely to happen based on the past observation, a statistical (i.e., generative) model is necessary; otherwise it is difficult to generalize sensory inputs [45]. Note that the surprise is the marginal over the generative model; hence, the neural network can reduce the surprise by optimizing its internal states, while Shannon entropy of the input is determined by the environment. When the actual probability density and a generative model are given by and , respectively, the cross entropy is always larger than or equal to Shannon entropy because of the non-negativity of KLD. Hence, in this study, the input surprise is defined by and its expectation over by This definition of is to ensure is non-negative and if and only if . Since is determined by the environment and constant for the neural network, minimization of this is the same meaning as minimization of . As the sensory input is generated by the external world generative process, consideration of the structure and dynamics placed in the background of the sensory input can provide accurate inference. According to the internal model hypothesis, animals develop the internal model in their brain to increase the accuracy and efficiency of inference [12,13,14,15,17,18,19]; thus, internal states of the neural network are hypothesized to imitate the hidden states of the external world . A problem is that is intractable for the neural network, because the integral of placed in the logarithm function. The FEP hypothesizes that the neural network calculates an upper bound of instead of the exact value as a proxy, which is more tractable [13] (because is fixed, the free energy is sometimes defined including or excluding this term). This upper bound is termed as variational free energy: Note that expresses the belief about hidden states of the external world encoded by internal states of the neural network, termed as the recognition density. Due to the non-negativity of KLD, is guaranteed to be an upper bound of and holds if and only if . Furthermore, the expectation of over is defined by where is the negative log likelihood and called the accuracy [15]. The second and third terms are the cross entropy of and the conditional entropy of given x, , where the difference between them is called the complexity [15]. The last term is a constant. indicates the difference between the actual probability and the generative model . Given the non-negativity of KLD, is always larger than or equal to non-negative value , and holds if and only if . The FEP hypothesized that is minimized by optimizing neural activities (u), synaptic strengths (W and V; i.e., synaptic plasticity), and activities of neuromodulators (). The accuracy quantifies the amplitude of the reconstruction error. Minimization of the accuracy is the maximum likelihood estimation [10] and provides a solution that (at least locally) minimizes the reconstruction error. Whereas, minimization of the complexity makes closer to . As usually supposes the elements of are mutually independent, this acts as the maximization of the entropy under a constraint. Hence, this leads to the increase of the independence between internal states, which helps neurons to establish an efficient representation, as pointed out by Jaynes’ max entropy principle [46,47]. This is essential for BSS [33,34,35,36] because the optimal parameters that minimize the accuracy are not always uniquely determined. Due to this, the maximum likelihood estimation alone does not always identify the generative process behind the sensory inputs. As is the sum of costs for the maximum likelihood estimation and BSS, free-energy minimization is the rule to simultaneously minimize the reconstruction error and maximize the independence of the internal states. It is recognized that animals perform BSS [2,3,4,5,6,7]. Interestingly, even in vitro neural networks perform BSS, which is accompanied by significant reduction of free energy in accordance with the FEP and Jaynes’ max entropy principle [48].

2.4. Information Available for Inference

We now consider how free energy expectation relates to mutual information . According to unconscious inference and the internal model hypothesis, the aim of a neural network is to predict x, and for this purpose, it infers hidden states of the external world. While the neural network is conventionally hypothesized to express sufficient statistics of the hidden states of the external world [14], here it is hypothesized that internal states of the neural network are random variables and the probability distribution of them imitates the probability distribution of the hidden states of the external world. The neural network hence attempts to match the joint probability of the sensory inputs and the internal states with that of the sensory inputs and the hidden states of the external world. To do so, the neural network shifts the actual probability of internal states closer to those of the generative model that the neural network expects to follow (note that here, and ). This means that the shape or structure of is pre-defined, but the argument can still change. From this viewpoint, the difference between these two distributions is associated with the loss of information. The amount of information available for inference can be calculated using the following three values related to information loss: (i) because is information of the sensory input and is information stored in the neural network, indicates the information loss in the recognition model (Figure 2); (ii) the difference between actual and desired (prior) distributions of internal states quantifies the information loss for inferring internal states using the prior (i.e., blind state separation). This is a common approach used in BSS methods [33,34,35,36]; and (iii) the difference between distributions of the actual reconstruction error and the reconstruction error under the given model quantifies the information loss for representing inputs using internal states. Therefore, by subtracting these three values from , a mutual-information-like measure representing the inference capability is obtained: which is called utilizable information in this study. This utilizable information is defined by replacing in with , immediately yielding

Figure 2

Relationship between information measures. The mutual information between the inputs and internal states of the neural network () is less than or equal to the Shannon entropy of the inputs () because of the information loss in the recognition model. The utilizable information () is less than or equal to the mutual information, and the gap between them gives the expectation of the variational free energy (), which quantifies the loss in the generative model. The sum of the principal component analysis (PCA) and independent component analysis (ICA) costs () is equal to the gap between the Shannon entropy and the utilizable information, expressing the sum of losses in the recognition and generative models.

Hence, represents the gap between the amount of information stored in the neural network and the amount that is available for inference, which is equivalent to the information loss in the generative model. Note that the sum of losses in the recognition and generative models is an upper bound of because of the non-negativity of (Figure 2). As is generally nonzero, does not usually reach zero, even when . Furthermore, is transformed into where is the so-called reconstruction error, which is similar to the reconstruction error for principal component analysis (PCA) [49], while is a generalization of Amari’s cost function for independent component analysis (ICA) [50]. PCA is one of the most popular dimensionality reduction methods. It is used to remove background noise and extract important features from sensory inputs [49,51]. In contrast, ICA is a BSS method used to decompose a mixture set of sensory inputs into independent hidden sources [34,36,50,52,53]. Theoreticians hypothesize that the PCA- and ICA-like learning underlies BSS in the brain [3]. This kind of extraction of the hidden representation is also an important problem in machine learning [54,55]. Equation (21) indicates that consists of PCA- and ICA-like parts, i.e., maximization of can perform both dimensionality reduction and BSS (Figure 2). Their relationship is discussed in the next section.

3. Comparison between the Free-Energy Principle and Related Theories

In this section, the FEP is compared with other theories. As described in the Methods, the aim of the infomax principle is to maximize mutual information (Equation (13)), while the aim of the FEP is to minimize free energy expectation (Equation (18)), while maximization of utilizable information (Equation (19)) means to do both of them simultaneously.

3.1. Infomax Principle

The generative process and the recognition and generative models defined in Equations (1)–(3) are assumed. For the sake of simplicity, let us suppose , and follow Dirac’s delta functions; then, the goal of the infomax principle is simplified to maximization of the mutual information between the sensory inputs x and the neural outputs u: Here W, V, and are still variables, and W is optimized according to the learning while V and do not directly contribute to minimization of . For the sake of simplicity, let us suppose and a linear recognition model , with full-rank matrix W. As is usually assumed and u has an infinite range, monotonically increases as the variance of u increases. Thus, without any constraint is insufficient for deriving learning algorithms for PCA or ICA. To perform PCA and ICA based on the infomax principle, one may consider mutual information between the sensory inputs and the nonlinearly transformed neural outputs with an injective nonlinear function . This mutual information is given by: When nonlinear neural outputs have a finite range (e.g., between 0 and 1), the variance of u should be maintained in the appropriate range. The infomax-based ICA [52,53] is formulated based on this constraint. From , becomes . Since holds, Equation (25) becomes: This is the cost function that is usually considered in the studies on the infomax-based ICA [52,53]. The following section shows that PCA and ICA are performed by the maximization of Equation (26) as well as the FEP.

3.2. Principal Component Analysis

Both the infomax principle and FEP yield a cost function of PCA. One of the most popular data compression methods, PCA is defined by minimization of the error when the inputs are reconstructed from the compressed representation (i.e., u in this study) [49]. It is known that PCA is derived from the infomax principle under a constraint on the internal states. Although maximization of the mutual information between x and u under the orthonormal constraint on W is usually considered [29], here let us consider another solution. Suppose , , and . From Equation (24), holds. Since the reconstruction error is given by for the linear system under consideration, we obtain . Thus, Equation (26) becomes: The first term of Equation (27) is maximized if holds (i.e., if W is an orthogonal matrix; here, a coarse graining with a finite resolution of W is supposed). To maximize the second term, outputs u need to be involved in a subspace spanned by the first to the N-th major principal components of x. Therefore, maximization of Equation (27) performs PCA. Further, PCA is also derived by minimization of (Equation (22)), under the assumption that the reconstruction error follows a Gaussian distribution . Here, is a scalar hyper-parameter that scales the precision of the reconstruction error. Hence, the cost function is given by: When is fixed, the derivative of Equation (28) with respect to W gives the update rule for the least mean square error PCA [49]. As this cost function quantifies the magnitude of the reconstruction error, the algorithm that minimizes Equation (28) yields the low-dimensional compressed representation that minimizes the loss incurred in reconstructing the sensory inputs. This algorithm is the same as Oja’s subspace rule [51], up to an additional term that does not essentially change its behavior (see, e.g., [56] for a comparison between them). The here is also in the same form as the cost function for an auto-encoder [54]. Moreover, when the priors of , and are flat, and are constants with respect to u, W, V, and , because is supposed to be a delta function. Hence, the free energy expectation (Equation (18)) becomes , where is a constant with respect to u, W, and V. In this case, the optimization of W gives the minimum of because u and V are determined by W while is fixed. Thus, under this condition, is equivalent to the cost function of the least mean square error PCA.

3.3. Independent Component Analysis

It is known that ICA yields independent representation of input data by maximizing the independence between the outputs [52,53]. Thus, ICA reduces the redundancy and yields an efficient representation. When sensory inputs are generated from hidden sources, representing the hidden sources is usually the most efficient representation. Both the infomax principle and FEP yield a cost function of ICA. Let us suppose that sources independently follow an identical distribution . The infomax-based ICA is derived from Equation (26) [52,53]. If is defined to satisfy , negative mutual information becomes the KLD between the actual and prior distributions up to a constant term, The here is known as Amari’s ICA cost function [50], which is a reduction of (23). While both and provide the same gradient descent rule, formulating requires nonlinearly transformed neural outputs . By contrast, straightforwardly represents that ICA is performed by minimization of the KLD between and . Indeed, if , the background noise is small, and the priors of , and are flat, we obtain . Therefore, ICA is a subset of the inference problem considered in the FEP, and the derivation from the FEP is simpler, although both the infomax principle and FEP yield the same ICA algorithm. Furthermore, when , minimization of can perform both dimensionality reduction and BSS. When the priors of , and are flat, free energy expectation (Equation (18)) approximately becomes . Here, is fixed so that is a constant with respect to and V. Conditional entropy is ignored in the calculation because it is typically of a smaller order than when is not fine-tuned. As parameterizes the precision of the reconstruction error, it controls the ratio of PCA to ICA. Hence, as decreases to zero, the solution shifts from a PCA-like to an ICA-like solution. Unlike the case with the scalar described above, if is fine-tuned by high-dimensional to minimize , is obtained. Under this condition, is equal to up to a constant term, and thereby, is obtained. This indicates that consists only of the ICA part. These comparisons suggest that low-dimensional is better for performing noise reduction than high-dimensional .

4. Simulation and Results

The difference between the infomax principle and the FEP is illustrated by a simple simulation using a linear generative process and a linear neural network (Figure 3). For simplification, it is assumed that u quickly converge to compared to the change of s (adiabatic approximation).

Figure 3

Difference between the infomax principle and free-energy principle (FEP) when sources follow a non-Gaussian distribution. Black, blue, and red circles indicate the results when W is a random matrix, optimized for the infomax principle (i.e., PCA), and optimized for the FEP, respectively.

For the results shown in Figure 3, s denotes two-dimensional hidden sources following an identical Laplace distribution with zero mean and unit variance; x denotes four-dimensional sensory inputs; u denotes two-dimensional neural outputs; z denotes four-dimensional background Gaussian noises following ; denotes a -dimensional mixing matrix; W is a -dimensional synaptic strength matrix for the bottom-up path; and V is a -dimensional synaptic strength matrix for the top-down path. The priors of , and are supposed to be flat as in Section 3. Sensory inputs are determined by , while neural outputs are determined by . The reconstruction error is given by and used to calculate and . Horizontal and vertical axes in the figure are conditional entropy (Equation (14)) and free energy expectation (Equation (18)), respectively. Simulations were conducted 100 times with randomly selected and for each condition. For each simulation, random sample points were generated and probability distributions were calculated using the histogram method. First, when W is randomly chosen and V is defined by , both and are scattered (black circles in Figure 3) because neural outputs represent random mixtures of sources and noises. Next, when W is optimized according to either Equation (27) or (28) under the constraint of , the neural outputs express the major principal components of the inputs, i.e., the network performs PCA (blue circles in Figure 3). This is the case when is minimized. In contrast, when , and are optimized according to the FEP (see Equation (18)), the neural outputs represent the independent components that match the prior source distribution; i.e., the network performs BSS or ICA while reducing the reconstruction error (red circles in Figure 3). For linear generative processes, the minimization of can reliably and accurately perform both dimensionality reduction and BSS because the outputs become independent of each other and match the prior belief if and only if the outputs represent true sources up to permutation and sign-flip. As the utilizable information consists of PCA and ICA cost functions (see Equation (21)), the maximization of leads to a solution that is a compromise between the solutions for the infomax principle and the FEP. Interestingly, the infomax optimization (i.e., PCA) provides a W that makes closer to zero than random states, which indicates that the infomax optimization contributes to the free energy minimization. Note that, for nonlinear systems, there are many different transformations that make the outputs independent of each other [57]. Hence, there is no guarantee that minimization of can identify the true sources of nonlinear generative models. In summary, the aims of the FEP and infomax principle are similar to each other. In particular, when both the sources and noises follow Gaussian distributions, their aims become the same. Conversely, the optimal synaptic weights under the FEP are different from those under the infomax principle when sources follow non-Gaussian distributions. Under this condition, the maximization of the utilizable information leads to a compromise solution between those for the FEP and the infomax principle.

5. Discussion

In this study, the FEP is linked with the infomax principle, PCA, and ICA. It is more likely that the purpose of a neural network in a biological system is to minimize the surprise of sensory inputs to realize better inference rather than maximize the amount of stored information. For example, the visual input captured by a video camera contributes to the stored information, but this amount of information is not equal to the amount of information available for inference. The surprise expectation represents the difference between actual and inferred observations; the free energy expectation provides the difference between recognition and generative models. Utilizable information is introduced to quantify the inference and generalization capability of sensory inputs. Using this approach, the free energy expectation can be explained as the gap between the information stored in the neural network and that available for inference. To perform ICA based on the infomax principle, one needs to tune the nonlinearity of the neural outputs to ensure the derivative of the nonlinear I/O function matches the prior distribution. Conversely, under the FEP, ICA is straightforwardly derived from the KLD between the actual probability distribution and the prior distribution of u. Especially, in the absence of background noise and prior knowledge of the parameters and hyper-parameters, the free energy expectation is equivalent to the surprise expectation as well as Amari’s ICA cost function, which indicates that ICA is a subproblem of the FEP. The variational free energy quantifies the gap between the actual probability and the generative model and is a straightforward extension of the cost functions for BSS in the sense that it comprises the cost function for PCA [49] and ICA [50] in some special cases. Apart from that, there are studies that use the gap between the actual probability and the product of the marginal distributions to perform BSS [58] or to evaluate the information loss [59,60]. While the relationship between the product of the marginal distributions and the generative model is non-trivial, the comparison would lead to a deeper understanding about how the information of the external world is encoded by the neural network. In the subsequent work, we would like to see how the FEP and the infomax principle are related to those approaches. The FEP is a rigorous and promising theory from theoretical and engineering viewpoints because various learning rules are derived from the FEP [14,15]. However, to be a physiologically plausible theory of the brain, the FEP needs to satisfy certain physiological requirements. There are two major requirements: first, physiological evidence that shows the existence of learning or self-organizing processes under the FEP is required. The model structure under the FEP is consistent with the structure of cortical microcircuits [19]. Moreover, in vitro neural networks performing BSS reduce free energy [48]. It is known that the spontaneous prior activity of a visual area enables it to learn the properties of natural pictures [61]. These results suggest the physiological plausibility of the FEP. Nevertheless, further experiments and consideration of information-theoretical optimization under physiological constraints [62] are required to prove the existence of the FEP in the biological brain. Second, the update rule must be a biologically plausible local learning rule, i.e., synaptic strengths must be changed by signals from connected cells or widespread liquid factors. While the synaptic update rule for a discrete system is local [17], the current rule for a continuous system [14] is a non-local rule. Recently developed biologically-plausible three-factor learning models in which Hebbian learning is mediated by a third modulatory factor [56,63,64,65] may help reveal the neuronal mechanism underlying unconscious inference. Therefore, it is necessary to investigate how actual neural networks infer the dynamics placed in the background of the sensory input and whether this is consistent with the FEP (see also [66] for the relationship between the FEP and spike-timing dependent plasticity [67,68]). This may help develop a biologically plausible learning algorithm through which an actual neural network might develop its internal model. Characterization of information from a physical viewpoint may also help understand how the brain physically embodies the information [69,70]. In the subsequent work, we would like to investigate this relationship. In summary, this study investigated the differences between two types of information: information stored in the neural network and information available for inference. It was demonstrated that free energy represents the gap between these two types of information. This result clarifies the difference between the FEP and related theories and can be utilized for understanding unconscious inference from a theoretical viewpoint.

45 in total

Review 1. The free-energy principle: a unified brain theory?

Authors: Karl Friston
Journal: Nat Rev Neurosci Date: 2010-01-13 Impact factor: 34.870

2. The Helmholtz machine.

Authors: P Dayan; G E Hinton; R M Neal; R S Zemel
Journal: Neural Comput Date: 1995-09 Impact factor: 2.026

3. Action understanding and active inference.

Authors: Karl Friston; Jérémie Mattout; James Kilner
Journal: Biol Cybern Date: 2011-02-17 Impact factor: 2.086

4. The "independent components" of natural scenes are edge filters.

Authors: A J Bell; T J Sejnowski
Journal: Vision Res Date: 1997-12 Impact factor: 1.886

5. Mechanisms underlying selective neuronal tracking of attended speech at a "cocktail party".

Authors: Elana M Zion Golumbic; Nai Ding; Stephan Bickel; Peter Lakatos; Catherine A Schevon; Guy M McKhann; Robert R Goodman; Ronald Emerson; Ashesh D Mehta; Jonathan Z Simon; David Poeppel; Charles E Schroeder
Journal: Neuron Date: 2013-03-06 Impact factor: 17.173

6. A Free Energy Principle for Biological Systems.

Authors: Friston Karl
Journal: Entropy (Basel) Date: 2012-11 Impact factor: 2.524

7. Life as we know it.

Authors: Karl Friston
Journal: J R Soc Interface Date: 2013-07-03 Impact factor: 4.118

Review 8. A Duet for one.

Authors: Karl Friston; Christopher Frith
Journal: Conscious Cogn Date: 2015-01-03

Review 9. Neuromodulated Spike-Timing-Dependent Plasticity, and Theory of Three-Factor Learning Rules.

Authors: Nicolas Frémaux; Wulfram Gerstner
Journal: Front Neural Circuits Date: 2016-01-19 Impact factor: 3.492

10. Cultured Cortical Neurons Can Perform Blind Source Separation According to the Free-Energy Principle.

Authors: Takuya Isomura; Kiyoshi Kotani; Yasuhiko Jimbo
Journal: PLoS Comput Biol Date: 2015-12-21 Impact factor: 4.475