Literature DB >> 35462780

Active Inference and Epistemic Value in Graphical Models.

Thijs van de Laar¹, Magnus Koudahl^1,2, Bart van Erp¹, Bert de Vries^1,3.

Abstract

The Free Energy Principle (FEP) postulates that biological agents perceive and interact with their environment in order to minimize a Variational Free Energy (VFE) with respect to a generative model of their environment. The inference of a policy (future control sequence) according to the FEP is known as Active Inference (AIF). The AIF literature describes multiple VFE objectives for policy planning that lead to epistemic (information-seeking) behavior. However, most objectives have limited modeling flexibility. This paper approaches epistemic behavior from a constrained Bethe Free Energy (CBFE) perspective. Crucially, variational optimization of the CBFE can be expressed in terms of message passing on free-form generative models. The key intuition behind the CBFE is that we impose a point-mass constraint on predicted outcomes, which explicitly encodes the assumption that the agent will make observations in the future. We interpret the CBFE objective in terms of its constituent behavioral drives. We then illustrate resulting behavior of the CBFE by planning and interacting with a simulated T-maze environment. Simulations for the T-maze task illustrate how the CBFE agent exhibits an epistemic drive, and actively plans ahead to account for the impact of predicted outcomes. Compared to an EFE agent, the CBFE agent incurs expected reward in significantly more environmental scenarios. We conclude that CBFE optimization by message passing suggests a general mechanism for epistemic-aware AIF in free-form generative models.

Entities: Chemical

Keywords: active inference; constrained bethe free energy; free energy principle; message passing; variational optimization

Year: 2022 PMID： 35462780 PMCID： PMC9019474 DOI： 10.3389/frobt.2022.794464

Source DB: PubMed Journal: Front Robot AI ISSN： 2296-9144

1 Introduction

Free energy can be considered as a central concept in the natural sciences. Many natural laws can be derived through the principle of least action, which rests on variational methods to minimize a path integral of free energy over time (Caticha, 2012). In neuroscience, an application of the least action principle to biological behavior is formalized as the Free Energy Principle (Friston et al., 2006). The Free Energy Principle (FEP) postulates that biological agents perceive and interact with their environment in order to minimize a Variational Free Energy (VFE) that is defined with respect to a model of their environment. Under the FEP, perception relates to the process of hidden state estimation, where the agent tries to infer hidden causes of its sensory observations; and action (intervention) relates to a process where the agent actively tries to influence its (predicted) future observations by manipulating the external environment. Because the future is unobserved (by definition), the agent includes prior beliefs about desired outcomes in its model and infers a policy that prescribes a sequence of future controls. The corollary of the FEP that includes action is referred to as Active Inference (AIF) (Friston et al., 2010). The AIF literature describes multiple Free Energy (FE) objectives for policy planning, e.g., the Expected FE (Friston et al., 2015), Generalized FE (Parr and Friston, 2019) and Predicted (Bethe) FE (Schwöbel et al., 2018) (among others, see e.g. (Tschantz et al., 2020b; Hafner et al., 2020; Sajid et al., 2021)). Traditionally, the Expected Free Energy (EFE) is evaluated for a selection of policies, and a posterior distribution over policies is constructed from the corresponding EFEs. The EFE is designed to balance epistemic (knowledge seeking) and extrinsic (goal seeking) behavior. The active policy (the sequence of future controls to be executed in the environment) is then selected from this policy posterior (Friston et al., 2015). Several authors have attempted to formulate minimization of the EFE by message passing on factor graphs (de Vries and Friston, 2017; Parr et al., 2019; Parr and Friston, 2019; Champion et al., 2021). These formulations evaluate the EFE objective with the use of a message passing scheme. In this paper we revisit this problem and compare the EFE approach with the message passing interpretation of the variational optimization of a Bethe Free Energy (BFE) (Pearl, 1988; Caticha, 2004; Yedidia et al., 2005). However, the BFE is known to lack epistemic (information-seeking) qualities, and resulting BFE AIF agents therefore do not pro-actively seek informative states (Schwöbel et al., 2018). As a solution to the lack of epistemic qualities of the BFE, in this paper we approach epistemic behavior from a Constrained BFE (CBFE) perspective (Şenöz et al., 2021). We illustrate how optimization of a point-mass constrained BFE objective instigates self-evidencing behavior. Crucially, variational optimization of the CBFE can be expressed in terms of message passing on a graphical representation of the underlying generative model (GM) (Dauwels, 2007; Cox et al., 2019), without modification of the GM itself. The contributions of this paper are as follows: • We formulate the CBFE as an objective for epistemic-aware active inference (Section 2.6) that can be interpreted as message passing on a GM (Section 6); • We interpret the constituent terms of the CBFE objective as drivers for behavior (Section 4); • We illustrate our interpretation of the CBFE by planning and interacting with a simulated T-maze environment (Section 5). • Simulations show that the CBFE agent plans epistemic policies multiple time-steps ahead (Section 6.2), and accrues reward for a significantly larger set of scenarios than the EFE (Section 7). The main advantage of AIF with the CBFE objective, is that it allows inference to be fully automated by message passing, while retaining the epistemic qualities of the EFE. Automated message passing absolves the need for manual derivations and removes computational barriers in scaling AIF to more demanding settings (van de Laar, 2018).

2 Problem Statement

In this section we will introduce the free energy objectives as used throughout the paper. We start by introducing the Variational Free Energy (VFE), and explain how a VFE can be employed in an AIF context for perception and policy planning. We then introduce the Expected Free Energy (EFE) as a variational objective that is explicitly designed to yield epistemic behavior in AIF agents, but also note that the EFE definition limits itself to (hierarchical) state-space models. We then introduce the Bethe Free Energy (BFE), and argue that the BFE allows for convenient optimization on free-form models by message passing, but note that the BFE lacks information-seeking qualities. We conclude this section by introducing the Constrained Bethe Free Energy (CBFE), which equips the BFE with information-seeking qualities on free-form models through additional constraints on the variational distribution. Table 1 summarizes notational conventions throughout the paper.

TABLE 1

Summary of notational conventions throughout the paper.

Notation	Def	Explanation
s		Collection of (arbitrary) model variables
s _j		Individual model variable with index j∈S
f( s )	(Eq. 1)	Factorized model of variables s
f _a( s _a)	(Eq. 1)	Factor (conditional or prior probability distribution) with argument variables s _a and index a∈F
q( s )	(Eq. 2)	Variational distribution of (latent) variables s
Uq(s)f(s)	(Eq. 2)	Average energy
Hq(s)	(Eq. 2)	Entropy
F[q]	(Eq. 2)	Variational Free Energy
y _k, x _k, u _k		Observation, state and control variable (at time k) respectively
y^k		Specific realization for observation or unobserved future (predicted) outcome
u^k		Specific control realization
p(y _k, x _k\|x _k−1, u _k)	(Eq. 7)	Generative Model engine
p(y _k\|x _k)	(Eq. 7)	Observation model
p(x _k\|x _k−1, u _k)	(Eq. 7)	Transition model
p(x _t−1)	(Eq. 8)	State prior
y , x , u		Sequence of future observation variables y _t:t+T−1, state variables x _{t−1:t+T−1} and control variables u _t:t+T−1, respectively
u^j		Policy (sequence of specific future controls), u^j∈C , where index j is usually omitted
F*(u^j)	(Eq. 12)	Optimized Variational Free Energy for policy u^j
u^*	(Eq. 13)	Optimal policy u^*∈C
G[q;u^j]	(Eq. 14)	Expected Free Energy (EFE)
p( y \| x )	(Eq. 16a)	Aggregate observation model
p( x \| u )	(Eq. 16b)	Aggregate state transition model, including state prior
p~(y)	(Eq. 18)	Goal prior for sequence of future observation variables
p~(yk)	(Eq. 18)	Goal prior for observation variable at time k
f( y , x \| u )	(Eq. 19), (Eq. 25)	Factorized model of future variables (at time t), for EFE and (C)BFE respectively
B[q]	(Eq. 23)	Bethe Free Energy (BFE)
HBq	(Eq. 24)	Bethe entropy
B[q;u^j]	(Eq. 26)	Bethe Free Energy of future model under policy u^j
B[q;u^j,y^]	(Eq. 28)	Constrained Bethe Free Energy (CBFE) of future model under policy u^j and predicted outcomes y^

Summary of notational conventions throughout the paper.

2.1 Variational Free Energy

The Variational Free Energy (VFE) is a principled metric in physics, where a time-integral over free energy is known as the action functional. Many natural laws can be derived from the principle of least action, where the action functional is minimized with the use of variational calculus (Caticha, 2012; Lanczos, 2012). The VFE is defined with respect to a factorized generative model (GM). We consider a GM f( ) with factors and variables that factorizes according to where collects the argument variables of the factors f . As a notational convention, we write collections and sequences in bold script. In the model factorization of (Eq. 1), the factors f would correspond with the prior and conditional probability distributions that define the GM. The VFE is then defined as a functional of an (approximate) posterior q( ) over latent variables, as which consists of an average energy and an entropy . Because the VFE is (usually) optimized with respect to the posterior q with the use of variational calculus (Yedidia et al., 2005), the posterior q is also referred to as the variational distribution. In this paper, we will strictly reserve the q notation for variational distributions. We can relate the exact posterior belief with the model definition through a normalizing constant Z, as where Throughout this paper, summation can be replaced by integration in the case of continuous variables. In a Bayesian context, the normalizer Z is commonly referred to as the marginal likelihood or evidence for model f. However, exact summation (marginalization) of (Eq. 4) over all variable realizations is often prohibitively difficult in practice, so that the evidence and exact posterior become unobtainable. Substituting (3) in the VFE (2) expresses the VFE as an upper bound on the surprise, that is the negative log-model evidence, as The marginalization problem of (Eq. 4) is thus converted to an optimization problem over q. After optimization, the VFE approximates the surprise, and the optimal variational distribution becomes an approximation to the true posterior, p( ) ≈ q ∗( ). Crucially, we are free to choose constraints on q such that the optimization becomes practically feasible, at the cost of an increased posterior divergence. One such approximation is the Bethe assumption, as we will see in Section 2.5.

2.2 Inference for Perception

We now return to the model definition of (Eq. 1). In the context of AIF, a Generative Model (GM) comprises of a probability distribution over states x , observations y and controls u , at each time index k. We will use a hat to indicate specific variable realizations, i.e. for a specific outcome and for a specific control at time k. As a notational convention, we use k as an arbitrary time index, often used in the context of iterations, and we use t to indicate the current simulation time index. We define a state-space model (Koller and Friedman, 2009) for the generative model engine, which represents our belief about how observations follow from a given control and previous state, as We use a prior belief about past states p(x ) together with the generative model engine (Eq. 7) to define a generative model for perception which, after substitution in (Eq. 2), results in the VFE objective for perception, At each time t, the process of perception then relates to inferring the optimal variational distribution q∗(x , x ) about latent states, given the current action and resulting outcome . The resulting variational distribution can then be used as a prior for the next time-step, such that .

2.3 Inference for Planning

At each time t, planning is concerned with selecting optimal future controls by minimizing a Free Energy (FE) objective that is defined with respect to future variables. We write = , = , and = as the sequences of future observations, states and controls respectively, for a fixed-time horizon of T time-steps ahead. We will refer to a specific future control sequence as a policy. The optimal policy is then referred to as the active policy, where (local) optimality is indicated by an asterisk. Inference for planning then aims to select the optimal policy (in terms of FE) from a collection of candidate policies , where represents the finite set of all (user-provided) candidate policies, and j the selected policy index. When we view the candidate policy as a model selection variable, the problem of policy selection becomes equivalent to the problem of Bayesian model selection, where we wish to find a probabilistic model with the highest posterior probability among some given candidate models. When there is no prior preference about models, the optimal model is the one with the highest marginal likelihood (evidence). Given a model f( , | ) of future observations and states given a future control sequence, we can express the marginal likelihood (evidence) for a specific policy choice, as Using (Eq. 3), we can then relate the exact posterior belief with the variational distribution and the policy evidence, as Under optimization of q, the minimal VFE then approximates the surprise (Eq. 5), as The optimal policy then minimizes the optimized VFE, as In the following, we omit the explicit indexing of the policy on j for notational convenience, and simply write to represent a specific policy choice. Because the VFE (2) involves expectations over the full joint variational distribution, it may become prohibitively expensive to compute for larger models. Therefore, additional assumptions and constraints on the VFE are often required. As a result, multiple free energy objectives for policy planning have been proposed in the literature, e.g., the Expected Free Energy (EFE) (Friston et al., 2015; Friston et al., 2021), the Free Energy of the Expected Future (Millidge et al., 2020b), the Generalized Free Energy (Parr and Friston, 2019), the Predicted (Bethe) Free Energy (Schwöbel et al., 2018), and marginal approximations (Parr et al., 2019).

2.4 Expected Free Energy

The Expected Free Energy (EFE) is an FE objective for planning that is explicitly constructed to elicit information-seeking behavior (Friston et al., 2015). Because future observations are (by definition) unknown, the EFE is defined in terms of an expectation that includes observation variables, as Construction of the (Markovian) model for the EFE starts by stringing together a state prior with the generative model engine of (Eq. 7) for future times, as where the state prior p(x ) follows from the perceptual process (Section 2.2). For notational convenience, we often group the observation and state transition models (including the state prior), according to From the future generative model engine (Eq. 15), the EFE defines a state posterior Note that our notation differs from (Friston et al., 2015), where posterior distributions are denoted by q. We strictly reserve the q notation for variational distributions. We introduce goal priors over observation variables. Goal priors encode prior beliefs about desired observations (or states) (Friston et al., 2013), and are annotated with a tilde to avoid confusion with the marginal distribution over the same variable. We then introduce a shorthand notation that aggregates (independent) goal priors for the future generative model, as Together with the aggregated goal prior, the factorized model for the EFE is then constructed as There are several things of note about the model of (Eq. 19): • The model includes variables that pertain to future time-points, t ≤ k ≤ t + T − 1. As a result, the future observation variables are latent; • The model includes a state prior that is a result of inference for perception; • The (informative) goal priors introduce a bias in the model towards desired outcomes; • Candidate policies will be given, as indicated by a conditioning on controls. Upon substitution of (Eq. 19), the EFE (Eq. 14) factorizes into an epistemic and an extrinsic value term (Friston et al., 2015), as where the epistemic value relates to a mutual information between states and observations. This decomposition is often used to motivate the epistemic qualities of the EFE. An alternative decomposition, in terms of ambiguity and observation risk, can be obtained under the assumptions q( | ) ≈ p( | ) (approximation of the observation model), and q( | ) ≈ p( | , ) (approximation of the exact posterior). These assumptions allow us to rewrite the exact relationship q( , ) = q( | ) q( ) = q( | ) q( ) in terms of the approximations q( , ) ≈ p( | ) q( ) ≈ p( | , ) q( ). As a result, we obtain where q( ) and q( ) on the r.h.s are implicitly conditioned on . This decomposition is often used to motivate the explorative (ambiguity reducing) qualities of the EFE. In the current paper we evaluate the EFE in accordance with (Friston et al., 2015; Da Costa et al., 2020a), for which the procedure is detailed in Supplementary Appendix SA. Although the EFE leads to epistemic behavior, it does not fit the general functional form of the VFE (Eq. 2), where the expectation and numerator define the same variational distribution. As a result, EFE minimization by message passing requires custom definitions, and limits itself to (hierarchical) state-space models. Furthermore, note that the EFE involves the state posterior p( | , ) as part of its definition, which is technically a quantity that needs to be inferred. The EFE thus conflates the definition of the planning objective with the inference procedure for planning itself.

2.5 Bethe Free Energy

The Bethe Free Energy (BFE) defines a variational distribution that factorizes according to the Bethe assumption where the degree d counts how many q ’s contain s as an argument. After substituting the Bethe assumption (Eq. 22) in the VFE (Eq. 2), we obtain the BFE, as a special case of the VFE. The entropy contributions are often summarized in the Bethe entropy, as Because the BFE fully factorizes into local contributions in and , it can be optimized by message passing on the generative model (Pearl, 1988; Yedidia et al., 2005; Wainwright and Jordan, 2008; Şenöz et al., 2021). In the context of AIF, the BFE for a model over future states is also referred to as the Predicted Free Energy (Schwöbel et al., 2018). For a fixed time-horizon T, the factorized model for future states is constructed from the generative model engine and goal prior, as Because the generative model engine and goal priors introduce a simultaneous constraint on future observations, the model of (Eq. 25) represents a scaled probability distribution. The BFE of the future model under policy then becomes A major advantage of the BFE over the EFE as an objective for AIF is that message passing implementations can be automatically derived on free form models, thus greatly enhancing model flexibility. A drawback of the BFE, however, is that it lacks the epistemic qualities of the EFE (Schwöbel et al., 2018), see also Section 4.

2.6 Constrained Bethe Free Energy

The Constrained Bethe Free Energy (CBFE) that we propose in this paper combines the epistemic qualities of the EFE with the computational ease and model flexibility of the BFE. The CBFE can be derived from the BFE by imposing additional constraints on the variational distribution where δ(⋅) defines the appropriate (Kronecker or Dirac) delta function for the domain of the observation variable y (discrete or continuous). The point-mass (delta) constraints of the CBFE are motivated by the following key insight: although the future is unknown, we know that we will observe something in the future. However, because future outcomes are by definition unobserved, the encode potential outcomes that need to be optimized for. For the model of (Eq. 25), the CBFE then becomes4 The current paper investigates how point-mass constraints of the form (Eq. 27) affect epistemic behavior in AIF agents.

3 Methods

To minimize the (Constrained) Bethe Free Energy, the current paper uses message passing on a Forney-style factor graph (FFG) representation (Forney, 2001) of the factorized model (Eq. 1). In an FFG, edges represent variables and nodes represent the functional relationships between variables (i.e. the prior and conditional probabilities). Especially in a signal processing and control context, the FFG paradigm leads to convenient message passing formulations (Korl, 2005; Loeliger et al., 2007). Namely, inference can be described in terms of messages that summarize and propagate information across the FFG. The BFE is well-known for being the fundamental objective of the celebrated sum-product algorithm (Pearl, 1988), which has been formulated in terms of message passing on FFGs (Loeliger, 2004). Extensions of the sum-product algorithm to hybrid formulations, such as variational message passing (VMP) (Dauwels, 2007) and expectation maximization (EM) (Dauwels et al., 2005) have also been formulated as message passing on FFGs. More recently, more general hybrid algorithms have been described in terms of message passing, see e.g. (Zhang et al., 2017; van de Laar et al., 2021). A comprehensive overview is provided in (Şenöz et al., 2021), where additional constraints, including point-mass constraints, are imposed on the BFE and optimized for by message passing on FFGs.

3.1 Forney-Style Factor Graph Example

Let us consider an example model (1) that factorizes according to The FFG representation of (Eq. 29) is depicted in Figure 1A.

FIGURE 1

Forney-style factor graph representation for the example model of (Eq. 29) (A) and message passing schedule for the Bethe Free Energy minimization of (Eq. 30) (B). Shaded messages indicate variational message updates, and the solid square node indicates given (clamped) values. The round node indicates a point-mass constraint for which the value is optimized. Now suppose we observe s 4 and introduce a point-mass constraint on s 3. The variational distribution then factorizes as where s 4 is excluded from the variational distribution because it is observed and therefore no longer a latent variable. Substituting (Eq. 29) and (Eq. 30) in (Eq. 28) yields the CBFE as where we directly substituted the observed value into the factorized model. In this paper we adhere to the notation in (Şenöz et al., 2021), and indicate point-mass constraints by an unshaded round node with an annotated δ on the corresponding edge of the FFG. A solid square node indicates a given value (e.g., an action, observed outcome or given parameter), whereas an unshaded round node indicates a point-mass constraint that is optimized for (e.g. a potential outcome). Unshaded messages indicate sum-product messages (Loeliger, 2004) and shaded messages indicate variational messages, as scheduled and computed in accordance with (Dauwels, 2007). The ForneyLab probabilistic programming toolbox (Cox et al., 2019) implements an automated message passing scheduler and a lookup table of pre-derived message updates (Korl, 2005; van de Laar, 2019, Supplementary Appendix SA). Variational optimization of (Eq. 31) then yields the (iterative) message passing schedule of Figure 1B, where is initialized. After computation of the messages, the mode of the product between message ❻ and ⑦ becomes the new value for , and the schedule is repeated until convergence. The resulting optimization procedure then resembles an expectation maximization (EM) scheme where ❻ acts as a likelihood and ⑦ as a prior (Dauwels et al., 2005), and where upon each iteration the value is updated with the maximum a-posteriori (MAP) estimate.

4 Value Decompositions

In this section we further investigate the drivers for behavior of the (C)BFE. We assume that all variational distributions factorize according to the Bethe assumption (Eq. 22).

4.1 Confidence and Complexity

We substitute the model of (Eq. 25) in the CBFE definition of (Eq. 28) and combine to identify three terms, as Table 2 is provided as an overview, and summarizes the properties of the individual terms of (Eq. 32) under optimization.

TABLE 2

Optimize		Vary
Optimize	Fix	y^	u^	q( x )
max ex. val		y^ that maximizes p~(y^)
min complexity	q( x )		u^∈C that renders p(x\|u^) closest to q( x )
min complexity	u^			q(x)=p(x\|u^)
max confidence	q( x )	y^ that maximizes expected outcomes⁴ Eq(x)log⁡p(y^\|x)
	y^			q(x)=δ(x−x^) where x^ maximizes the likelihood p(y^\|x)
Max intrinsic value	u^	y^ that maximizes the evidence⁴ p(y^\|u^)
Max intrinsic value	y^		u^∈C that renders p(y\|u^) closest to δ(y−y^)
min posterior divergence	q( x ), u^	y^ that renders p(x\|y^,u^) closest to q( x )
	q( x ), y^		u^∈C that renders p(x\|y^,u^) closest to q( x )
	y^ , u^			q(x)=p(x\|y^,u^)

Optima for the individual terms of the CBFE decompositions (Eq. 32, Eq. 34). Each row varies one quantity (variable or function) in their respective term while other quantities remain fixed. Shaded cells indicate that the term (row) is not a function(al) of that specific optimization quantity (column). The extrinsic value induces a preference for extrinsically rewarding future outcomes. Minimizing complexity prefers policies that induce transitions that are in line with state beliefs, and (vice versa) prefers state beliefs that remain close the policy-induced state transitions. The confidence expresses the expected difference (divergence) in information between the outcomes as predicted by the observation model and the most likely (expected) outcome. In other words, this term quantifies the information difference between predictions and absolute certainty about outcomes. While the negative confidence term could be interpreted as an ambiguity (deviation from certainty), we choose this alternative terminology to prevent confusion with the ambiguity as defined in (21). Specifically, the ambiguity (Eq. 21) and negative confidence (Eq. 32) are both of the form . However, where the ambiguity approximates q( , ) ≈ p( | ) q( ) (Eq. 21), the confidence defines (Eq. 27). As a result, the ambiguity explicitly accounts for the full spread of p( | ), whereas the confidence evaluates p( | ) at the expected maximum . Maximizing confidence prefers outcomes that are in line with predictions, and simultaneously tries to maximize the precision of state beliefs (Table 2), see also (Friston and Penny, 2011, p. 2093). Note that all terms act in unison–the precision of state beliefs is simultaneously influenced by the complexity, which prevents the collapse of the state belief to a point-mass.

4.2 Intrinsic and Extrinsic Value

A second decomposition of the CBFE objective follows when we rewrite the factorized model of (Eq. 25) using the product rule, as Substituting (Eq. 33) in the CBFE definition (Eq. 28) and combining terms, then yields Table 2 again summarizes the properties of the individual terms of (Eq. 34) under optimization. The second term of (Eq. 34) expresses the difference (divergence) in information between the predicted outcomes and the point-mass constrained (expected) outcome. In contrast to the extrinsic value (third term), this term quantifies a (negative) intrinsic value that purely depends upon the agent’s intrinsic beliefs about the environment (state prior and generative model engine (Eq. 15)). Under optimization (Table 2), this term prefers policies that lead to precise predictions for the outcomes. The posterior divergence (first term) is always non-negative and will diminish under optimization, which allows us to combine (Eq. 32) and (Eq. 34) into Eq. 4 Interestingly, (Eq. 35) tells us that the intrinsic value of (Eq. 34) relates to the model evidence as predicted under the policy and resulting expected outcomes. Inference for planning with the CBFE then attempts to make precise predictions for outcomes by maximizing predicted model evidence. In this view, the CBFE for planning can be considered (quite literally) as self-evidencing (Friston et al., 2010; Hohwy, 2016). As a result of selected actions, environmental outcomes may still be surprising under current generative model assumptions. Inference for perception then subsequently corrects the generative model priors (Section 2.2), and the action-perception loop repeats (see also Algorithm 1). Epistemic qualities then emerge from this continual pursuit of evidence (Hohwy, 2021). In short, we note a distinction in the interpretation of epistemic value between the EFE and the CBFE. In the EFE (Eq. 20), epistemic value is directly related with a mutual information term between states and outcomes. In the CBFE, the epistemic drive appears to result from a self-evidencing mechanism.

4.3 Bethe Free Energy Value Decomposition

The BFE does not permit an interpretation in terms of intrinsic value. When we substitute (Eq. 33) in the BFE definition of (Eq. 26) and combine terms, we obtain The intrinsic value term of (Eq. 34) has been replaced by a predictive divergence in (Eq. 36). This term expresses the difference (divergence) in information between the observations as predicted by the model under policy , and variational distribution about outcomes q( ). Under optimization of q( | ), the posterior divergence will vanish for all . Without the additional point-mass constraint of (Eq. 27), the predictive divergence then no longer quantifies the information difference between uncertainty (predictive distribution) and certainty (predicted outcomes). As a result, the BFE lacks the self-evidencing qualities and resulting epistemic drive of the CBFE, as we will further illustrate in Section 4.4, 6.

4.4 Example Application

We illustrate our interpretation of (Eqs 34, 36) by a minimal example model. We consider a two-armed bandit, where an agent chooses between two levers, u ∈ {0, 1}. Each lever offers a distinct probability for observing an outcome y ∈ {0, 1}. Specifically, choosing will offer a 0.5 probability for observing (ignorant policy), wheres choosing will always lead to the observation (informative policy). We do not equip the agent with any external preference (there is no extrinsic reward). The agent’s factorized model then becomes with a = (0,1) and the conditional probability matrix The BFE then follows as The CBFE additionally constrains , and as a result corresponds directly with the negative intrinsic value term of (Eq. 34), as4 The FFG for the model definition of (Eq. 37) together with the resulting schedule for optimization of the (C)BFE is drawn in Figure 2.

FIGURE 2

Message passing schedule for the example model of (Eq. 37) for the BFE (A) and CBFE (B). The dashed box summarizes the observation model.

Message passing schedule for the example model of (Eq. 37) for the BFE (A) and CBFE (B). The dashed box summarizes the observation model. The results of Table 3 show that the BFE does not distinguish between policies. The CBFE however penalizes the ignorant policy , which does not predict precise outcomes. This mechanism thus induces a preference for the informative policy , which does predict precise outcomes. In the following, we will further investigate this behavior in a less trivial setting.

TABLE 3

Free energies (in bits) per policy for the example application.

Policy	BFE	CBFE
Ignorant (u^=0)	0	1
Informative (u^=1)	0	0

Free energies (in bits) per policy for the example application.

5 Experimental Setting

A classic experimental setting that investigates epistemic behavior is the T-maze task (Friston et al., 2015). The T-maze environment consists of four positions , as drawn in Figure 3. The agent starts in position 1, and aims to obtain a reward that resides in either arm 2 or 3, . The position of the reward is unknown to the agent a priori, and once the agent enters one of the arms it remains there.

FIGURE 3

Layout of the T-maze. The agent starts at position 1. The reward is located at either position 2 or 3. Position 4 contains a cue which indicates the reward position.

Layout of the T-maze. The agent starts at position 1. The reward is located at either position 2 or 3. Position 4 contains a cue which indicates the reward position. In order to learn the position of the reward, the agent first needs to move to position 4, where a cue indicates the reward position. At each position, the agent may observe one of four reward-related outcomes : 1. The reward is indicated to reside at location two (left arm); 2. The reward is indicated to reside at location three (right arm); 3. The reward is obtained; 4. The reward is not obtained. The key insight is that an epistemic policy would first inspect the cue at position 4 and then move to the indicated arm, whereas a purely goal directed agent would immediately move towards either of the potential goal positions instead of visiting the cue.

5.1 Generative Model Specification

We follow (Friston et al., 2015), and assume a generative model with discrete states x , observations y and controls u . The state , represents the agent position at time k (four positions, Figure 3) combined with the reward position (two possibilities). The state vector thus comprises of eight possible realizations. The transition between states is affected by the control , which encodes the agent’s attempted next position in the maze. The observation variables represent the agent position at time k (four positions) in combination with the reward-related outcome at that position (four possibilities). The respective state prior, observation model, transition model and goal priors are defined as where , , and . The agent plans two steps ahead (T = 2), for which the FFG is drawn in Figure 4.

FIGURE 4

Forney-style factor graph of the generative model for the T-maze. The MUX nodes select the transition matrix as determined by the control variable. Dashed boxes summarize the indicated distributions.

5.2 Parameter Assignments

We start the simulation at t = 1, and assume that the initial position is known, namely we start at position 1 (Figure 3). However, the reward position is unknown a priori. This prior information is encoded by the initial state probability vector where ⊗ denotes the Kronecker product. The transition matrix encodes the state transitions (from column-index to row-index), as The control affects the agent position, but not the reward position. Therefore, Kronecker products with the two-dimensional unit matrix I 2 ensure that the transitions are duplicated for both possible reward positions. Note that positions 2 and 3 (the reward arms) are attracting states, since none of the transition matrices allow a transition away from these positions. This means that although it is possible to propose any control at any time, not all controls will move the agent to its attempted position. We denote the collection of transition matrices by = {B 1, B 2, B 3, B 4}. The observed outcome depends on the position of the agent. The position-dependent observation matrices specify how observations follow, given the current position of the agent (subscripts) and the reward position (columns), as with reward probability α. The columns of these position-dependent observation matrices represent the two possibilities for the reward position. The position-dependent observation matrices combine into the complete block-diagonal, 16-by-8 observation matrix where ⊕ denotes the direct sum (i.e. block-diagonal concatenation). The goal prior depends upon the future time, with reward utility c, and σ the soft-max function where . The flat prior c 1 encodes a lack of external preference at t = 1, while c for k > 1 encodes a preference for observing rewards at subsequent times. This effectively removes the goal prior for the first move (t = 1), while in subsequent moves the agent is rewarded for extrinsically rewarding states (van de Laar, 2018).

6 Inference for Planning

In this simulation we compare the behavior of a CBFE agent to the behavior of a reference BFE agent (without point-mass constraints). We consider given policies , and the optimal CBFE as a function of those policies where the subscript indicates the explicit inclusion of point-mass constraints. The unconstrained BFE represents the objective where the future observation variables are not point-mass constrained by their potential outcomes . The unconstrained agent will therefore optimize the joint belief over state and future observation variables rather than potential outcomes, as We will evaluate the BFE, CBFE and EFE for all sixteen (T = 2) possible candidate policies . We consider several scenarios with varying reward probabilities α (Eq. 41) and reward utilities c (42). In the current section we do not (yet) consider the interaction of the agent with the environment. In other words, actions from optimal policy are not (yet) executed; we are purely interested in the inference for planning itself, and the resulting free energy values as a function of the candidate policies (Eq. 43).

6.1 Message Passing Schedule for Planning

The message passing schedules for planning are drawn in Figure 5 (BFE) and Figure 6 (CBFE), where light messages are computed by sum-product (SP) message passing updates (Loeliger, 2004), and dark messages by variational message passing updates (Dauwels, 2007). An overview of message passing updates for discrete nodes can be found in (van de Laar, 2019, Supplementary Appendix SA).

FIGURE 5

Message passing schedule for planning in the T-maze with the BFE.

FIGURE 6

Message passing schedule for planning in the T-maze with the CBFE.

Message passing schedule for planning in the T-maze with the BFE. Message passing schedule for planning in the T-maze with the CBFE. For the CBFE, the posterior beliefs associated with the observation variables are constrained by point-mass (Dirac-delta) distributions, see (27), and the corresponding potential outcomes are optimized for. The message passing optimization scheme is derived from first principles in (Şenöz et al., 2021). In order to obtain a new value, e.g. , messages ⓫ and ⑫ are multiplied. The mode of the product then becomes the new value , which is used to construct the belief . The updated belief is subsequently used in the next iteration to compute ❸. The resulting iterative expectation maximization (EM) procedure initializes values for all , and is performed using message passing according to (Dauwels et al., 2005). Interestingly, where optimization of the EFE is performed by a forward-only procedure (see Supplementary Appendix SA), optimization of the (C)BFE, as illustrated in Figures 5, 6, also includes a complete backward (smoothing) pass over the model.

6.2 Inference Results for Planning

Optimization of the (C)BFE by message passing is performed with ForneyLab , version 0.11.4 (Cox et al., 2019). Free energies for planning, for three different agents and T-maze scenarios, are plotted in Figure 7. The distinct agents optimize the CBFE, BFE and EFE, respectively. We summarize the most important observations below.

FIGURE 7

(Constrained) Bethe Free Energies ((C)BFE) and Expected Free Energies (EFE) (in bits) for the T-maze policies under varying parameter settings. Each diagram plots the minimized free energy values for all possible policies (lookahead T = 2), with the first move on the vertical axis and the second move on the horizontal axis. For example, the cell in row 4, column 3 represents the policy , which first moves to position 4 and then to position 3. The values for the optimal policies are annotated red with an asterisk. The first column of diagrams in Figure 7 shows the results for the CBFE agent, for varying scenarios. • The first scenario for the CBFE agent (upper left diagram) imposes a likely reward (α = 0.9) and positive reward utility (c = 2). In this scenario, the CBFE agent prefers the informative policies (4,2) and (4,3), where the agent seeks the cue in the first move and the reward in the second move. An epistemic (information seeking) agent would prefer these policies in this scenario. • In the upper left diagram, note the lack of preference between position 2 and 3 in the second move. Because the policy is not yet executed (moves are only planned), the true reward location remains unknown. Therefore, both of these informative policies are on equal footing. The second column of diagrams shows the results for the BFE agent. • In every scenario, the BFE agent fails to distinguish between the majority of ignorant (first move to 1), informative (first move to 4) and greedy policies (first move to 2 or 3). These policy preferences do not correspond with the anticipated preferences of an epistemic agent. • Comparing the BFE with the CBFE results, we observe that the point-mass constraint on potential outcomes induces a differentiation between ignorant, informative and greedy policies. • More specifically, the third scenario (third row of diagrams) removes the extrinsic value of reward (c = 0). While the CBFE still differentiates between ignorant, informative and greedy policies, the BFE agent exhibits a total lack of preference. • The second scenario (second row of diagrams) removes the value of information about the reward position (α = 0.5). This scenario thus renders the cue worthless. The BFE agent appears insusceptible to a change in the epistemic α parameter. Taken together, these observations support the interpretation of the BFE as a purely extrinsically driven objective (Section 4). The third column of diagrams produces the results for an EFE agent, as implemented in accordance with (Friston et al., 2015), see also Supplementary Appendix SA. • In all scenarios, the EFE agent exhibits a consistent preference for the (4,4) policy. Compared to the CBFE agent, the EFE agent fails to plan ahead to obtain future reward after observing the cue. • As we will see in Section 7, the EFE agent only infers a preference for a reward arm (position 2 or 3) after execution of the first move to the cue position. In contrast, the CBFE agent predicts the impact of information and plans accordingly. • The second scenario (middle row) provides an informative cue, but removes the possibility to exploit that information (α = 0.5). Interestingly, the EFE agent still moves to the que position (for the sake of getting information), whereas the CBFE agent expresses ambivalence under an inoperable cue.

6.3 Results for CBFE Value Decomposition

Simulated values for the CBFE decomposition (Eq. 32) in the T-maze application are shown in Figure 8, for four different T-maze scenarios. We summarize the most important observations below.

FIGURE 8

Confidence, complexity and extrinsic value contributions (in bits) to the Constrained Bethe Free Energy (32) for the T-maze policies (lookahead T = 2) under varying parameter settings. Optimal values are indicated red with an asterisk. The first column of diagrams in Figure 8 represents the confidence (Eq. 34) of the CBFE objective for all (planned) policies. The confidence prefers (or ties) the most informative policy (4,4) for all scenarios. • In the first three scenarios (first three rows of diagrams), all policies other than (4,4) dismiss the opportunity to obtain full information about outcomes on two occasions (T = 2). This is reflected by a negative confidence value, which measures the average rejected information in bits. For example, the policy (1,1) rejects two possibilities to obtain 1 bit of information, leading to a confidence of − 2. • A change in the external value parameter c does not affect the confidence, which supports the interpretation of the confidence as an intrinsic quantity (Eq. 35). • In the final scenario, the greedy policies (moving first to position 2 or 3) are on equal footing with the informative policies (moving first to 4). This is because in the final scenario, visiting position 2 or 3 offers the same amount of information (namely, complete certainty) about the reward position, as would visiting the cue position. The complexity (second column of diagrams) opposes changes in state beliefs that are unwarranted by the policy-induced state transitions, and guards against premature convergence of the state precision (Table 2). As a result, the complexity prefers (or ties) the most conservative policy (1,1) for all scenarios. • The complexity is unaffected by changes in utility (similar to the confidence), which supports the interpretation of the complexity as an intrinsic quantity (Eq. 35). • In the third scenario (α = 0.5), the greedy policies become tied in complexity with (most of) the ignorant policies. Because neither visiting a reward arm nor remaining at the initial position offers any useful information about the reward position, the state belief remains unaltered, and these policies incur no complexity penalty. The extrinsic value (third column of diagrams) represents the value of external reward, and leads the agent to pursue extrinsically rewarding states. • The extrinsic value is unaffected by changes in the epistemic reward probability parameter α, which supports the interpretation of the extrinsic value as an externally determined quantity. • In the second scenario the reward utility vanishes (c = 0), and the extrinsic value becomes indifferent about policies.

7 Interactive Simulation

In this section we compare the resulting behavior of the CBFE agent with a traditional EFE agent, in interaction with a simulated environment.

7.1 Experimental Protocol

The experimental protocol governs how the agent interacts with its environment. In our protocol, the action and outcome at time t are the only quantities that are exchanged between the agent and the environment (generative process). The task of the agent is then to plan for actions that lead the agent to desired states. We adapt the experimental protocol of (van de Laar and de Vries, 2019) for the purpose of the current simulation. We write the model f with a time-subscript to indicate the time-dependent statistics of the state prior as a result of the perceptual process (Section 2.2). The experimental protocol (Algorithm 1) then consists of five steps per time t. Experimental Protocol The plan step solves the inference for planning (Section 2.3), and returns the active policy that represents the (believed) optimal sequence of future controls. In the act step, the first action is picked from the policy. The execute step then subsequently executes this action in the simulated environment. Execution will alter the state of the environment. In the observe step, the environment responds with a new observation . Given the action and resulting observation, the slide step then solves the inference for perception (Section 2.2) and prepares the model for the next step. Inference for the slide step is illustrated in Figure 9, where message ③ propagates an observed outcome , and where message ⑤ summarizes the information contained within in the dashed box. Only the dashed sub-model is relevant to the slide step, that is, beliefs about the future do not influence ⑤. After computation, message ⑤ is normalized, and the resulting state posterior q ∗(x ) is subsequently used as a prior to construct the model f for the next time-step, see also (van de Laar and de Vries, 2019).

FIGURE 9

Message passing schedule for the slide step.

7.2 Results for Interactive Simulation

We initialize an environment with the reward in the right arm (position 3). We then execute the experimental protocol of Algorithm 1, with lookahead T = 2, for N = 2 moves, on a dense landscape of varying reward probabilities α and utilities c (scenarios). After the first move, the environment returns an observations to the agent, which informs the agent about second move. After the second move, the expected reward that is associated with the resulting position is reported. We perform 10 simulations per scenario, and compute the average reward probability. The results of Figure 10 compare the average rewards of the CBFE agent and the EFE agent.

FIGURE 10

Average reward landscapes for the Constrained Bethe Free Energy (CBFE) agent and the Expected Free Energy (EFE) agent.

7.3 CBFE Agent EFE Agent

From the results of Figure 10 it can be seen that the region of zero average reward (dark region in lower left corner) is significantly smaller for the CBFE agent than for the EFE agent. This indicates that the CBFE agent accrues reward in a significantly larger portion of the scenario landscape than the EFE agent. In the lower left corner, the resulting CBFE agent trajectory becomes (4, 4), whereas the EFE agent trajectory becomes (4, 1). Although both agents observe the cue after their first move, they do not visit the indicated reward position in the second move, which leads to zero average reward. Note that neither objective is explicitly designed to optimize for average reward; both define a free energy instead, where multiple simultaneous forces are at play. In the upper right regions, with high reward probability and utility, both agents consistently execute (4, 3). With this trajectory, the cue is observed after the first move, and the indicated (correct) reward position is visited in the second move, leading to an average reward of α. For reward probabilities close to α = 1 however, the performance of both agents deteriorates. In this upper region, the informative policies become tied with the greedy policies (see Figure 8), and there is no single dominant trajectory. In some trajectories the agent enters the wrong arm on the first move, from which the agent cannot escape, and the average reward deteriorates. Greedy behavior is also observed for the CBFE agent when informative priors c (Eq. 42a, Eq. 42b) are set for all k (including k = 1), conforming with the configuration of (Friston et al., 2015). With this configuration, expected reward for the CBFE agent deteriorates to 0.5 in the otherwise rewarding region. Interestingly, this change of priors does not affect results for the EFE agent. The resulting change in behavior suggests that the CBFE agent is more susceptible to temporal aspects of the goal prior configuration. While this effect may be considered a nuisance in some cases, it also allows for increased flexibility when assigning explict temporal requirements about goals. For example, assigning an informative versus a flat prior for k = 1 respectively encodes an urgency in obtaining immediate reward versus a freedom to explore.

8 Discussion

In this paper, we focused on epistemic drivers for behavior. We noted that the nature of the epistemic drive differs between an EFE and CBFE agent. Namely, the epistemic drive for the EFE agent stems directly from maximizing a mutual information term between states and observations (Eq. 20), while the epistemic drive for the CBFE agent stems from a self-evidencing mechanism (Section 4.2). In order to better understand the strengths and limitations of the driving forces for the CBFE, it would be interesting to investigate its behavior in more challenging setups, including continuous variables, inference for control (van de Laar, 2018), and the effects point-mass constraints on other model variables. Recent work by (Da Costa et al., 2020b; Millidge et al., 2021) shows that epistemic behavior does not occur when the goal prior goes to a point-mass. The work of (Millidge et al., 2021) points to the entropy of the observed variables as a pivotal quantity for epistemic behavior. The CBFE however does not include an entropy over observations, and still exhibits epistemic qualities. The difference in methods lies with the constraint quantity; namely (Da Costa et al., 2020a; Millidge et al., 2021) constrain the goal prior , while the current paper constrains the variational distribution instead. While both constraints remove from the resulting FE objective, optimization of in the CBFE still induces an epistemic drive (Section 4). Our results thus show that epistemic drives for AIF prove to be more subtle than initially anticipated. Our presented approach is uniquely scalable, because it employs off-the-shelf message passing algorithms. All message computations are local, which makes our approach naturally amenable to both parallel and on-line processing (Bagaev and de Vries, 2021). Especially AIF in deep hierarchical models might benefit from the improved computational properties of the CBFE. It will be interesting to investigate how the presented approach generalizes to more demanding (practical) settings. As a generic variational inference procedure, the CBFE approach applies to arbitrary models. This allows researchers to investigate epistemics in a much wider class of models than previously available. One immediate avenue for further research is the integration of CBFE with predictive coding schemes (Friston and Kiebel, 2009; Bogacz, 2017; Millidge et al., 2020a). Predictive coding has so far been driven mainly by minimizing free energy in hierarchical models under the Laplace approximation. Here, the CBFE approach readily applies as well (Şenöz et al., 2021), allowing researchers to explore the effects of augmenting existing predictive coding models with epistemic components. The derivation of alternative functionals that preserve the desirable epistemic behavior of EFE optimization is an active research area (Tschantz et al., 2020a; Sajid et al., 2021). There have been several interesting proposals such as the Free Energy of the Expected Future (Millidge et al., 2020b; Tschantz et al., 2020b; Hafner et al., 2020) or Generalized Free Energy (Parr and Friston, 2019), as well as amortization strategies (Ueltzhöffer, 2018; Millidge, 2019) and sophisticated schemes (Friston et al., 2021). Comparing behavior between the CBFE and other free energy objectives might therefore prove an interesting avenue for future research. In the original description of active inference, a policy precision is optimized during policy planning, and the policy for execution is sampled from a distribution of precision-weighted policies (Friston et al., 2015). The present paper does not consider precision optimization, and effectively assumes a large, fixed precision instead. In practice, this procedure consistently selects the policy with minimal free energy; see also maximum selection (in terms of value) as described by (Schwöbel et al., 2018). To accommodate for precision optimization, the CBFE objective might be extended with a temperature parameter, mimicking thermodynamic descriptions of free energy (Ortega and Braun, 2013). Optimization of the temperature parameter might then relate to optimization of the policy precision, as often seen in biologically plausible formulations of AIF (FitzGerald et al., 2015). Another interesting avenue for further research would be the design of a meta-agent that determines the statistics and temporal configuration of the goal priors. In our experiments we design the goal priors (Eq. 42a, Eq. 42b) ourselves, such that the agent is free to explore in the first move and seeks reward on the second move. The challenge then becomes to design a synthetic meta-agent that automatically generates an effective lower-level goal sequence from a single higher-level goal definition.

9 Conclusion

In this paper we presented mathematical arguments and simulations that show how inclusion of point-mass constraints on the Bethe Free Energy (BFE) leads to epistemic behavior. The thus obtained Constrained Bethe Free Energy (CBFE) has direct connections with formulations of the principle of least action in physics (Caticha, 2012), and can be conveniently optimized by message passing on a graphical representation of the generative model (GM). Simulations for the T-maze task illustrate how a CBFE agent exhibits an epistemic drive, whereas the BFE agent lacks epistemic qualities. The key intuition behind the working mechanism of the CBFE is that point-mass constraints on observation variables explicitly encode the assumption that the agent will observe in the future. Although the actual value of these observation remains unknown, the agent “knows” that it will observe in the future, and it “knows” (through the GM) how these (potential) outcomes will influence inferences about states. We dissected the CBFE objective in terms of its constituent drivers for behavior. In the CBFE framework, in addition to being functionals of the state beliefs, the confidence and complexity are viewed as functions of the potential outcomes and policy respectively. Simultaneous optimization of variational distributions and potential outcomes then leads the agent to prefer epistemic policies. Interactive simulations for the T-maze showed that, compared to an EFE agent, the CBFE agent incurs expected reward in a significantly larger portion of the scenario landscape. We performed our simulations by message passing on a Forney-style factor graph representation of the generative model. The modularity of the graphical representation allows for flexible model search, and message passing allows for distributed computations that scale well to bigger models. Constraining the BFE and optimizing the CBFE objective by message passing thus suggests a simple and general mechanism for epistemic-aware AIF in free-form generative models.

22 in total