| Literature DB >> 19641614 |
Karl J Friston1, Jean Daunizeau, Stefan J Kiebel.
Abstract
This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.Entities:
Mesh:
Year: 2009 PMID: 19641614 PMCID: PMC2713351 DOI: 10.1371/journal.pone.0006421
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Glossary of mathematical symbols.
| Variable | Short description |
|
| Environmental causes of sensory input |
|
| Generalised hidden-states of an agent. These are time-varying quantities that include all high-order temporal derivatives. |
|
| Generalised forces or causal states that act on hidden states |
|
| Generalised sensory states sampled by an agent |
|
| Parameters of |
|
| Parameters of the precisions of random fluctuations |
|
| Generalised random fluctuations of the motion of hidden states |
|
| Generalised random fluctuations of sensory states |
|
| Generalised random fluctuations of causal states |
|
| Precisions or inverse covariances of generalised random fluctuations |
|
| Sensory mapping and equations of motion generating sensory states |
|
| Sensory mapping and equations of motion used to model sensory states |
|
| Action: a policy function of generalised sensory states; a hidden state that the agent can change |
|
| Equilibrium ensemble density; the density of an ensemble of agents at equilibrium with their environment. It is the principal eigensolution of the Fokker-Plank operator |
|
| Fokker-Plank operator that is a function of fixed causes |
|
| Kullback-Leibler divergence; also known as relative-entropy, cross-entropy or information gain |
|
| Expectation or mean under the density |
|
| Model or agent; entailing the form of a generative model |
|
| Entropy of generalised hidden states |
|
| Entropy of generalised sensory states |
|
| Surprise or self-information of generalised sensory states |
|
| Free-energy bound on surprise |
|
|
|
|
| Conditional or posterior expectation of the causes |
|
| Prior expectation of generalised causal states |
|
| Desired equilibrium density |
|
| Generalised prediction error on sensory states |
Figure 1An agent that thinks it is a Lorenz attractor.
This figure illustrates the behaviour of an agent whose trajectories are drawn to a Lorenz attractor. However, this is no ordinary attractor; the trajectories are driven purely by action (displayed as a function of time in the right panels). Action tries to suppress prediction errors on motion through this three dimensional state-space (blue lines in the left panels). These prediction errors are the difference between sensed and expected motion based on the agent's generative model; (red arrows: evaluated at ). These prior expectations are based on a Lorentz attractor. The ensuing behaviour can be regarded as a form of chaos control. Critically, this autonomous behaviour is very resistant to random forces on the agent. This can be seen by comparing the top row (with no perturbations) with the middle row, where the first state has been perturbed with a smooth exogenous force (broken line). Note that action counters this perturbation and the ensuing trajectories are essentially unaffected. The bottom row shows exactly the same simulation but with action turned off. Here, the environmental forces cause the agents to precess randomly around the fixed point attractor of . These simulations used a log-precision on the random fluctuations of 16.
Figure 2The mountain car problem.
This is a schematic representation of the mountain car problem: Left: The landscape or potential energy function that defines the motion of the car. This has a minima at . The mountain-car is shown at its uncontrolled stable position (transparent) and the desired parking position at the top of the hill on the right . Right: Forces experienced by the mountain-car at different positions due to the slope of the hill (blue). Critically, at the force is minus one and cannot be exceeded by the cars engine, due to the squashing function applied to action.
Figure 3Equilibria in the state-space of the mountain car problem.
Left panels: Flow-fields and associated equilibria for an uncontrolled environment (top), a controlled or optimised environment (middle) and under prior expectations after learning (bottom). Notice how the flow of states in the controlled environment enforces trajectories that start by moving away from the desired location (green dot at ). The arrows denote the flow of states (position and velocity) prescribed by the parameters. The equilibrium density in each row is the principal eigenfunction of the Fokker-Plank operator associated with the parameters. For the controlled and expected environments, these are low entropy equilibria, centred on the desired location. Right panels: These panels show the flow fields in terms of their nullclines. Nullclines correspond to lines in state-space where the rate of change or one variable is zero. Here the nullcline for position is along the x-axis, where velocity is zero. The nullcline for velocity is when the change in velocity goes from positive (grey) to negative (white). Fixed points correspond to the intersection of these nullclines. It can be seen that under an uncontrolled environment (top) there a stable fixed point, where the velocity nullcline intersects the position nullcline with negative slope. Under controlled (middle) and expected (bottom) dynamics there are three fixed points. The rightmost fixed-point is under the desired equilibrium density and is stable. The middle fixed-point is halfway up the hill and the final fixed-point is at the bottom. Both of these are unstable and repel trajectories so that they are ultimately attracted to the desired location. The red lines depict exemplar trajectories, under deterministic flow, from . In a controlled environment, this shows the optimum behaviour of moving up the opposite hill to gain momentum so that the desired location can be reached.
Figure 4Inferred motion and action of an mountain car agent.
Top row: The left panel shows the predicted sensory states (position in blue and velocity in green). The red lines correspond to the prediction error based upon conditional expectations of the states on (right panel). These expectations are optimised using Equation 9. This is a variational scheme that optimises the free-energy in generalised coordinates of motion. The associated conditional covariance is displayed as 90% confidence intervals (thin grey areas). Middle row: The nullclines and implicit fixed points associated with the parameters learnt by the agent, after exposure to a controlled environment (left). The actual trajectory through state-space is shown in blue (the red line is the equivalent trajectory under deterministic flow). The action causing this trajectory is shown on the right and shows a poly-phasic response, until the desired position is reached, after which a small amount of force is required to stop it sliding back down the hill (see Figure 2). Bottom row: As for the middle row but now in the context of a smoothly varying perturbation (broken line in the right panel). Note that this exogenous force has very little effect on behaviour because it is unexpected and countered by action. These simulations used expected log-precisions of: .
Figure 5The effect of precision (dopamine) on behaviour.
Inferred states (top row) and trajectories through state-space (bottom row) under different levels of conditional uncertainty or expected precision. As in previous figures, the inferred sensory states (position in blue and velocity in green) are shown with their 90% confidence intervals. And the trajectories are superimposed on nullclines. As the expected precision falls, the inferred dynamics are less accountable to prior expectations, which become less potent in generating prediction errors and action. It is interesting to see that uncertainty about the states (gray area) increases, as precision falls and confidence is lost.