| Literature DB >> 27793098 |
Judit Zsuga1, Klara Biro2, Gabor Tajti2, Magdolna Emma Szilasi3, Csaba Papp2, Bela Juhasz4, Rudolf Gesztelyi3.
Abstract
BACKGROUND: Reinforcement learning is a fundamental form of learning that may be formalized using the Bellman equation. Accordingly an agent determines the state value as the sum of immediate reward and of the discounted value of future states. Thus the value of state is determined by agent related attributes (action set, policy, discount factor) and the agent's knowledge of the environment embodied by the reward function and hidden environmental factors given by the transition probability. The central objective of reinforcement learning is to solve these two functions outside the agent's control either using, or not using a model.Entities:
Keywords: Bellman equation; Cue-context congruence; Model-based reinforcement learning; Policy function; Proactive brain; Reward function
Mesh:
Year: 2016 PMID: 27793098 PMCID: PMC5086043 DOI: 10.1186/s12868-016-0302-7
Source DB: PubMed Journal: BMC Neurosci ISSN: 1471-2202 Impact factor: 3.288
Fig. 1‘Proactive’ use of cue-context congruence for building reinforcement learning’s reward function. Left panel Salient stimulus, conceptualized as cue, and its context are processed by parallel but richly interconnected systems that center on the amygdala and hippocampus for cue-based and context-based learning, respectively. By means of Pavlovian learning, a set of relevant context frames are formed for each cue (hence, the uniform subscript of cues indicates the fact that a cue may be associated with distinct contexts, accordingly with distinct rewards). These context frames encompass permanent features of the context. Based on computational models of others and theoretical considerations, we presume that context frames also include reward-related information. According to the concept of proactive brain [23], when an unexpected stimulus is encountered, cue and context-based gist information is rapidly extracted that activates the most relevant context-frame that based on prior experience. Building on this, we propose that the reward function attribute of the world model is compiled by the OFC, which, by determining cue-context congruence, is able to identify the most relevant context frame. Using this context frame as a starting point (e.g. state), forward looking simulations may be performed to estimate expected reward and optimize policy (dark blue line). Right panel Upon activation of the most relevant context frame, predictions related to the expected reward will be made in the OFC. This information encompasses substantial environmental input and forwarded by glutaminergic neurons to the ventral striatum, VTA and PPTgN. The VTA will emit the reward prediction error signal, inherent of the model-free reinforcement learning system, by integrating actual reward and predicted reward information. In line with observations of others, we suggest that OFC derived expected reward information is incorporated into the reward prediction error signal (dotted green line). Furthermore, we propose that the scalar value of reward is updated by the reward prediction error signal contributing to the update of the world model. Abbreviations: action (a), context frame (CFx), model-based reinforcement learning (MB-RL), model-free reinforcement learning (MF-RL), Pavlovian learning (PL), reward (Rx), reward prediction error (RPE), transition (t), ventral striatum (VS), orbitofrontal cortex (OFC), ventral tegmental area (VTA), pedunculo-pontine-tegmental nucleus (PPTgN), black dot transitory state, black arrow glutaminergic modulation, green arrow dopaminergic modulation