Literature DB >> 21435563

Model-based influences on humans' choices and striatal prediction errors.

Nathaniel D Daw¹, Samuel J Gershman, Ben Seymour, Peter Dayan, Raymond J Dolan.

Abstract

The mesostriatal dopamine system is prominently implicated in model-free reinforcement learning, with fMRI BOLD signals in ventral striatum notably covarying with model-free prediction errors. However, latent learning and devaluation studies show that behavior also shows hallmarks of model-based planning, and the interaction between model-based and model-free values, prediction errors, and preferences is underexplored. We designed a multistep decision task in which model-based and model-free influences on human choice behavior could be distinguished. By showing that choices reflected both influences we could then test the purity of the ventral striatal BOLD signal as a model-free report. Contrary to expectations, the signal reflected both model-free and model-based predictions in proportions matching those that best explained choice behavior. These results challenge the notion of a separate model-free learner and suggest a more integrated computational architecture for high-level human decision-making.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Dopamine

Year: 2011 PMID： 21435563 PMCID： PMC3077926 DOI： 10.1016/j.neuron.2011.02.027

Source DB: PubMed Journal: Neuron ISSN： 0896-6273 Impact factor: 17.173

Introduction

An ubiquitous idea in psychology, neuroscience, and behavioral economics is that the brain contains multiple, distinct systems for decision-making (Daw et al., 2005; Kahneman, 2003; Loewenstein and O’Donoghue, 2004; Rangel et al., 2008; Redish et al., 2008; Sloman, 1996). One prominent contender dates back to Thorndike’s (1911) “law of effect,” which states that an action followed by reinforcement is more likely to be repeated in the future. This habit principle is also at the heart of temporal-difference (TD) learning accounts of the dopaminergic system and its action in striatum (Barto, 1995; Schultz et al., 1997). In the actor-critic, for instance, a dopaminergic “reward prediction error” (RPE) signal plays the role of Thorndike’s reinforcer, increasing the propensity to take actions that are followed by positive RPEs (Maia, 2010; Suri and Schultz, 1999). However, it has long been known that the reinforcement principle offers at best an incomplete account of learned action choice. Evidence from reward devaluation studies suggests that animals can also make “goal-directed” choices, putatively controlled by representations of the likely outcomes of their actions (Dickinson and Balleine, 2002). This realizes a suggestion, dating back at least to Tolman (1948), that animals are not condemned merely to repeat previously reinforced actions. From the perspective of neuroscience, habits and goal-directed action systems appear to coexist in different corticostriatal circuits. While these systems learn concurrently, they control behavior differentially under alternative circumstances (Balleine and O’Doherty, 2010; Dickinson, 1985; Killcross and Coutureau, 2003). Computational treatments (Balleine et al., 2008; Daw et al., 2005; Doya, 1999; Niv et al., 2006; Redish et al., 2008) interpret these as two complementary mechanisms for reinforcement learning (RL). The TD mechanism is associated with dopamine and RPEs, and is “model-free” in the sense of eschewing the representation of task structure and instead working directly by reinforcing successful actions. The goal-directed mechanism is a separate “model-based” RL system, which works by using a learned “internal model” of the task to evaluate candidate actions (e.g., by mental simulation; Hassabis and Maguire, 2007; Schacter et al., 2007; perhaps implemented by some form of preplay; Foster and Wilson, 2006; Johnson and Redish, 2007). Bar one recent exception (Gläscher et al., 2010) (which focused on the different issue of the neural substrates of learning the internal model), previous studies investigating the neural substrates of model-free and model-based control have not attempted to detect simultaneous correlates of both as these systems learn concurrently. Thus, the way the controllers interact is unclear, and the prevailing supposition that neural RPEs originate from a distinct model-free system remains untested. Here we exploited the difference between their two types of action evaluation to investigate the interaction of the controllers in humans quantitatively, using functional MRI. Model-free evaluation is retrospective, chaining RPEs backward across a sequence of actions. By contrast, model-based evaluation is prospective, directly assessing available future possibilities. Thus, it is possible to distinguish the two using a sequential choice task. In theory, the choices recommended by model based and model free strategies depend on their own, separate, valuation computations. Thus, if behavior reflects contributions from each strategy, then we can make the clear, testable, prediction that neural signals reflecting either valuation should dissociate from behavior (Kable and Glimcher, 2007). Correlates of reward prediction have most repeatedly been demonstrated in fMRI in two areas: the ventromedial prefrontal cortex (vmPFC) and the ventral striatum (ventral putamen and nucleus accumbens) (Delgado et al., 2000; Hare et al., 2008; Knutson et al., 2007; Knutson et al., 2000; Lohrenz et al., 2007; O’Doherty, 2004; Peters and Buchel, 2009; Plassmann et al., 2007; Preuschoff et al., 2006; Tanaka et al., 2004; Tom et al., 2007). Of these, value-related signals in medial prefrontal cortex are sensitive to task contingencies, and are thus good candidates for involvement in model-based evaluation (Hampton et al., 2006, 2008; Valentin et al., 2007). Conversely, the ventral striatal signal correlates with an RPE (McClure et al., 2003a; O’Doherty et al., 2003; Seymour et al., 2004), and on standard accounts, is presumed to be associated with dopamine and with a model-free TD system. If so, these signals should reflect ignorance of task structure and instead be driven by past reinforcement, even though subjects’ behavior, if it is partly under the control of a separate model-based system, may be better informed. Contrary to this hitherto untested prediction, our results demonstrate that reinforcement-based and model-based value predictions are combined in both brain areas, and more particularly, that RPEs in ventral striatum do not reflect pure model-free TD. These results suggest a more integrated computational account of the neural substrates of valuation.

Results

Behavior

Subjects (n=17) completed a two-stage Markov decision task (Figure 1) in which, on each trial, an initial choice between two options labeled by (semantically irrelevant) Tibetan characters led probabilistically to either of two, second-stage “states,” represented by different colors. In turn, these both demanded another two-option choice, each of which was associated with a different chance of delivering a monetary reward. The choice of one first-stage option led predominantly (70% of the time) to an associated one of the two second-stage states, and this relationship was fixed throughout the experiment. However, to incentivize subjects to continue learning throughout the task, the chances of payoff associated with the four second-stage options were changed slowly and independently, according to Gaussian random walks. Theory (Daw et al., 2005; Dickinson, 1985) predicts that such change should tend to favor the ongoing contribution of model-based evaluation.

Figure 1

(a) Timeline of events in trial. A first-stage choice between two options (green boxes) leads to a second-stage choice (here, between two pink options), which is reinforced with money. (b) State transition structure. Each first-stage choice is predominantly associated with one or the other of the second-stage states, and leads there 70% of the time.

Each subject undertook 201 trials, of which 2 ± 2 (mean ± 1 SD) trials were not completed due to failure to enter a response within the two second limit. These trials were omitted from analysis. The logic of the task was that model-based and model-free strategies for RL predict different patterns by which reward obtained in the second stage should impact first-stage choices on subsequent trials. For illustration, consider a trial in which a first-stage choice, uncharacteristically, led to the second-stage state with which it is not usually associated, and in which the choice then made at the second stage was rewarded. The principle of reinforcement would predict that this experience should increase the probability of repeating the first-stage choice since it was ultimately rewarded. However, a subject choosing instead using an internal model of the task’s transition structure, that evaluates actions prospectively, would be expected instead to decrease the probability of choosing that same option. This is because any increase in the value of the rewarded second-stage option will more greatly increase the expected value of the first-stage option that is more likely to lead there. This is actually the first-stage option that was not originally chosen. Given previous work suggesting the coexistence of multiple valuation processes in the brain (Balleine et al., 2008; Dickinson, 1985), we hypothesized that subjects might exhibit a mixture of both strategies. First, to see learning effects of this sort in a relatively theory-neutral manner, we directly assessed the effect of events on the previous trial (trial n) on the choice on the current trial (trial n+1). The two key events on trial n are whether or not reward was received, and whether the second-stage state presented was the common or rare, given the first-stage choice on trial n. We evaluated the impact of these events on the chance of repeating the same first-stage choice on trial n+1. For reasons outlined above, a simple reinforcement strategy (simulated in Figure 2a using the TD algorithm SARSA(λ) for λ=1) predicts only a main effect of reward: an ultimately rewarded choice is more likely to be repeated, regardless of whether that reward followed a common or rare transition. Conversely, a model-based strategy (simulated in Figure 2b) predicts a crossover interaction between the two factors, because a rare transition inverts the effect of the subsequent reward.

Figure 2

Factorial analysis of choice behavior. (a) Simple reinforcement predicts that a first-stage choice resulting in reward is more likely to be repeated on the subsequent trial, regardless of whether that reward occurred after a common or rare transition. (b) Model-based prospective evaluation instead predicts that a rare transition should affect the value of the other first-stage option, leading to a predicted interaction between the factors of reward and transition probability. (c) Actual stay proportions, averaged across subjects, display hallmarks of both strategies. Error bars: 1 SEM.

Figure 2c plots the observed choice proportions as a function of these two factors, in the average across subjects. In order to study effects that were statistically reliable at the level of the population, we quantified the effects using hierarchical logistic regression with all coefficients taken as random effects across subjects. At the population level, the main effect of reward was significantly different from zero (p<1e-8, two-tailed), demonstrating a reinforcement effect. However, the interaction between reward and the transition probability was also significant (p<5e-5), rejecting a pure reinforcement account and suggesting that subjects take the transition model into account in making their choices. As both theories predict, there was no significant main effect of transition likelihood (p=.5). Finally, the constant term was significantly positive (p<5e-12), suggesting an overall tendency to stick with the same option from trial to trial, notwithstanding reward (Ito and Doya, 2009; Kim et al., 2009; Lau and Glimcher, 2005). We also considered estimates of the effect sizes for each individual within this analysis (conditional on the group level parameter estimates); the effect of reward was positive (within the 95% confidence interval) for 14/17 subjects, and the interaction was positive for 10/17 individuals, including 7 for whom the main effect of reward was also positive. Together these data suggest that hallmarks of both strategies are seen significantly at the population level and within many individuals, but that there may be between-subject variability in their deployment. Motivated by these results, we considered the fit of full model-based and model-free (SARSA(λ) TD; Rummery and Niranjan, 1994) RL algorithms to the choice sequences. The former evaluates actions by prospective simulation in a learned model; the latter uses a generalized principle of reinforcement. The generalization, controlled by the reinforcement eligibility parameter λ, is that the estimated value of the second-stage state should act as the same sort of model-free reinforcer for the first-stage choice as the final reward actually received after the second-stage choice. The parameter λ governs the relative importance of these two reinforcers, with λ=1 being the special case of figure 2a in which only the final reward is important, and λ=0 being the purest case of the TD algorithm in which only the second-stage value plays a role. We also considered a hybrid theory (Gläscher et al., 2010) in which subjects could run both algorithms in parallel, and make choices according to the weighted combination of the action values that they produce (see Experimental Procedures). We took the relative weight of the two algorithms’ values in determining the choices to be a free parameter, which we allowed to vary across subjects but assumed to be constant throughout the experiment. Thus, this algorithm contains both the model-based and TD algorithms as special cases, where one or the other gets all weight. We first verified that the model fit significantly better than chance; it did so, at p<.05, for all 17 subjects (likelihood ratio tests). We estimated the theory’s free parameters individually for each subject by maximum likelihood (Table 1). Such an analysis treats each subject as occupying a point on a continuum trading off the two strategies; tests of the parameter estimates across subjects seek effects that are generalizable to other members of the population (analogous to the random effects level in fMRI; Holmes and Friston, 1998). Due to non-Gaussian statistics (since the parameters are expected to lie in the unit range), we analyzed the estimated parameters’ medians using nonparametric tests. Across subjects, the median weighting for model-free RL values was 61% (with model-based RL at 39%), which was significantly different from both 0 and 100% (sign tests, Ps<.005), again suggesting both strategies were mixed in the population. The second important parameter is the reinforcement eligibility parameter λ, which controls the two reinforcement effects in TD, i.e., the relative influence of the estimated value of the second-stage state and the ultimate reward on the model-free value of the first-stage choice. Across subjects, the median estimate for λ was 0.57 (significantly different from 0 and 1; sign tests, Ps<.05), suggesting that at the population level reinforcement occurred in part according to TD-like value chaining (λ<1) and in part according to direct reinforcement (λ>0).

Table 1

Best-fitting parameter estimates, shown as median + quartiles across subjects. Also shown are medians and quartiles for the negative log-likelihood (−LL) of the data at the best fitting parameters, and a pseudo-r2 statistic (p-r2), a normalized measure of the degree to which the model explained the choice data.

	β₁	β₂	α₁	α₂	λ	p	w	−LL	p-r²
25^th pctile	2.76	2.69	0.46	0.21	0.41	0.02	0.29	167.74	.17
median	5.19	3.69	0.54	0.42	0.57	0.11	0.39	200.55	.26
75^th pctile	7.45	5.16	0.87	0.71	0.94	0.22	0.59	228.22	.40

Since analyzing estimates of the free parameters does not speak to their necessity for explaining data, we used both classical and Bayesian model comparison to test whether these free parameters of the full model were justified by data, relative to four simplifications. We tested the special cases of TD(λ) and model-based RL alone, plus the hybrid model using only direct reinforcement or value chaining (i.e., with λ restricted to 0 or 1). The results in Table 2 show the superiority of the hybrid model both in the aggregate over subjects and also, in most tests, for the majority of subjects considered individually. Finally, we fit the hierarchical model of Stephan et al. (2009) to treat the identity of the best fitting model as a random effect that itself could vary across subjects. The exceedance probabilities from this analysis, shown in Table 2, indicate that the hybrid model had the highest chance (with probability 92%) of being the most common model in the population. The same analysis estimated the expected proportion of each sort of learner in the population; here the hybrid model was dominant (at 48%), followed by TD at 18%.

Table 2

Model comparisons between full (hybrid) model and its special cases. Shown for each model are raw negative log likelihood (−LL); the number of subjects favoring the hybrid model on a likelihood ratio test (p<.05); test statistic and p value for a likelihood ratio test against the hybrid model, aggregated across subjects; the negative log model evidence −log(P(M|D)); the number of subjects favouring the hybrid model according to the model evidence; the log Bayes factor favouring the hybrid model, in the aggregate over subjects; and the Bayesian exceedance probability (Stephan et al. 2009) or probability that each model is the most common among the five over the population.

	classical			Bayesian
	−LL	# favoring hybrid	agg. LRT favoring hybrid	−log(P(M\|D))	# favoring hybrid	agg. log Bayes factor favoring hybrid	exceedance prob
hybrid	3364	-	-	3564	-	-	.92
TD only	3418	5	χ²₁₇ = 108p < 5e-15	3594	11	30.0	.031
model-based only	3501	14	χ²₅₁ = 273p < 5e-16	3646	15	82.4	.0019
λ=0	3452	14	χ²₁₇ = 176p < 5e-16	3627	16	62.9	.0012
λ=1	3392	4	χ²₁₇ = 54.5p < 1e-5	3573	8	8.87	.049

Together, these analyses provided compelling support for the proposition that the task exercised both model-free and model-based learning strategies, albeit with evidence for individual variability in the degree to which subjects deploy each of them. Next, armed with the trial-by-trial estimates of the values learned by each putative process, from the hybrid algorithm (refit using a mixed-effects model for more stable fMRI estimates; Table 3), we sought neural signals related to these valuation processes.

Table 3

Mixed effects parameter estimates used for fMRI regressors.

β₁	β₂	α₁	α₂	λ	p	w	−LL	p-r²
4.23	2.95	0.70	0.40	0.63	0.17	mean 0.51SD 0.31	3702	.22

Neuroimaging

BOLD responses in a number of regions – notably the striatum and the ventromedial prefrontal cortex (vmPFC) – have repeatedly been shown to covary with subjects’ value expectations (Berns et al., 2001; Hare et al., 2008; O’Doherty et al., 2007). The ventral striatum has been closely associated with model-free RL, and so a prime question is whether BOLD signals in this structure indeed reflect model-free knowledge alone, even for subjects whose actual behavior shows model-based influences. To investigate this question, we sought voxels where BOLD activity correlated with two candidate timeseries. The first timeseries was the standard RPE based on model-free TD, using just the timepoints of the transition to the second-stage and the delivery of the outcome in order to avoid uncertainty about the appropriate baseline against which to measure the first-stage prediction (see Supplemental Experimental Procedures). The second timeseries involved subtracting these TD prediction errors from the RPEs that would arise if the predictions had been model-based rather than model free (Daw, in press; Friston et al., 1998; Wittmann et al., 2008). We adopted this approach (rather than simply including both model-free and model-based RPEs as explanatory variables) to reduce the correlation between the regressors of interest, and also because it encompassed the test of the null hypothesis that RPE signaling in striatum was purely model-free. If so, then the signal would be accounted for entirely by the model-free regressor, and the difference timeseries should not correlate significantly. If, however, the BOLD signal reflected pure model-based values, or any combination of both, then it would be best described by some weighted combination of the two regressors; that is, the difference regressor would account for residual BOLD activity in addition to that accounted for by the model-free RPE. We tested the conjunction of the two regressors to verify whether BOLD activity in a voxel was indeed significantly correlated with the weighted sum of both (Nichols et al., 2005). Figure 3a shows that BOLD activity correlated significantly with the model-free RPE timeseries in left and right ventral striatum (both p<.001; except where noted, all reported statistics are corrected at the cluster level for familywise error due to whole-brain multiple comparisons). Moreover, this activity was better characterized, on average, as including some model-based valuation: the model-based difference regressor loaded significantly (right, p<.005, left, p<.05; Figure 3b) in the same area (conjunction; right, p<.01 whole-brain corrected; left, p<.01 small-volume corrected within an anatomically defined mask of the bilateral nucleus accumbens; Figure 3c). Similar results, though less strong, were also observed in medial/ventromedial prefrontal cortex (vmPFC), where both model-free RPE (p<.001; Figure 4a) and the difference regressor indicating model-based valuation (p<.01; Figure 4b) correlated significantly with BOLD activity. However, although the conjunction between these two maps showed voxels significant at p<.001 uncorrected, it survived whole-brain multiple comparison correction for cluster size (at p<.005 corrected; Figure 4c) only when the threshold on the conjunction map was relaxed to p<.005 uncorrected. (Note that cluster size correction is valid independent of the threshold on the underlying uncorrected map, though examining additional thresholds implies additional multiple comparisons; Friston et al., 1993.)

Figure 3

Neural correlates of model-free and model-based valuations in RPE in striatum. All maps thresholded at p<.001 uncorrected for display. (a) Correlates of model-free RPE in bilateral striatum (left peak: −12 10 4, right: 10 12 −4). (b) RPE signaling in ventral striatum is better explained by including some model-based predictions: correlations with the difference between model-based and model-free RPE signals (left: −10 6 12, right: 12 16 −8). (c) Conjunction of contrasts from a and b (left: −12 10 −10, right, 12 16 −6). (d) Region of right ventral striatum where the weight given to model-based valuations in explaining the BOLD response correlated, across subjects, with that derived from explaining their choice behavior (14 20 −6). (e) Conjunction of contrasts from a and d (14 20 −6). (f) Scatterplot of the correlation from d, from average activity over an anatomically defined mask of right ventral striatum. (r2 =.28, p=.027).

Figure 4

Neural correlates of model-free and model-based valuations in RPE in medial PFC. Thresholded at p<.001 uncorrected (a and b) or p<.005 uncorrected (c) for display. (a) Correlates of model-free RPE in medial PFC (−4 66 14). (b) RPE signaling in medial PFC is better explained by including some model-based predictions: correlations with the difference between the two RPE signals (−4 56,14). (c) Conjunction of contrasts from a and b (−4 62 12).

These results suggested that RPE-related BOLD signals in ventral striatum, and also vmPFC, reflected valuations computed at least in part by model-based methods rather than pure TD. To investigate this activity further, we compared neural and behavioral estimates of the degree of reliance on model-based valuation, across subjects. The neural and behavioral estimates should correlate if, though computed using different observables, they were measuring the same phenomenon, and if RPE activity in striatum was related to a behaviorally relevant mixture of model-based and model-free values, rather than to one or the other. We measured the degree of model-based valuation in the neural signal by the effect size estimated for the model-based difference regressor (with a larger weighting indicating that the net signal represented an RPE more heavily weighted toward model-based values). Behaviorally, we assessed the degree of model-based influence on choices by the fit of the weighting parameter w in the hybrid algorithm. Significant correlation between these two estimates was indeed detected in right ventral striatum (p<.01 small volume corrected within an anatomical mask of bilateral nucleus accumbens; Figure 3d); and the site of this correlation overlapped the basic RPE signal there (p<.01 small volume corrected; Figure 3e). Figure 3f illustrates a scatterplot of the effect, here independently re-estimated from BOLD activity averaged over an anatomically defined mask of right nucleus accumbens. The finding of consistency between both these estimates helps to rule out unanticipated confounds specific to either analysis. Altogether, these results suggested that BOLD activity in striatum reflected a mixture of model-free and model-based evaluations, in proportions matching those that determine choice behavior. Finally, in order more directly to characterize this activity and to interrogate this conclusion via an analysis using different data-points and weaker theoretical assumptions, we subjected BOLD activity in ventral striatum to a factorial analysis of its dependence on the previous trial’s events, analogous to that used for choice behavior in Figure 2. In particular, the TD RPE when a trial starts reflects the value expected during the trial (as in the anticipatory activity of Schultz et al., 1997), which can be quantified as the predicted value of the top-level action chosen (Morris et al., 2006). For reasons analogous to those discussed above for choice behavior, learning by reinforcement as in TD(λ) (for λ>0) predicts that this value should reflect the reward received following the same action on the previous trial. However, a model-based valuation strategy instead predicts that this previous reward effect should interact with whether the previous choice was followed by a common or rare transition. We therefore examined BOLD activity at the start of trials in right ventral striatum (defined anatomically), as a function of the reward and transition on the previous trial. For reasons mentioned above, these signals did not form part of the previously described parametric RPE analyses. In order to isolate activity specifically related to the same action that had been learned about on the previous trial, we restricted our assessment to those trials in which the same action was chosen twice in a row (Morris et al., 2006). As seen in Figure 5a, there was a main effect of reward (p<.005), consistent with TD-like valuation. This, to our knowledge, is the first time that RPEs in BOLD have been directly shown to exhibit learning through an explicit dependence on previous-trial outcomes (Bayer and Glimcher, 2005). Across subjects, the interaction with the transition probability – the marker for model-based evaluation – was not significant (p>.4), but the size of the interaction per-subject (taken as another neural index of the per-subject model-based effect) correlated with the behavioral index of model-based valuation (p<.02; Figure 5b). This last result further confirmed that striatal BOLD reflected model-based valuation to the extent that choice behavior did. Indeed, speaking to the consistency of the results, although the two neural estimates reported here for the extent of model-based valuation in the striatal BOLD signal (Figures 3f and 5b) were generated from different analytical approaches, and based on activity modeled at different timepoints within each trial, they significantly correlated with one another (r2=0.37; p<.01).

Figure 5

Factorial analysis of BOLD signal at start of trial, from average activity over an anatomical mask of right nucleus accumbens. (a) Signal change (relative to mean) as a function of whether the choice on the previous trial previous trial was rewarded or unrewarded, and whether that occurred after a common or rare transition (compare Figure 2c) Error bars: 1SEM. (b) Scatterplot of the correlation, across subjects, between the contrast measuring the size of the interaction between reward and transition probability (an index of model-based valuation), and the weight given to model-based vs model-free valuations in explaining choice behavior. (r2=0.32, p=.017).

Discussion

We studied human choice behavior and BOLD activity in a two-stage decision task that allowed us to disambiguate model-based and model-free valuation strategies through their different claims about the effect of second-stage reinforcement on first-stage choices and BOLD signals. Here, ongoing adjustments in the values of second-stage actions extended the one-shot reward devaluation challenge often used in animal conditioning studies (Dickinson, 1985) and also the introduction of novel goals as in latent learning (Gläscher et al., 2010): they continually tested whether subjects prospectively adjusted their preferences for actions leading to a subsequent incentive (here, the second-stage state), when its value changed. Following Daw et al., (2005), we see such reasoning via sequential task structure as the defining feature that distinguishes model-based from model-free approaches to RL (although Hampton et al., 2006, and Bromberg-Martin et al., 2010 hold a somewhat different view: they associate model-based computation with learning nonsequential task structure as well). We recently used a similar task in a complementary study (Gläscher et al., 2010) which minimized learning about the rewards (by reporting them explicitly and keeping them stable), to isolate learning about the state transition contingencies. Here by contrast, we minimized transition learning (by partly instructing subjects) and introduced dynamic rewards to allow us to study the learning rules by which neural signals tracked them. This, in turn, allowed us to test an uninvestigated assumption of the analysis in the previous paper, i.e. the isolation of model-free value learning as expressed in the striatal PE. Our previous computational theory of multiple RL systems in the brain (Daw et al., 2005) focused on a dynamic mechanism for trading off the reliance on model-based and model-free valuations based on their relative uncertainties. In the current task, the ever-changing rewards should keep the trade-off roughly constant over time, allowing us to focus on the broader two-system structure of this theory. Rather than confronting the many (unknown) factors that determine the uncertainties of each system within each subject, we treated the balance between the two processes as exogenous, controlled by a constant free parameter (w), whose value we could estimate. Indeed, consistent with our intent, there was no significant trend (analyses not presented) towards progressive habit formation (Adams, 1982; Gläscher et al., 2010). Nevertheless, consistent with findings from animal learning (Balleine and O’Doherty, 2010; Balleine et al., 2008; Dickinson, 1985; Dickinson and Balleine, 2002), we found clear evidence for both TD- and model-like valuations, suggesting that the brain employs a combination of both strategies. The standard view is that the two putative systems work separately and in parallel, a view reinforced by the strong association of the mesostriatal dopamine system with model-free RL and the fact that, in animal studies, each system appears to operate relatively independently when brain areas associated with the other are lesioned (Killcross and Coutureau, 2003; Yin et al., 2004; Yin et al., 2005). Also consistent with this idea, previous work (Hampton et al., 2006, 2008) suggested model-based influences on the vmPFC expected value signal, but did not test for additional model-free influences there, nor conversely whether model-based influences also affected striatal RPEs. Here we found that even the signal most associated with model-free RL, the striatal RPE, reflects both types of valuation, combined in a way that matches their observed contribution to choice behavior. The finding that a similar result in vmPFC was weaker may reflect the fact that neural signaling there is, in some studies, better explained by a correlated variable, expected future value, and not RPE per se (Hare et al., 2008); residual error due to such a discrepancy could suppress effects there. However, in a sequential task these two quantities are closely related, thus, unlike Hare’s, the present study was not designed to dissociate them. Our ventral striatal finding invites a re-evaluation of the standard account of RPE signaling in the brain, since it suggests that even a putative TD system does not exist in isolation from model-based valuation. One possibility about what might replace this account is suggested by contemplating an infelicity of the algorithm used here for data analysis. In order to reject the null hypothesis of purely model-free RPE signaling, we defined a generalized RPE with respect to model-based predictions as well. However, this augmented signal was nugatory, in the sense that model-based RPEs played no role in our account of choice behavior. Indeed, model-based learners do not rely on model-based RPEs: the learning problem they face – tracking state transition probabilities and immediate rewards rather than cumulative future rewards – demands different training signals (Gläscher et al., 2010). This apparent mismatch encourages consideration of a hybrid of a different sort. We have so far examined theories in which model-based and model-free predictions compete directly to select actions (Daw et al., 2005). However, model-based and model-free RPEs could also usefully be integrated for training. For instance, consider the standard actor-critic account (Barto et al., 1983; Barto, 1995). This uses RPEs derived from model-free predictions (the critic) to reinforce action selection policies (the actor). Errors in model-based predictions, if available, could serve the same purpose. A model-free actor trained, in part, by such a “model-based” critic would, in effect, cache (Daw et al., 2005) or memorize the recommendations of a model-based planner, and could execute them subsequently without additional planning. The computational literature on RL includes some related ideas, in algorithms such as prioritized sweeping (Moore and Atkeson, 1993), which caches the results of model-based evaluation (albeit without a model-free component), and Dyna (Johnson and Redish, 2005; Sutton, 1990) which trains a model-free algorithm (though offline) using simulated experiences generated from a world model. In neuroscience, various theories have been proposed in which a world model impacts the input to the model-free system (Bertin et al., 2007; Daw et al., 2006a; Doya, 1999; Doya et al., 2002). The architecture suggested here more closely resembles the “biased” learning hypothesized by Doll et al. (2009), according to which top-down information (there provided by experimenter instructions rather than a learned world model) modifies the target of model-free RL. Outside the domain of learning, striatal BOLD responses are indeed affected by values communicated by instruction rather than experience (Fitzgerald et al., 2010; Tom et al., 2007) and also by emotional self-regulation (Delgado et al., 2008). Further theoretical work is needed to characterize the different algorithms suggested by this general architecture. However, in general, by preserving the overall structure of parallel model-based and model-free systems – albeit exchanging information at an earlier level – the proposal of a model-based critic would appear to remain consistent with the lesion data suggesting that the systems can function in isolation (Killcross and Coutureau, 2003; Yin et al., 2004; Yin et al., 2005) and with behavioral data demonstrating that distinct decision systems may have different properties and can be differentially engaged in different circumstances (Doeller and Burgess, 2008; Frank et al., 2007; Fu and Anderson, 2008). It also remains consistent with other fMRI studies (Doeller et al., 2008; Poldrack et al., 2001; Venkatraman et al., 2009) suggesting that overall activity in different brain systems associated with either system can modulate with time or circumstances, presumably in relation to the extent that either process is engaged. Apart from training, a different use for model-based RPEs would be for online action evaluation and selection. In particular, Doya (1999) proposed that a world model could be used to predict the next state following a candidate action, and that a dopaminergic RPE with respect to that projected state could then be used to evaluate whether the action was worth taking (in a scheme related to that suggested by McClure et al., 2003b; Montague et al., 1995; Montague et al., 1996). RPEs for planning would appear to be categorically different in timing and content than RPEs for learning, in that the former are triggered by hypothetical state transitions and the latter by actual ones, as in the effects reported here. The Doya (1999) circuit also differs from a full model-based planner in that it envisions only a single step of model-based state lookahead; however, to test this limitation would require a task with longer sequences. In the present study, as in most fMRI studies of RPEs, our effects focused on ventral striatum, and we did not see any correlates of the organization of striatum into components associated with different learning strategies as suggested by the rodent literature (Yin et al., 2004; Yin et al., 2005). Further, although there is evidence suggesting that RPE effects in the ventral striatal BOLD signal reflect, at least in part, dopaminergic action there (Knutson and Gibbs, 2007; Pessiglione et al., 2006; Schonberg et al., 2010), the BOLD signal in striatum likely conflates multiple causes including also cortical input and local activity, and it is thus not possible to identify it uniquely with dopamine. Indeed, it is possible, even if the effects attributed to our model-free RPE regressor are dopaminergic in origin, that the residual effects captured by the model-based difference regressor in the same voxels arise from other sources. The questions raised by the present study thus invite resolution by testing a similar multistep task in animals using dopamine unit electrophysiology or voltammetry. In this respect, recent results by Bromberg-Martin et al. (2010) showing that in a serial reversal task (albeit nonsequential) a dopaminergic RPE response is more sophisticated than a basic TD theory would predict, provides a tantalising clue that our results might hold true of dopaminergic spiking as well. Overall, by demonstrating that it is feasible to detect neural and behavioral signatures of both learning strategies, the present study opens the door to future within-subject studies targeted at manipulating and tracking the tradeoff dynamically, and thence, at uncovering the computational mechanisms and neural substrates for controlling it. Such meta-control of decision systems is of particular practical importance, for instance because the compulsive nature of drug abuse has been proposed to result from aberrant expression of habitual control (Everitt and Robbins, 2005), and similar mechanisms have also, plausibly, been linked to other serious issues of self-control, including undersaving and overeating (Loewenstein and O’Donoghue, 2004).

Experimental Procedures

Participants and behavioral task

Seventeen healthy adults (five female; mean age 25.8 years) participated in this study. All participants gave written informed consent, and the study was conducted in accordance with the guidelines of the local ethics committee. The task consisted of 201 trials, in three blocks of 67, separated by breaks. The events in the trial are sketched in Figure 1a. Each trial consisted of two stages. In the first stage, subjects used an MR compatible button box to choose between two options, represented by Tibetan characters in colored boxes. If subjects failed to enter a choice within 2 seconds, the trial was aborted. The chosen option rose to the top of the screen, while the option not chosen faded and disappeared. At the second stage, subjects were presented with either of two more choices between two options (“states”), and entered another choice. The second choice was rewarded with money (depicted by a pound coin, though subjects were paid 20% of this amount), or not (depicted by a zero). Trials were separated by an inter-trial interval of randomized length, on average about 1TR. Which second-stage state was presented depended, probabilistically, on the first-stage choice, according to the transition scheme shown in Figure 1b. The assignment of colors to states was counterbalanced across subjects, and the two options at each state were permuted pseudorandomly between left and right from trial to trial. Each bottom-stage option was rewarded according to a probability associated with that option. In order to encourage ongoing learning, these reward probabilities were diffused at each trial by adding independent Gaussian noise (mean 0, SD .025), with reflecting boundaries at .25 and .75. In a computerized training session prior to the fMRI task, subjects were instructed that the reward probabilities would change, but those controlling the transitions from the first to the second stage would remain fixed. They were also instructed about the overall structure of the transition matrix, specifically, that each first stage option was primarily associated with one or the other of the second-stage states, but not which one. Prior to the scanning session, to familiarize themselves with the structure of the task, subjects played 50 trials on a practice task using a different stimulus set.

Behavioral analyses

We first conducted a logistic regression in which the dependent variable was the first stage choice (coded as stay vs switch), and the explanatory variables were the reward received on the previous trial, a binary indicator variable indicating whether the previous trial’s transition was common or rare, and the interaction of the two. We took all coefficients as random effects across subjects, and estimated this multilevel regression using the lme4 linear mixed effects package (Bates and Maechler, 2010) in the R statistical language (R Development Core Team, 2010). We also extracted posterior effect size estimates (conditional on the estimated population-level prior) and confidence intervals from the posterior covariance for each of the individuals from this fit. The predictions in Figures 2a,b are derived from simulations of SARSA(1) and model-based algorithms (below), using the parameters best fit to the subjects’ data within each class of algorithm.

Computational model of behavior

In a second set of analyses, we fit choice behavior to an algorithm that is similar to the hybrid algorithm of Gläscher et al. (2010). In particular, it learned action values via both model-based RL (explicit computation of Bellman’s equation) and by model-free SARSA(λ) TD learning (Rummery and Niranjan, 1994), and assumed choices were driven by the weighted combination of these two valuations. The relative weighting was controlled by a free parameter w, which we assumed to be constant across trials. We also computed TD RPEs with respect to both the model-free and model-based valuations, and, for fMRI analysis, defined a “difference regressor” as the difference between them. Full equations are given in Supplemental Experimental Procedures.

Behavioral estimation

For behavioral analysis, we estimated the free parameters of the algorithm separately for each subject, to maximize the log-likelihood of the data (from the log of Equation 2 summed over all trials), for the choices actually made conditioned on the states and rewards previously encountered. We constrained the learning rates to lie between zero and one, but allowed λ and w (which also nominally range between zero and one) to float arbitrarily beyond these boundaries, so as to make meaningful the tests whether the median estimates were different from the nominal boundaries across the population. For classical model comparison, we repeated this procedure for the nested subcases, and tested the null hypothesis of the parametric restriction (either individually per subject, or for likelihoods aggregated over the population) using likelihood ratio tests. For Bayesian model comparison, we computed a Laplace approximation to the model evidence (MacKay, 2003) integrating out the free parameters; this analysis requires a prior over the parameters, which we took to be Beta(1.1,1.1) for the learning rates, λ and w, Normal(0,1) for p, and Gamma(1.2,5) for the softmax temperatures, selected so as to be uninformative over the parameter ranges we have seen in previous studies, and to roll off smoothly at parametric boundaries. We also fit the model of Stephan et al. (2009), which takes model identity as a random effect, by submitting the Laplace-approximated log model evidences to the spm_BMS routine from spm8. Thus, we performed all behavioral analyses assuming the parameters (and in some cases the model identity) as random effects across subjects. However, to generate regressors for neural analyses on a common scale, we refit the algorithm to the choices taking only w as a random effect, instantiated once per subject, and assuming common values for the other parameters. This is because in these sorts of algorithms, noise and variation in parameter estimates from subject to subject results, effectively, in a rescaling of regressors between subjects, which suppresses the significance of neural effects in a subsequent second-level fMRI analysis, producing poor results (Daw, in press; Daw et al., 2006b; Gershman et al., 2009; Schonberg et al., 2007; Schonberg et al., 2010).

fMRI procedures

Functional imaging was conducted using a 1.5T Siemens Sonata MRI scanner to acquire gradient echo T2*-weighted echo-planar images (EPI) with blood oxygenation level dependent (BOLD) contrast. Standard preprocessing was performed; see Supplemental Experimental Procedures for full details of preprocessing and acquisition.

fMRI analysis

The fMRI analysis was based around the timeseries of model-free and model-based RPEs as generated from the simulation of the model over each subject’s experiences. We defined two parametric regressors – the model-free RPE, and the difference between the model-free and model-based RPEs. The latter regressor characterizes how net BOLD activity would differ if it were correlated with model-based RPEs or any weighted mixture of both. For each trial, the RPE timeseries were entered as parametric regressors modulating impulse events at the second-stage onset and reward receipt. To test the correspondence between behavioral and neural estimates of the model-based effect, we also included the per-subject estimate of the model-based effect (w, above) from the behavioral fits as a second-level covariate for the difference regressor. A full description of the analysis is given in Supplemental Experimental Procedures. For display purposes, we render activations at an uncorrected threshold of p<.001 (except relaxing this in one case to p<.005), overlaid on the average of subjects’ normalized structural images. For all reported statistics, we subjected these uncorrected maps to cluster-level correction for family-wise error due to multiple comparisons over the whole brain, or, in a few cases (noted specifically) over a small volume defined by an anatomical mask of bilateral nucleus accumbens. This mask was hand-drawn on the subject-averaged structural image, according to the guidelines of Breiter et al. (Ballmaier et al., 2004; Breiter et al., 1997; Schonberg et al., 2010), notably, defining the nucleus’ superior border by a line connecting the most ventral point of the lateral ventricle to the most ventral point of the internal capsule at the level of the putamen. Conjunction inference was by the minimum t-statistic (Nichols et al., 2005) using the conjunction null hypothesis. The difference regressor was orthogonalized against the RPE regressor, so that up to minor correlation that can be reintroduced by whitening and filtering, it captured only residual variation in BOLD activity not otherwise explained by the model-free RPE. However, note that conjunction inference via the minimum t-statistic is valid even when the conjoined contrasts are not independent (Nichols et al., 2005).

ROI analyses

We also used the right-hemisphere portion of the mask of nucleus accumbens (right being the side on which we have previously observed stronger RPE activity; e.g. Daw et al., 2006b; Wittmann et al., 2008) to define the region of interest for two analyses conducted with the MarsBaR ROI toolbox (Brett et al., 2002). First, average activity from the region was extracted and subjected to the same analysis as described above, to produce Figure 3f. Second, the activity from the region was subject to a second regression analysis using a different design, which tagged the first-stage onset of each trial with an impulse regressor of one of five types: switches (trials on which the opposite first-stage choice was made than on the previous trial), and, for stays, four types of events modelling all combinations of the factors reward vs. nonreward and common vs. rare transition in the previous trial. An additional nuisance regressor was included at the time of outcomes. Per-subject effect sizes for the four “stay” regressors were subject to a 2×2 repeated-measure ANOVA, and, additionally, the value for each subject of the contrast measuring the interaction of the two factors ([reward/common minus nonreward/common] minus [reward/rare minus nonreward/rare]) was correlated with the weight given to model-based values (the estimated parameter w) from the behavioral fit.

75 in total

1. Tracking the hemodynamic responses to reward and punishment in the striatum.

Authors: M R Delgado; L E Nystrom; C Fissell; D C Noll; J A Fiez
Journal: J Neurophysiol Date: 2000-12 Impact factor: 2.714

2. Multiple model-based reinforcement learning.

Authors: Kenji Doya; Kazuyuki Samejima; Ken-ichi Katagiri; Mitsuo Kawato
Journal: Neural Comput Date: 2002-06 Impact factor: 2.026

3. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops.

Authors: Saori C Tanaka; Kenji Doya; Go Okada; Kazutaka Ueda; Yasumasa Okamoto; Shigeto Yamawaki
Journal: Nat Neurosci Date: 2004-07-04 Impact factor: 24.884

4. Solving the credit assignment problem: explicit and implicit learning of action sequences with probabilistic outcomes.

Authors: Wai-Tat Fu; John R Anderson
Journal: Psychol Res Date: 2007-04-20

Review 5. Model-based fMRI and its application to reward learning and decision making.

Authors: John P O'Doherty; Alan Hampton; Hackjin Kim
Journal: Ann N Y Acad Sci Date: 2007-04-07 Impact factor: 5.691

Review 6. A framework for studying the neurobiology of value-based decision making.

Authors: Antonio Rangel; Colin Camerer; P Read Montague
Journal: Nat Rev Neurosci Date: 2008-06-11 Impact factor: 34.870

7. Regulating the expectation of reward via cognitive strategies.

Authors: Mauricio R Delgado; M Meredith Gillis; Elizabeth A Phelps
Journal: Nat Neurosci Date: 2008-06-29 Impact factor: 24.884

Review 8. A neural substrate of prediction and reward.

Authors: W Schultz; P Dayan; P R Montague
Journal: Science Date: 1997-03-14 Impact factor: 47.728

Review 9. A unified framework for addiction: vulnerabilities in the decision process.

Authors: A David Redish; Steve Jensen; Adam Johnson
Journal: Behav Brain Sci Date: 2008-08 Impact factor: 21.357

10. Striatal activity underlies novelty-based choice in humans.

Authors: Bianca C Wittmann; Nathaniel D Daw; Ben Seymour; Raymond J Dolan
Journal: Neuron Date: 2008-06-26 Impact factor: 17.173

514 in total

1. Changes in corticostriatal connectivity during reinforcement learning in humans.

Authors: Guillermo Horga; Tiago V Maia; Rachel Marsh; Xuejun Hao; Dongrong Xu; Yunsuo Duan; Gregory Z Tau; Barbara Graniello; Zhishun Wang; Alayar Kangarlu; Diana Martinez; Mark G Packard; Bradley S Peterson
Journal: Hum Brain Mapp Date: 2014-11-12 Impact factor: 5.038