| Literature DB >> 29049406 |
Jaron T Colas1, Wolfgang M Pauli1,2, Tobias Larsen2,3, J Michael Tyszka2, John P O'Doherty1,2.
Abstract
Prediction-error signals consistent with formal models of "reinforcement learning" (RL) have repeatedly been found within dopaminergic nuclei of the midbrain and dopaminoceptive areas of the striatum. However, the precise form of the RL algorithms implemented in the human brain is not yet well determined. Here, we created a novel paradigm optimized to dissociate the subtypes of reward-prediction errors that function as the key computational signatures of two distinct classes of RL models-namely, "actor/critic" models and action-value-learning models (e.g., the Q-learning model). The state-value-prediction error (SVPE), which is independent of actions, is a hallmark of the actor/critic architecture, whereas the action-value-prediction error (AVPE) is the distinguishing feature of action-value-learning algorithms. To test for the presence of these prediction-error signals in the brain, we scanned human participants with a high-resolution functional magnetic-resonance imaging (fMRI) protocol optimized to enable measurement of neural activity in the dopaminergic midbrain as well as the striatal areas to which it projects. In keeping with the actor/critic model, the SVPE signal was detected in the substantia nigra. The SVPE was also clearly present in both the ventral striatum and the dorsal striatum. However, alongside these purely state-value-based computations we also found evidence for AVPE signals throughout the striatum. These high-resolution fMRI findings suggest that model-free aspects of reward learning in humans can be explained algorithmically with RL in terms of an actor/critic mechanism operating in parallel with a system for more direct action-value learning.Entities:
Mesh:
Year: 2017 PMID: 29049406 PMCID: PMC5673235 DOI: 10.1371/journal.pcbi.1005810
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Subject groups.
Subjects were first objectively divided into two groups a priori according to their performance on the task as represented by the accuracy score listed here. Of 39 total subjects, 20 were classified as “Good-learner” subjects for whom choice accuracy was significantly greater than the chance score of 50% at the level of an individual subject (p < 0.05). Of the remaining 19 “Poor-learner” subjects, 4 were subsequently reclassified as “Nonperformer” subjects in cases of complete insensitivity to outcomes, which was verified with computational modeling. There were no significant differences between the two main groups when considering possible confounds in reaction time (RT), the total number of missed trials following errors, or age and gender (p > 0.05). Standard deviations are listed in parentheses by the corresponding means within groups.
| Good learner | Poor learner | Nonperformer | Performer | Aggregate | |
|---|---|---|---|---|---|
| 20 | 15 | 4 | 35 | 39 | |
| Accuracy (%) | 70.9 (7.1) | 53.1 (5.4) | 43.5 (6.1) | 63.3 (10.9) | 61.2 (12.1) |
| RT (ms) | 755 (107) | 779 (137) | 712 (170) | 765 (120) | 760 (124) |
| Missed trials | 6.0 (5.2) | 5.5 (5.3) | 12.8 (13.1) | 5.8 (5.2) | 6.5 (6.5) |
| Age (y) | 23.5 (3.8) | 25.8 (5.2) | 27.3 (8.3) | 24.5 (4.6) | 24.7 (5.0) |
| M:F (%) | 50 | 40 | 100 | 45.7 | 51.3 |
Model parameters.
The means and standard deviations of the ACQ model’s fitted parameters—including from the hysteresis model the (arbitrarily rightward) constant choice bias β and initial magnitude β coupled with inverse decay rate λ for exponential decay of the perseveration bias—are listed separately for each group, revealing a tendency for Good learners to have lower temperature than Poor learners (M = 0.987, t = 2.88, p = 0.004). The logarithm of the ratio between the eligibility-adjusted learning rate and the temperature provides a more precise metric for the sensitivity dictated by the model’s fitted parameters than the temperature alone—especially given the correlation between the eligibility-adjusted learning rate and the temperature [57] exhibited within the Poor-learner group (r = 0.547, t = 2.36, p = 0.035) and the lack of such a correlation among Good learners (r = 0.121, t = 0.52, p = 0.611). Model sensitivity, which was significantly positive across the Good-learner group (M = 0.440, t = 5.59, p < 10-4) but not the Poor-learner group (M = 0.020, t = 0.18, p = 0.428), was not only greater for Good learners than for Poor learners (M = 0.420, t = 3.23, p = 10-3) but also significantly correlated with the objective metric for choice accuracy (r = 0.409, t = 2.57, p = 0.007). The residual deviance D (with degrees of freedom in the subscript) corresponds to the ACQ model’s improvement in fit relative to either a null intercept model or the hysteresis model.
| Good learner | Poor learner | |
|---|---|---|
| 20 | 15 | |
| Accuracy (%) | 70.9 (7.1) | 53.1 (5.4) |
| Sensitivity | 0.440 (0.352) | 0.020 (0.417) |
| Learning rate | 0.588 (0.237) | 0.551 (0.308) |
| Eligibility | 0.682 (0.323) | 0.687 (0.431) |
| Action-value weight | 0.661 (0.315) | 0.626 (0.418) |
| Softmax temperature | 0.404 (0.262) | 1.390 (1.512) |
| Perseveration bias: magnitude | 0.093 (0.366) | -0.088 (0.521) |
| Perseveration bias: rate | 0.621 (0.375) | 0.751 (0.281) |
| Rightward bias | 0.230 (0.425) | 0.128 (0.673) |
| Null: residual deviance | 45.60 (20.31) | 21.59 (20.15) |
| Hysteresis: residual deviance | 20.18 (13.32) | 9.41 (9.13) |