| Literature DB >> 25161635 |
Yu Bai1, Kentaro Katahira1, Hideki Ohira1.
Abstract
Humans are capable of correcting their actions based on actions performed in the past, and this ability enables them to adapt to a changing environment. The computational field of reinforcement learning (RL) has provided a powerful explanation for understanding such processes. Recently, the dual learning system, modeled as a hybrid model that incorporates value update based on reward-prediction error and learning rate modulation based on the surprise signal, has gained attention as a model for explaining various neural signals. However, the functional significance of the hybrid model has not been established. In the present study, we used computer simulation in a reversal learning task to address functional significance in a probabilistic reversal learning task. The hybrid model was found to perform better than the standard RL model in a large parameter setting. These results suggest that the hybrid model is more robust against the mistuning of parameters compared with the standard RL model when decision-makers continue to learn stimulus-reward contingencies, which can create abrupt changes. The parameter fitting results also indicated that the hybrid model fit better than the standard RL model for more than 50% of the participants, which suggests that the hybrid model has more explanatory power for the behavioral data than the standard RL model.Entities:
Keywords: decision making; learning rate; reinforcement learning model; reversal learning; value
Year: 2014 PMID: 25161635 PMCID: PMC4129443 DOI: 10.3389/fpsyg.2014.00871
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1Simulation results of RL models. We used computer simulations to compare the Q-learning model and hybrid model performing a reversal learning task with 160 trials. To examine the model performance in various learning settings, simulations were repeated for varying initial learning rates α0 (0–1) and the exploration/exploitation parameter β (0–50) at different η levels (0–1). (A) With the rate of advantageous choice across all combinations of α0 and β at different η levels (seven typical levels: 0, 0.05, 0.1, 0.3, 0.5, 0.7, 1), particularly when η = 0, the model corresponds to the Q-learning model. Each cell depicts the proportion of advantageous choice, which was computed by simulating learning tasks 1000 times for each model. The resulting 2-dimensional plots were sufficiently smooth (Figure 1A), which suggested that the estimated average values were reliable. The initial learning rate α0 is varied on the y-axis, and the exploration/exploitation parameter β is varied on the x-axis. The region (i) indicates a mistuned parameter situation (α0 = 0.15; β = 45), the region (ii) indicates a good situation (α0 = 0.65; β = 25), and the region (i) indicates a mistuned parameter situation (α0 = 0.95; β = 45). (B) Typical time courses of the likelihood of choosing option 1 (the good choice before the reversal occurs) and the learning rate, with the same combination of α0 and β on the left-side panel. The learning rate curves illustrate that the learning rate changes frequently as the parameter η value increases. The likelihood of choosing option 1 curves indicates that the agent can detect the reversal of the good option faster as η value increases.
Figure 2Proportion of parameter regions in which the performance exceeds the median of the standard Q-learning for different task difficulties. The proportion is defined as the fraction of the parameter set of α0 and β with a rate of advantageous choice that is larger than the median (across parameter combinations of α0 and β) of that of the standard Q learning rate (η = 0). The task difficulty is a measure of how difficult it is to distinguish a good choice from a bad choice; in this case, it took three levels (reward/loss frequency ratios of 80:20, 70:30, and 60:40; easier tasks have a higher ratio).
Model fit results (group fit).
| Q-learning | −1379 | 2762 | 2773 | 0.114 | 2.584 | – |
| Hybrid | −1366 | 2739 | 2756 | 0.018 | 2.601 | 0.004 |
Shown are negative log-likelihood (−LL), AIC, BIC, and maximum-like maximum-likelihood parameter estimates for the standard Q-learning model and hybrid model with parameters fit to the entire group.
Figure 3Goodness-of-fit of the standard Q-learning model and hybrid model with parameters fit to the entire group (A,B), with parameters fit to individual participants (C,D). (A,C) AIC scores of the two models. (B,D) BIC scores of the two models. The error bar indicates SD.
Model fit results (individual fit).
| Q-learning | −84 ± 23 | 173 ± 47 | 179 ± 47 | 0.08 ± 0.04 | 6.98 ± 12.12 | – |
| Hybrid | −78 ± 28 | 162 ± 56 | 172 ± 56 | 0.18 ± 0.35 | 7.2 ± 12.18 | 0.07 ± 0.12 |
Shown are negative log-likelihood (−LL), AIC, BIC, and maximum-like maximum-likelihood parameter estimates for the standard Q-learning model and hybrid model with parameters fit to the individual subjects.