| Literature DB >> 34355220 |
Corey Yishan Zhou1, Dalin Guo1, Angela J Yu1.
Abstract
Humans frequently overestimate the likelihood of desirable events while underestimating the likelihood of undesirable ones: a phenomenon known as unrealistic optimism. Previously, it was suggested that unrealistic optimism arises from asymmetric belief updating, with a relatively reduced coding of undesirable information. Prior studies have shown that a reinforcement learning (RL) model with asymmetric learning rates (greater for a positive prediction error than a negative prediction error) could account for unrealistic optimism in a bandit task, in particular the tendency of human subjects to persistently choosing a single option when there are multiple equally good options. Here, we propose an alternative explanation of such persistent behavior, by modeling human behavior using a Bayesian hidden Markov model, the Dynamic Belief Model (DBM). We find that DBM captures human choice behavior better than the previously proposed asymmetric RL model. Whereas asymmetric RL attains a measure of optimism by giving better-than-expected outcomes higher learning weights compared to worse-than-expected outcomes, DBM does so by progressively devaluing the unchosen options, thus placing a greater emphasis on choice history independent of reward outcome (e.g. an oft-chosen option might continue to be preferred even if it has not been particularly rewarding), which has broadly been shown to underlie sequential effects in a variety of behavioral settings. Moreover, previous work showed that the devaluation of unchosen options in DBM helps to compensate for a default assumption of environmental non-stationarity, thus allowing the decision-maker to both be more adaptive in changing environments and still obtain near-optimal performance in stationary environments. Thus, the current work suggests both a novel rationale and mechanism for persistent behavior in bandit tasks.Entities:
Keywords: Bayesian modeling; decision making; multi-armed bandit; reinforcement learning; unrealistic optimism
Year: 2020 PMID: 34355220 PMCID: PMC8336429
Source DB: PubMed Journal: Cogsci
Figure 1:Model comparison. (A) BIC of DBM versus RW±, for both softmax and ε-greedy decision policies. Error bars: s.e.m. of BIC having subtracted out BIC for DBM for each subject, thus error bar for DBM is 0. (B) Average predictive accuracy of DBM versus RW±, for both softmax and ε-greedy decision policies. Chance predictive accuracy is 0.5. Error bars: s.e.m. of individually subtractively normalized predictive accuracy, analogous to (A). (C) Predictive accuracy of DBM versus RW± at the individual level.
Figure 2:Evolution of differential Q-value (left - right) as a function of trials for an example subject (subject 24). Circles indicate the subject’s actual choice (.5 = left, −.5 = right). More positive Q-value difference means greater model-predicted probability of choosing left arm. Filled circles correspond to reward, and hollow circles correspond to no reward. Top left: 25/25%; Top right: 75/25%; Bottom left: 25/75%; Bottom right: 75/75%. The four pairs were randomly interleaved in their presentation.