Sinem Balta Beylergil1, Anne Beck2, Lorenz Deserno3, Robert C Lorenz4, Michael A Rapp5, Florian Schlagenhauf6, Andreas Heinz7, Klaus Obermayer8. 1. Department of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587 Berlin, Germany; Bernstein Center for Computational Neuroscience Berlin, 10115 Berlin, Germany. Electronic address: sinembalta@gmail.com. 2. Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany. 3. Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany; Max Planck Institute for Human Cognitive and Brain Sciences, 04103 Leipzig, Germany; Department of Neurology, Otto von Guericke University, 39118 Magdeburg, Germany. 4. Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany; Center for Adaptive Rationality, Max Planck Institute for Human Development, 14195 Berlin, Germany. 5. Social and Preventive Medicine, University of Potsdam, 14469 Potsdam, Germany. 6. Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany; Max Planck Institute for Human Cognitive and Brain Sciences, 04103 Leipzig, Germany. 7. Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany; Cluster of Excellence NeuroCure, Charité-Universitätsmedizin Berlin, 10117 Berlin, Germany. 8. Department of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587 Berlin, Germany; Bernstein Center for Computational Neuroscience Berlin, 10115 Berlin, Germany.
Abstract
Substance-dependent individuals often lack the ability to adjust decisions flexibly in response to the changes in reward contingencies. Prediction errors (PEs) are thought to mediate flexible decision-making by updating the reward values associated with available actions. In this study, we explored whether the neurobiological correlates of PEs are altered in alcohol dependence. Behavioral, and functional magnetic resonance imaging (fMRI) data were simultaneously acquired from 34 abstinent alcohol-dependent patients (ADP) and 26 healthy controls (HC) during a probabilistic reward-guided decision-making task with dynamically changing reinforcement contingencies. A hierarchical Bayesian inference method was used to fit and compare learning models with different assumptions about the amount of task-related information subjects may have inferred during the experiment. Here, we observed that the best-fitting model was a modified Rescorla-Wagner type model, the "double-update" model, which assumes that subjects infer the knowledge that reward contingencies are anti-correlated, and integrate both actual and hypothetical outcomes into their decisions. Moreover, comparison of the best-fitting model's parameters showed that ADP were less sensitive to punishments compared to HC. Hence, decisions of ADP after punishments were loosely coupled with the expected reward values assigned to them. A correlation analysis between the model-generated PEs and the fMRI data revealed a reduced association between these PEs and the BOLD activity in the dorsolateral prefrontal cortex (DLPFC) of ADP. A hemispheric asymmetry was observed in the DLPFC when positive and negative PE signals were analyzed separately. The right DLPFC activity in ADP showed a reduced correlation with positive PEs. On the other hand, ADP, particularly the patients with high dependence severity, recruited the left DLPFC to a lesser extent than HC for processing negative PE signals. These results suggest that the DLPFC, which has been linked to adaptive control of action selection, may play an important role in cognitive inflexibility observed in alcohol dependence when reinforcement contingencies change. Particularly, the left DLPFC may contribute to this impaired behavioral adaptation, possibly by impeding the extinction of the actions that no longer lead to a reward.
Substance-dependent individuals often lack the ability to adjust decisions flexibly in response to the changes in reward contingencies. Prediction errors (PEs) are thought to mediate flexible decision-making by updating the reward values associated with available actions. In this study, we explored whether the neurobiological correlates of PEs are altered in alcohol dependence. Behavioral, and functional magnetic resonance imaging (fMRI) data were simultaneously acquired from 34 abstinent alcohol-dependent patients (ADP) and 26 healthy controls (HC) during a probabilistic reward-guided decision-making task with dynamically changing reinforcement contingencies. A hierarchical Bayesian inference method was used to fit and compare learning models with different assumptions about the amount of task-related information subjects may have inferred during the experiment. Here, we observed that the best-fitting model was a modified Rescorla-Wagner type model, the "double-update" model, which assumes that subjects infer the knowledge that reward contingencies are anti-correlated, and integrate both actual and hypothetical outcomes into their decisions. Moreover, comparison of the best-fitting model's parameters showed that ADP were less sensitive to punishments compared to HC. Hence, decisions of ADP after punishments were loosely coupled with the expected reward values assigned to them. A correlation analysis between the model-generated PEs and the fMRI data revealed a reduced association between these PEs and the BOLD activity in the dorsolateral prefrontal cortex (DLPFC) of ADP. A hemispheric asymmetry was observed in the DLPFC when positive and negative PE signals were analyzed separately. The right DLPFC activity in ADP showed a reduced correlation with positive PEs. On the other hand, ADP, particularly the patients with high dependence severity, recruited the left DLPFC to a lesser extent than HC for processing negative PE signals. These results suggest that the DLPFC, which has been linked to adaptive control of action selection, may play an important role in cognitive inflexibility observed in alcohol dependence when reinforcement contingencies change. Particularly, the left DLPFC may contribute to this impaired behavioral adaptation, possibly by impeding the extinction of the actions that no longer lead to a reward.
Alcohol has been considered as the most harmful psychoactive substance when physical, psychological, and social effects are taken together (McGinnis and Foege, 1993, Nutt et al., 2010). It can cause structural and functional changes in a network of cortical and subcortical structures (Beck et al., 2012, Makris et al., 2008, Moriyama et al., 2002, Ratti et al., 2002). These alterations, which partly seem to persist during abstinence (Ratti et al., 2002, Zinn et al., 2004), gradually reduce cognitive control, deteriorating individual's ability to inhibit perseverative responses and adapt to the changes in environmental contingencies. Indeed, alcohol use disorder itself can be seen as an inability to adjust responses to stimuli formerly coupled with alcohol leading to habitual, perseverative consumption patterns (Stalnaker et al., 2009).Probabilistic reversal learning task (PRLT) has been traditionally used to assess cognitive flexibility in addiction (Izquierdo and Jentsch, 2012, Swainson et al., 2000). Experiments using PRLTs have demonstrated that various substance-dependent groups including alcohol, cocaine, and stimulant-dependent patients have difficulties adapting to reversals, i.e. abrupt changes in reward contingencies (Deserno et al., 2014, Ersche et al., 2008, Ersche et al., 2011, Park et al., 2010). Recently, there has been an increasing interest to understand the underlying computational mechanisms of these impairments in substance use disorder through reinforcement learning (RL) models (Deserno et al., 2014, Park et al., 2010, Patzelt et al., 2014, Tanabe et al., 2013). These models are based on the idea that while individuals tend to repeat the actions leading to rewards, they tend to cease the activities that give them punishments (Sutton and Barto, 1998). They rely on a teaching signal called “prediction error,” which quantifies the discrepancy between the estimated reward value of an action and the actual reward obtained by selecting that action. Learning takes place as PE updates the selected action's reward value, which then guides action selection on the next trial. More recently, it has been suggested that healthy human subjects in value-based decision-making tasks not only learn the expected reward value of the action they select, but also consider what they would have obtained if they selected the alternative action (Boorman et al., 2009, Boorman et al., 2013, Li and Daw, 2011, Lohrenz et al., 2007, Tobia et al., 2014). The RL models accommodate this counterfactual learning via an additional fictive update rule that updates the reward expectancy of the unselected option, assuming that subjects consider the anti-correlated reward structure of the PRLT such that if one choice is likely to be rewarded, the alternative is likely to be punished (Hampton et al., 2007). In this study, using a reward-guided decision-making task with anti-correlated action-outcome contingencies that abruptly change throughout the experiment, we hypothesized that subjects would infer and incorporate this latent feature of the task structure in decision-making and integrate both actual and fictive outcomes into their decisions. Based on the recent reports showing superior model-fitting performance of these “double-update” (DU) learning models (Glaescher et al., 2009, Hampton et al., 2007, Schlagenhauf et al., 2014), we hypothesized that the DU model would fit to the behavioral data better than the standard RL models that only update the value of the selected action. Previous reports with subsets of our subjects (Deserno et al., 2014, Park et al., 2010) used the basic RL model to test their hypotheses related to the blood oxygen level dependent (BOLD) activity in the ventral striatum (VS) that varies with PEs, which has been shown to be reliably predicted by this model (Pagnoni et al., 2002). Alternatively, our approach was to compare various learning models with different assumptions about the amount of task-related information subjects may have extracted during the experiment. Our aim was to find the model that best explains the underlying computations carried out by the subjects while adjusting their responses to abruptly changing reinforcement contingencies of the task.The combination of RL modeling and fMRI holds promise for testing various hypotheses concerning the brain mechanisms responsible for computational processes in reward-based decision-making (Glaescher and O'Doherty, 2010). Recently, by adopting this “model-based fMRI” approach (Montague et al., 2012), neural representations of learning have been compared between substance-dependent and control groups to gain insight into the cognitive rigidity of addictive behavior. Research to date has mainly focused on the striatal impairments in reward-based decision-making (Deserno et al., 2014, Park et al., 2010) because addictive drugs seem to “hijack” the reward-related processes governed by striatal structures and evoke a pattern of behavior similar to those evoked by natural rewards (Dayan, 2009, Hyman, 2005). However, drugs also cause structural and functional changes in the PFC, especially in the DLPFC, which possibly contribute to the decline of cognitive control (Charlet et al., 2014, Goldstein et al., 2004, Loeber et al., 2009, Sullivan et al., 2000). Furthermore, reduced neural recruitment in the DLPFC has been extensively reported in various drug-dependent groups performing other tasks that require cognitive flexibility (Bolla et al., 2004, Eldreth et al., 2004, Paulus et al., 2008, Salo et al., 2009; see Goldstein and Volkow, 2011 for a review). Moreover, a previous study with a subgroup of our subjects reported abnormal signal propagation between VS and DLPFC, possibly leading to impairments in modifying and controlling behavior following reinforcement (Park et al., 2010). Based on these findings, and recent evidence on the involvement of DLPFC in PRLT (Budhani et al., 2007, Cools et al., 2002, Greening et al., 2011, Mitchell et al., 2009), the focus of this study was to elaborate on DLPFC's contribution to the inability of ADP in making flexible decisions in response to the reversals of reward contingencies. We sought to capture the neural substrates of decision-making in the PFC via a model that assumes subjects infer the unobservable (latent) reward structure of the task, which is then used to choose actions that maximize reward attained. We hypothesized that the BOLD activity in the DLPFC of ADP failing to track the PE signal derived from this model would contribute to impaired behavioral adaptation in alcohol dependence.There is growing evidence that the human brain has distinct neural mechanisms for processing rewards and punishments (Bischoff-Grethe et al., 2009, Frank et al., 2004, Liu et al., 2007, Wrase et al., 2007, Yacubian, 2006). Furthermore, recent research suggests that these mechanisms may act differently in the case of substance use disorder (Parvaz et al., 2015, Paulus et al., 2008, Rossiter et al., 2012). The mechanisms responsible for processing punishments may be of particular interest in understanding maladaptive decision-making in alcohol use disorder because aversive consequences of alcohol use seem to be often consciously acknowledged but behaviorally ignored by abusers. Previous studies using behavioral modeling showed that the actions of substance-dependent individuals usually fail to match with the punishment expectancies attached to them (Bishara et al., 2009, Fridberg et al., 2010, Stout et al., 2004, Tanabe et al., 2013). Therefore, we hypothesized that punishments received in the current experiment would have weaker effects on the decisions of ADP compared to the decisions of HC. Negative PEs play a pivotal role in the current task as they mediate the extinction of learned actions that no longer lead to a reward when reinforcement contingencies change. In alcohol dependence, abnormal representation of these signals may contribute to the difficulties in ceasing drug-related behavior hindering the maintenance of abstinence. Therefore, one of the aims of the present study was to investigate the neural correlates of abnormal encoding of negative PEs. It has been shown that the severity of dependence symptoms (Doyle and Donovan, 2009) and craving for alcohol (Bottlender and Soyka, 2004) are significantly related to the ability of an alcohol-dependent individual to stay abstinent in high-risk relapse situations in which the individual should override the action “to consume” alcohol. Therefore, we reasoned that impaired negative PE signaling in ADP would be related to high dependence severity and high craving for alcohol.
Materials and methods
Subjects
34 abstinent ADP and 26 HC (all male) participated in the current study (see Table 1 for sample characteristics). Subjects had no other neurological or psychiatric disorder and no current drug abuse other than nicotine. All ADP were diagnosed according to the International Classification of Diseases and Related Health Problems 10th edition (World Health Organization, 2004) and Diagnostic and Statistical Manual of Mental Disorders 4th edition (American Psychiatric Association, 1994). The severity of dependence and the mean craving for alcohol were assessed with the Alcohol Dependence Scale (ADS, Skinner and Horn, 1984) and the average craving subscale of Obsessive Compulsive Drinking Scale (OCDS) (Anton, 2000). The amount of alcohol intake in the past year was evaluated with the Lifetime Drinking History (LDH) questionnaire (Skinner and Sheu, 1982). The smoking severity of the subjects was also assessed with the Fagerström Test for Nicotine Dependence (Heatherton et al., 1991). During fMRI sessions, ADP had been abstinent and were free of benzodiazepine or chlormethiazole medication for at least 1 week (> 4 half-lives). Groups did not differ on age, handedness (Oldfield, 1971) or verbal intelligence as assessed with a German vocabulary test (Schmidt and Metzler, 1992). However, there were significantly more chronic cigarette smokers in ADP than HC. All statistical analyses (including the fMRI analyses) were therefore controlled for smoking status.
Table 1
Sample characteristics. ADP: alcohol-dependent patients, HC: healthy controls, FTND: Fagerstrom Test for Nicotine Dependence, EDI: the Edinburgh Handedness Inventory, LDH: Lifetime Drinking History, OCDS: Obsessive Compulsive Drinking Scale, ADS: Alcohol Dependence Scale.
ADP (34)
HC (26)
Statistics
p
Age
44.73 ± 8.27, 23–60 years
41.92 ± 9.59, 28–61 years
t58 = 1.21
0.220
Sex
All male
All male
Smoking
25 smokers
11 smokers
χ2 = 4.75
0.020
FTND
5 ± 2.73, 1–10
3.36 ± 2.37, 0–7
t34 = − 1.71
0.100
EDI
Right-handed
Right-handed
Verbal IQ
102.85 ± 8.92, 85–125
103.80 ± 8.93, 90–125
t58 = 0.41
0.680
LDH last year (kg)
89.10 ± 166.04, 2.10–999
5.69 ± 13.27, 0.12–68.88
t58 = − 2.55
0.010
OCDS sum
17.48 ± 7.09, 4–33
2.53 ± 2.56, 0–11
t58 = − 7.51
0.001
OCDS craving
8.23 ± 10.06, 0–40
28.32 ± 35.60, 0–100
t58 = − 2.78
0.007
ADS
15.48 ± 7.73, 1–36
–
Days of abstinence
17.55 ± 7.92, 7–46 days
–
Sample characteristics. ADP: alcohol-dependent patients, HC: healthy controls, FTND: Fagerstrom Test for Nicotine Dependence, EDI: the Edinburgh Handedness Inventory, LDH: Lifetime Drinking History, OCDS: Obsessive Compulsive Drinking Scale, ADS: Alcohol Dependence Scale.The study was approved by the Ethics Committee of Charité - Universitätsmedizin Berlin and all subjects signed a written consent after all procedures were explained thoroughly.
Task description
During fMRI acquisition, subjects performed a reward-guided decision-making task with dynamically changing action-outcome contingencies (Deserno et al., 2014, Park et al., 2010, Schlagenhauf et al., 2013, Schlagenhauf et al., 2014). On each trial, subjects had to choose one of the two abstract visual stimuli presented on a computer screen for 2 s. Following the action, the selected stimulus and its outcome—either a green smiley for reward or a red frowny for punishment—stayed on the screen for 1 s. The experiment included two runs of 100 trials separated by a short break. Trial timings were jittered by an interval of 1–6.5 s.There were three block types with the following reward contingencies: 20% left- and 80% right-, (2) 80% left- and 20% right-, and (3) 50% left- and 50% right-hand choices leading to a reward, otherwise to a punishment. Reward contingencies on the two options were fully anti-correlated, so that, for instance, when one option resulted in a reward on 80% of occasions, the other option led to a punishment 80% of the time. Subjects started the experiment with either the block type (1) or (2). Block type shifted abruptly and unpredictably to any of the randomly chosen block types after ten trials (minimum block length) when subjects chose the most highly rewarding option on 70% (50% for the 3rd block type) of the trials of an entire block. Regardless of whether this learning criterion was fulfilled, reward contingencies automatically changed after the maximum block length of 16 trials.Subjects were instructed that the aim of the task was to learn by trial and error which of the two stimuli is better than the other, i.e. has a higher chance of winning. They were asked to adapt their behavior to possible changes in reward contingencies and win as often as possible. However, they were not informed about the exact timing of contingency changes or the reward probabilities (see Supplementary material for task instructions). Before entering the fMRI scanner, subjects were asked to perform a short version without the changes in reward contingencies to become familiar with the probabilistic nature of the task.
Statistical analysis of the behavior
The total number of correct choices and the number of blocks for which the reversal criterion was met were compared between the two groups using two-sample t-tests. Response times after rewards and punishments were compared using a 2 × 2 ANCOVA, with a between-subject factor group, a within-subject factor outcome valence, and a covariate for smoking status. We also measured the extent to which the outcome information gathered by the subjects during the previous four trials was integrated into the decisions to stay on the same option (win-stay behavior) or shift to the other option (lose-shift behavior). We then tested for between-group differences in win-stay and lose-shift behavior, which were assessed by a logistic regression analysis as explained elsewhere (den Ouden et al., 2013) and in the Supplementary material.All standard tests in this study were performed in R 3.0.2 (R Core Team, 2013). Greenhouse–Geiser correction was used whenever the sphericity assumption was violated.
Computational modeling of the behavioral data
Models
We adopted a behavioral modeling approach to understand the computational processes underlying the reward-based decisions of the subjects and to explore the differences between ADP and HC in these processes. We considered three groups of computational learning models with different assumptions about the amount of task-related information subjects may have inferred during the experiment (Schlagenhauf et al., 2014). The first group of learning models consisted of the standard Rescorla-Wagner type models (Rescorla and Wagner, 1972) in which learning is based on an error measure called “prediction error”. Learning takes place as the PE (denoted as δ in Eq. (1)), which quantifies the discrepancy between the received outcome R and the expected outcome Q(a), updates the expected value Q(a) of the selected action at the end of each trial when R information is revealed (Eqs. (1), (2)). It reinforces an action or facilitates its extinction depending on whether the obtained R is better (positive PE) or worse (negative PE) than the expected outcome Q(a).The effect of the reinforcement on subject's decision is represented by a free parameter called reinforcement sensitivity, as denoted by ρ in Eq. (1). Higher values of ρ magnify the differences between the option values and increase the probability of the selection of the action with higher expected value. On the other hand, lower values lead to explorative decisions that are inconsistent and independent of the reward expectancies (Stout et al., 2004). This definition of sensitivity should be distinguished from the more traditional definition as the ability to derive pleasure or displeasure from the reinforcers in the experiment.The extent to which PE updates the expected value is determined by another free parameter called learning rate α (Eq. (2)). To study the dependence-related behavioral patterns that are specific to processing reward and punishment, we allowed the reinforcement sensitivity parameter to take two distinct values, reward sensitivity (ρ) or punishment sensitivity (ρ), according to the valence of outcome (Ito and Doya, 2009, Schlagenhauf et al., 2013). Learning rate was also allowed to take two different values depending on whether the received outcome R is a reward (reward learning rate α) or a punishment (punishment learning rate α).The first group of learning models tested in this study assumes that learning can only take place through experience. Therefore, they only update the expected value of the selected action Q(a), while leaving the expected value of the unselected action Q(a′) unchanged (Eq. (3)) (hence called “single-update” models). The second class of models, called “double-update” models, extend the single update (SU) models by also taking into account the counterfactual outcome that could have been received from the unselected action (Boorman et al., 2009, Boorman et al., 2013, Li and Daw, 2011, Lohrenz et al., 2007, Tobia et al., 2014). Based on the idea that subjects infer and utilize the knowledge that reward contingencies on two options are fully anti-correlated, double update (DU) models update the expected values of both the selected action a (Eq. (2)) and the unselected action a′ (Eq. (4)). It is important to note that after receiving a reward, the update rule for the unchosen option does not assume a certain punishment (or vice versa); but a lower probability of receiving reward, therefore a higher probability of receiving punishment as there is no other type of feedback in the experiment.We generated three versions of SU and DU models with different combinations of free parameters (SU1–3, DU1–3, see Table 2). Additionally, with an additional DU model (DU4), we tested the hypothesis that fictive learning signals would not be utilized in updating of the action values as effectively as actual learning signals (Matsumoto et al., 2007). This model uses a fictive learning rate parameter, which is calculated by weighting the learning rate with an additional parameter ξ. This parameter is a fractional step size, which takes a value between 0 and 1 (Eq. (5)).
Table 2
Computational learning models. Single-update (SU), double-update (DU) models, and Hidden Markov Models (HMMs) use various combinations of free parameters. The potential scale reduction factor (PSRF) values inform about the convergence of the Markov Chain Monte Carlo (MCMC) algorithm. The minimum deviance information criterion (DIC) value (written in bold) designates the most parsimonious model. DIC values are reported for all behavioral data including (DICALL) and excluding the poorly-fitted subjects (DICFit > Chance). α: learning rate, ρ: reinforcement sensitivity, ξ: fictive weight, τ: transition probability, φ: outcome probability. The parameters, which take different values according to the valence of the outcome, are marked with subscripts r for reward and p for punishment.
Model
Free parameters
PSRF
DICALL
DICFit > Chance
SU1
α, ρ
1.03
11,178
8260
SU2
α, ρr, ρp
1.01
10,783
7888
SU3
αr, αp, ρ
1.26
10,819
7910
DU1
α, ρ
1.05
10,493
7588
DU2
α, ρr, ρp
1.04
10,015
7141
DU3
αr, αp, ρ
1.02
10,067
7184
DU4
α, ξ, ρ
1.01
10,515
7608
DU5
α, ξ, ρr, ρp
1.01
10,025
7150
HMM1
τ, φ
1.04
10,331
7461
HMM2
τ, φr, φp
1.01
10,049
7212
Computational learning models. Single-update (SU), double-update (DU) models, and Hidden Markov Models (HMMs) use various combinations of free parameters. The potential scale reduction factor (PSRF) values inform about the convergence of the Markov Chain Monte Carlo (MCMC) algorithm. The minimum deviance information criterion (DIC) value (written in bold) designates the most parsimonious model. DIC values are reported for all behavioral data including (DICALL) and excluding the poorly-fitted subjects (DICFit > Chance). α: learning rate, ρ: reinforcement sensitivity, ξ: fictive weight, τ: transition probability, φ: outcome probability. The parameters, which take different values according to the valence of the outcome, are marked with subscripts r for reward and p for punishment.In all of the SU and DU models, action probabilities p(a) were calculated from the expected reward values of the options using the following action selection rule,where σ(z) = 1 / (1 + exp(− z)) is the sigmoid function. The noise temperature parameter β in Eq. (6) controls the level of stochasticity in action selection. Adjusting the reward and punishment sensitivity parameters in SU and DU models is an alternative way to modify β, which was therefore set to 1 to avoid overparameterization. The indecision point c in Eq. (6) determines the point on the sigmoid function at which both choices are equally likely to be selected. No bias was found in the choices when reward values of options were equal (paired-sample t-test, t59 = − 0.944, p = 0.348). Hence, c was fixed to 0.The third group of learning models implemented in this study was Hidden Markov Models (HMMs), which assume that subjects construct a state-based representation of the task via probabilities that determine contingency changes and the outcome that would arise from selecting an action (Hampton et al., 2006, Schlagenhauf et al., 2014). HMMs assign a prior belief probability to each action b(a), which indicates the subjective belief that an action a is correct, i.e. associated with the higher reward contingency. At the end of each trial upon each new outcome, prior probabilities are updated to a posterior belief probability via Bayes' rule (Jordan, 1998) (see the Supplementary material for the implementation details of HMM). An important feature of HMM is that updating of the posterior belief probabilities does not involve computations of PEs. In contrast to SU and DU models, HMM uses the outcome information as an evidence to simultaneously update the belief probabilities of all possible actions. The amount of change in the prior belief made by an outcome is called “Bayesian surprise” (Itti and Baldi, 2005), which can only be computed after belief updating. On the other hand, PEs in RL models are calculated at the time of the outcome presentation and directly used in learning (Barto et al., 2013).HMM captures the reversal nature of the task with a free parameter called transition probability (τ), which governs the transitions among the belief states. The probability with which an outcome can be obtained in a particular belief state is represented by a free parameter called outcome probability (φ). Analogous to the distinct reward and punishment sensitivities defined in the SU and DU models, the outcome probability parameter was also allowed to take two different values according to the valence of the outcome. Reward probability (φ) represents the likelihood of getting a reward given that subject is in a correct belief state; whereas punishment probability (φ) accounts for receiving a punishment given that subject is in an incorrect belief state. We tested two versions of the HMM. The first version assumes that the chance of getting a reward from a ‘correct’ belief state equals to the chance of receiving a punishment from an ‘incorrect’ belief state. The second version allows outcome probabilities to take different values according to the valence of the outcome (see HMM1 and HMM2 in Table 2).
Model fitting and model comparison
Individual and group parameters were simultaneously estimated in terms of probability distributions using a hierarchical Bayesian inference method. We adopted this method because it provides a principled approach for tackling optimization problems such as numerical stability and estimations at parameter boundaries, which often occur in the modeling of choice data (Daw, 2011, Wagenmakers et al., 2008, Wetzels et al., 2010).A Bayesian graphical model was created for each candidate learning model in JAGS (Plummer, 2003). A Markov Chain Monte Carlo (MCMC) algorithm called Gibbs sampling was used to sample from the parameter distributions of the models. Three chains of 100,000 samples were generated. To reduce autocorrelation between the MCMC samples, only every 5th sample was retained. The first 5000 samples from each chain were discarded for burn-in, leaving 19,000 samples per chain. Group prior distributions were only weakly informed to keep the estimated parameters in a reasonable range [Uniform(0,1) for learning rate, fictive weight parameter and all parameters of the HMMs; Uniform(0,20) for reward and punishment sensitivities]. To assure convergence, MCMC chains were visually analyzed for each parameter whether they stabilized at the same region of the sample space (Gelman and Rubin, 1992a, Gelman and Rubin, 1992b). We also calculated the Gelman-Rubin convergence and reported the potential scale reduction factor (PSRF) of each model (Gelman and Rubin, 1992a).For selecting the best-fitting model, we calculated the model scores in deviance information criterion (DIC) (Spiegelhalter et al., 2002, see Supplementary material for details). The model with the smallest value was selected as the best-fitting model. Furthermore, to assess the level of improvement provided by the best-fitting model over the null model in predicting subject's choices, we calculated the pseudo-R2 values for each subject, as described elsewhere (Camerer and Ho, 1999, Daw, 2011), using the posterior means of individual parameter distributions. We used pseudo-R2 values to single out the subjects whose behavior could not be predicted by the best-fitting model significantly better than near chance level. Based on our previous reports (Schlagenhauf et al., 2014), the threshold for near chance level was set to p ≤ 0.55 which corresponds to pseudo-R2 ≤ 0.1375. This threshold value was selected to make sure that our results were not confounded by poor model fits because we believe that the behavioral data of the subjects whose model-fits are close to chance level should also be treated with caution due to the probabilistic nature of model fitting. Although not reported in this article because of space limitations, we also used two other threshold values, 0.50 and 0.52, to confirm that our results were not sensitive to the selected specific threshold value.After exclusion of these poorly-fitted subjects, we performed the model selection analysis once more to confirm that the results were not confounded by poor model fit. Model comparison was also applied separately to the choice data of HC and ADP.
Comparison of the model parameters between the two groups
For each model parameter, we computed the differences between the samples of the two groups (HC > ADP) at each step of the MCMC chains. We then plotted these sample differences in a histogram. The null hypothesis (H0) was rejected when the value zero—indicating no significant group difference—fell outside the 95% high-density interval (HDI) which spanned 95% of the histograms (Kruschke, 2010).We also examined the relationship between the parameters of the best-fitting model and clinical questionnaire scores. We used the scores on the ADS, average craving subscale of the OCDS, and the LDH to divide ADP into two subgroups at the median values. In the first analysis, we estimated and compared the model parameters of the “severely affected” (18 subjects, ADS: 15–36 ≥ 15) and the “less severe” (16 subjects, ADS: 1–14 < 15) ADP. We used the same parameter comparison technique described above. Similarly, we categorized ADP into “high craving” (19 ADP, OCDScraving: 10–100 ≥ 10) and “low craving” (15 subjects, OCDScraving: 0–5 < 10) groups using the median score on the OCDScraving. Finally, taking the same approach, we compared the group parameters of the “high consumers” (17 ADP, 58.56–999 l ≥ 57 l) and the “low consumers” (17 ADP, 2.10–55.44 l < 57 l), which were specified according to the median score on the LDH questionnaire. We also verified the results by testing the correlation between the posterior means of individual parameter distributions of ADP and their clinical scores.
Learning curves
Learning curve visualizes the adaptation of choice behavior to the reversals of reinforcement contingencies. Average learning curves of HC, ADP, and the poorly-fitted subjects were constructed by plotting the mean correct responses as a function of trial number. Choosing the stimulus with higher reward probability was considered as a correct response. The number of trials was limited to ten because blocks consisted of a minimum number of ten trials. We performed a 3 × 10 ANCOVA to compare the learning curves of the groups. The mean correct responses of the subjects at each trial after contingency reversals were defined as the dependent variable. The between-subjects factor group had three levels for HC, ADP, and the poorly-fitted subjects; whereas the within-subjects factor trial had ten levels for each trial after the reversals. Smoking status was included as a nuisance variable.A successful learning model should capture and replicate the characteristics of behavioral data. To test if the best-fitting learning model fulfilled this criterion, we examined whether surrogate learning curves matched the actual learning curves of the subjects. Surrogate learning curves were constructed by plotting the mean performance of simulated data generated by letting the best-fitting model with parameters fitted to the individual subjects perform the task 100 times. Poorly-fitted data were excluded from this analysis. Surrogate data were then compared using a 2 × 10 group × trial ANCOVA.
FMRI data acquisition and preprocessing
Imaging was performed using a 3 Tesla GE Signa scanner with a T2*-weighted sequence (29 slices with 4 mm thickness; repetition time, 2.3 s; echo time, 27 ms; flip, 90°; matrix size, 128 × 128; field of view, 256 × 256 mm2; in-plane voxel resolution of 2 × 2 mm2) and a T1-weighted structural scan (repetition time, 7.8 ms; echo time, 3.2 ms; flip, 20°; matrix size 256 × 256; 1 mm slice thickness; voxel size of 1 mm3).Functional imaging data were analyzed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm/software/spm8/). The first three volumes of each session were discarded. Volumes were corrected for the delay of slice time acquisition and motion. They were spatially normalized into MNI (Montreal Neurological Institute) space and were spatially filtered with a Gaussian kernel (8 mm full width at half maximum). Imaging data of 6 subjects (3 ADP due to motion artifacts and 3 HC due to susceptibility artifacts) were discarded. The region of interest (ROI) analyses were performed using the Marsbar toolbox in SPM (Brett et al., 2002).
FMRI data analysis
FMRI data were analyzed in an event-related manner using a general linear model approach with two levels. At the first level, reward and punishment events were modeled by stick functions at the onset of the outcome. Trial-by-trial PE time-series were computed using the best-fitting learning model. Similar to outcome events, PE signals were also grouped into positive and negative PEs and included in the GLM as parametric modulators for reward and punishment conditions. Trials without a response were modeled separately. All regressors were convolved with the canonical hemodynamic response function as provided by SPM. The movement parameters from the realignment process were included as regressors of no interest.
Reward and punishment representations in the brain
“Reward vs. punishment” and “punishment vs. reward” contrast images were generated for each subject and taken to the second level. For each contrast image, we performed a random-effects group-level analysis with a one-sample t-test across the entire sample. We also compared groups with a two-sample t-test. FMRI results were reported as significant at p ≤ 0.05 family wise error (FWE) whole-brain corrected at the voxel level.
Neural correlates of the reward and punishment sensitivities
To investigate how reinforcement sensitivity parameters estimated in behavioral modeling were correlated with reward- and punishment-related BOLD responses across all subjects, we used two independent linear regression models at the second level of the fMRI analysis. The first regression model included the “reward vs. punishment” contrast images as the dependent variable and the reward sensitivities of the subjects as the covariate of interest. Likewise, the second regression model included the “punishment vs. reward” contrast images as the dependent variable, and the punishment sensitivities of the subjects as the covariate of interest. Results were reported significant at p < 0.05 SVC within the brain regions, which showed significant “reward vs. punishment” activity (for reward sensitivity), or “punishment vs. reward” activity (for punishment sensitivity) across all subjects at p < 0.05 FWE whole brain corrected.
Neural correlates of prediction errors
We performed a parametric model-based fMRI analysis to examine the differences between HC and ADP in the neural correlates of PEs. PEs were calculated using the mean values of the posterior parameter distributions estimated for each individual. Single subject contrast images of the parametric modulators positive PE ([+]PE) and negative PE ([−]PE) were taken to a 2 × 2 repeated measures ANOVA (flexible factorial design in SPM) with a between-subjects factor group (HC vs. ADP) and a within-subjects factor PE type (positive vs. negative). Subjects factor in SPM was also included to model subject constants. We tested the following contrasts: (1) the PE-related activity across all subjects, (2) between-group difference in the PE-related activity, (3) between-group difference in the [+]PE-related activity, (4) between-group difference in the [−]PE-related activity. FMRI results were reported as significant at p ≤ 0.05 FWE whole-brain corrected at the voxel level.
Results
ADP met the learning criterion (70% correct responses during a maximum block length of 16 trials) less often than HC (t58 = 1.99, p = 0.05; M = 11.15, SD = 3.86; M = 9.176, SD = 3.76), completing the task with a significantly lower number of correct choices (t58 = 2.586, p = 0.012, M = 136.15, SD = 6.14; M = 131.35, SD = 7.78). A 2 × 2 group × outcome valence ANCOVA with response times as the dependent variable showed no significant main effect of group (F(1,58) = 0.002, p = 0.960) or outcome valence (F(1,58) = 2.986, p = 0.089); or a significant group × outcome valence interaction (F(1,58) = 1.031, p = 0.314).We also tested for between-group differences in win-stay and lose-shift behavior. A logistic regression analysis estimated the beta parameters of the four win-stay regressors and the four lose-shift regressors. These regressors model the extent to which subjects integrated the outcome information from the previous four trials (lag) into their decisions to stay on the same option or shift to the other option. The first 2 × 4 group × lag ANOVA with the parameter estimates of win-stay regressors revealed no significant difference between the groups (F(1, 58) = 1.479, p = 0.229, see (Fig. 1A); however the main effect of the factor lag was found significant (F(3, 174) = 7.679, p < 0.0001). No significant interaction effect was found between the factors group and lag (F(3, 174) = 0.629, p = 0.597). The second 2 × 4 ANOVA with the parameter estimates of the four lose-shift regressors indicated a significant difference between HC and ADP in lose-shift behavior (F(1, 58) = 5.971, p = 0.017, see Fig. 1B). The main effect of the factor lag (F(3, 174) = 2.694, p < 0.047), as well as the interaction effect between the factors group and lag (F(3, 174) = 0.966, p = 0.410) were found insignificant.
Fig. 1
Win-stay/lose-shift analysis. Parameter estimates of the (A) win-stay and (B) lose-shift regressors of the 4 trials into the past. “t” represents the time of choice. Bars denote standard errors. Asterisk denotes statistical significance (p ≤ 0.05).
Win-stay/lose-shift analysis. Parameter estimates of the (A) win-stay and (B) lose-shift regressors of the 4 trials into the past. “t” represents the time of choice. Bars denote standard errors. Asterisk denotes statistical significance (p ≤ 0.05).Potential scale reduction factors (PSRF) of the candidate models indicated that the MCMC algorithm converged for each model (see the PSRFs calculated for each model in Table 2). The model comparison analysis based on the DIC scores of the models showed that compared to the SU models, the DU models and HMMs provided superior fits to behavioral data, supporting the assumption that subjects inferred and utilized the knowledge that reward contingencies on two options are fully anti-correlated. The DU2 model (the DU model with equal reward and punishment learning rates, but distinct reward and punishment sensitivities) was selected as the best model as it fitted the behavioral data of all subjects better than the other candidate models (Fig. 2A and Table 2). The DU5 model was another candidate model with a similar DIC score. However, this model can easily be reduced to the DU2 model, because the estimated fictive weight parameter was found to be approximately equal to 1 (equal learning rates for actual and fictive outcomes).
Fig. 2
Model comparison. (A) The DIC scores of the candidate learning models for all subjects. The most parsimonious model, the DU2 model (plotted with a patterned bar) has the lowest DIC score. (B) The pseudo-R2 values of the subjects show the relative improvement in model-fitting provided by the DU2 model over the null model. The DU2 model was not able to predict the behavioral data of 11 subjects (5 HC and 6 ADP; marked by asterisks) better than near chance level which is marked with a horizontal dotted line at the pseudo-R2 = 0.1375 (corresponding to p = 0.55). (C) The DIC scores of the candidate learning models for all subjects fitted above the near chance level. α: learning rate, ρ: reinforcement sensitivity, ξ: fictive weight, τ: transition probability, φ: outcome probability. Parameters, which take different values according to the valence of the outcome, are marked with subscripts r for reward and p for punishment.
Model comparison. (A) The DIC scores of the candidate learning models for all subjects. The most parsimonious model, the DU2 model (plotted with a patterned bar) has the lowest DIC score. (B) The pseudo-R2 values of the subjects show the relative improvement in model-fitting provided by the DU2 model over the null model. The DU2 model was not able to predict the behavioral data of 11 subjects (5 HC and 6 ADP; marked by asterisks) better than near chance level which is marked with a horizontal dotted line at the pseudo-R2 = 0.1375 (corresponding to p = 0.55). (C) The DIC scores of the candidate learning models for all subjects fitted above the near chance level. α: learning rate, ρ: reinforcement sensitivity, ξ: fictive weight, τ: transition probability, φ: outcome probability. Parameters, which take different values according to the valence of the outcome, are marked with subscripts r for reward and p for punishment.The pseudo-R2 values computed for each subject revealed that the DU2 model was not able to predict the behavioral data of 5 HC and 6 ADP better than near chance level (Fig. 2B). To make sure that these poorly-fitted data did not confound the model comparison results, we repeated the model selection analysis for the well-fitted subjects only. We found that the DU2 model once again explained the behavioral data significantly better than the other models (Fig. 2C and Table 2). None of the candidate models were able to predict the behavioral data of the poorly-fitted subjects better than near chance level.Additionally, when we repeated the analysis separately for each subject group, the DU2 model provided a parsimonious fit for HC; whereas, when only ADP were considered, the HMM2 provided a slightly better fit than the DU2 model (see Supplementary Fig. 1). HMM requires the complete model of the environment, which may be seen as a strong assumption about learning given the fact that subjects were not given the chance to practice the task beforehand (orientation version did not involve reversals). On the other hand, DU model can handle the stochastic transitions and rewards of this task without constructing the model of the environment. As a matter of fact, a comparison of the surrogate learning curves generated by these models revealed that both of these models were able to predict the behavioral data of both groups statistically alike (see Supplementary material for a comparative analysis). This similarity is also consistent with the recent studies which did not perform any model comparison analysis and used the DU model based on the assumption that this model provides a good approximation of the HMM for this task design while being more parsimonious (Glaescher et al., 2009, Hampton et al., 2007).Also, our model selection was motivated by our study's hypotheses. In this study, we were particularly interested in addiction-related changes in reward-based learning guided by PEs. However, learning in HMMs do not involve computations of PE signals. Therefore, we selected the DU2 model as the best fitting model for all subjects and used this model to derive the PEs for the subsequent model-based fMRI analysis.The posterior group parameter distributions of HC and ADP (see Table 3) were approximated using the DU2 model with converged MCMC samples (PSRF = 1.04, see Table 2). For each parameter of the DU2 model, parameter comparison between HC and ADP was performed by computing the differences between the samples of the two groups (HC > ADP) and plotting these differences as histograms (Fig. 3A). The null hypothesis H0 of “no group difference (HC − ADP = 0)” was rejected only for the punishment sensitivity parameter, as the value zero fell outside the 95% HDI (0.09–1.35) of the histogram. The positive value of the mean difference (HC − ADP = 0.705) indicates that ADP had significantly lower punishment sensitivities compared to HC. The result remained unchanged when the analysis was repeated only for the well-fitted subjects (mean difference = 0.906, 95% HDI = 0.24–1.58). On the other hand, neither the learning rates, nor the reward sensitivities showed differences between groups (learning rate: mean difference = 0.003, 95% HDI = − 0.14–0.15; reward sensitivity: mean difference = 0.265, 95% HDI = − 0.61–1.17).
Histograms of parameter differences. A. Between-group comparisons in DU2 model parameters indicate lower punishment sensitivity in ADP. B. Parameter comparison between less severe (LO) and severely affected (HI) ADP indicate greater reward sensitivity in severely affected ADP. Mean values of the histograms are shown with solid black lines. The point of no group difference is marked with a red dashed line. 95% of the distributions are found within arrows. HDI: High-density interval. μα: group parameter distribution for learning rate, μρR: group parameter distribution for reward sensitivity, μρP: group parameter distribution for punishment sensitivity. The figure is generated by adapting the R code originally created by Kruschke (2010).
Histograms of parameter differences. A. Between-group comparisons in DU2 model parameters indicate lower punishment sensitivity in ADP. B. Parameter comparison between less severe (LO) and severely affected (HI) ADP indicate greater reward sensitivity in severely affected ADP. Mean values of the histograms are shown with solid black lines. The point of no group difference is marked with a red dashed line. 95% of the distributions are found within arrows. HDI: High-density interval. μα: group parameter distribution for learning rate, μρR: group parameter distribution for reward sensitivity, μρP: group parameter distribution for punishment sensitivity. The figure is generated by adapting the R code originally created by Kruschke (2010).Summary table of the DU2 model's estimated parameters (mean ± SD). N: sample size, α: learning rate, ρ: reward sensitivity, ρ: punishment sensitivity.To examine the relationship between the parameters of the DU2 model and the clinical questionnaire scores, we ran additional model fitting analyses within ADP. First, we sought to determine whether the severity of alcohol dependence (assessed with ADS) was related to the parameters of the DU2 model. The posterior parameter distributions of the less severe (LO) and the severely affected (HI) ADP were approximated using MCMC samples. Difference distributions, which were computed by subtracting the parameter distributions of the severely affected ADP from those of the less severe ADP, were plotted as difference histograms (Fig. 3B). Reward sensitivity parameter was found significantly different between these subgroups as the value zero indicating no difference was outside the 95% HDI (− 2.31 − [− 0.147]) of the histogram. The negative mean difference (severely affected − less severe = − 1.21) indicated that the severely affected ADP had significantly higher reward sensitivities relative to the less severe ADP (MLO = 1.419, SDLO = 0.828; MHI = 2.633, SDHI = 1.672). Also, a significant positive correlation was found between the posterior means of individual reward sensitivity distributions of ADP and their ADS scores (Pearson's r = 0.482, p = 0.005).There was no significant difference between the low craving and the high craving ADP; or between the low consumers and the high consumers.We constructed the average learning curves of HC, ADP, and the poorly-fitted subjects by plotting the mean correct responses as a function of trial number (bold curves in Fig. 4). Learning curves were then compared using a 3 × 10 group × trial ANCOVA, which showed a significant main effect of group (F(2, 57) = 6.27, p = 0.003), a significant main effect of trial (F(3.90, 222.64) = 46.46, p < 0.001, Greenhouse–Geiser corrected) and a significant group × trial interaction (F(7.81, 222.64) = 6.351, p < 0.001, Greenhouse–Geiser corrected). When the analysis was repeated for the well-fitted HC and ADP only, the main effect of group (F(1, 47) = 5.378, p = 0.002) and trial remained significant (F(3.21, 151.01) = 82.707, p < 0.001, Greenhouse–Geiser corrected); whereas the significant group × trial interaction effect disappeared (F(3.21, 151.01) = 1.042, p = 0.404, Greenhouse–Geiser corrected). Post hoc two-sample t-tests revealed a significant difference between the mean correct responses of HC and ADP at the 5th trial after reversal, at which both groups reached their highest performance (t47 = 2.894, p = 0.028, Holm–Bonferroni corrected, M = 92.09%, SD = 9.27%; M = 81.53%, SD = 14.64%).
Fig. 4
Learning curves of HC, ADP, and poorly-fitted subjects. Correct responses (selection of the stimulus with higher reward probability) were averaged over blocks of 10 trials for actual (solid lines) and simulated data (dashed lines). Individually estimated parameters of the DU2 model were used for simulations. Shaded regions denote standard errors.
Learning curves of HC, ADP, and poorly-fitted subjects. Correct responses (selection of the stimulus with higher reward probability) were averaged over blocks of 10 trials for actual (solid lines) and simulated data (dashed lines). Individually estimated parameters of the DU2 model were used for simulations. Shaded regions denote standard errors.Next, we tested whether surrogate learning curves generated by the DU2 model followed the actual learning curves of the subjects. Specifically, we were interested whether the difference in the punishment sensitivities of the DU2 model (when fitted to the ADP and HC) translated into the difference in learning curves. First, we generated surrogate choice data. DU2 models with parameters fitted to the individual subjects performed the task (100 times per model). Second, we averaged the correct responses and constructed the surrogate learning curves for HC, ADP and poorly-fitted subjects (dashed curves in Fig. 4). Finally, we compared the surrogate learning curves using a 2 × 10 group × trial ANCOVA. Poorly-fitted data, as well as the data recorded during blocks with L: 50%–R: 50% reward contingencies, were excluded from this analysis. ANCOVA showed a significant main effect of group (F(1, 47) = 6.95, p = 0.011) and a significant main effect of trial (F(2.26, 106.59) = 139.19, p < 0.001, Greenhouse–Geiser corrected). The group × trial interaction was found to be insignificant (F(2.26, 106.59) = 1.81, p = 0.064). Post hoc t-tests revealed a significant group difference in the mean correct responses at the 4th trial after the reversal (t47 = 3.244, p = 0.01, Holm–Bonferroni corrected, M = 87.62%, SD = 7.99%; M = 78.37%, SD = 11.06%) in addition to the 5th trial after the reversal (t47 = 3.273, p = 0.01, Holm–Bonferroni corrected, M = 91.43%, SD = 5.88%; M = 83.18%, SD = 10.35%). Hence, replication of the between-group difference in learning curves using the simulated data confirmed the significant association found between the decrease in the punishment sensitivity and the impaired behavioral adaptation of ADP.Learning curve analysis was not affected by the selection of the near chance threshold as both values yielded comparable results.
FMRI analysis
Across all subjects, compared to punishments, rewards elicited a significant BOLD response in the bilateral posterior cingulate cortex, the bilateral precuneus, and the medial orbitofrontal cortex. Additionally, the left middle/superior PFC and the right putamen displayed an increased activity for reward vs. punishment (see Supplementary Fig. 2A and Supplementary Table 1). On the other hand, a significant activation in response to punishments relative to rewards was observed bilaterally in the anterior insula/inferior PFC, the dorsal anterior cingulate cortex (ACC), and the pre-SMA (see Supplementary Fig. 2B and Supplementary Table 1). Two-sample t-tests revealed no significant between-group difference in the reward vs. punishment or punishment vs. reward activity (p ≥ 0.001 uncorrected).We also sought to probe whether there are neural correlates of reward and punishment sensitivity parameters. A linear regression performed at the second level of the fMRI analysis, which examined the correlation between “punishment vs. reward” activity and punishment sensitivity parameter, revealed a significant positive correlation across all subjects in the right insula/inferior PFC (MNI [x y z] = [32 21 5]; k = 11; t52 = 3.80; pFWE voxel (SVC) = 0.024; Fig. 5). On the other hand, no significant correlation was found between “reward vs. punishment” activity and reward sensitivity parameter (p ≥ 0.001 uncorrected).
Fig. 5
Neural correlates of punishment sensitivity. The “punishment > reward” activity in the right insula is positively correlated with punishment sensitivity parameter of the best-fitting learning model. A scatter plot of the log-transformed punishment sensitivities vs. the mean parameter estimates of the punishment-related activity in the R insula (circled area) is also shown.
Neural correlates of punishment sensitivity. The “punishment > reward” activity in the right insula is positively correlated with punishment sensitivity parameter of the best-fitting learning model. A scatter plot of the log-transformed punishment sensitivities vs. the mean parameter estimates of the punishment-related activity in the R insula (circled area) is also shown.Across all subjects, neural correlations of model-derived PE were found bilaterally in the VS, the middle, superior and inferior prefrontal cortices, the ACC, the midbrain, the globus pallidi, the middle temporal lobules, as well as in the left insula, the left supramarginal gyrus, the right inferior parietal lobule, the right precuneus and the right cerebellum (see Supplementary Fig. 3 and Supplementary Table 2).Among these regions, the contrast HC > ADP showed a significant between-group difference in the PE-related activity in the bilateral DLPFC (right: MNI [x y z], [40 33 43], t52 = 5.831, pFWE peak voxel (whole-brain) = 0.005; left: [− 41 18 53], t52 = 5.488, pFWE peak voxel (whole-brain) = 0.014), the bilateral dorsal premotor areas (right: [25 8 63], t52 = 6.081, pFWE peak voxel (whole-brain) = 0.002; left: [− 41 11 53], t52 = 5.23, pFWE peak voxel (whole-brain) = 0.032), and the right intraparietal sulcus (IPS) ([42 − 62 43], t52 = 6.112, pFWE peak voxel (whole-brain) = 0.002) (Fig. 6A and Table 4). Striatal activity related to PE did not differ between the two groups (p ≥ 0.001 uncorrected). Furthermore, the reverse contrast, ADP > HC showed no significant difference (p ≥ 0.001 uncorrected). In order to address the concern that group differences observed in the DLPFC might be confounded by the individual differences in the model-fits, we repeated the 2nd level analysis only for the well-fitted subjects. The differences between HC and ADP in the PE-related activity remained significant in the left and the right DLPFCs (left DLPFC: [− 23 6 43], t43 = 5.15, pFWE peak voxel (whole-brain) = 0.05; right DLPFC: [35 38 23], t43 = 5.16, pFWE peak voxel (whole-brain) = 0.05).
Fig. 6
Impaired PE-related activity in ADP. Group differences (HC > ADP) in the neural correlations of (A) total prediction error (PE) (both positive and negative), (B) negative prediction error ([−]PE), (C) positive prediction error ([+]PE). A threshold of p = 0.001 uncorrected with an extent threshold of 20 voxels is used for visualization (corresponds to t > 3.31). The color bar represents t-values. Bar plots show the beta estimates of the parametric modulators (D) [−]PE and (E) [+]PE extracted from the peak coordinates [− 33 8 50] and [42 36 35] showing significant group × PE type interaction effect. Asterisks denote statistical significance. Error bars indicate standard errors.
Table 4
Model-based fMRI analysis results. Between-group differences (HC > ADP) in the neural correlates of the prediction error (PE), the positive PE and the negative PE. BA: Brodmann Area, k: cluster size at p < 0.001 uncorrected, FWE (whole-brain): FWE whole-brain corrected at the voxel level, MNI: Montreal Neurological Institute, HC: healthy controls, ADP: alcohol-dependent patients, PFC: prefrontal cortex, R: right, L: left.
Region
BA
k
pFWE voxel (whole-brain)
t
MNI (x,y,z)
PE (positive & negative)HC > ADP
R Superior PFC
6
529
0.002
6.081
25
8
63
R Middle PFC
46
0.005
5.831
40
33
43
9
0.028
5.280
27
23
45
L Middle PFC
9
251
0.014
5.488
− 41
18
53
9
0.032
5.230
− 41
11
53
9
0.070
4.950
− 33
13
53
R Angular gyrus
39
530
0.002
6.112
42
− 62
43
7
0.073
4.934
27
− 80
48
7
0.095
4.840
17
− 72
50
Positive PEHC > ADP
R Middle PFC
46
176
0.032
5.218
40
33
40
Negative PEHC > ADP
L Middle PFC
9
339
0.025
5.298
− 38
11
53
8
0.050
5.067
− 28
11
50
R Superior PFC
6
34
0.041
5.135
25
8
65
R Angular gyrus
39
162
0.072
4.941
42
− 65
45
Impaired PE-related activity in ADP. Group differences (HC > ADP) in the neural correlations of (A) total prediction error (PE) (both positive and negative), (B) negative prediction error ([−]PE), (C) positive prediction error ([+]PE). A threshold of p = 0.001 uncorrected with an extent threshold of 20 voxels is used for visualization (corresponds to t > 3.31). The color bar represents t-values. Bar plots show the beta estimates of the parametric modulators (D) [−]PE and (E) [+]PE extracted from the peak coordinates [− 33 8 50] and [42 36 35] showing significant group × PE type interaction effect. Asterisks denote statistical significance. Error bars indicate standard errors.Model-based fMRI analysis results. Between-group differences (HC > ADP) in the neural correlates of the prediction error (PE), the positive PE and the negative PE. BA: Brodmann Area, k: cluster size at p < 0.001 uncorrected, FWE (whole-brain): FWE whole-brain corrected at the voxel level, MNI: Montreal Neurological Institute, HC: healthy controls, ADP: alcohol-dependent patients, PFC: prefrontal cortex, R: right, L: left.We also analyzed the effect of PE type (positive vs. negative) on the neural correlates of PE. PE was grouped into [+]PE and [−]PE according to whether the obtained outcome is better ([+]PE) or worse ([−]PE) than the expected outcome. The contrast “HC > ADP” showed a hemispheric asymmetry in the DLPFC activation for the between-group differences such that a significant decrease in the [−]PE-related activity ([− 38 11 53], t52 = 5.298, pFWE peak voxel (whole-brain) = 0.026) was observed in the left DLPFC (Fig. 6B and Table 4). On the other hand, reduced [+]PE-related activity in ADP was found in the right DLPFC ([40 33 40], t52 = 5.218, pFWE peak voxel (whole-brain) = 0.033) (Fig. 6C and Table 4).This unanticipated asymmetry in the DLPFC for negative and positive PEs prompted us to perform a post hoc ANOVA. The interaction effect between group and PE type was tested using the contrasts “(HC vs. ADP) × ([−]PE vs. [+]PE)” and “(HC vs. ADP) × ([+]PE vs. [−]PE)”. Results were reported as significant at p < 0.05 FWE corrected for the multiple comparisons within a volume in Brodmann area 9 and 46 that shows significant PE-related activity across all subjects. The contrast “(HC vs. ADP) × ([−]PE vs. [+]PE)” revealed a significant activation in the left DLPFC ([− 33 8 50], t52 = 3.459, pFWE voxel (SVC) = 0.043, Fig. 6D and Table 5); whereas the contrast “(HC vs. ADP) × ([+]PE vs. [−]PE)” showed a significant activation in the right DLPFC ([42 36 35], t52 = 4.359, pFWE voxel (SVC) = 0.016, Fig. 6D and Table 5).
Table 5
Group × type of the prediction error (PE) interaction effects in the left and the right dorsolateral prefrontal cortices. BA: Brodmann Area, k: cluster size at p < 0.001 uncorrected, FWE voxel (SVC): FWE small volume corrected at the voxel level, MNI: Montreal Neurological Institute, HC: healthy controls, ADP: alcohol-dependent patients, [+]PE: positive prediction error, [−]PE: negative prediction error, PFC: prefrontal cortex, R: right, L: left.
Region
BA
k
pFWE voxel (SVC)
t
MNI (x,y,z)
Group × PE type interactions
(HC vs. ADP) × ([+]PE vs. [−]PE)
R Middle PFC
46
13
0.013
4.359
42
36
35
(HC vs. ADP) × ([−]PE vs. [+]PE)
L Middle PFC
9
9
0.034
3.459
− 33
8
50
Group × type of the prediction error (PE) interaction effects in the left and the right dorsolateral prefrontal cortices. BA: Brodmann Area, k: cluster size at p < 0.001 uncorrected, FWE voxel (SVC): FWE small volume corrected at the voxel level, MNI: Montreal Neurological Institute, HC: healthy controls, ADP: alcohol-dependent patients, [+]PE: positive prediction error, [−]PE: negative prediction error, PFC: prefrontal cortex, R: right, L: left.Finally, we tested whether the impairments in the [−]PE- and [+]PE-related activities in the left and right DLPFC are correlated with the clinical severity of dependence and the mean craving for alcohol as assessed with ADS and OCDScraving, respectively. We extracted the mean parameter estimates (beta estimates) of the [−]PE- and [+]PE-related activities from the clusters showing significant group differences (cluster centers at [− 38 11 53] and [40 33 40]). We found that [−]PE-related activity in the left DLPFC is significantly correlated with ADS scores of ADP (Pearson's r = − 0.347, p = 0.032). This result remained significant when the poorly-fitted ADP were excluded from the analysis (r = − 0.494, p = 0.006). No correlation was found between the [−]PE-related activity difference in the left DLPFC and OCDScraving scores (r = − 0.001, p = 0.498). Additionally, we found that [+]PE-related activity were correlated neither with ADS (r = − 0.022, p = 0.454), nor with OCDScraving scores (r = − 0.079, p = 0.346). None of the other subscales or the total score of OCDS were correlated with the PE-related activity in the DLPFC.
Discussion
In this study, by using a reward-guided decision-making task and a so-called “double-update” RL model, we report a relation in alcohol dependence between impaired adaptation to the changes in reinforcement contingencies and decreased sensitivity to punishments. We also report a reduced correlation between the PEs derived from this DU model and the BOLD activity in the DLPFC of ADP. Moreover, we report an association between the severity of alcohol dependence and the decrease in the DLPFC activity related to negative PE signals, which play a critical role in adaptation to contingency changes by mediating the extinction of the behavior that is no longer associated with reward.ADP had difficulty adapting their responses to the changing reward contingencies of the reward-guided decision-making task, a finding consistent with the results of the previous studies with subsets of our sample (13 ADP and 14 HC in Deserno et al., 2014; 20 ADP and 16 HC Park et al., 2010). Statistical analysis of win-stay and lose-shift behavior revealed that this adaptation difficulty was related to the weakened influence of punishments on decisions to shift the response. To understand the underlying computational mechanisms of this impairment, we modeled the choice behavior of our subjects using computational learning models with different assumptions about the amount of task-related information subjects may have inferred during the experiment. In line with our expectations, we found that the DU model achieved the highest accuracy in predicting the choices of all subjects. Between-group comparisons of the free parameters of this best-fitting model revealed that ADP had significantly lower punishment sensitivity. This finding is congruent with our hypothesis and the previous reports on reduced loss sensitivity and lower decision consistency in drug abuse (Ahn et al., 2014, Bishara et al., 2009, Fridberg et al., 2010, Stout et al., 2004, Tanabe et al., 2013, Vassileva et al., 2013). A computer simulation of behavioral data using the fitted parameters of the DU model reproduced the maladaptive behavior of ADP, further verifying the association between decreased punishment sensitivity and impaired behavioral adaptation. On the other hand, no significant group difference was found in other parameters of the DU model, i.e. learning rate and reward sensitivity. We argue that the difference observed between HC and ADP in adapting to changes in contingencies may not be related to learning speed or implemented learning strategy. Slower adaptation to reversals may rather be due to the fact that ADP's choices just after reversals (when subjects receive the majority of consecutive punishments) were less affected by the action values. Therefore, our results suggest that when faced with punishment, decisions of ADP are more often replaced by random guesses, which are possibly reached in the absence of deliberation. Finally, when ADP were divided into two groups at the median ADS score, we found that relative to the “less severe” group, the “severely affected” ADP had greater reward sensitivity, showing a behavioral pattern suggestive of increased tendency to respond actively to the stimuli leading to pursuit of rewards (Hyman, 2005).Across all subjects, we discovered a positive correlation between the model-estimated punishment sensitivity and right anterior insula/inferior PFC activity in response to “punishment vs. reward”. In previous neuroimaging studies featuring tasks with reversals, anterior insula, and inferior PFC responses have been shown to signal the decreases in the expected values of selected actions and predict the consecutive behavioral shifts (Cools et al., 2002, Ghahremani et al., 2010, Glaescher et al., 2009, Hampton et al., 2006, O'Doherty et al., 2003, Schlagenhauf et al., 2014). Thus, our finding can be interpreted as evidence that the right anterior insula may be involved in the reduced ability of ADP to adjust choice behavior according to negative outcome experiences. As a part of the salient network, the anterior insula plays a crucial role in detecting salient events and engaging the central executive network for high-level cognitive control and attentional processing (see reviews by Menon and Uddin, 2010, Uddin, 2015). In our experiment, high performance partly depends on detecting the saliency of punishing stimuli, and channeling brain's top-down control resources via other cortical regions such as the DLPFC (Johnston et al., 2007). The significant correlation between punishment sensitivity and the activity in the right anterior insula therefore suggests that reduced punishment sensitivity in ADP can be related to a compromised detection of punishment events as being salient by the right anterior insula, which may fail to trigger appropriate cognitive control signals in alcohol dependence. However, it is pertinent to point out that punishment vs. reward activity in the right anterior insula/inferior PFC did not differ between HC and ADP despite the significant difference found in punishment sensitivity. The reason for this might be that the event-related fMRI analysis per se was not able to differentiate alterations in neural activity with respect to learning or decision-making in our patient group, which motivated us to combine model-derived PEs and the fMRI data in a model-based fMRI analysis.Model-based fMRI analysis revealed significantly lower PE-related activities in the bilateral DLPFC, the bilateral dorsal premotor areas, and the right IPS of ADP, indicating that these regions were less responsive to teaching signals that putatively facilitate behavioral adaptation. This result accords with our hypothesis that the DLPFC is implicated in the maladaptive reward-based decision-making of ADP given that the adaptive processes taking place in the PFC were captured by a computational learning model that incorporates task-related information into decisions. PEs, which form the basis for learning (Schultz and Dickinson, 2000), seem to evoke BOLD responses in the DLPFC of healthy subjects when they learned the associations between cues and affectively neutral outcomes in an associative learning task (Fletcher et al., 2001). Furthermore, Fletcher et al. demonstrated that the DLPFC activity was also able to predict the subsequent decisions of these subjects. Indeed, the tendency for taking the corrective action upon receiving an error seems to get weakened by DLPFC damage (Gehring and Knight, 2000). Similarly, transient disruption of the DLPFC activity with transcranial magnetic stimulation impairs flexible decision-making in healthy individuals (Smittenaar et al., 2013). Therefore, it is possible to interpret the observed attenuation in the PE-related DLPFC activity as a decline in ADP in the selection of the corrective action in an environment requiring adaptive responses.To our knowledge, this is the first fMRI study with substance-dependent patients showing reduced PE-related activity in the DLPFC. Although the DLPFC has been regarded as an important neural substrate of maladaptive decision-making in substance dependence (Eldreth et al., 2004, Ersche et al., 2005, Monterosso et al., 2007, Paulus et al., 2002), a decrease in the neural tracking of PEs in this brain region has not yet been reported by other studies with substance-dependent subjects (Chiu et al., 2008, Deserno et al., 2014, Park et al., 2010, Tanabe et al., 2013). The primary reason might be that the PEs used in our model-based fMRI analysis were derived from a model that was selected from a pool of candidate models according to its performance in predicting behavioral data. On the contrary, the previous studies cited above defined the standard Rescorla–Wagner model a priori, based on their hypotheses related to the striatal PE-signaling, which has been shown to be reliably predicted by this model (Pagnoni et al., 2002). To confirm this interpretation, we repeated the model-based fMRI analysis with the PEs derived from the standard Rescorla–Wagner (denoted as “SU1” in the model set). Consistent with these studies mentioned above, we also observed significant PE-related signals in the bilateral VS (see Supplementary material). However, the between-group difference we found in the PE-related DLPFC activity disappeared. It is probable that improvement provided by the DU model in explaining the computational processes underlying the choice behavior increased the model-based fMRI analysis's capability to capture the group differences in the neural correlates of these processes. Therefore, we conclude that selecting the learning model based on its performance on predicting behavioral data also improved the sensitivity of the subsequent model-based fMRI analysis.Consistent with two previous studies with subsets of our subjects (Deserno et al., 2014, Park et al., 2010), we found intact striatal PE signaling in ADP, which suggests that action selection in ADP is inadequately informed by otherwise properly computed reward-learning signals in the reward/valuation network. It has been suggested that DLPFC potentiates adaptive decisions by incorporating the reward expectancies into decision representations (Barraclough et al., 2004, Christakou et al., 2009, Gold and Shadlen, 2001, Kim and Shadlen, 1999, Sugrue et al., 2005, Wallis and Miller, 2003). Consistent with this idea, simultaneous recordings from the caudate nucleus (a limbic brain structure known to encode PEs) and the lateral PFC of monkeys during a reversal learning task showed that in addition to encoding PE, the lateral PFC activity also predicts the forthcoming responses (Asaad and Eskandar, 2011). Therefore, intact striatal but reduced DLPFC activity correlated with PE suggests an ineffective integration of the reward-related information in the DLPFC of ADP which may result in selection of choices that are loosely coupled with the recently updated contingencies of the environment (Park et al., 2010, Sakagami and Watanabe, 2007).Another way to interpret our data is that motivational signals may not be effectively embedded into cognitive processing in alcohol dependence. For reward maximization, it has recently been proposed that cognitive control function interacts with motivation (Botvinick and Braver, 2015). For instance, cognitive tasks offering monetary gains have shown that motivation can enhance executive processes to achieve efficient goal-directed behavior (e.g. Engelmann et al., 2009). Experimental data suggest that this interplay between motivation and cognition requires robust interactions between the reward/valuation network and the fronto-parietal attentional network (Pessoa, 2008, Pessoa and Engelmann, 2010). In particular, the DLPFC in the latter network appears to bridge cognitive control and value-processing by representing both cognitive and motivational (value-based) information (Dixon and Christoff, 2014). A previous report with a subset of our subject group demonstrated an abnormal functional connectivity between these two networks, specifically between the VS and the DLPFC (Park et al., 2010). Therefore, the reduced PE-related activity in the DLPFC of ADP, together with the findings of Park et al. (2010) suggest an impaired integration of motivational signals with executive control, with a possible consequence of a decrease in the engagement of cognitive control mechanisms in alcohol dependence.The left DLPFC activity in ADP showed a decreased neural tracking of negative PEs, which, according to the RL theory, facilitate the extinction of a learned response (Schultz, 1998). This attenuated activity in the left DLPFC may contribute to the cognitive rigidity of ADP by delaying the extinction of the action that is no longer paired with a reward when reinforcement contingencies change. Diminished activity in the left DLPFC has also been demonstrated in ADP performing stop signal (Li et al., 2009) and Stroop tasks (Dao-Castellana et al., 1998), which involve extinction of “old” and reconfiguration of “new” stimulus-response associations. Moreover, transcranial magnetic stimulation of the left but not the right DLPFC disrupted the cognitive flexibility of healthy participants (Ko et al., 2008, Smittenaar et al., 2013). Here, it is important to note that the ranges of the negative PEs used in this study were determined by the punishment sensitivity parameter of the DU model estimated for each subject. Therefore, the reduced tracking of these signals suggests a neural substrate in the left DLPFC for the diminished influence of adverse consequences over the actions of ADP. An additional finding was that the right DLPFC of ADP showed a reduced tracking of positive PEs, which may reflect impairment in initiating actions to select the action that was formerly punishing and became rewarding after a contingency reversal.Correlating the PE-related activations in the DLPFC with a severity index of alcohol dependence (Alcohol Dependency Scale, ADS) revealed that the diminished encoding of negative PEs in the left DLPFC was more prominent in ADP with high severity scores. This finding suggests that functional abnormalities in the left DLPFC may contribute to the difficulties severely affected ADP commonly experience in overriding drug-related behavior and maintaining abstinence. On the other hand, neither the PE-related activity in the DLPFC nor the PE-related activity in the VS of ADP was found to be correlated with the craving scores of ADP (as measured using OCDScraving). This finding is discordant with Deserno et al. (2014) showing an association between the striatal PE signals and OCDScraving. This discrepancy, which may be due to sample-to-sample variation between these two studies or the methodological differences in behavioral modeling, needs to be clarified by future studies.One limitation of this study was the magnetic susceptibility artifacts leading to the loss of signal intensity in the orbitofrontal cortex, as this region is located in the vicinity of the sinonasal areas. Future studies should tackle this problem with a more sensitive scanning method. Also, only male participants were recruited to avoid gender's confounding effects. Future studies with female subjects are of interest, as differences between gender groups in addictive behavior have been noted in several studies (Brady and Randall, 1999, Kosten et al., 1985, Nolen-Hoeksema, 2004). Finally, it is important to bear in mind that our correlational design limits causal inferences. Therefore, longitudinal studies are required to determine whether the alterations in the PE-related DLPFC activity reflect changes in cognitive flexibility due to alcohol dependence, or they result from preexisting vulnerabilities.
Conclusions
In conclusion, our results may contribute to the elucidation of the behavioral mechanisms and their neural correlates involved in impaired decision-making in substance dependence. They may, in particular, help us to understand the cognitive processes underlying the difficulties in overriding previously rewarded, but currently punishing drug-related actions with non-drug-related ones. There is some evidence that computer-aided cognitive training can treat impaired cognitive processes; improving information processing, verbal and non-verbal memory, attention, and problem-solving (Vinogradov et al., 2012). Moreover, it has successfully been shown with alcohol-dependent individuals that cognitive training can support rehabilitation as part of the traditional treatment (e.g. Fals-Stewart and Lam, 2010, Houben et al., 2011, Rupp et al., 2012). Therefore, it is possible that a focused training of adaptation to reversing reinforcement contingencies might be a valuable treatment module for improving clinical outcomes in alcohol dependence, especially in severe cases.
Authors: Soyoung Q Park; Thorsten Kahnt; Anne Beck; Michael X Cohen; Raymond J Dolan; Jana Wrase; Andreas Heinz Journal: J Neurosci Date: 2010-06-02 Impact factor: 6.167
Authors: Rainer Spanagel; Daniel Durstewitz; Anita Hansson; Andreas Heinz; Falk Kiefer; Georg Köhr; Franziska Matthäus; Markus M Nöthen; Hamid R Noori; Klaus Obermayer; Marcella Rietschel; Patrick Schloss; Henrike Scholz; Gunter Schumann; Michael Smolka; Wolfgang Sommer; Valentina Vengeliene; Henrik Walter; Wolfgang Wurst; Uli S Zimmermann; Sven Stringer; Yannick Smits; Eske M Derks Journal: Addict Biol Date: 2013-11 Impact factor: 4.280
Authors: Christina E Wierenga; Erin Reilly; Amanda Bischoff-Grethe; Walter H Kaye; Gregory G Brown Journal: J Int Neuropsychol Soc Date: 2021-11-29 Impact factor: 3.114
Authors: Murat Yücel; Erin Oldenhof; Serge H Ahmed; David Belin; Joel Billieux; Henrietta Bowden-Jones; Adrian Carter; Samuel R Chamberlain; Luke Clark; Jason Connor; Mark Daglish; Geert Dom; Pinhas Dannon; Theodora Duka; Maria Jose Fernandez-Serrano; Matt Field; Ingmar Franken; Rita Z Goldstein; Raul Gonzalez; Anna E Goudriaan; Jon E Grant; Matthew J Gullo; Robert Hester; David C Hodgins; Bernard Le Foll; Rico S C Lee; Anne Lingford-Hughes; Valentina Lorenzetti; Scott J Moeller; Marcus R Munafò; Brian Odlaug; Marc N Potenza; Rebecca Segrave; Zsuzsika Sjoerds; Nadia Solowij; Wim van den Brink; Ruth J van Holst; Valerie Voon; Reinout Wiers; Leonardo F Fontenelle; Antonio Verdejo-Garcia Journal: Addiction Date: 2018-10-05 Impact factor: 6.526