Literature DB >> 35862443

Asymmetric and adaptive reward coding via normalized reinforcement learning.

Abstract

Learning is widely modeled in psychology, neuroscience, and computer science by prediction error-guided reinforcement learning (RL) algorithms. While standard RL assumes linear reward functions, reward-related neural activity is a saturating, nonlinear function of reward; however, the computational and behavioral implications of nonlinear RL are unknown. Here, we show that nonlinear RL incorporating the canonical divisive normalization computation introduces an intrinsic and tunable asymmetry in prediction error coding. At the behavioral level, this asymmetry explains empirical variability in risk preferences typically attributed to asymmetric learning rates. At the neural level, diversity in asymmetries provides a computational mechanism for recently proposed theories of distributional RL, allowing the brain to learn the full probability distribution of future rewards. This behavioral and computational flexibility argues for an incorporation of biologically valid value functions in computational models of learning and decision-making.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35862443 PMCID： PMC9345478 DOI： 10.1371/journal.pcbi.1010350

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

Introduction

Reinforcement learning (RL) provides a theoretical framework for how an agent learns about its environment and adopts actions to maximize its cumulative long-term reward [1]. In standard RL models, the values of stimuli or actions are learned via a reward prediction error (RPE), defined as the difference between actual and expected outcomes. Using RPE signals over repeated samples from the environment, learners progressively update their estimates to obtain a measure of the average value (e.g. of an action, a state, or a state-action pair). For example, in its simplest form an RL model updates its value estimate V as: where the incremental update signal comprises a learning rate η and an RPE term. The canonical RPE term is the difference between actual reward R and expected reward V, which produces value signals that are a linear function of reward. RL algorithms accurately capture various types of learning in animal and human subjects [2,3], and have been employed in increasingly complex ways to produce powerful artificial intelligence agents [4-6]. Furthermore, the activity of midbrain dopamine neurons closely matches theoretical RPE signals [7-9], suggesting that RL serves as a computational model of neurobiological learning. Despite their diversity and ubiquity, almost all RL approaches utilize a linear reward function at odds with evidence for nonlinear reward coding in the brain. A nonlinear relationship between internal subjective value and external objective reward is a longstanding assumption in psychology and economics. For example, risk preferences in choice under uncertainty are consistent with choosers employing a utility function that is a nonlinear function of reward [10,11]. This behavioral nonlinearity is reflected in underlying neural responses, with activity in reward-related brain areas correlating with subjective values rather than objective reward amounts [12-14]. Notably, in contrast to standard RL assumptions, the activity of dopamine neurons also exhibits a nonlinear response consistent with a subtraction between actual and expected reward terms, both of which are sublinear functions of reward amount [15,16]. Together, these results suggest that neurobiological RL mechanisms operate on nonlinear reward representations, but the behavioral and computational implications of nonlinear RL are not well understood. Here, we develop and characterize an RL algorithm that learns a nonlinear reward function implemented via the divisive normalization computation [17]. Originally proposed to describe nonlinear responses in early visual cortex, normalization has been observed in multiple brain regions, sensory modalities, and species, suggesting that normalization is a canonical neural computation [18]. In addition to sensory processing, normalization explains neural responses in higher-order cognitive processes including attention, multisensory integration, and decision-making [19-22]. In particular, its role in reward coding makes it an attractive candidate mechanism for nonlinear reward representations in reinforcement learning.

Results

Normalized reinforcement learning model

In contrast to standard RL, we propose a normalized RL algorithm (NRL) that learns a value function nonlinearly related to objective reward. Specifically, NRL (Fig 1A) assumes that objective rewards R are represented by a normalized subjective value function U: and that learning employs the corresponding normalized prediction error term: where the exponent and semisaturation parameters n and σ govern the precise form of the nonlinear transformation (see below). Normalization produces a value function that is a saturating function of reward: at small reward magnitudes (R << σ) value grows with reward, while at large reward magnitudes (R >> σ) value approaches an asymptote. Consistent with this intuition, in simulation NRL algorithms learn values that are a saturating nonlinear function of reward (Fig 1B).

Fig 1

Normalized reinforcement learning model.

Normalized reinforcement learning model.

(a) Comparison of standard reinforcement learning (RL) and normalized reinforcement learning (NRL) models. RL and NRL differ in how external rewards are transformed by the reward coding function f(Rt) prior to learning internal value estimates. Standard RL uses a linear reward function, while NRL uses a divisively normalized representation. (b) Learned value functions under RL and NRL. Left, dynamic value estimates during learning. Right, steady state value estimates. Simulations were performed for each of seven rewards, with additive zero-mean Gaussian noise (learning rate η = 0.1). In contrast to RL, NRL algorithms learn values that are a nonlinear function of external rewards (example NRL simulation parameters: σ = 50, n = 2). (c) Convexity and concavity in NRL value functions. Top, NRL value functions with different exponents (fixed σ = 50 A.U.). Bottom, second derivative of value functions. Dots show inflection points between convex and concave value regimes. (d) Parametric control of NRL value curvature. NRL value function (top) and second derivative (bottom) for different σ values (fixed n = 2). Unlike standard nonlinear utility functions, normalized value functions can exhibit an inflection point in reward responses that introduces an intrinsic, magnitude-dependent asymmetry in RPEs. The direction and extent of this asymmetry depends on the specific parameterization of the normalized value function. The exponent n in normalization models governs amplification of inputs, and is generally considered a fixed property of the model [17,18]. For n≤1, normalized value functions are always concave; however, for n>1, normalized value functions are convex at lower rewards and concave at higher rewards (Fig 1C); convex (concave) regimes are evident as positive (negative) regions of the normalized value second derivative. In the convex regime, increases in R generate larger changes in U(R) than equivalent decreases in R; this predicts asymmetric RPE responses biased towards outcomes better than expected. In contrast, in the concave regime, decreases in R generate larger U(R) changes and RPEs are biased towards outcomes worse than expected. Notably, both theoretical and empirical considerations support normalization exponents consistent with inflection points: input squaring (n = 2) was used to model threshold linear responses in spiking activity in the original normalization equation [17], and fits of n to neural data typically yield values between 1.0 and 3.5 (average value of 2) [18,23]. Thus, asymmetric RPE responses are likely to arise if reward learning relies on normalized value coding. Critically, the type of NRL RPE asymmetry is parametrically tunable and depends on the relationship between rewards and the semisaturation term σ. Value function convexity and resulting positive-biased RPE asymmetry arise for rewards less than σ, while value concavity and negative-biased RPE asymmetry arise for rewards greater than σ. When σ is varied, the reward amounts generating value concavity and convexity shift accordingly (Fig 1D; see S1 Appendix). As a result, asymmetric prediction errors around a given reward magnitude can be either negatively or positively biased, depending on the magnitude of the semisaturation term (Fig 2). Thus, the degree and direction of RPE asymmetry is parametrically controlled in the NRL algorithm; we next examine how variability in this asymmetry can generate variability in risk preferences (across individuals) and in reward learning (across information processing channels).

Fig 2

Parametric control of prediction error asymmetry.

(a) Examples of variable reward prediction error (RPE) asymmetry. Each panel shows NRL responses for reward inputs (R = 50 A.U.) with uniformly distributed noise. Lines show piecewise linear regression fits for negative (red) and positive (green) reward errors. (b) The NRL semisaturation term governs the degree and direction of RPE asymmetry. NRL RPE asymmetry is biased towards negative RPEs at low σ and postive RPEs at high σ.

Parametric control of prediction error asymmetry.

Variability in risk preference

RPE asymmetry predicts that reinforcement learners will differentially weight outcomes that are better or worse than expected, consistent with behavior in multiple empirical studies [24-27]. However, previous studies–assuming linear reward coding—attributed this asymmetry to different learning rates for positive versus negative RPEs. For example, variable risk preferences in choice under uncertainty can be captured by standard RL models with valence-dependent learning rates [28]: For choices between a certain option and an uncertain option with the same mean outcome, these different learning rates implement a risk-sensitive RL process. If η−>η+, the model will learn an uncertain option value lower than its mean nominal outcome and exhibit risk-aversion; in contrast, if η−<η+, the model will overestimate uncertain options and exhibit risk-seeking behavior. RL models with differential learning rates have been increasingly influential and examined in the context of their adaptive properties [29,30], role in cognitive biases [27,31], and potential neurobiological substrates [24,25,32]. Here, we show that the NRL model can generate variable risk preferences without requiring different learning rates. We simulated NRL behavior in a dynamic learning and choice task, in which the value of certain and risky options had to be learned via outcomes [26]. Given the relationship between σ and RPE asymmetry, the NRL model can generate either risk averse or risk seeking behavior (Fig 3A, blue lines; example risk averse σ = 20, example risk seeking σ = 60). Across a population of simulated subjects (N = 50), NRL agents generated a range of behavioral risk aversion levels that are strongly correlated with σ (r = -0.889; p = 7.08x10-18). However, if a linear reward function is wrongly assumed, RL models with differential learning rates (Eq 4) can accurately fit NRL-generated behavior (Fig 3A, black lines); furthermore, these data will appear to support differential learning rates linked to risk preference (Fig 3A, right). Intuitively, these apparent learning rates arise because fitting with the standard RL model (Eq 4) and two valence-dependent learning rates approximates bipartite linear regression on the nonlinear value function. In other words, the different curvature of the NRL value function for negative and positive RPEs (as seen in examples in Fig 2A) will be captured (incorrectly) as differential linear modulation by negative and positive learning rates. As a result, NRL-generated data will demonstrate a consistent relationship between apparent learning rate asymmetry and behavioral risk aversion (Fig 3B). However, this relationship is driven by the strong relationship between risk preference and the generating NRL model σ (Fig 3C). Rather than the specific relationship between NRL parameterization and risk behavior, we emphasize that these results highlight the general finding that variability in risk aversion and implied learning rates can be generated by changes in the nonlinear NRL value function.

Fig 3

NRL RPE asymmetry governs the degree of risk preference in reward learning.

(a) Examples of risk averse and risk seeking NRL agent behavior. Left, behavior in a task involving choices between a certain (100% chance of 20 A.U.) and a risky (50% chance of 0 or 40 A.U.) option. Blue lines, behavior of the generative NRL agent. Black lines, behavior of best fitting linear RL model with asymmetric learning rates. Right, apparent learning rates for negative and positive RPEs in linear RL model. Example risk averse σ = 20 (top); example risk seeking σ = 60 (bottom). (b) Apparent relationship between risk preference and asymmetric learning rates under assumption of linear reward coding. Behavioral risk aversion (percent choice of certain option) and learning rate asymmetry (η η )/ (η + η ) defined as in previous work [26]. (c) Risk preference depends on RPE asymmetry in generative NRL model. Degree of behavioral risk aversion controlled by NRL semisaturation parameter.

NRL RPE asymmetry governs the degree of risk preference in reward learning.

A computational mechanism for distributional RL

Asymmetric NRL prediction errors also provide a computational mechanism for recent theories of distributional reinforcement learning. In contrast to standard RL approaches, where models learn a single scalar quantity representing the mean reward, distributional RL models learn about the full distribution of possible rewards [33,34]. The critical difference in distributional approaches is a diversity of RPE channels, with differing degrees of optimism about outcomes, which learn varying predictions about future reward. While most theoretical work relies on learning rate differences to produce RPE asymmetries, a recent report shows that dopamine neurons themselves exhibit sufficient characteristics to support distributional RL [35]: (1) dopamine neurons show a diversity of reversal points (reward magnitude where prediction errors switch from negative to positive), (2) dopamine neurons differ in their relative weighting of positive and negative RPEs, (3) this asymmetry in RPE weighting correlates with reversal point across neurons, and (4) the diversity in reversal points and asymmetries support a decoding of reward distributions. Notably, these findings show that distributional learning arises from asymmetries in RPE coding rather than in downstream learning rates, but do not address how such asymmetry might arise. Here, we show that—given parametric diversity in individual dopamine neurons—NRL provides a computational mechanism for the neurophysiological characteristics supporting distributional RL. In contrast to standard RL, distributional RL posits that different RPE channels learn different value predictions in the identical reward environment. In a variable outcome environment [15,16], this predicts that individual dopamine neurons will exhibit different reversal points. Recent work shows that empirical dopamine responses exhibit this hypothesized variability in reversal points, driven by asymmetric scaling of negative and positive RPE responses [35]. Given that such asymmetric RPE responses are intrinsic to NRL, we examined the relationship between different NRL agents with varying σ parameters and their reversal points in a stochastic outcome environment replicating the previously reported experiment [15,16]. Specifically, we simulated NRL responses to single rewards drawn from a distribution of seven different reward magnitudes (0.1, 0.3, 1.2, 2.5, 5, 10, or 20 A.U.); each possible reward outcome was drawn with an equal probability, and we examined steady state NRL RPE responses across outcomes. We find that NRL agents simulated in such a stochastic reward environment exhibit a diversity of reversal points (examples, Fig 4A) directly related to the individual NRL channel σ parameter (Fig 4B).

Fig 4

NRL RPE asymmetry provides a computational basis for distributional reinforcement learning.

NRL RPE asymmetry provides a computational basis for distributional reinforcement learning.

(a) Variable NRL RPE response asymmetries in a probabilistic reward environment. Examples show NRL agents with stronger negative (blue) and positive (red) RPE asymmetry. Note that these two agents exhibit different reversal points in the same reward environment (rewards = {0.1, 0.3, 1.2, 2.5, 5, 10, 20 μl}, as in previous work [35]). Triangles denote the true average reward (black) and estimated average reward learned by pessimistic (blue) and optimistic (red) NRL agents. (b) Learned reversal points vary systematically with NRL parameterization. RPE responses and reversal points quantified for varying σ parameters. (c) Reversal points depend on NRL RPE asymmetry. Plots show NRL responses normalized by negative RPE slope and aligned to individual reversal points. As in empirical dopamine data, low (high) reversal points arise from stronger negative (positive) RPE asymmetry. (d) NRL asymmetry and learning match empirical dopamine data. Blue, dopamine neurons recorded in stochastic reward environment [35]; black, heterogeneous NRL agents in identical reward environment. Asymmetry is defined as in previous work as a function of positive (α+) and negative (α-) RPE coding slopes. (e) A population of NRL agents learns the distribution of experienced rewards. 40 NRL agents were simulated in four different reward environments: symmetric, right-skewed, left-skewed, and multimodal. Each panel plots the ground truth (gray) and decoded (blue) probability densities, with samples smoothed by kernel density estimation. Distribution decoding was performed via an imputation strategy, treating the NRL reversal points and response asymmetries as expectiles. Moreover, the intrinsic RPE asymmetry of NRL model units replicates key aspects of recorded dopamine neuron activity. First, NRL model units display a diversity of RPE biases, ranging from pessimistic (stronger negative RPE responses) to optimistic (stronger positive RPE responses) (Fig 4C). Second, the degree of RPE asymmetry is directly related to average expected value learned in the simulated environment (as quantified by reversal points; Fig 4D); this relationship mirrors that reported for recorded dopamine neurons. Finally, as a test of the ability of NRL to support distributional decoding, we examined whether a population of diverse NRL agents carries the necessary information to decode the distribution of previously experienced rewards (Fig 4E). Specifically, as done in recent work on empirical dopaminergic responses, we assumed that NRL reversal points and response asymmetries—learned in response to different reward environments—define a set of expectiles, and we transformed these expectiles into a probability density [35]. Following reward learning in a small population of NRL agents (n = 40, matched to empirical data), the probabilistic density of experienced rewards in different environments—including symmetric, asymmetric, or multimodal reward distributions–can be decoded from NRL responses (see Materials and Methods and S1 Appendix). Thus, NRL agents replicate the key empirical features seen in dopamine neurons: a diversity of reversal points, variable RPE asymmetries, a strong reversal point-asymmetry relationship, and an encoding of the statistical distribution of experienced rewards. Together, these findings suggest that the NRL algorithm provides a robust computational mechanism for distributional RL.

Alternative parameterizations and biological plausibility

While for simplicity we parameterize NRL heterogeneity above via the semisaturation term σ, equivalent variability in RPE asymmetry can be generated by differential weighting of reward inputs: Because an increase in input weighting is equivalent to a decrease in effective σ (S1 Appendix), the w term also controls RPE asymmetry: positive (negative) RPE biases occur for the same reward R given sufficiently small (large) w weights. In this alternative formulation, neural population diversity in RPE asymmetry arises simply from a variable weighting of reward inputs. Such variable input weighting is consistent with evidence for heterogeneity in both synaptic physiology and inputs to dopaminergic neurons [36]. Furthermore, differences in input weighting offers a more biologically plausible source of heterogeneity than the semisaturation term, which is typically considered a shared network property (i.e. baseline activity in a normalization pool) in circuit models of normalization [17,18]. For example, diversity in RPE asymmetry can be simply implemented by heterogeneity in the strength or number of reward-coding inputs to midbrain dopaminergic neurons. Importantly, this alternative formulation also preserves the ability of the NRL model to capture adaptation, analogous to a reference point in history-dependent models of context-dependent decision-making. In sensory neuroscience, adaptation in neural responses is typically implemented in normalization approaches via a history-dependent σ term that accounts for recent past stimuli. For example, normalization models with a dynamic σ term that averages past contrast levels explains adapting responses in visual cortical neurons [17] and has been shown to improve efficient coding via redundancy reduction [37]. Furthermore, the adaptation of human valuation behavior is captured by a normalization mechanism with an equivalent σ averaging past rewards [38]. Thus, by separating how the NRL algorithm models RPE asymmetry and reward history, this alternative parameterization can be used to examine neural and behavioral adaptation effects during reward learning.

Discussion

Standard RL algorithms assume linear reward representations, but the brain represents objective rewards in a subjectively nonlinear manner; here we show that a nonlinear RL algorithm—employing the canonical divisive normalization computation—captures diverse neural and behavioral features of reward learning and decision-making. While the use of nonlinear reward transformations in RL is not novel, normalization generates both convex and concave regimes and, as a result, asymmetries in RPE responses for negative and positive prediction errors. In addition to matching empirical observations of saturating, nonlinear reward representations, the NRL model explain aspects of observed behavioral and neural data including variable risk preferences and asymmetric RPE responses thought to support distributional RL. In its fullest form (Eq 5), the NRL algorithm is governed by three parameters with specific neurobiological interpretations and different functional implications. The exponent n governs input amplification, and is generally viewed as a fixed property of normalization computations. The input weighting term w implements parametric diversity in RPE asymmetries, which underlies the behavioral and neural diversity generated by the NRL model. Though a novel feature of the model, input weighting diversity may be simply realized as heterogeneity in synaptic strength or number of afferents. Finally, as in standard models of divisive normalization, a dynamic semisaturation term σ can implement reference-dependent valuation by adjusting the normalized value response to recent rewards. The precise mechanism by which σ adapts is not known, but possibilities include changes in pooled inputs contributing to normalization [17], long timescale nonlinear dynamics in divisive gain control [39, 40], and RL-like computations learning average reward. Regardless of mechanism, an adaptive σ allows the NRL model to flexibly encode value in changing reward environments, adjusting reward coding relative to a reference point implemented as the σ parameter. Our results show that intrinsic asymmetries in reward learning can arise from subtleties in RPE coding rather than in downstream learning rates. However, these results do not preclude coexisting asymmetries in both systems. Both variability in dopaminergic genes and pharmacologic dopamine modulation affect reward learning in a valence-dependent manner, consistent with differential responses in striatal D1 and D2 receptors to positive versus negative RPEs [24,25]. Such downstream differences would drive differential weighting of prediction errors carried by dopaminergic inputs. On the other hand, increasing evidence shows valence-dependent differences in dopamine neuron responses [35,41], arguing for asymmetries in RPE coding itself. We suggest that NRL provides a computational mechanism to explain such asymmetries at the level of dopamine RPE representation, but valence-dependent biases in both prediction error coding and downstream processing likely play a role in biological learning. While distributional RL has been proposed and implemented in computational algorithms, only recently has evidence arisen that dopamine neurons exhibit the necessary reward asymmetries [35]. Specifically, individual dopamine neurons differentially weight positive versus negative RPEs, and this relative valence-dependent weighting varies across neurons. However, how valence-dependent RPE coding arises is unknown. We show here that an RL system with a biologically-inspired normalized value function reproduces heterogeneous RPE asymmetries. In the NRL algorithm, asymmetry arises from the intrinsic curvature changes in the normalized reward function, and structured diversity in this asymmetry arises from parametric differences in normalized reward coding. Importantly, both asymmetry and structured diversity have plausible biological sources: normalization is produced by a number of mechanisms including feedforward inhibition, feedback inhibition, and synaptic depression [18], and asymmetry diversity requires only heterogeneity in input weighting (e.g. synaptic strength). Thus, RPE asymmetry in NRL arises solely from the reward processing circuit, and does not require differential weighting of separate sources of negative and positive RPE information. Variability in RPE asymmetries is crucial to theories of distributional learning based on expectile regression, and empirical dopamine RPE asymmetries carry sufficient information to decode distributional information about experienced rewards via expectile-based decoding methods. Our results show that a population of NRL agents can learn and encode sufficient information to allow decoding of experienced reward distributions, at a level comparable to decoding from empirical responses [35]. However, while this shows that distributional information exists in both empirical and NRL responses, it is unclear whether expectile-based decoding is biologically plausible or employed by the brain. Alternative distributional codes have been suggested to be more computationally straightforward [42]; interestingly, these codes rely on variable nonlinear reward functions closely related to NRL responses, suggesting that NRL function in distributional learning may extend beyond expectile-based approaches. Further work is needed to verify distributional reward coding in downstream brain areas, identify the biological decoding algorithm, and test the contribution of NRL asymmetries to distributional learning. Beyond capturing variability in risk preference and RPE asymmetry, the NRL model makes a number of further predictions about reward-guided behavior and neural activity. At the behavioral level, in addition to the across-subject variability shown here, the sigmoidal shape of the NRL value function predicts within-subject changes in risk preference. Under normalized value coding, the local curvature of the reward function depends on the relationship between rewards and the semisaturation term σ. Specifically, NRL predicts that individual risk preference should be magnitude dependent, with increasing risk aversion at larger outcomes; such outcome-dependent changes are consistent with some behavioral evidence [43,44], but remains to be tested in strict reinforcement learning scenarios. At the neural level, NRL predicts an adaptive flexibility in individual (and population) dopamine neuron asymmetries. Unlike in other theories [35], RPE bias in a given NRL agent depends on the reward magnitude (relative to σ). Thus, while a population of NRL agents should retain their relative ranking of asymmetries in different environments, absolute asymmetries will change depending on the experienced rewards–an effect that may confer an advantageous adaptability. More broadly, when RPE asymmetry diversity is parameterized by input weights (Eq 4), the NRL algorithm can incorporate past reward information via a history-dependent σ term. This suggests that NRL responses should capture contextual phenomena such as adaptive coding of reward values [45] and adaptation in risk preferences [46]. Finally, context-dependent valuation has been a recent area of interest in psychology, economics, and neuroscience, and it is important to consider how the NRL model relates to existing models of context-dependent valuation and choice. In its most detailed implementation (Eq 5), NRL provides a parametric control over both contextual adaptation (via σ) and RPE asymmetry (via w), implementing a reference-dependent S-shaped value function similar to several previously proposed models. Most broadly, prospect theory and related models propose an asymmetric value function convex in the region of losses and concave in the region of gains, with gains and losses determined relative to a reference point [47,48]. More recently, similar value functions arise in models derived from efficient coding principles in response to processing constraints [49] or designed to capture adaptation effects in choice behavior [50]. While the NRL value function also exhibits an S-shaped reference-dependence, it extends previous work in a number of ways. First, it applies nonlinear valuation to reinforcement learning, where linear reward functions are widely and conventionally employed; in contrast to prior behavioral theories, the NRL framework makes direct and testable predictions about both behavior and dopaminergic neural activity. Second, the normalized value function has a clear biological grounding: it defined from neurophysiological reward responses [15, 16, 21, 51], based on a canonical neural computation [18], and implementable in simple neural circuits [39, 40, 52]. Third, the NRL algorithm incorporates a novel parametric control of the degree and direction of asymmetric responding to negative and positive RPEs; diversity in NRL RPE asymmetry provides a unitary explanation for diversity in both behavior and neural responses. Finally, the combination of adaptation and distributional coding suggests the possibility of rich contextual reward learning, offering a single biological mechanism for diverse phenomena currently explained by different reference point centering [53], range adaptation [54], and range frequency based [55] models. In summary, we present a model of reinforcement learning that incorporates a biologically-relevant nonlinear reward function implemented by divisive normalization. Normalized value coding introduces a parametrically tunable valence-based bias in prediction errors, and structured diversity in this bias captures both variable risk preferences across individuals and variable prediction error weighting across neurons. Together, these findings reconcile empirical and theoretical aspects of reinforcement learning, support the robustness of normalization-based value coding, and argue for the incorporation of biologically valid value representations into computational models of reward learning and choice behavior.

Materials and methods

NRL model

The NRL model applies a divisive normalization transform to experienced rewards (Eq 2), parameterized by an exponent n and semisaturation term σ. Analytic analysis of NRL curvature is provided in the Supplementary Information. For all simulations (other than Fig 1c), we fixed n = 2. For simplicity, we implemented single state RL models that update value estimates with the product of the RPE and learning rate η (Eq 3); however, the NRL framework can be applied in more complicated models that incorporate features like action policy updating and temporal difference learning [1]. To examine RPE asymmetry, we quantified the response of different NRL agents (parameterized by varying σ; η = 0.1) to a fixed reward signal (R = 50 A.U.) corrupted with uniform noise (-40 to 40 A.U.). The degree of RPE asymmetry was quantified by piecewise linear regression analyses of negative and positive reward errors. Changing the parameters of the reward noise did not affect the qualitative finding of variable RPE asymmetries (S1 Appendix). All simulations and analyses were performed in MATLAB (R2015b).

Risk-dependent decision-making

To examine risk preferences in the NRL model, we simulated agent behavior in a learning and choice task previously used to examine human subject behavior [26]. NRL agents were generated with random σ parameters in the range [10,80]. In this task, each NRL agent chose between a certain option (100% chance of 20 A.U. reward) and a risky option (50% chance of 0 or 40 A.U. reward); initial values for both options were set to 0. In a given trial, choice was determined via a softmax function of estimated option values; to achieve a comparable level of choice stochasticity across different agents, the softmax temperature was inversely scaled with the σ parameter. When chosen, the value of an option was updated based on received outcome according to Eq 3 (η = 0.1). For each NRL agent, we simulated behavior for 1000 trials. To examine how NRL agent behavior appears if linear reward functions are assumed, we fit NRL-generated behavior with an RL model with valence-dependent learning rates (η and η ; Eq 4) and quantified the apparent learning rate asymmetry (η — η )/ (η + η ).

Distributional RL

To examine whether information about experienced reward distributions were encoded in learned NRL responses, we quantified NRL agent behavior in different reward environments. Rewards were drawn from symmetric, left-skewed, and right-skewed distributions; in addition, we examined an environment with seven equiprobable rewards (0.1, 0.3, 1.2, 2.5, 5, 10, or 20 A.U.), replicating the conditions under which empirical dopamine neurons exhibit RPE asymmetries [15,16,35]. To facilitate comparison to empirical decoding performance, we examined 40 NRL agents with a diversity of semisaturation parameters (0.5 to 48 A.U.). For each environment, analytical steady state NRL functions and reversal points were calculated for each NRL agent (S1 Appendix), but similar results were obtained when NRL agents learned via sampling. Following identification of the reversal point, we estimated the RPE response asymmetry of each agent via separate linear regressions for negative and positive RPE responses around the reversal point. Given a reversal point and RPE asymmetry τ for each NRL agent denoted by index n (see S1 Appendix), distribution decoding was performed as previously described for empirical dopamine data [35]. Briefly, these data were interpreted as expectiles, where the τ-th expectile had the value . Decoding consisted of an imputation method to find a probability density that best matched the set of expectiles. As previously described, the density was parametrized as a set of 100 reward samples and optimization was performed to minimize a loss function between reversal points, asymmetries, and reward sample locations; see S1 Appendix for details.

Code availability

MATLAB code used for simulation and analysis of the NRL model are available at https://osf.io/e6t5z/.

Analytical derivations and decoding methods.

(DOCX) Click here for additional data file. 26 Mar 2022 Dear Dr. Louie, Thank you very much for submitting your manuscript "Asymmetric and adaptive reward coding via normalized reinforcement learning" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. I enjoyed reading the paper, and the reviewer comments are pretty straightforward. One of the main questions is comparison to other models. You don't need to go on a wild goose chase exploring many different non-linear models, but it's important to convey to the reader more clearly what the space of possibilities is and why this particular model might be a good choice over other possibilities. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Samuel J. Gershman Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Reviewer #1: Thanks for the opportunity to review the paper titled “Asymmetric and adaptive reward coding via normalized reinforcement learning”. The paper proposes a divisive normalization model of subjective value to explain important processes characterising reinforcement learning. Simulations aim at demonstrating that the non-linear value function proposed by the model can explain (i) asymmetries between positive and negative learning rates and (ii) data about dopaminergic neurons. Overall, the paper is clearly written, and its contribution is timely and valuable. Below, please see some suggestions on how to improve the paper further: Major points: The author should clarify the specific contribution of the paper to the literature. It seems to me that the idea of proposing a reference-dependent sigmoid function (albeit in a different form) to describe subjective value is not novel (e.g., Woodford, 2012; Rigoli, 2019). The author should discuss this literature and clarify what the present paper adds to it. In my opinion, while previous literature adopting similar sigmoid models focuses on reference effects, the novel contribution of the paper is to explore the implications of a sigmoid value function in the context of RL; and show how this can explain (i) reported asymmetries in learning rate and (ii) recent data about dopamine function. Does the author agree on this interpretation? In any case, it would be helpful to clarify the precise contribution of the paper. Although the paper hints to reference effects in shaping the value function (also consistent with the literature on normalization mentioned in the introduction), the nature of these effects is not explained. Specifically, the paper is silent about where the parameters sigma and n come from. Do they depend on the property of the distribution of rewards (e.g., sigma being an average and n being an index of variability)? I think that an in-depth analysis of this aspect is not necessary here, as the main points of the paper emerge clearly enough as it is. Yet, a short clarification of this important aspect of the model would be beneficial. Relatedly, given that the focus is on RL, the discussion might briefly speculate on how the parameters are themselves acquired via learning. As argued above, the simulation described in fig. 3 appears crucial to me, as it reflects one of the main contributions of the paper (explaining asymmetries in learning rates as arising from non-linear value functions). However, I am not sure how robust the result that sigma inversely correlates with risk aversion is. Does this also depend on the range of rewards simulated? I suggest to vary the reward distribution (keeping the parameters fixed) in different simulations and show what happens. In general, a few other simulations on this point (e.g., based on empirical studies conducted by Palminteri’s lab or Frank’s lab) would make the argument more convincing. Relatedly, what is the role of the parameter n in risk sensitivity? Regarding the paragraph starting with “Recent work shows that empirical dopamine responses exhibit…”, I found the description of the simulation unclear. Please provide all the relevant information about this simulation in this paragraph. Minor points: Fig 1 b: please report the values of sigma and n adopted in the simulation. Line 100: I think it should be “asymmetric prediction errors” “Intuitively, these apparent learning rates arise because fitting with the standard RL model (Eqn. 4) approximates bipartite linear regression on the nonlinear value function”. This sentence is not clear to me. Regarding the paragraph “Moreover, the intrinsic RPE asymmetry of NRL model..” the author nicely draws a parallel between the simulation’s results and the main four findings reported by Dabney at al (2020) outlined previously. It would help to make the parallel more explicit. References Rigoli, F. (2019). Reference effects on decision-making elicited by previous rewards. Cognition, 192, 104034. Woodford, M. (2012). Prospect theory as efficient perceptual distortion. American Economic Review, 102(3), 41-46. Reviewer #2: Review of Asymmetric and adaptive reward coding via normalized reinforcement learning by Louie. Summary In the present study, the author describes behavioral and computational implications of using non-linear reinforcement learning (RL) models. The author characterized a RL model implementing a non-linear value function of reward via a divisive normalization computation. The author shows that this non-linear specification can produce asymmetrical prediction errors coding and then explain both behavioral phenomena such as attitudes toward risk and computational mechanisms underlying distributional learning in the brain. The paper is well written, clear, and concise, and the question raised by the author is of prime interest. The link made between non-linear value functions and updating asymmetry in linear models with multiple learning rates is particularly interesting and promising, and in making that link and describing its computational implication, the study fulfils its goal perfectly. The paper is very informative, and I enjoyed reading it. On a side note, it is not clear how the model, in its current version, will be useful to analyze behavior in future studies. The later can produce relevant effects observed in humans but the lack of clear view on what could represent (cognitively) or determine the semisaturation parameter is problematic. It is not obvious either whether the model can fit reliably human behavior and make actual predictions about the later. Major comments #1 Alternative non-linear models The non-linear valuation in the model can produce asymmetric prediction errors coding, but so would do other nonlinear, convex or concave, utility functions. For instance, another function that is convex then concave (like the present function for n≥2) is the prospect theory valuation function, which is convex below the reference point and concave above. Moving the reference point would then produce similar effects as moving the semisaturation parameter in the present model. Beyond its explanatory power at the neural level, could author explain the advantages of using the present model over other non-linear ones? I understand why the divisive normalization is used here but is it, in the present version, a good behavioral model? Regarding the prospect theory function, moving arbitrarily the reference point to account for various behaviors, without clear assumption about how it is set, would not make much sense. And the same could be said with moving the semisaturation parameter in the present model, which has furthermore a less clear definition. Could the author comment on this point? #2 Interpretation of the semisaturation parameter Related to point #1, I find it hard to grasp the meaning of the semisaturation parameter, at the behavioral level. In the simulations presented in the paper, it seems that its only goal is to place the concave or convex portion of the valuation curve adequately on the range of objective values in order to generate a positive or a negative asymmetry in prediction error coding respectively. In the same way and as an example, does it make sense to define the semisaturation parameter as being twice as large as the upper bond of the range of objective value in the considered environment (e.g., the choice task described here)? 3# On the use of the model in behavioral studies It is not clear from the paper how well the model could fit behavioral data and how well the parameters could be recovered from simulated data. For instance, in the behavioral part of the paper, the model is used to generate data in a choice task, could the simulated data be reasonably well fitted by the generative model and the parameters of the latter well recovered? Minor comments #1 Figure 1b - parameters It would be useful for readers to make explicit parameters that have been used in figure 1b. #2 Figure 3a parameters Similarly, it would be useful for readers to know which parameters have been used to generate the risk-seeking and risk-averse agents from figure 3a. We know the range of values that has been used for all agents but not the values of these two particular cases. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Francesco Rigoli Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 14 Jun 2022 Submitted filename: Louie - NRL PLOSCompBio response ro reviewers.docx Click here for additional data file. 1 Jul 2022 Dear Dr. Louie, We are pleased to inform you that your manuscript 'Asymmetric and adaptive reward coding via normalized reinforcement learning' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Also please note a couple of minor suggestions from Reviewer 1. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Samuel Gershman Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Reviewer #1: Thanks again for the opportunity to review the paper. The authors have addressed all the points I highlighted, and the paper appears as a valuable contribution to the literature. I have two further minor suggestions: - the sentence "the adaptation of human valuation behavior is captured by a normalization mechanism with an equivalent σ averaging past rewards (38)" sounds a bit overstated; I suggest to add something like "at least in some circumstances" - in the response to reviewers, the authors mention the empirical literature about asymmetrical RL, and argue that an analysis of this is beyond the scope of the manuscript - this is something left for future work. The authors' reply sounds reasonable. However, I think it is preferable to briefly acknowledge this in the discussion (and potentially briefly speculate about how the model could capture some of the main findings) Reviewer #2: I thank the author for their pertinent and precise answers to my comments. The author performed new analyses and added new paragraphs to the manuscript, improving further an already very interesting study. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Francesco Rigoli Reviewer #2: No 11 Jul 2022 PCOMPBIOL-D-22-00336R1 Asymmetric and adaptive reward coding via normalized reinforcement learning Dear Dr Louie, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

41 in total

Asymmetric and adaptive reward coding via normalized reinforcement learning.

Introduction

Results

Normalized reinforcement learning model

Normalized reinforcement learning model.

Parametric control of prediction error asymmetry.

Variability in risk preference

NRL RPE asymmetry governs the degree of risk preference in reward learning.

A computational mechanism for distributional RL

NRL RPE asymmetry provides a computational basis for distributional reinforcement learning.

Alternative parameterizations and biological plausibility

Discussion

Materials and methods

NRL model

Risk-dependent decision-making

Distributional RL

Code availability

Analytical derivations and decoding methods.

1. A canonical neural circuit for cortical nonlinear operations.

2. Adaptive properties of differential learning rates for positive and negative outcomes.

3. Neural representation of subjective value under risk and ambiguity.

Review 4. Ventral tegmental area: cellular heterogeneity, connectivity and behaviour.

5. Reward value-based gain control: divisive normalization in parietal cortex.

6. A Neural Signature of Divisive Normalization at the Level of Multisensory Integration in Primate Cortex.

7. A recurrent circuit implements normalization, simulating the dynamics of V1 activity.

8. Free choice shapes normalized value signals in medial orbitofrontal cortex.

9. Reward-based training of recurrent neural networks for cognitive and value-based tasks.

10. Contextual modulation of value signals in reward and punishment learning.