Literature DB >> 18846203

A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback.

Robert Legenstein¹, Dejan Pecevski, Wolfgang Maass.

Abstract

Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as a candidate for a learning rule that could explain how behaviorally relevant adaptive changes in complex networks of spiking neurons could be achieved in a self-organizing manner through local synaptic plasticity. However, the capabilities and limitations of this learning rule could so far only be tested through computer simulations. This article provides tools for an analytic treatment of reward-modulated STDP, which allows us to predict under which conditions reward-modulated STDP will achieve a desired learning effect. These analytical results imply that neurons can learn through reward-modulated STDP to classify not only spatial but also temporal firing patterns of presynaptic neurons. They also can learn to respond to specific presynaptic firing patterns with particular spike patterns. Finally, the resulting learning theory predicts that even difficult credit-assignment problems, where it is very hard to tell which synaptic weights should be modified in order to increase the global reward for the system, can be solved in a self-organizing manner through reward-modulated STDP. This yields an explanation for a fundamental experimental result on biofeedback in monkeys by Fetz and Baker. In this experiment monkeys were rewarded for increasing the firing rate of a particular neuron in the cortex and were able to solve this extremely difficult credit assignment problem. Our model for this experiment relies on a combination of reward-modulated STDP with variable spontaneous firing activity. Hence it also provides a possible functional explanation for trial-to-trial variability, which is characteristic for cortical networks of neurons but has no analogue in currently existing artificial computing systems. In addition our model demonstrates that reward-modulated STDP can be applied to all synapses in a large recurrent neural network without endangering the stability of the network dynamics.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2008 PMID： 18846203 PMCID： PMC2543108 DOI： 10.1371/journal.pcbi.1000180

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Numerous experimental studies (see [1] for a review; [2] discusses more recent in-vivo results) have shown that the efficacy of synapses changes in dependence of the time difference Δt = t−t between the firing times t and t of the pre- and postsynaptic neurons. This effect is called spike-timing-dependent plasticity (STDP). But a major puzzle for understanding learning in biological organisms is the relationship between experimentally well-established rules for STDP on the microscopic level, and adaptive changes of the behavior of biological organisms on the macroscopic level. Neuromodulatory systems, which send diffuse signals related to reinforcements (rewards) and behavioral state to several large networks of neurons in the brain, have been identified as likely intermediaries that relate these two levels of plasticity. It is well-known that the consolidation of changes of synaptic weights in response to pre- and postsynaptic neuronal activity requires the presence of such third signals [3],[4]. In particular, it has been demonstrated that dopamine (which is behaviorally related to novelty and reward prediction [5]) gates plasticity at corticostriatal synapses [6],[7] and within the cortex [8]. It has also been shown that acetylcholine gates synaptic plasticity in the cortex (see for example [9] and [10],[11] contains a nice review of the literature). Corresponding spike-based rules for synaptic plasticity of the formhave been proposed in [12] and [13] (see Figure 1 for an illustration of this learning rule), where w is the weight of a synapse from neuron i to neuron j, c(t) is an eligibility trace of this synapse which collects weight changes proposed by STDP, and d(t) = h(t)−h̅ results from a neuromodulatory signal h(t) with mean value h̅. It was shown in [12] that a number of interesting learning tasks in large networks of neurons can be accomplished with this simple rule in Equation 1. It has recently been shown that quite similar learning rules for spiking neurons arise when one applies the general framework of distributed reinforcement learning from [14] to networks of spiking neurons [13],[15], or if one maximizes the likelihood of postsynaptic firing at desired firing times [16]. However no analytical tools have been available, which make it possible to predict for what learning tasks, and under which parameter settings, reward-modulated STDP will be successful. This article provides such analytical tools, and demonstrates their applicability and significance through a variety of computer simulations. In particular, we identify conditions under which neurons can learn through reward-modulated STDP to classify temporal presynaptic firing patterns, and to respond with particular spike patterns.

Figure 1

Scheme of reward-modulated STDP according to Equations 1–4.

Scheme of reward-modulated STDP according to Equations 1–4.

(A) Eligibility function f(t), which scales the contribution of a pre/post spike pair (with the second spike at time 0) to the eligibility trace c(t) at time t. (B) Contribution of a pre-before-post spike pair (in red) and a post-before-pre spike pair (in green) to the eligibility trace c(t) (in black), which is the sum of the red and green curves. According to Equation 1 the change of the synaptic weight w is proportional to the product of c(t) with a reward signal d(t). We also provide a model for the remarkable operant conditioning experiments of [17] (see also [18],[19]). In the simpler ones of these experiments the spiking activity of single neurons (in area 4 of the precentral gyrus of monkey cortex) was recorded, the deviation of the current firing rate of an arbitrarily selected neuron from its average firing rate was made visible to the monkey through the displacement of an illuminated meter arm, whose rightward position corresponded to the threshold for the feeder discharge. The monkey received food rewards for increasing (or in alternating trials for decreasing) the firing rate of this neuron. The monkeys learnt quite reliably (within a few minutes) to change the firing rate of this neuron in the currently rewarded direction. Adjacent neurons tended to change their firing rate in the same direction, but also differential changes of directions of firing rates of pairs of neurons are reported in [17] (when these differential changes were rewarded). For example, it was shown in Figure 9 of [17] (see also Figure 1 in [19]) that pairs of neurons that were separated by no more than a few hundred microns could be independently trained to increase or decrease their firing rates. Obviously the existence of learning mechanisms in the brain which are able to solve this extremely difficult credit assignment problem provides an important clue for understanding the organization of learning in the brain. We examine in this article analytically under what conditions reward-modulated STDP is able to solve such learning problem. We test the correctness of analytically derived predictions through computer simulations of biologically quite realistic recurrently connected networks of neurons, where an increase of the firing rate of one arbitrarily selected neuron within a network of 4000 neurons is reinforced through rewards (which are sent to all 142813 synapses between excitatory neurons in this recurrent network). We also provide a model for the more complex operant conditioning experiments of [17] by showing that pairs of neurons can be differentially trained through reward-modulated STDP, where one neuron is rewarded for increasing its firing rate, and simultaneously another neuron is rewarded for decreasing its firing rate. More precisely, we increased the reward signal d(t) which is transmitted to all synapses between excitatory neurons in the network whenever the first neuron fired, and decreased this reward signal whenever the second neuron fired (the resulting composed reward corresponds to the displacement of the meter arm that was shown to the monkey in these more complex operant conditioning experiments). Our theory and computer simulations also show that reward-modulated STDP can be applied to all synapses within a large network of neurons for long time periods, without endangering the stability of the network. In particular this synaptic plasticity rule keeps the network within the asynchronous irregular firing regime, which had been described in [20] as a dynamic regime that resembles spontaneous activity in the cortex. Another interesting aspect of learning with reward-modulated STDP is that it requires spontaneous firing and trial-to-trial variability within the networks of neurons where learning takes place. Hence our learning theory for this synaptic plasticity rule provides a foundation for a functional explanation of these characteristic features of cortical network of neurons that are undesirable from the perspective of most computational theories.

Results

We first give a precise definition of the learning rule in Equation 1 for reward-modulated STDP. The standard rule for STDP, which specifies the change W(Δt) of the synaptic weight of an excitatory synapse in dependence on the time difference Δt = t−t between the firing times t and t of the pre- and postsynaptic neuron, is based on numerous experimental data (see [1]). It is commonly modeled by a so-called learning curve of the formwhere the positive constants A + and A − scale the strength of potentiation and depression respectively, and τ + and τ − are positive time constants defining the width of the positive and negative learning window. The resulting weight change at time t of synapse ji for a presynaptic spike train and a postsynaptic spike train is usually modeled [21] by the instantaneous application of this learning rule to all spike pairings with the second spike at time t The spike train of a neuron i which fires action potentials at times , , ,… is formalized here by a sum of Dirac delta functions . The model analyzed in this article is based on the assumption that positive and negative weight changes suggested by STDP for all pairs of pre- and postsynaptic spikes at synapse ji (according to the two integrals in Equation 3) are collected in an eligibility trace c(t) at the site of the synapse. The contribution to c(t) of all spike pairings with the second spike at time t−s is modeled for s>0 by a function f(s) (see Figure 1A); the time scale of the eligibility trace is assumed in this article to be on the order of seconds. Hence the value of the eligibility trace of synapse ji at time t is given bysee Figure 1B. The actual weight change at time t for reward-modulated STDP is the product c(t)·d(t) of the eligibility trace with the reward signal d(t) as defined by Equation 1. Since this simple model can in principle lead to unbounded growth of weights, we assume that weights are clipped at the lower boundary value 0 and an upper boundary w. The network dynamics of a simulated recurrent network of spiking neurons where all connections between excitatory neurons are subject to STDP is quite sensitive to the particular STDP-rule that is used. Therefore we have carried out our network simulations not only with the additive STDP-rule in Equation 3, whose effect can be analyzed theoretically, but also with the more complex rule proposed in [22] (which was fitted to experimental data from hippocampal neurons in culture [23]), where the magnitude of the weight change depends on the current value of the weight. An implementation of this STDP-rule (with the parameters proposed in [22]) produced in our network simulations of the biofeedback experiment (computer simulation 1) as well as for learning pattern classification (computer simulation 4) qualitatively the same result as the rule in Equation 3.

Theoretical Analysis of the Resulting Weight Changes

In this section, we derive a learning equation for reward-modulated STDP. This learning equation relates the change of a synaptic weight w over some sufficiently long time interval T to statistical properties of the joint distribution of the reward signal d(t) and pre- and postsynaptic firing times, under the assumption that the weight and correlations between pre- and postsynaptic spike times are slowly varying in time. We treat spike times as well as the reward signal d(t) as stochastic variables. This mathematical framework allows us to derive the expected weight change over some time interval T (see [21]), with the expectation taken over realizations of the stochastic input- and output spike trains as well as stochastic realizations of the reward signal, denoted by the ensemble average 〈·〉 where we used the abbreviation . If synaptic plasticity is sufficiently slow, synaptic weights integrate a large number of small changes. In this case, the weight w can be approximated by its average 〈w〉 (it is “self-averaging”, see [21]). We can thus drop the expectation on the left hand side of Equation 5 and write it as . Using Equation 1, this yields (see Methods)This formula contains the reward correlation for synapse ji which is the average reward at time t given a presynaptic spike at time t−s−r and a postsynaptic spike at time t−s. The joint firing rate ν(t,r) = 〈S(t)S(t−r)〉 describes correlations between spike timings of neurons j and i, i.e., it is the probability density for the event that neuron i fires an action potential at time t−r and neuron j fires an action potential at time t. For synapses subject to reward-modulated STDP, changes in efficacy are obviously driven by co-occurrences of spike pairings and rewards within the time scale of the eligibility trace. Equation 6 clarifies how the expected weight change depends on how the correlations between the pre- and postsynaptic neurons correlate with the reward signal. If one assumes for simplicity that the impact of a spike pair on the eligibility trace is always triggered by the postsynaptic spike, one gets a simpler equation (see Methods)The assumption introduces a small error for post-before-pre spike pairs, because for a reward signal that arrives at some time d after the pairing, the weight update will be proportional to f(d) instead of f(d+r). The approximation is justified if the temporal average is performed on a much longer time scale than the time scale of the learning window, the effect of each pre-post spike pair on the reward signal is delayed by an amount greater than the time scale of the learning window, and f changes slowly compared to the time scale of the learning window (see Methods for details). For the analyzes presented in this article, the simplified Equation 8 is a good approximation for the learning dynamics. Equation 8 is a generalized version of the STDP learning equation in [21] that includes the impact of the reward correlation weighted by the eligibility function. To see the relation between standard STDP and reward-modulated STDP, consider a constant reward signal d(t) = d 0. Then also the reward correlation is constant and given by D(t,s,r) = d 0. We recover the standard STDP learning equation scaled by d 0 if the eligibility function is an instantaneous delta-pulse f(s) = δ(s). Furthermore, if the statistics of the reward signal d(t) is time-independent and independent from the pre- and postsynaptic spike statistics of some synapse ji, then the reward correlation is given by D(t,s,r) = 〈d(t)〉 = d 0 for some constant d 0. Then, the weight change for synapse ji is . The temporal average of the joint firing rate 〈ν(t−s,r〉 is thus filtered by the eligibility trace. We assumed in the preceding analysis that the temporal average is taken over some long time interval T. If the time scale of the eligibility trace is much smaller than this time interval T, then the weight change is approximately , and the weight w will change according to standard STDP scaled by a constant proportional to the mean reward and the integral over the eligibility function. In the remainder of this article, we will always use the smooth time-averaged weight change , but for brevity, we will drop the angular brackets and simply write . The learning Equation 8 provides the mathematical basis for our following analyses. It allows us to determine synaptic weight changes if we can describe a learning situation in terms of reward correlations and correlations between pre- and postsynaptic spikes.

Application to Models for Biofeedback Experiments

We now apply the preceding analysis to the biofeedback experiment of [17] that were described in the introduction. These experiments pose the challenge to explain how learning mechanisms in the brain can detect and exploit correlations between rewards and the firing activity of one or a few neurons within a large recurrent network of neurons (the credit assignment problem), without changing the overall function or dynamics of the circuit. We show that this phenomenon can in principle be explained by reward-modulated STDP. In order to do that, we define a model for the experiment which allows us to formulate an equation for the reward signal d(t). This enables us to calculate synaptic weight changes for this particular scenario. We consider as model a recurrent neural circuit where the spiking activity of one neuron k is recorded by the experimenter (Experiments where two neurons are recorded and reinforced were also reported in [17]. We tested this case in computer simulations (see Figure 2) but did not treat it explicitly in our theoretical analysis). We assume that in the monkey brain a reward signal d(t) is produced which depends on the visual feedback (through an illuminated meter, whose pointer deflection was dependent on the current firing rate of the randomly selected neuron k) as well as previously received liquid rewards, and that this signal d(t) is delivered to all synapses in large areas of the brain. We can formalize this scenario by defining a reward signal which depends on the spike rate of the arbitrarily selected neuron k (see Figure 3A and 3B). More precisely, a reward pulse of shape ε(r) (the reward kernel) is produced with some delay d every time the neuron k produces an action potentialNote that d(t) = h(t)−h̅ is defined in Equation 1 as a signal with zero mean. In order to satisfy this constraint, we assume that the reward kernel ε has zero mass, i.e., . For the analysis, we use the linear Poisson neuron model described in Methods. The mean weight change for synapses to the reinforced neuron k is then approximately (see Methods)This equation describes STDP with a learning rate proportional to . The outcome of the learning session will strongly depend on this integral and thus on the form of the reward kernel ε. In order to reinforce high firing rates of the reinforced neuron we have chosen a reward kernel with a positive bump in the first few hundred milliseconds, and a long negative tail afterwards. Figure 3C shows the functions f and ε that were used in our computer model, as well as the product of these two functions. One sees that the integral over the product is positive and according to Equation 10 the synapses to the reinforced neuron are subject to STDP. This does not guarantee an increase of the firing rate of the reinforced neuron. Instead, the changes of neuronal firing will depend on the statistics of the inputs. In particular, the weights of synapses to neuron k will not increase if that neuron does not fire spontaneously. For uncorrelated Poisson input spike trains of equal rate, the firing rate of a neuron trained by STDP stabilizes at some value which depends on the input rate (see [24],[25]). However, in comparison to the low spontaneous firing rates observed in the biofeedback experiment [17], the stable firing rate under STDP can be much higher, allowing for a significant rate increase. It was shown in [17] that also low firing rates of a single neuron can be reinforced. In order to model this, we have chosen a reward kernel with a negative bump in the first few hundred milliseconds, and a long positive tail afterwards, i.e. we inverted the kernel used above to obtain a negative integral . According to Equation 10 this leads to anti-STDP where not only inputs to the reinforced neuron which have low correlations with the output are depressed (because of the negative integral of the learning window), but also those which are causally correlated with the output. This leads to a quick firing rate decrease at the reinforced neuron.

Figure 2

Differential reinforcement of two neurons (within a simulated network of 4000 neurons, the two rewarded neurons are denoted as A and B), corresponding to the experimental results shown in Figure 9 of [17] and Figure 1 of [19].

(A) The spike response of 100 randomly chosen neurons at the beginning of the simulation (20 sec–23 sec, left plot), and at the middle of simulation just before the switching of the reward policy (597 sec–600 sec, right plot). The firing times of the first reinforced neuron A are marked by blue crosses and those of the second reinforced neuron B are marked by green crosses. (B) The dashed vertical line marks the switch of the reinforcements at t = 10 min. The firing rate of neuron A (blue line) increases while it is positively reinforced in the first half of the simulation and decreases in the second half when its spiking is negatively reinforced. The firing rate of the neuron B (green line) decreases during the negative reinforcement in the first half and increases during the positive reinforcement in the second half of the simulation. The average firing rate of 20 other randomly chosen neurons (dashed line) remains unchanged. (C) Evolution of the average weight of excitatory synapses to the rewarded neurons A and B (blue and green lines, respectively), and of the average weight of 1744 randomly chosen excitatory synapses to other neurons in the circuit (dashed line).

Figure 3

Setup of the model for the experiment by Fetz and Baker [17].

Differential reinforcement of two neurons (within a simulated network of 4000 neurons, the two rewarded neurons are denoted as A and B), corresponding to the experimental results shown in Figure 9 of [17] and Figure 1 of [19].

Setup of the model for the experiment by Fetz and Baker [17].

(A) Schema of the model: The activity of a single neuron in the circuit determines the amount of reward delivered to all synapses between excitatory neurons in the circuit. (B) The reward signal d(t) in response to a spike train (shown at the top) of the arbitrarily selected neuron (which was selected from a recurrently connected circuit consisting of 4000 neurons). The level of the reward signal d(t) follows the firing rate of the spike train. (C) The eligibility function f(s) (black curve, left axis), the reward kernel ε(s) delayed by 200 ms (red curve, right axis), and the product of these two functions (blue curve, right axis) as used in our computer experiment. The integral of f(s+d)ε(s) is positive, as required according to Equation 10 in order to achieve a positive learning rate for the synapses to the selected neuron. The mean weight change of synapses to non-reinforced neurons j≠k is given bywhere ν(t) = 〈S(t)〉 is the instantaneous firing rate of neuron j at time t. This equation indicates that a non-reinforced neuron is trained by STDP with a learning rate proportional to its correlation with the reinforced neuron given by ν(t−d−r′,s−d−r′)/ν(t−s). In fact, it was noted in [17] that neurons nearby the reinforced neuron tended to change their firing rate in the same direction. This observation might be explained by putative correlations of the recorded neuron with nearby neurons. On the other hand, if a neuron j is uncorrelated with the reinforced neuron k, we can decompose the joint firing rate into ν(t−d−r′,s−d−r′) = ν(t−d−r′)ν(t−s). In this case, the learning rate for synapse ji is approximately zero (see Methods). This ensures that most neurons in the circuit keep a constant firing rate, in spite of continuous weight changes according to reward-modulated STDP. Altogether we see that the weights of synapses to the reinforced neuron k can only change if there is spontaneous activity in the network, so that in particular also this neuron k fires spontaneously. On the other hand the spontaneous network activity should not consist of repeating large-scale spatio-temporal firing patterns, since that would entail correlations between the firing of neuron k and other neurons j, and would lead to similar changes of synapses to these other neurons j. Apart from these requirements on the spontaneous network activity, the preceding theoretical results predict that stability of the circuit is preserved, while the neuron which is causally related to the reward signal is trained by STDP, if is positive.

Computer Simulation 1: Model for Biofeedback Experiment

We tested these theoretical predictions through computer simulations of a generic cortical microcircuit receiving a reward signal which depends on the firing of one arbitrarily chosen neuron k from the circuit (reinforced neuron). The circuit was composed of 4000 LIF neurons, with 3200 being excitatory and 800 inhibitory, interconnected randomly by 228954 conductance based synapses with short term dynamics (All computer simulations were also carried out as a control with static current based synapses, see Methods and Suppl.). In addition to the explicitly modeled synaptic connections, conductance noise (generated by an Ornstein-Uhlenbeck process) was injected into each neuron according to data from [26], in order to model synaptic background activity of neocortical neurons in-vivo (More precisely, for 50% of the excitatory neurons the amplitude of the noise injection was reduced to 20%, and instead their connection probabilities from other excitatory neurons were chosen to be larger, see Methods and Figure S1 and Figure S2 for details. The reinforced neuron had to be chosen from the latter population, since reward-modulated STDP does not work properly if the postsynaptic neuron fires too often because of directly injected noise). This background noise elicited spontaneous firing in the circuit at about 4.6 Hz. Reward-modulated STDP was applied continuously to all synapses which had excitatory presynaptic and postsynaptic neurons, and all these synapses received the same reward signal. The reward signal was modeled according to Equation 9. Figure 3C shows one reward pulse caused by a single postsynaptic spike at time t = 0 with the parameters used in the experiment. For several postsynaptic spikes, the amplitude of the reward signal follows the firing rate of the reinforced neuron, see Figure 3B. This model was simulated for 20 minutes of biological time. Figure 4A, 4B, and 4D show that the firing rate of the reinforced neuron increases within a few minutes (like in the experiment of [17]), while the firing rates of the other neurons remain largely unchanged. The increase of weights to the reinforced neuron shown in Figure 4C can be explained by the correlations between its presynaptic and postsynaptic spikes shown in panel E. This panel shows that pre-before-post spike pairings (black curve) are in general more frequent than post-before-pre spike pairings. The reinforced neuron increases its rate from around 4 Hz to 12 Hz, which is comparable to the measured firing rates in [15] before and after learning.

Figure 4

Simulation of the experiment by Fetz and Baker [17] for the case where an arbitrarily selected neuron triggers global rewards when it increases its firing rate.

Simulation of the experiment by Fetz and Baker [17] for the case where an arbitrarily selected neuron triggers global rewards when it increases its firing rate.

(A) Spike response of 100 randomly chosen neurons within the recurrent network of 4000 neurons at the beginning of the simulation (20 sec–23 sec, left plot), and at the end of the simulation (the last 3 seconds, right plot). The firing times of the reinforced neuron are marked by blue crosses. (B) The firing rate of the positively rewarded neuron (blue line) increases, while the average firing rate of 20 other randomly chosen neurons (dashed line) remains unchanged. (C) Evolution of the average weight of excitatory synapses to the reinforced neuron (blue line), and of the average weight of 1663 randomly chosen excitatory synapses to other neurons in the circuit (dashed line). (D) Spike trains of the reinforced neuron before and after learning. (E) Histogram of the time-differences between presynaptic and postsynaptic spikes (bin size 0.5 ms), averaged over all excitatory synapses to the reinforced neuron. The black curve represents the histogram values for positive time differences (when the presynaptic spike precedes the postsynaptic spike), and the red curve represents the histogram for negative time differences. In Figure 9 of [17] and Figure 1 of [19] the results of another experiment were reported where the activity of two adjacent neurons was recorded, and high firing rates of the first neuron and low firing rates of the second neuron were reinforced simultaneously. This kind of differential reinforcement resulted in an increase and decrease of the firing rates of the two neurons correspondingly. We implemented this type of reinforcement by letting the reward signal in our model depend on the spikes of the two randomly chosen neurons (we refer to these neurons as neuron A and neuron B), i.e. , where is the component that positively rewards spikes of neuron A, and negatively rewards spikes of neuron B. Both parts of the reward signal, and , were defined as in Equation 9 for the corresponding neuron. For we used the reward kernel ε as defined in Equation 29, whereas for we used ε − = −ε (note that the integral over ε − is still zero). At the middle of the simulation (simulation time t = 10 min), we changed the direction of the reinforcements by negatively rewarding the firing of neuron A and positively rewarding the firing of neuron B (i.e., ). The results are summarized in Figure 2. With a reward signal modeled in this way, we were able to independently increase and decrease the firing rates of the two neurons according to the reinforcements, while the firing rates of the other neurons remained unchanged. Changing the type of reinforcement during the simulation from positive to negative for neuron A and from negative to positive for neuron B resulted in a corresponding shift in their firing rate change in the direction of the reinforcement. The dynamics of a network where STDP is applied to all synapses between excitatory neurons is quite sensitive to the specific choice of the STDP-rule. The preceding theoretical analysis (see Equations 10 and 11) predicts that reward-modulated STDP affects in the long run only those excitatory synapses where the firing of the postsynaptic neuron is correlated with the reward signal. In other words: the reward signal gates the effect of STDP in a recurrent network, and thereby can keep the network within a given dynamic regime. This prediction is confirmed qualitatively by the two panels of Figure 4A, which show that even after all excitatory synapses in the recurrent network have been subject to 20 minutes (in simulated biological time) of reward-modulated STDP, the network stays within the asynchronous irregular firing regime. It is also confirmed quantitatively through Figure 5. These figures show results for the simple additive version of STDP (according to Equation 3). Very similar results (see Figure S3 and Figure S4) arise from an application of the more complex STDP-rule proposed in [22] where the weight-change depends on the current weight value.

Figure 5

Evolution of the dynamics of a recurrent network of 4000 LIF neurons during application of reward-modulated STDP.

(A) Distribution of the synaptic weights of excitatory synapses to 50 randomly chosen non-reinforced neurons, plotted for 4 different periods of simulated biological time during the simulation. The weights are averaged over 10 samples within these periods. The colors of the curves and the corresponding intervals are as follows: red (300–360 sec), green (600–660 sec), blue (900–960 sec), magenta (1140–1200 sec). (B) The distribution of average firing rates of the non-reinforced excitatory neurons in the circuit, plotted for the same time periods as in (A). The colors of the curves are the same as in (A). The distribution of the firing rates of the neurons in the circuit remains unchanged during the simulation, which covers 20 minutes of biological time. (C) Cross-correlogram of the spiking activity in the circuit, averaged over 200 pairs of non-reinforced neurons and over 60 s, with a bin size of 0.2 ms, for the period between 300 and 360 seconds of simulated biological time. It is calculated as the cross-covariance divided by the square root of the product of variances. (D) As in (C), but between seconds 1140 and 1200. (Separate plots of (B), (C), and (D) for two types of excitatory neurons that received different amounts of noise currents are given in Figure S1 and Figure S2.)

Evolution of the dynamics of a recurrent network of 4000 LIF neurons during application of reward-modulated STDP.

Rewarding Spike-Times

The preceding model for the biofeedback experiment of Fetz and Baker focused on learning of firing rates. In order to explore the capabilities and limitations of reward-modulated STDP in contexts where the temporal structure of spike trains matters, we investigated another reinforcement learning scenario where a neuron should learn to respond with particular temporal spike patterns. We first apply analytical methods to derive conditions under which a neuron subject to reward-modulated STDP can achieve this. In this model, the reward signal d(t) is given in dependence on how well the output spike train of a neuron j matches some rather arbitrary spike train S* (which might for example represent spike output from some other brain structure during a developmental phase). S* is produced by a neuron μ* that receives the same n input spike trains S 1,…,S as the trained neuron j, with some arbitrarily chosen weights , . But in addition the neuron μ* receives n′−n further spike trains S +1,…,S ′ with weights . The setup is illustrated in Figure 6A. It provides a generic reinforcement learning scenario, when a quite arbitrary (and not perfectly realizable) spike output is reinforced, but simultaneously the performance of the learner can be evaluated clearly according to how well its weights w 1,…,w match those of the neuron μ* for those n input spike trains which both of them have in common. The reward d(t) at time t depends in this task on both the timing of action potentials of the trained neuron and spike times in the target spike train S*where the function κ(r) with describes how the reward signal depends on the time difference r between a postsynaptic spike and a target spike, and d>0 is the delay of the reward.

Figure 6

Setup for reinforcement learning of spike times.

(A) Architecture. The trained neuron receives n input spike trains. The neuron μ* receives the same inputs plus additional inputs not accessible to the trained neuron. The reward is determined by the timing differences between the action potentials of the trained neuron and the neuron μ*. (B) A reward kernel with optimal offset from the origin of t = −6.6 ms. The optimal offset for this kernel was calculated with respect to the parameters from computer simulation 1 in Table 1. Reward is positive if the neuron spikes around the target spike or somewhat later, and negative if the neuron spikes much too early.

Setup for reinforcement learning of spike times.

Table 1

Parameter values used for computer simulation 3 (see Figure 8).

Ex.	τ_ε [ms]	w_max	υ^post_min [Hz]	A₊ 10⁶	A₋/A₊	τ₊ [ms]	A^κ₊, A^κ₋	τ^κ ₂ [ms]	t_sim [h]
1	10	0.012	10	16.62	1.05	20	3.34, −3.12	20	5
2	7	0.020	5	11.08	1.02	15	4.58, −4.17	16	10
3	20	0.010	6	5.54	1.10	25	1.50, −1.39	40	19
4	7	0.020	5	11.08	1.07	25	4.67, −4.17	16	13
5	10	0.015	6	20.77	1.10	25	3.75, −3.12	20	2
6	25	0.005	3	13.85	1.01	25	3.34, −3.12	20	18

Our theoretical analysis (see Methods) predicts that under the assumption of constant-rate uncorrelated Poisson input statistics this reinforcement learning task can be solved by reward-modulated STDP for arbitrary initial weights if three constraints are fulfilled: The following parameters occur in these equations: ν* is the output rate of neuron μ*, is the minimal output rate, is the maximal output rate of the trained neuron, is the integral over the eligibility trace, is the integral over the STDP learning curve (see Equation 2), is the convolution of the reward kernel with the shape of the postsynaptic potential (PSP) ε(s), and is the integral over the PSP weighted by the learning window. If these inequalities are fulfilled and input rates are larger than zero, then the weight vector of the trained neuron converges on average from any initial weight vector to w* (i.e., it mimics the weight distribution of neuron μ* for those n inputs which both have in common). To get an intuitive understanding of these inequalities, we first examine the idea behind Constraint 13. This constraint assures that weights of synapses i with decay to zero in expectation. First note that input spikes from a spike train S with have no influence on the target spike train S*. In the linear Poisson neuron model, this leads to weight changes similar to STDP which can be described by two terms. First, all synapses are subject to depression stemming from the negative part of the learning curve W and random pre-post spike pairs. This weight change is bounded from below by for some positive constant α. On the other hand, the positive influence of input spikes on postsynaptic firing leads to potentiation of the synapse bounded from above by . Hence the weight decays to zero if , leading to Inequality 13. For synapses i with , there is an additional drive, since each presynaptic spike increases the probability of a closely following spike in the target spike train S*. Therefore, the probability of a delayed reward signal after a presynaptic spike is larger. This additional drive leads to positive weight changes if Inequalities 14 and 15 are fulfilled (see Methods). Note that also for the learning of spike times spontaneous spikes (which might be regarded as “noise”) are important, since they may lead to reward signals that can be exploited by the learning rule. It is obvious that in reward-modulated STDP, a silent neuron cannot recover from its silent state, since there will be no spikes which can drive STDP. But in addition, Condition 13 shows that in this learning scenario, the minimal output rate —which increases with increasing noise—has to be larger than some positive constant, such that depression is strong enough to weaken synapses if needed. On the other hand, if the noise is too strong also synapses i with w = w will be depressed and may not converge correctly. This can happen when the increased noise leads to a maximal postsynaptic rate such that Constraints 14 and 15 are not satisfied anymore. Conditions 13–15 also reveal how parameters of the model influence the applicability of this setup. For example, the eligibility trace enters the equations only in the form of its integral and its value at the reward delay in Equation 15. In fact, the exact shape of the eligibility trace is not important. The important property of an ideal eligibility trace is that it is high at the reward delay and low at other times as expressed by the fraction in Condition 15. Interestingly, the formulas also show that one has quite some freedom in choosing the form of the STDP window, as long as the reward kernel ε is adjusted accordingly. For example, instead of a standard STDP learning window W with W(r)≥0 for r>0 and W(r)≤0 for r<0 and a corresponding reward kernel κ, one can use a reversed learning window W′ defined by W′(r)≡W(−r) and a reward kernel κ′ such that ε ′(r) = ε(−r). If Condition 15 is satisfied for W and κ, then it is also satisfied for W′ and κ′ (and in most cases also Condition 14 will be satisfied). This reflects the fact that in reward modulated STDP the learning window defines the weight changes in combination with the reward signal. For a given STDP learning window, the analysis reveals what reward kernels κ are suitable for this learning setup. From Condition 15, we can deduce that the integral over κ should be small (but positive), whereas the integral should be large. Hence, for a standard STDP learning window W with W(r)≥0 for r>0 and W(r)≤0 for r<0, the convolution ε(r) of the reward kernel with the PSP should be positive for r>0 and negative for r<0. In the computer simulation we used a simple kernel depicted in Figure 6B, which satisfies the aforementioned constraints. It consists of two double-exponential functions, one positive and one negative, with a zero crossing at some offset t from the origin. The optimal offset t is always negative and in the order of several milliseconds for usual PSP-shapes ε. We conclude that for successful learning in this scenario, a positive reward should be produced if the neuron spikes around the target spike or somewhat later, and a negative reward should be produced if the neuron spikes much too early.

Computer Simulation 2: Learning Spike Times

In order to explore this learning scenario in a biologically more realistic setting, we trained a LIF neuron with conductance based synapses exhibiting short term facilitation and depression. The trained neuron and the neuron μ* which produced the target spike train S* both received inputs from 100 input neurons emitting spikes from a constant rate Poisson process of 15 Hz. The synapses to the trained neuron were subject to reward-modulated STDP. The weights of neuron μ* were set to for 0≤i<50 and for 50≤i<100. In order to simulate a non-realizable target response, neuron μ* received 10 additional synaptic inputs (with weights set to w/2). During the simulations we observed a firing rate of 18.2 Hz for the trained neuron, and 25.2 Hz for the neuron μ*. The simulations were run for 2 hours simulated biological time. We performed 5 repetitions of the experiment, each time with different randomly generated inputs and different initial weight values for the trained neuron. In each of the 5 runs, the average synaptic weights of synapses with and approached their target values, as shown in Figure 7A. In order to test how closely the trained neuron reproduces the target spike train S* after learning, we performed additional simulations where the same spike input was applied to the trained neuron before and after the learning. Then we compared the output of the trained neuron before and after learning with the output S* of neuron μ*. Figure 7B shows that the trained neuron approximates the part of S* which is accessible to it quite well. Figure 7C–F provide more detailed analyses of the evolution of weights during learning. The computer simulations confirmed the theoretical prediction that the neuron can learn well through reward-modulated STDP only if a certain level of noise is injected into the neuron (see preceding discussion and Figure S6).

Figure 7

Results for reinforcement learning of exact spike times through reward-modulated STDP.

(A) Synaptic weight changes of the trained LIF neuron, for 5 different runs of the experiment. The curves show the average of the synaptic weights that should converge to (dashed lines), and the average of the synaptic weights that should converge to (solid lines) with different colors for each simulation run. (B) Comparison of the output of the trained neuron before (top trace) and after learning (bottom trace). The same input spike trains and the same noise inputs were used before and after training for 2 hours. The second trace from above shows those spike times S* which are rewarded, the third trace shows the realizable part of S* (i.e. those spikes which the trained neuron could potentially learn to reproduce, since the neuron μ* produces them without its 10 extra spike inputs). The close match between the third and fourth trace shows that the trained neuron performs very well. (C) Evolution of the spike correlation between the spike train of the trained neuron and the realizable part of the target spike train S*. (D) The angle between the weight vector w of the trained neuron and the weight vector w* of the neuron μ* during the simulation, in radians. (E) Synaptic weights at the beginning of the simulation are marked with ×, and at the end of the simulation with •, for each plastic synapse of the trained neuron. (F) Evolution of the synaptic weights w/w during the simulation (we had chosen for i<50, for i≥50).

Results for reinforcement learning of exact spike times through reward-modulated STDP.

Computer Simulation 3: Testing the Analytically Derived Conditions

Equations 13–15 predict under which relationships between the parameters involved the learning of particular spike responses through reward-modulated STDP will be successful. We have tested these predictions by selecting 6 arbitrary settings of these parameters, which are listed in Table 1. In 4 cases (marked by light gray shading in Figure 8) these conditions were not met (either for the learning of weights with target value w, or for the learning of weights with target value 0. Figure 8 shows that the derived learning result is not achieved in exactly these 4 cases. On the other hand, the theoretically predicted weight changes (black bar) predict in all cases the actual weight changes (gray bar) that occur for the chosen simulation times (listed in the last column of Table 1) remarkably well.

Figure 8

Test of the validity of the analytically derived conditions 13–15 on the relationship between parameters for successful learning with reward-modulated STDP.

Predicted average weight changes (black bars) calculated from Equation 22 match in sign and magnitude the actual average weight changes (gray bars) in computer simulations, for 6 different experiments with different parameter settings (see Table 1). (A) Weight changes for synapses with . (B) Weight changes for synapses with . Four cases where constraints 13–15 are not fulfilled are shaded in light gray. In all of these four cases the weights move into the opposite direction, i.e., a direction that decreases rewards.

Test of the validity of the analytically derived conditions 13–15 on the relationship between parameters for successful learning with reward-modulated STDP.

Pattern Discrimination with Reward-Modulated STDP

We examine here the question whether a neuron can learn through reward-modulated STDP to discriminate between two spike patterns P and N of its presynaptic neurons, by responding with more spikes to pattern P than to pattern N. Our analysis is based on the assumption that there exist internal rewards d(t) that could guide such pattern discrimination. This reward based learning architecture is biologically more plausible than an architecture with a supervisor which provides for each input pattern a target output and thereby directly produces the desired firing behavior of the neuron (since the question becomes then how the supervisor has learnt to produce the desired spike outputs). We consider a neuron that receives input from n presynaptic neurons. A pattern X consists of n spike trains, each of time length T, one for each presynaptic neuron. There are two patterns, P and N, which are presented in alternation to the neuron, with some reset time between presentations. For notational simplicity, we assume that each of the n presynaptic spike trains consists of exactly one spike. Hence, each pattern can be defined by a list of spike times: , , where is the time when presynaptic neuron i spikes for pattern X∈{P,N}. A generalization to the easier case of learning to discriminate spatio-temporal presynaptic firing patterns (where some presynaptic neurons produce different numbers of spikes in different patterns) is straightforward, however the main characteristics of the learning dynamics are better accessible in this conceptually simpler setup. It had already been shown in [12] that neurons can learn through reward-modulated STDP to discriminate between different spatial presynaptic firing patterns. But in the light of the analysis of [27] it is still open whether neurons can learn with simple forms of reward-modulated STDP, such as the one considered in this article, to discriminate temporal presynaptic firing patterns. We assume that the reward signal d(t) rewards—after some delay d—action potentials of the trained neuron if pattern P was presented, and punishes action potentials of the neuron if pattern N was presented. More precisely, we assume thatwith some reward kernel ε and constants α<0<α. The goal of this learning task is to produce many output spikes for pattern P, and few or no spikes for pattern N. The main result of our analysis is an estimate of the expected weight change of synapse i of the trained neuron for the presentation of pattern P, followed after a sufficiently long time T′ by a presentation of pattern N where 〈·〉 | is the expectation over the ensemble given that pattern X was presented. This weight change can be estimated as (see Methods)where ν(t) is the postsynaptic rate at time t for pattern X, and the constants for X∈{P,N} are given byAs we will see shortly, an interesting learning effect is achieved if is positive and is negative. Since f(r) is non-negative, a natural way to achieve this is to choose a positive reward kernel ε(r)≥0 for r>0 and ε(r) = 0 for r<0 (also, f(r) and ε(r) must not be identical to zero for all r). We use Equation 17 to provide insight on when and how the classification of temporal spike patterns can be learnt with reward-modulated STDP. Assume for the moment that . We first note that it is impossible to achieve through any synaptic plasticity rule that the time integral over the membrane potential of the trained neuron has after training a larger value for input pattern P than for input pattern N. The reason is that each presynaptic neuron emits the same number of spikes in both patterns (namely one spike). This simple fact implies that it is impossible to train a linear Poisson neuron (with any learning method) to respond to pattern P with more spikes than to pattern N. But Equation 17 implies that reward-modulated STDP increases the variance of the membrane potential for pattern P, and reduces the variance for pattern N. This can be seen as follows. Because of the specific form of the STDP learning curve W(r), which is positive for (small) positive r, negative for (small) negative r, and zero for large r, has a potentiating effect on synapse i if the postsynaptic rate for pattern P is larger (because of a higher membrane potential) shortly after the presynaptic spike at this synapse i than before that spike. This tends to further increase the membrane potential after that spike. On the other hand, since is negative, the same situation for pattern N has a depressing effect on synapse i, which counteracts the increased membrane potential after the presynaptic spike. Dually, if the postsynaptic rate shortly after the presynaptic spike at synapse i is lower than shortly before that spike, the effect on synapse i is depressing for pattern P. This leads to a further decrease of the membrane potential after that spike. In the same situation for pattern N, the effect is potentiating, again counteracting the variation of the membrane potential. The total effect on the postsynaptic membrane potential is that the fluctuations for pattern P are increased, while the membrane potential for pattern N is flattened. For the LIF neuron model, and most reasonable other non-linear spiking neuron models, as well as for biological neurons in-vivo and in-vitro [28]–[30], larger fluctuations of the membrane potential lead to more action potentials. As a result, reward-modulated STDP tends to increase the number of spikes for pattern P for these neuron models, while it tends to decrease the number of spikes for pattern N, thereby enabling a discrimination of these purely temporal presynaptic spike patterns.

Computer Simulation 4: Learning Pattern Classification

We tested these theoretical predictions through computer simulations of a LIF neuron with conductance based synapses exhibiting short-term depression and facilitation. Both patterns, P and N, had 200 input channels, with 1 spike per channel (hence this is the extreme where all information lies in the timing of presynaptic spikes). The spike times were drawn from an uniform distribution over a time interval of 500 ms, which was the duration of the patterns. We performed 1000 training trials where the patterns P and N were presented to the neuron in alternation. To introduce exploration for this reinforcement learning task, the neuron had injected 20% of the Ornstein-Uhlenbeck process conductance noise (see Methods for further details). The theoretical analysis predicted that the membrane potential will have after learning a higher variance for pattern P, and a lower variance for pattern N. When in our simulation of a LIF neuron the firing of the neuron was switched off (by setting the firing threshold potential too high) we could observe the membrane potential fluctuations undisturbed by the reset mechanism after each spike (see Figure 9C and 9D). The variance of the membrane potential did in fact increase for pattern P from 2.49 (mV)2 to 5.43 (mV)2 (Figure 9C), and decrease for pattern N (Figure 9D), from 2.34 (mV)2 to 1.33 (mV)2. The corresponding plots with the firing threshold included are given in panels E and F, showing an increased member of spikes of the LIF neuron for pattern P, and a decreased number of spikes for pattern N. Furthermore, as Figure 9A and 9B show, the increased variance of the membrane potential for the positively reinforced pattern P led to a stable temporal firing pattern in response to pattern P.

Figure 9

Training a LIF neuron to classify purely temporal presynaptic firing patterns: a positive reward is given for firing of the neuron in response to a temporal presynaptic firing pattern P, and a negative reward for firing in response to another temporal pattern N.

(A) The spike response of the neuron for individual trials, during 500 training trials when pattern P is presented. Only the spikes from every 4-th trial are plotted. (B) As in (A), but in response to pattern N. (C) The membrane potential V(t) of the neuron during a trial where pattern P is presented, before (blue curve) and after training (red curve), with the firing threshold removed. The variance of the membrane potential increases during learning, as predicted by the theory. (D) As in (C), but for pattern N. The variance of the membrane potential for pattern N decreases during learning, as predicted by the theory. (E) The membrane potential V(t) of the neuron (including action potentials) during a trial where pattern P is presented before (blue curve) and after training (red curve). The number of spikes increases. (F) As in (E), but for trials where pattern N is given as input. The number of spikes decreases. (G) Average number of output spikes per trial before learning, in response to pattern P (gray bars) and pattern N (black bars), for 6 experiments with different randomly generated patterns P and N, and different random initial synaptic weights of the neuron. (H) As in (G), for the same experiments, but after learning. The average number of spikes per trial increases after training for pattern P, and decreases for pattern N.

Training a LIF neuron to classify purely temporal presynaptic firing patterns: a positive reward is given for firing of the neuron in response to a temporal presynaptic firing pattern P, and a negative reward for firing in response to another temporal pattern N.

Computer Simulation 5: Training a Readout Neuron with Reward-Modulated STDP To Recognize Isolated Spoken Digits

A longstanding open problem is how a biologically realistic neuron model can be trained in a biologically plausible manner to extract information from a generic cortical microcircuit. Previous work [31]–[35] has shown that quite a bit of salient information about recent and past inputs to the microcircuit can be extracted by a non-spiking linear readout neuron (i.e., a perceptron) that is trained by linear regression or margin maximization methods. Here we examine to what extent a LIF readout neuron with conductance based synapses (subject to biologically realistic short term synaptic plasticity) can learn through reward-modulated STDP to extract from the response of a simulated cortical microcircuit (consisting of 540 LIF neurons), see Figure 10A, the information which spoken digit (transformed into spike trains by a standard cochlea model) is injected into the circuit. In comparison with the preceding task in simulation 4, this task is easier because the presynaptic firing patterns that need to be discriminated differ in temporal and spatial aspects (see Figure 10B; Figure S10 and S11 show the spike trains that were injected into the circuit). But this task is on the other hand more difficult, because the circuit response (which creates the presynaptic firing pattern for the readout neuron) differs also significantly for two utterances of the same digit (Figure 10C), and even for two trials for the same utterance (Figure 10D) because of the intrinsic noise in the circuit (which was modeled according to [26] to reflect in-vivo conditions during cortical UP-states). The results shown in Figure 10E–H demonstrate that nevertheless this learning experiment was successful. On the other hand we were not able to achieve in this way speaker-independent word recognition, which had been achieved in [31] with a linear readout. Hence further work will be needed in order to clarify whether biologically more realistic models for readout neurons can be trained through reinforcement learning to reach the classification capabilities of perceptrons that are trained through supervised learning.

Figure 10

A LIF neuron is trained through reward-modulated STDP to discriminate as a “readout neuron” responses of generic cortical microcircuits to utterances of different spoken digits.

(A) Circuit response to an utterance of digit “one” (spike trains of 200 out of 540 neurons in the circuit are shown). The response within the time period from 100 to 200 ms (marked in gray) is used as a reference in the subsequent 3 panels. (B) The circuit response from (A) (black) for the period between 100 and 200 ms, and the circuit response to an utterance of digit “two” (red). (C) The circuit spike response from (A) (black) and a circuit response for another utterance of digit “one” (red), also shown for the period between 100 and 200 ms. (D) The circuit spike response from (A) (black), and another circuit response to the same utterance in another trial (red). The responses differ due to the presence of noise in the circuit. (E) Spike response of the LIF readout neuron for different trials during learning, for trials where utterances of digit “two” (left plot) and digit “one” (right plot) are presented as circuit inputs. The spikes from each 4th trial are plotted. (F) Average number of spikes in the response of the readout during training, in response to digit “one” (blue) and digit “two” (green). The number of spikes were averaged over 40 trials. (G) The membrane potential V(t) of the neuron during a trial where an input pattern corresponding to an utterance of digit “two” is presented, before (blue curve) and after training (red curve), with the firing threshold removed. (H) As in (G), but for an input pattern corresponding to an utterance of digit “one”. The variance of the membrane potential increases during learning for utterances of the rewarded digit, and decreases for the non-rewarded digit.

A LIF neuron is trained through reward-modulated STDP to discriminate as a “readout neuron” responses of generic cortical microcircuits to utterances of different spoken digits.

Methods

We first describe the simple neuron model that we used for the theoretical analysis, and then provide derivations of the equations that were discussed in the preceding section. After that we describe the models for neurons, synapses, and synaptic background activity (“noise”) that we used in the computer simulations. Finally we provide technical details to each of the 5 computer simulations that we discussed in the preceding section.

Linear Poisson Neuron Model

In our theoretical analysis, we use a linear Poisson neuron model whose output spike train is a realization of a Poisson process with the underlying instantaneous firing rate R(t). The effect of a spike of presynaptic neuron i at time t′ on the membrane potential of neuron j is modeled by an increase in the instantaneous firing rate by an amount w(t′)ε(t−t′), where ε is a response kernel which models the time course of a postsynaptic potential (PSP) elicited by an input spike. Since STDP according to [12] has been experimentally confirmed only for excitatory synapses, we will consider plasticity only for excitatory connections and assume that w≥0 for all i and ε(s)≥0 for all s. Because the synaptic response is scaled by the synaptic weights, we can assume without loss of generality that the response kernel is normalized to . In this linear model, the contributions of all inputs are summed up linearly:where S 1,…,S are the n presynaptic spike trains. Since the instantaneous firing rate R(t) is analogous to the membrane potential of other neuron models, we occasionally refer to R(t) as the “membrane potential” of the neuron.

Learning Equations

In the following, we denote by the ensemble average of a random variable x given that neuron k spikes at time t and neuron i spikes at time t′. We will also sometimes indicate the variables Y 1,Y 2,… over which the average of x is taken by writing .

Derivation of Equation 6

Using Equations 5, 1, and 4, we obtain the expected weight change between time t and t+T with D(t,s,r) = 〈d(t)|Neuron j spikes at t−s, and neuron i spikes at t−s−r〉, and the joint firing rate ν(t,r) = 〈S(t)S(t−r)〉 describes correlations between spike timings of neurons j and i. The joint firing rate ν(t−s,r) depends on the weight at time t−s. If the learning rate defined by the magnitude of W(r) is small, the synaptic weights can be assumed constant on the time scale of T. Thus, the time scales of neuronal dynamics are separated from the slow time scale of learning. For slow learning, synaptic weights integrate a large number of small changes. We can then expect that averaged quantities enter the learning dynamics. In this case, we can argue that fluctuations of a weight w about its mean are negligible and it can well be approximated by its average 〈w〉 (it is “self-averaging”, see [21],[36]). To ensure that average quantities enter the learning dynamics, many presynaptic and postsynaptic spikes as well as many independently delivered rewards at varying delays have to occur within T. Hence, in general, the time scale of single spike occurrences and the time scale of the eligibility trace is required to be much smaller than the time scale of learning. If time scales can be separated, we can drop the expectation on the left hand side of the last equation and writeWe thus obtain Equation 6:

Simplification of Equation 6

In order to simplify this equation, we first observe that W(r) is vanishing for large |r|. Hence we can approximate the integral over the learning window by a bounded integral for some T>0 and T≪T. In the analyzes of this article, we consider the case where reward is delivered with a relatively large temporal delay. To be more precise, we assume that a pre-post spike pair has an effect on the reward signal only after some minimal delay d and that we can write for some baseline reward d 0 and a part which depends on the timing of pre-post spike pairs with for sT. We can then approximate the second term of Equation 6:because 〈ν(t−s−r,r)〉≈〈ν(t−s,r)〉 for r∈[−T,T] and T≪T. Since for s≤T, the second term in the brackets is equivalent to which in turn is approximately given by if we assume that f(s+r)≈f(s) for s≥d and |r|single integral to obtain Equation 8.

Derivations for the Biofeedback Experiment

We assume that a reward with the functional form ε is delivered for each postsynaptic spike with a delay d. The reward as time t is therefore

Weight change for the reinforced neuron (derivation of Equation 10)

The reward correlation for a synapse ki afferent to the reinforced neuron is If we assume that the output firing rate is constant on the time scale of the reward function, the first term vanishes. We rewrite the result asThe mean weight change for weights to the reinforced neuron is thereforeWe show that the second term in the brackets is very small compared to the first term:The last approximation is based on the assumption that f(s)≈f(s−r′) and 〈ν(t−r′,r)〉≈〈ν(t,r)〉 for r′∈[−T−T,T]. Here, T is the time scale of the learning window (see above), and T is time scale of the PSP, i.e., we have ε(s)≈0 for s≥T. Since by definition, we see that this is the first term in the brackets of Equation 20 scaled by w. For neurons with many input synapses we have w≪1. Thus the second term in the brackets of Equation 20 is small compared to the first term. We therefore have

Weight change for non-reinforced neurons (derivation of Equation 11)

The reward correlation of a synapse ji to a non-reinforced neuron j is given byWe havefor which we obtainIn analogy to the previous derivation, we assume here that the firing rate ν(t−s) in the denominator results from many PSPs. Hence, the single PSP w(r) is small compared to ν(t−s). Similarly, we assume that with weights w, w≪1, the second term in the nominator is small compared to the joint firing rate ν(t−d−r′,s−d−r′). We therefore approximate the reward correlation byHence, the reward correlation of a non-reinforced neuron depends on the correlation of this neuron with the reinforced neuron. The mean weight change for a non-reinforced neuron j≠k is thereforeThis equation deserves a remark for the case that ν(t−s) is zero, since it appears in the denominator of the fraction. Note that in this case, both ν(t−d−r′,s−d−r′) and ν(t−s,r) are zero. In fact, if we take the limit ν(t−s)→0, then both of these factors approach zero at least as fast. Hence, in the limit of ν(t−s)→0, the term in the angular brackets evaluates to zero. This reflects the fact that since STDP is driven by pre- and postsynaptic spikes, there is no weight change if no postsynaptic spikes occur.

For uncorrelated neurons, Equation 11 evaluates to zero

For uncorrelated neurons k, j, ν(t−d−r′,s−d−r′) can be factorized into ν(t−d−r′)ν(t−s), and we obtainThis evaluates approximately to zero if the mean output rate of neuron k is constant on the time scale of the reward kernel.

Analysis of Spike-Timing-Dependent Rewards (Derivation of Conditions 13–15)

Below, we will indicate the variables Y 1,Y 2,… over which the average of x is taken by writing . From Equation 12, we can determine the reward correlation for synapse i where denotes the instantaneous firing rate of the trained neuron at time t, and ν *(t) = 〈S *(t)〉 denotes the instantaneous rate of the target spike train at time t. Since weights are changing very slowly, we have w(t−s−r)≈w(t). In the following, we will drop the dependence of w on t for brevity. For simplicity, we assume that input rates are stationary and uncorrelated. In this case (since the weights are changing slowly), also the correlations between inputs and outputs can be assumed stationary, ν(t,r) = ν(r). With constant input rates, we can rewrite Equation 21 aswith . We use this results to obtain the temporally smoothed weight change for synapse ji. With stationary correlations, we can drop the dependence of ν on t and write ν(t,r) = ν(r). Furthermore, we define and obtainWe assume that the eligibility function f(d)≈f(d+r) if |r| is on the time scale of a PSP, the learning window, or the reward kernel, and that d is large compared to these time scales. Then, we havewhere is the convolution of the reward kernel with the PSP. Furthermore, we findWith these simplifications, and the abbreviation we obtain the weight change at synapse ji where . For uncorrelated Poisson input spike trains of rate and the linear Poisson neuron model, the input-output correlations are . With these correlations, we obtain where , and. The weight change at synapse ji is then We will now bound the expected weight change for synapses ji with and for synapses jk with . In this way we can derive conditions for which the expected weight change for the former synapses is positive, and that for the latter type is negative. First, we assume that the integral over the reward kernel is positive. In this case, the weight change given by Equation 22 is negative for synapses i with if and only if , and . In the worst case, w is w and is small. We have to guarantee some minimal output rate such that even if w = w, this inequality is fulfilled. This could be guaranteed by some noise current. Given such minimal output rate, we can state the first inequality which guarantees convergence of weights w with For synapses ji with , we obtain two more conditions. The approximate weight change is given byThe last term in this equation is positive and small. We can ignore it in our sufficient condition. The second to last term is negative. We will include in our condition that the third to last term compensates for this negative term. Hence, the second condition iswhich should be satisfied in most setups. If we assume that this holds, we obtainwhich should be positive. We obtain the following inequalityAll three inequalities are summarized in the following:where is the maximal output rate. If these inequalities are fulfilled and input rates are positive, then the weight vector converges on average from any initial weight vector to w*. The second condition is less severe, and should be easily fulfilled in most setups. If this is the case, the first Condition 13 ensures that weights with w* = 0 are depressed while the third Condition 15 ensures that weights with w* = w are potentiated.

Analysis of the Pattern Discrimination Task (Derivation of Equation 17)

We assume that a trial consists of the presentation of a single pattern starting at time t = 0. We compute the weight change for a single trial given that pattern X∈{P,N} was presented with the help of Equations 1, 3, and 4 asWe can compute the average weight change given that pattern X was presented:If we assume that f is approximately constant on the time scale of the learning window W, we can simplify this toFor the linear Poisson neuron, we can write the auto-correlation function aswhere ν(t) = 〈S(t)〉 | is the ensemble average rate at time t given that pattern X was presented. If an experiment for a single pattern runs over the time interval [0,T′], we can compute the total average weight change of a trial given that pattern X was presented asBy definingwe can write Equation 23 asWe assume that eligibility traces and reward signals have settled to zero before a new pattern is presented. The expected weight change for the successive presentation of both patterns is thereforeThe equations can easily be generalized to the case where multiple input spikes per synapse are allowed and where jitter on the templates is allowed. However, the main effect of the rule can be read off the equations given here.

Common Models and Parameters of the Computer Simulations

We describe here the models and parameter values that were used in all our computer simulations. We will specify in a subsequent section the values of other parameters that had to be chosen differently in individual computer simulations, in dependence of their different setups and requirements of each computer simulation.

LIF Neuron Model

For the computer simulations LIF neurons with conductance-based synapses were used. The membrane potential V(t) of this neuron model is given by:where C is the membrane capacitance, R is the membrane resistance, V is the resting potential, and g ,(t) and g ,(t) are the K and K synaptic conductances from the excitatory and inhibitory synapses respectively. The constants E and E are the reversal potentials of excitatory and inhibitory synapses. I represents the synaptic background current which the neuron receives (see below for details). Whenever the membrane potential reaches a threshold value V, the neuron produces a spike, and its membrane potential is reset to the value of the reset potential V. After a spike, there is a refractory period of length T, during which the membrane potential of the neuron remains equal to the value V(t) = V. After the refractory period V(t) continues to change according to Equation 24. For a given synapse, the dynamics of the synaptic conductance g(t) is defined bywhere A(t) is the amplitude of the postsynaptic response (PSR) to a single presynaptic spike, which varies over time due to the inherent short-term dynamics of the synapse, and {t (} are the spike times of the presynaptic neuron. The conductance of the synapse decreases exponentially with time constant τ, and increases instantaneously by amount of A(t) whenever the presynaptic neuron spikes. In all computer simulations we used the following values for the neuron and synapse parameters. The membrane resistance of the neurons was R = 100 MΩ, the membrane capacitance C = 0.3 nF, the resting potential, reset potential and the initial value of the membrane potential had the same value of V = V = V(0) = −70 mV, the threshold potential was set to V = −59 mV and the refractory period T = 5 ms. For the synapses we used a time constant set to τ = 5 ms, reversal potential E = 0 mV for the excitatory synapses and E = −75 mV for the inhibitory synapses. All synapses had a synaptic delay of t = 1 ms.

Short-Term Dynamics of Synapses

We modeled the short-term dynamics of synapses according to the phenomenological model proposed in [37], where the amplitude A = A(t+t) of the postsynaptic response for the kth spike in a spike train with inter-spike intervals Δ1,Δ2,…,Δ −1 is calculated with the following equationswith hidden dynamic variables u∈[0,1] and R∈[0,1] whose initial values for the 1st spike are u 1 = U and R = 1 (see [38] for a justification of this version of the equations, which corrects a small error in [37] ). The variable w is the synaptic weight which scales the amplitudes of postsynaptic responses. If long-term plasticity is introduced, this variable is a function of time. In the simulations, for the neurons in the circuits the values for the U, D and F parameters were drawn from Gaussian distributions with mean values which depended on whether the type of presynaptic and postsynaptic neuron of the synapse is excitatory or inhibitory, and were chosen according to the data reported in [37] and [39]. The mean values of the Gaussian distributions are given in Table 2, and the standard deviation was chosen to be 50% of its mean. Negative values were replaced with values drawn from uniform distribution with a range between 0 and twice the mean value. For the simulations involving individual trained neurons, the U, D, and F parameters of these neurons were set to the values from Table 2.

Table 2

Mean values of the U, D, and F parameters in the model from [37] for the short-term dynamics of synapses, depending on the type of the presynaptic and postsynaptic neuron (excitatory or inhibitory).

Source/Dest.	Exc. (U, D, F)	Inh. (U, D, F)
Exc.	0.5, 1.1, 0.02	0.25, 0.7, 0.02
Inh.	0.05, 0.125, 1.2	0.32, 0.144, 0.06

These mean values, based on experimental data from [37],[39], were used in all computer simulations.

These mean values, based on experimental data from [37],[39], were used in all computer simulations. We have carried out control experiments with current-based synapses that were not subject to short-term plasticity (see Figure S5, Figure S8, and Figure S9; successful control experiments with static current-based synapses were also carried out for computer simulation 1, results not shown). We found that the results of all our computer simulations also hold for static current-based synapses.

Model of Background Synaptic Activity

To reproduce the background synaptic input cortical neurons receive in vivo, the neurons in our models received an additional noise process as conductance input. The noise process we used is a point-conductance approximation model, described in [26]. According to [26], this noise process models the effect of a bombardment by a large number of synaptic inputs in vivo, which causes membrane potential depolarization, referred to as “high conductance” state. Furthermore, it was shown that it captures the spectral and amplitude characteristics of the input conductances of a detailed biophysical model of a neocortical pyramidal cell that was matched to intracellular recordings in cat parietal cortex in vivo. The ratio of average contributions of excitatory and inhibitory background conductances was chosen to be 5 in accordance to experimental studies during sensory responses (see [40]–[42]). In this model, the noisy synaptic current I in Equation 24 is a sum of two currents:where g(t) and g(t) are time-dependent excitatory and inhibitory conductances. The values of the respective reversal potentials were E = 0 mV and E = −75 mV. The conductances g(t) and g(t) were modeled according to [26] as a one-variable stochastic process similar to an Ornstein-Uhlenbeck process:with mean g 0 = 0.012 µS, noise-diffusion constant D = 0.003 µS and time constant τ = 2.7 ms for the excitatory conductance, and mean g 0 = 0.057 µS, noise-diffusion constant D = 0.0066 µS, and time constant τ = 10.5 ms for the inhibitory conductance. χ 1(t) and χ 2(t) are Gaussian white noise of zero mean and unit standard deviation. Since these processes are Gaussian stochastic processes, they can be numerically integrated by an exact update rule:where N 1(0,1) and N 2(0,1) are normal random numbers (zero mean, unit standard deviation) and A, A are amplitude coefficients given by:

Reward-Modulated STDP

For the computer simulations we used the following parameters for the STDP window function W(r): A + = 0.01w, A −/A + = 1.05, τ + = τ − = 30 ms. w denotes the hard bound of the synaptic weight of the particular plastic synapse. Note that the parameter A + can be given arbitrary value in this plasticity rule, since it can be scaled together with the reward signal, i.e. multiplying the reward signal by some constant and dividing A + by the same constant results in identical time evolution of the weight changes. We have set A + to be 1% of the maximum synaptic weight. We used the α-function to model the eligibility trace kernel f(t)where the time constant τ was set to τ = 0.4 s in all computer simulations. For computer simulations 1 and 4 we performed control experiments (see Figure S3, Figure S4, and Figure S7) with the weight-dependent synaptic update rule proposed in [22], instead of the purely additive rule in Equation 3. We used the parameters proposed in [22], i.e. μ = 0.4, α = 0.11, τ + = τ − = 20 ms. The w 0 parameter was calculated according to the formula: where w is the maximum synaptic weight of the synapse. is equal to the initial synaptic weight for the circuit neurons, or to the mean of the distribution of the initial weights for the trained neurons.

Initial Weights of Trained Neurons

The synaptic weights of excitatory synapses to the trained neurons in experiments 2–5 were initialized from a Gaussian distribution with mean w/2. The standard deviation was set to w/10 bounded within the range [3w/10,7w/10].

Software

All computer simulations were carried out with the PCSIM software package (http://www.lsm.tugraz.at/pcsim). PCSIM is a parallel simulator for biologically realistic neural networks with a fast c++ simulation core and a Python interface. It has been developed by Thomas Natschläger and Dejan Pecevski. The time step of simulation was set to 0.1 ms.

Details to Individual Computer Simulations

For all computer simulations, both for the cortical microcircuits and readout neurons, the same parameters values for the neuron and synapse models and the reward-modulated STDP rule were used, as specified in the previous section (except in computer simulation 3, where the goal was to test the theoretical predictions for different values of the parameters). Each of the computer simulations in this article modeled a specific task or experimental finding. Consequently, the dependence of the reward signal on the behavior of the system had to be modeled in a specific way for each simulation (a more detailed discussion of the reward signal can be found in the Discussion section). The parameters for that are given below in separate subsections which address the individual simulations. Furthermore, some of the remaining parameters in the experiments, i.e. the values of the synaptic weights, the number of synapses of a neuron, number of neurons in the circuit and the Ornstein-Uhlenbeck (OU) noise levels were chosen to achieve different goals depending on the particular experiment. Briefly stated, these values were tuned to achieve a certain level of firing activity in the neurons, a suitable dynamical regime of the activity in the circuits, and a specific ratio between amount of input the neurons receive from the input synapses and the input generated by the noise process. We carried out two types of simulations: simulations of cortical microcircuits in computer simulations 1 and 5, and training of readout neurons in computer simulations 2, 3, 4, and 5. In the following we discuss these two types of simulations in more detail.

Cortical Microcircuits

The values of the initial weights of the excitatory and inhibitory synapses for the cortical microcircuits are given in Table 3. All synaptic weights were bounded in the range between 0 and twice the initial synaptic weight of the synapse.

Table 3

Specific parameter values for the cortical microcircuits in computer simulation 1 and 5.

Simulation No.	Neurons	p_ee, p_ei, p_ie, p_ii	w_exc(0) [nS]	w_inh [nS]	C_OU
1	4000	0.02,0.02,0.024,0.016	10.7	211.6	1.0, 0.2
5	540	0.1	0.784	5.1	0.4

p is the connection probability, w(0) and w(0) are the initial synaptic weights for the excitatory and inhibitory synapses respectively, and C is the scaling factor for the Ornstein-Uhlenbeck noise injected in the neurons. The cortical microcircuit was composed of 4000 neurons connected randomly with connection probabilities described in Details to computer simulation 1. The initial synaptic weights of the synapses and the levels of OU noise were tuned to achieve a spontaneous firing rate of about 4.6 Hz, while maintaining an asynchronous irregular firing activity in the circuit. 50% of all neurons (randomly chosen, 50% excitatory and 50% inhibitory) received downscaled OU noise (by a factor 0.2 from the model reported in [26]), with the subtracted part substituted by additional synaptic input from the circuit. The input connection probabilities of these neurons were scaled up, so that the firing rates remain in the same range as for the other neurons. This was done in order to observe how the learning mechanisms work when most of the input conductance in the neuron comes from a larger number of input synapses which are plastic, rather than from a static noise process. The reinforced neurons were randomly chosen from this group of neurons. We chose a smaller microcircuit, composed of 540 neurons, for the computer simulation 5 in order to be able to perform a large number of training trials. The synaptic weights in this smaller circuit were chosen (see Table 3) to achieve an appropriate level of firing activity in the circuit that is modulated by the external input. The circuit neurons had injected an Ornstein-Uhlenbeck (OU) noise multiplied by 0.4 in order to emulate the background synaptic activity in neocortical neurons in vivo, and test the learning in a more biologically realistic settings. This produced significant trial-to-trial variability in the circuit response (see Figure 10D). A lower value of the noise level could also be used without affecting the learning, whereas increasing the amount of injected noise would slowly deteriorate the information that the circuit activity maintains about the injected inputs, resulting in a decline of the learning performance.

Readout Neurons

The maximum values of the synaptic weights of readout neurons for computer simulations 2, 4, and 5, together with the number of synapses of the neurons, are given in Table 4.

Table 4

Specific parameter values for the trained (readout) neurons in computer simulation 2, 4, and 5.

Simulation No.	Num. Synapses	w_max [nS]	C_OU
2	100	11.9	1.0
4	200	5.73	0.2
5	432	2.02	0.2

w is the upper hard bound of the synaptic weights of the synapses. C is the scaling factor for the Ornstein-Uhlenbeck noise injected in the neurons.

w is the upper hard bound of the synaptic weights of the synapses. C is the scaling factor for the Ornstein-Uhlenbeck noise injected in the neurons. The neuron in computer simulation 2 had 100 synapses. We chose 200 synapses for the neuron in computer simulation 4, in order to improve the learning performance. Such improvement of the learning performance for larger numbers of synapses is in accordance with our theoretical analysis (see Equation 17), since for learning the classification of temporal patterns the temporal variation of the voltage of the postsynaptic membrane turns out to be of critical importance (see the discussion after Equation 17). This temporal variation depends less on the shape of a single EPSP and more on the temporal pattern of presynaptic firing when the number of synapses is increased. In computer simulation 5 the readout neuron received inputs from all 432 excitatory neurons in the circuit. The synaptic weights were chosen in accordance with the number of synapses in order to achieve a firing rate suitable for the particular task, and to balance the synaptic input and the noise injections in the neurons. For the pattern discrimination task (computer simulation 4) and the speech recognition task (computer simulation 5), the amount of noise had to be chosen to be high enough to achieve sufficient variation of the membrane potential from trial to trial near the firing threshold, and low enough so that it would not dominate the fluctuations of the membrane potential. In the experiment where the exact spike times were rewarded (computer simulation 2), the noise had a different role. As described in the Results section, there the noise effectively controls the amount of depression. If the noise (and therefore the depression) is too weak, w* = 0 synapses do not converge to 0. If the noise is too strong, w* = w synapses do not converge to w. To achieve the desired learning result, the noise level should be in a range where it reduces the correlations of the synapses with w* = 0 so that the depression of STDP will prevail, but at the same time is not strong enough to do the same for the other group of synapses with w* = w, since they have stronger pre-before-post correlations. For our simulations, we have set the noise level to the full amount of OU noise.

Details to Computer Simulation 1: Model for Biofeedback Experiment

The cortical microcircuit model consisted of 4000 neurons with twenty percent of the neurons randomly chosen to be inhibitory, and the others excitatory. The connections between the neurons were created randomly, with different connectivity probabilities depending on whether the postsynaptic neuron received the full amount of OU noise, or downscaled OU noise with an additional compensatory synaptic input from the circuit. For neurons in the latter sub-population, the connection probabilities were p = 0.02, p = 0.02, p = 0.024 and p = 0.016 where the ee, ei, ie, ii indices designate the type of the presynaptic and postsynaptic neurons (e = excitatory or i = inhibitory). For the other neurons the corresponding connection probabilities were downscaled by 0.4. The resulting firing rates and correlations for both types of excitatory neurons are plotted in Figure S1 and Figure S2. The shape of the reward kernel ε(t) was chosen as a difference of two α-functionsone positive α-pulse with a peak at 0.4 sec after the corresponding spike, and one long-tailed negative α-pulse which makes sure that the integral over the reward kernel is zero. The parameters for the reward kernel were , , , , and d = 0.2 s, which produced a peak value of the reward pulse 0.4 s after the spike that caused it.

Details to Computer Simulation 2: Learning Spike Times

We used the following function for the reward kernel κ(r)where and are positive scaling constants, and define the shape of the two double-exponential functions the kernel is composed of, and t defines the offset of the zero-crossing from the origin. The parameter values used in our simulations were , , , and t = −1 ms. The reward delay was equal to d = 0.4 s.

Details to Computer Simulation 3: Testing the Analytically Derived Conditions

We used a linear Poisson neuron model as in the theoretical analysis with static synapses and exponentially decaying postsynaptic responses . The neuron had 100 excitatory synapses, except in experiment #6, where we used 200 synapses. In all experiments the target neuron received additional 10 excitatory synapses with weights set to w. The input spike trains were Poisson processes with a constant rate of r = 6 Hz, except in experiment # 6 where the rate was r = 3 Hz. The weights of the target neuron were set to for 0≤i<50 and for 50≤i<100. The time constants of the reward kernel were , whereas had different values in different experiments (reported in table 1). The value of t was always set to an optimal value such that the . The time constant τ − of the negative part of the STDP window function W(r) was set to τ +. The reward signal was delayed by τ = 0.4 s. The simulations were performed for varying durations of simulated biological time (see the t-column in Table 1).

Details to Computer Simulation 4: Learning Pattern Classification

We used the reward signal from Equation 16, with an α-function for the reward kernel , and the reward delay d set to 300 ms. The amplitudes of the positive and negative pulses were α = −α = 1.435 and the time constant of the reward kernel was τ = 100 ms.

Details to Computer Simulation 5: Training a Readout Neuron with Reward-Modulated STDP To Recognize Isolated Spoken Digits

Spike representations of speech utterances

The speech utterances were preprocessed by the cochlea model described in [43], which captures the filtering properties of the cochlea and hair cells in the human inner ear. The resulting analog signals were encoded by spikes with the BSA spike encoding algorithm described in [44]. We used the same preprocessing to generate the spikes as in [45]. The spike representations had a duration of about 400 ms and 20 input channels. The input channels were connected topographically to the cortical microcircuit model. The neurons in the circuit were split into 20 disjunct subsets of 27 neurons, and each input channel was connected to the 27 neurons in its corresponding subsets. The readout neuron was trained with 20 different spike inputs to the circuit, where 10 of them resulted from utterances of digit “one”, and the other 10 resulted from utterances of digit “two” by the same speaker.

Training procedure

We performed 2000 training trials, where for each trial a spike representation of a randomly chosen utterance out of 10 utterances for one digit was injected into the circuit. The digit changed from trial to trial. Whenever the readout neuron spiked during the presentation of an utterance of digit “two”, a positive pulse was generated in the reward signal, and accordingly, for utterances of digit “one”, a negative pulse in the reward was generated. We used the reward signal from Equation 16. The amplitudes of the positive and negative pulses were α = −α = 0.883. The time constant of the reward kernel ε(r) was τ = 100 ms. The pulses in the reward were delayed d = 300 ms from the spikes that caused them.

Cortical microcircuit details

The cortical microcircuit model consisted of 540 neurons with twenty percent of the neurons randomly chosen to be inhibitory, and the others excitatory. The recurrent connections in the circuit were created randomly with a connection probability of 0.1. Long-term plasticity was not modeled in the circuit synapses. The synapses for the connections from the input neurons to the circuit neurons were static, current based with axon conduction delay of 1 ms, and exponentially decaying PSR with time constant τ = 3 ms and amplitude w = 0.715 nA.

Discussion

We have presented in this article analytical tools which make it possible to predict under which conditions reward-modulated STDP will achieve a given learning goal in a network of neurons. These conditions specify relationships between parameters and auxiliary functions (learning curves for STDP, eligibility traces, reward signals etc.) that are involved in the specification of the reward-modulated STDP learning rule. Although our analytical results are based on some simplifying assumptions, we have shown that they predict quite well the outcomes of computer simulations of quite complex models for cortical networks of neurons. We have applied this learning theory for reward-modulated STDP to a number of biologically relevant learning tasks. We have shown that the biofeedback result of Fetz and Baker [17] can in principle be explained on the basis of reward-modulated STDP. The underlying credit assignment problem was extremely difficult, since the monkey brain had no direct information about the identity of the neuron whose firing rate was relevant for receiving rewards. This credit assignment problem is even more difficult from the perspective of a single synapse, and hence for the application of a local synaptic plasticity rule such as reward-modulated STDP. However our theoretical analysis (see Equations 10 and 11) has shown that the longterm evolution of synaptic weights depended only on the correlation of pairs of pre- and postsynaptic spikes with the reward signal. Therefore the firing rate of the rewarded neuron increased (for a computer simulation of a recurrent network consisting of 4000 conductance based LIF neurons with realistic background noise typical for in-vivo conditions, and 228954 synapses that exhibited data-based short term synaptic plasticity) within a few minutes of simulated biological time, like in the experimental data of [17], whereas the firing rates of the other neurons remained invariant (see Figure 4B). We were also able to model differential reinforcement of two neurons in this way (Figure 2). These computer simulations demonstrated a remarkable stability of the network dynamics (see Figures 2A, 4A, and 5) in spite of the fact that all excitatory synapses were continuously subjected to reward-modulated STDP. In particular, the circuit remained in the asynchronous irregular firing regime, that resembles spontaneous firing activity in the cortex [9]. Other STDP-rules (without reward modulation) that maintain this firing regime have previously been exhibited in [22]. It was also reported in [17], and further examined in [46], that bursts of the reinforced neurons were often accompanied by activations of specific muscles in the biofeedback experiment by Fetz and Baker. But the relationship between bursts of the recorded neurons in precentral motor cortex and muscle activations was reported to be quite complex and often dropped out after continued reinforcement of the neuron alone. Furthermore in [46] it was shown that all neurons tested in that study could be dissociated from their correlated muscle activity by differentially reinforcing simultaneous suppression of EMG activity. These results suggest that the solution of the credit assignment problem by the monkeys (to stronger activate that neuron out of billions of neurons in their precentral gyrus that was reinforced) may have been supported by large scale exploration strategies that were associated with muscle activations. But the previously mentioned results on differential reinforcements of two nearby neurons suggest that this large scale exploration strategy had to be complemented by exploration on a finer spatial scale that is difficult to explain on the basis of muscle activations (see [19] for a detailed discussion). Whereas this learning task focused on firing rates, we have also shown (see Figure 7) that neurons can learn via reward-modulated STDP to respond to inputs with particular spike trains, i.e., particular temporal output patterns. It has been pointed out in [27] that this is a particularly difficult learning task for reward-modulated STDP, and it was shown there that it can be accomplished with a modified STDP rule and more complex reward prediction signals without delays. We have complemented the results of [27] by deriving specific conditions (Equations 13–15) under which this learning task can be solved by the standard version of reward-modulated STDP. Extensive computer simulations have shown that these analytically derived conditions for a simpler neuron model predict also for a LIF neuron with conductance based synapses whether it is able to solve this learning task. Figure 8 shows that this learning theory for reward-modulated STDP is also able to predict quite well how fast a neuron can learn to produce a desired temporal output pattern. An interesting aspect of [27] is that there also the utility of third signals that provide information about changes in the expectation of reward was explored. We have considered in this article only learning scenarios where reward prediction is not possible. A logical next step will be to extend our learning theory for reward-modulated STDP to scenarios from classical reinforcement learning theory that include reward prediction. We have also addressed the question to what extent neurons can learn via reward-modulated STDP to respond with different firing rates to different spatio-temporal presynaptic firing patterns. It had already been shown in [12] that this learning rule enables neurons to classify spatial firing patterns. We have complemented this work by deriving an analytic expression for the expected weight change in this learning scenario (see Equation 17), which clarifies to what extent a neuron can learn by reward-modulated STDP to distinguish differences in the temporal structure of presynaptic firing patterns. This theoretical analysis showed that in the extreme case, where all incoming information is encoded in the relative timing of presynaptic spikes, reward-modulated STDP is not able to produce a higher average membrane potential for selected presynaptic firing patterns, even if that would be rewarded. But it is able to increase the variance of the membrane potential, and thereby also the number of spikes of any neuron model that has (unlike the simple linear Poisson neuron) a firing threshold. The simulation results in Figure 9 confirm that in this way a LIF neuron can learn with the standard version of reward-modulated STDP to discriminate even purely temporal presynaptic firing patterns, by producing more spikes in response to one of these patterns. A surprising feature is, that although the neuron was rewarded here only for responding with a higher firing rate to one presynaptic firing pattern P, it automatically started to respond to this pattern P with a specific temporal spike pattern, that advanced in time during training (see Figure 9A). Finally, we have shown that a spiking neuron can be trained by reward-modulated STDP to read out information from a simulated cortical microcircuit (see Figure 10). This is insofar of interest, as previous work [31],[34],[47] had shown that models of generic cortical microcircuits have inherent capabilities to serve as preprocessors for such readout neurons, by combining in diverse linear and nonlinear ways information that was contained in different time segments of spike inputs to the circuit (“liquid computing model”). The classification of spoken words (that were first transformed into spike trains) had been introduced as a common benchmark task for the evaluation of different approaches towards computing with spiking neurons [31]–[33],[45],[48]. But so far all approaches that were based on learning (rather than on clever constructions) had to rely on supervised training of a simple linear readout. This gave rise to the question whether also biologically more realistic models for readout neurons can be trained through a biologically more plausible learning scenario to classify spoken words. The results of Figure 10 may be interpreted as a tentative positive answer to this question. We have demonstrated that LIF neurons with conductance based synapses (that are subject to biologically realistic short term plasticity) can learn without a supervisor through reward-modulated STDP to classify spoken digits. In contrast to the result of Figure 9, the output code that emerged here was a rate code. This can be explained through the significant in-class variance of circuit responses to different utterances of the same word (see Figure 10C and 10D). Although the LIF neuron learnt here without a supervisor to respond with different firing rates to utterances of different words by the same speaker (whereas the rate output was very similar for both words at the beginning of learning, see Figure 10E), the classification capability of these neurons has not yet reached the level of linear readouts that are trained by a supervisor (for example, speaker independent word classification could not yet be achieved in this way). Further work is needed to test whether the classification capability of LIF readout neurons can be improved through additional preprocessing in the cortical microcircuit model, through a suitable variation of the reward-modulated STDP rule, or through a different learning scenario (mimicking for example preceding developmental learning that also modifies the presynaptic circuit). The new learning theory for reward-modulated STDP will also be useful for biological experiments that aim at the clarification of details of the biological implementation of synaptic plasticity in different parts of the brain, since it allows to make predictions which types and time courses of signals would be optimal for a particular range of learning tasks. For each of the previously discussed learning tasks, the theoretical analysis provided conditions on the structure of the reward signal d(t) which guaranteed successful learning. For example, in the biofeedback learning scenario (Figure 4), every action potential of the reinforced neuron led—after some delay—to a change of the reward signal d(t). The shape of this change was defined by the reward kernel ε(r). Our analysis revealed that this reward kernel can be chosen rather arbitrarily as long as the integral over the kernel is zero, and the integral over the product of the kernel and the eligibility function is positive. For another learning scenario, where the goal was that the output spike train of some neuron j approximates the spike timings of some target spike train S* (Figure 7), the reward signal has to depend on both, and S*. The dependence of the reward signal on these spike timings was defined by a reward kernel κ(r). Our analysis showed that the reward kernel has to be chosen for this task so that the synapses receive positive rewards if the postsynaptic neuron fires close to the time of a spike in the target spike train S* or somewhat later, and negative rewards when an output spike occurs in the order of ten milliseconds too early. In the pattern discrimination task of Figure 9 each postsynaptic action potential was followed—after some delay—by a change of the reward signal which depended on the pattern presented. Our theoretical analysis predicted that this learning task can be solved if the integrals and defined by Equation 18 are such that and . Again, this constraints are fulfilled for a large class of reward kernels, and a natural choice is to use a non-negative reward kernel ε. There are currently no data available on the shape of reward kernels in biological neural systems. The previous sketched theoretical analysis makes specific prediction for the shape of reward kernels (depending on the type of learning task in which a biological neural system is involved) which can potentially be tested through biological experiments. An interesting general aspect of the learning theory that we have presented in this article is that it requires substantial trial-to-trial variability in the neural circuit, which is often viewed as “noise” of imperfect biological implementations of theoretically ideal circuits of neurons. This learning theory for reward-modulated STDP suggests that the main functional role of noise is to maintain a suitable level of spontaneous firing (since if a neuron does not fire, it cannot find out whether this will be rewarded), which should vary from trial to trial in order to explore which firing patterns are rewarded (It had been shown in [31],[34],[47] that such highly variable circuit activity is compatible with a stable performance of linear readouts). On the other hand if a neuron fires primarily on the basis of a noise current that is directly injected into that neuron, and not on the basis of presynaptic activity, then STDP does not have the required effect on the synaptic connections to this neuron (see Figure S6). This perspective opens the door for subsequent studies that compare for concrete biological learning tasks the theoretically derived optimal amount and distribution of trial-to-trial variability with corresponding experimental data.

Related Work

The theoretical analysis of this model is directly applicable to the learning rule considered in [12]. There, the network behavior of reward-modulated STDP was also studied some situations different from the ones in this article. The computer simulations of [12] operate apparently in a different dynamic regime, where LTD dominates LTP in the STDP-rule, and most weights (except those that are actively increased through reward-modulated STDP) have values close to 0 (see Figure 1b and 1d in [12], and compare with Figure 5 in this article). This setup is likely to require for successful learning a larger dominance of pre-before-post over post-before-pre pairs than the one shown in Figure 4E. Furthermore, whereas a very low spontaneous firing rate of 1 Hz was required in [12], computer simulation 1 shows that reinforcement learning is also feasible at spontaneous firing rates which correspond to those reported in [17] (the preceding theoretical analysis had already suggested that the success of the model does not depend on particularly low firing rates). The articles [15] and [13] investigate variations of reward-modulated STDP rules that do not employ learning curves for STDP that are based on experimental data, but modified curves that arise in the context of a very interesting top-down theoretical approach (distributed reinforcement learning [14]). The authors of [16] arrive at similar learning rules in a supervised scenario which can be reinterpreted in the context of reinforcement learning. We expect that a similar theory as we have presented in this article for the more commonly discussed version of STDP can also be applied to their modified STDP rules, thereby making it possible to predict under which conditions their learning rules will succeed. Another reward based learning rule for spiking neurons was recently presented in [49]. This rule exploits correlations of a reward signal with noisy perturbations of the neuronal membrane conductance in order to optimize some objective function. One crucial assumption of this approach is that the synaptic plasticity mechanism “knows” which contributions to the membrane potential arise from synaptic inputs, and which contributions are due to internal noise. Such explicit knowledge of the noise signal is not needed in the reward-modulated STDP rule of [12], which we have considered in this article. The price one has to pay for this potential gain in biological realism is a reduced generality of the learning capabilities. While the learning rule in [49] approximates gradient ascent on the objective function, this cannot be stated for reward-modulated STDP at present. Timing-based pattern discrimination with a spiking neuron, as discussed in the section “Pattern discrimination with reward-modulated STDP” of this article, was recently tackled in [50]. The authors proposed the tempotron learning rule, which increases the peak membrane voltage for one class of input patterns (if no spike occurred in response to the input pattern) while decreasing the peak membrane voltage for another class of input patterns (if a spike occurred in response to the pattern). The main difference between this learning rule and reward-modulated STDP is that the tempotron learning rule is sensitive to the peak membrane voltage, whereas reward-modulated STDP is sensitive to local fluctuations of the membrane voltage. Since the time of the maximal membrane voltage has to be determined for each pattern by the synaptic plasticity mechanism, the basic tempotron rule is perhaps not biologically realistic. Therefore, an approximate and potentially biologically more realistic learning rule was proposed in [50], where plasticity following error trials is induced at synapse i only if the voltage within the postsynaptic integration time after their activation exceeds a plasticity threshold κ. One potential problem of this rule is the plasticity threshold κ, since a good choice of this parameter strongly depends on the mean membrane voltage after input spikes. This problem is circumvented by reward-modulated STDP, which considers instead the local change in the membrane voltage. Further work is needed to compare the advantages and disadvantages of these different approaches.

Conclusion

Reward-modulated STDP is a very promising candidate for a synaptic plasticity rule that is able to orchestrate local synaptic modifications in such a way that particular functional properties of larger networks of neurons can be achieved and maintained (we refer to [12] and [27] for discussion of potential biological implementations of this plasticity rule). We have provided in this article analytical tools which make it possible to evaluate this rule and variations of this rule not just through computer simulations, but through theoretical analysis. In particular we have shown that successful learning is only possible if certain relationships hold between the parameters that are involved. Some of these predicted relationships can be tested through biological experiments. Provided that these relationships are satisfied, reward-modulated STDP turns out to be a powerful rule that can achieve self-organization of synaptic weights in large recurrent networks of neurons. In particular, it enables us to explain seemingly inexplicable experimental data on biofeedback in monkeys. In addition reward-modulated STDP enables neurons to distinguish complex firing patterns of presynaptic neurons, even for data-based standard forms of STDP, and without the need for a supervisor that tells the neuron when it should spike. Furthermore reward-modulated STDP requires substantial spontaneous activity and trial-to-trial variability in order to support successful learning, thereby providing a functional explanation for these ubiquitous features of cortical networks of neurons. In fact, not only spontaneous activity but also STDP itself may be seen in this context as a mechanism that supports the exploration of different firing chains within a recurrent network, until a solution is found that is rewarded because it supports a successful computational function of the network. Variations of Figure 5B–D for those excitatory neurons which receive the full amount of Ornstein-Uhlenbeck noise. (B) The distribution of the firing rates of these neurons remains unchanged during the simulation. The colors of the curves and the corresponding intervals are as follows: red (300–360 sec), green (600–660 sec), blue (900–960 sec), magenta (1140–1200 sec). (C) Cross-correlogram of the spiking activity of these neurons, averaged over 200 pairs of neurons and over 60 s, with a bin size of 0.2 ms, for the period between 300 and 360 seconds of simulation time. It is calculated as the cross-covariance divided by the square root of the product of variances. (D) As in (C), but for the last 60 seconds of the simulation. The correlation statistics in the circuit is stable during learning. (0.06 MB PDF) Click here for additional data file. Variations of Figure 5B–D for those excitatory neurons which receive a reduced amount of Ornstein-Uhlenbeck noise, but receive more synaptic inputs from other neurons. (B) The distribution of the firing rates of these neurons remains unchanged during the simulation. The colors of the curves and the corresponding intervals are as follows: red (300–360 sec), green (600–660 sec), blue (900–960 sec), magenta (1140–1200 sec). (C) Cross-correlogram of the spiking activity in the circuit, averaged over 200 pairs of these neurons and over 60 s, with a bin size of 0.2 ms, for the period between 300 and 360 seconds of simulation time. It is calculated as the cross-covariance divided by the square root of the product of variances. (D) As in (C), but for the last 60 seconds of the simulation. The correlation statistics in the circuit is stable during learning. (0.06 MB PDF) Click here for additional data file. Variation of Figure 4 from computer simulation 1 with results from a simulation where the weight-dependent version of STDP proposed in [22] was used. This STDP rule is defined by the following equations: and . We used the parameters proposed in [36], i.e. μ = 0.4, α = 0.11, τ + = τ − = 20 ms, λ = 0.1 and w 0 = 272.6 pS. The w 0 parameter was calculated according to the formula: where w is the maximum synaptic weight of the synapse. The amplitude parameters , for the reward kernel were set to and . All other parameter values were the same as in computer simulation 1. (0.09 MB PDF) Click here for additional data file. Variation of Figure 5 for the weight-dependent STDP rule from [22] (as in Figure S3). (0.06 MB PDF) Click here for additional data file. Variation of Figure 7 (i.e., of computer simulation 2) for a simulation where we used current-based synapses without short-term plasticity. The post-synaptic response had an exponentially decaying form , with τ = 5 ms. The value of the maximum synaptic weight was w = 32.9 pA. All other parameter values were the same as in computer simulation 2. (0.17 MB PDF) Click here for additional data file. Dependence of the learning performance on the noise level in computer simulation 2. The angular error (defined as the angle between the weight vector w of the trained neuron at the end of the simulation and the weight vector w* of the neuron μ*) is taken as measure for the learning performance, and plotted for 9 simulations with different noise levels that are given on the X axis (in term of multiples of the noise level chosen for Figure 7). All other parameters values were the same as in computer simulation 2. The figure shows that the learning performance declines both for too little and for too much noise. (0.02 MB PDF) Click here for additional data file. Variation of Figure 9 (i.e., of computer simulation 4) with the weight-dependent STDP rule proposed in [22]. This rule is defined by the following equations: and . We used the parameters proposed in [22], i.e. μ = 0.4, α = 0.11, τ + = τ − = 20 ms, λ = 0.1 and w 0 = 72.4 pS. The w 0 parameter was calculated according to the formula: where w is the maximum synaptic weight of the synapse. The amplitude parameters of the reward kernel were set to α = −α = 1.401. All other parameter values were the same as in computer simulation 4. The variance of the membrane potential increased for pattern P from 2.35 (mV)2 to 3.66 (mV)2 (C), and decreased for pattern N (D), from 2.27 (mV)2 to 1.54 (mV)2. (0.31 MB PDF) Click here for additional data file. Variation of Figure 9 for a simulation where we used current-based synapses without short-term plasticity. The post-synaptic response had an exponentially decaying form , with τ = 5 ms. The value of the maximum synaptic weight was w = 106.2 pA All other parameter values were the same as in computer simulation 4. The variance of the membrane potential increased for pattern P from 2.84 (mV)2 to 5.89 (mV)2 (C), and decreased for pattern N (D), from 2.57 (mV)2 to 1.22 (mV)2. (0.31 MB PDF) Click here for additional data file. Variation of Figure 10 (i.e., of computer simulation 5) for a simulation where we used current-based synapses without short-term plasticity. The post-synaptic response had an exponentially decaying form , with τ = 5 ms. The synaptic weights of the excitatory and inhibitory synapses in the cortical microcircuit were set to w = 65.4 pA and w = 238 pA respectively. The maximum synaptic weight of the synapses to the readout neuron was w = 54.3 pA. All other parameter values were the same as in computer simulation 5. (0.27 MB PDF) Click here for additional data file. Spike encodings of 10 utterances of digit “one” by one speaker with the Lyon cochlea model [43], which were used as circuit inputs for computer simulation 5. (0.05 MB PDF) Click here for additional data file. Spike encodings of 10 utterances of digit “two” by one speaker with the Lyon cochlea model [43], which were used as circuit inputs for computer simulation 5. (0.05 MB PDF) Click here for additional data file.

43 in total

Review 1. Synaptic plasticity: taming the beast.

Authors: L F Abbott; S B Nelson
Journal: Nat Neurosci Date: 2000-11 Impact factor: 24.884

2. Stimulus dependence of two-state fluctuations of membrane potential in cat visual cortex.

Authors: J Anderson; I Lampl; I Reichova; M Carandini; D Ferster
Journal: Nat Neurosci Date: 2000-06 Impact factor: 24.884

3. Input synchrony and the irregular firing of cortical neurons.

Authors: C F Stevens; A M Zador
Journal: Nat Neurosci Date: 1998-07 Impact factor: 24.884

4. Competitive Hebbian learning through spike-timing-dependent synaptic plasticity.

Authors: S Song; K D Miller; L F Abbott
Journal: Nat Neurosci Date: 2000-09 Impact factor: 24.884

5. Dynamics of networks of randomly connected excitatory and inhibitory spiking neurons.

Authors: N Brunel
Journal: J Physiol Paris Date: 2000 Sep-Dec

6. Fading memory and kernel properties of generic cortical microcircuit models.

Authors: Wolfgang Maass; Thomas Natschläger; Henry Markram
Journal: J Physiol Paris Date: 2005-11-28

7. Solving the distal reward problem through linkage of STDP and dopamine signaling.

Authors: Eugene M Izhikevich
Journal: Cereb Cortex Date: 2007-01-13 Impact factor: 5.357

8. The tempotron: a neuron that learns spike timing-based decisions.

Authors: Robert Gütig; Haim Sompolinsky
Journal: Nat Neurosci Date: 2006-02-12 Impact factor: 24.884

9. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type.

Authors: G Q Bi; M M Poo
Journal: J Neurosci Date: 1998-12-15 Impact factor: 6.167

10. What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration.

Authors: J J Hopfield; C D Brody
Journal: Proc Natl Acad Sci U S A Date: 2001-01-23 Impact factor: 11.205

61 in total

1. Reversible large-scale modification of cortical networks during neuroprosthetic control.

Authors: Karunesh Ganguly; Dragan F Dimitrov; Jonathan D Wallis; Jose M Carmena
Journal: Nat Neurosci Date: 2011-04-17 Impact factor: 24.884

Review 2. Harnessing chaos in recurrent neural networks.

Authors: Dean V Buonomano
Journal: Neuron Date: 2009-08-27 Impact factor: 17.173

3. Computational models of reinforcement learning: the role of dopamine as a reward signal.

Authors: R D Samson; M J Frank; Jean-Marc Fellous
Journal: Cogn Neurodyn Date: 2010-03-21 Impact factor: 5.082

4. Supervised learning with decision margins in pools of spiking neurons.

Authors: Charlotte Le Mouel; Kenneth D Harris; Pierre Yger
Journal: J Comput Neurosci Date: 2014-05-28 Impact factor: 1.621

Review 5. Control of synaptic plasticity in deep cortical networks.

Authors: Pieter R Roelfsema; Anthony Holtmaat
Journal: Nat Rev Neurosci Date: 2018-02-16 Impact factor: 34.870

6. A reward-modulated hebbian learning rule can explain experimentally observed network reorganization in a brain control task.

Authors: Robert Legenstein; Steven M Chase; Andrew B Schwartz; Wolfgang Maass
Journal: J Neurosci Date: 2010-06-23 Impact factor: 6.167