Literature DB >> 35727852

Error-based or target-based? A unified framework for learning in recurrent spiking networks.

Cristiano Capone¹, Paolo Muratore², Pier Stanislao Paolucci¹.

Abstract

The field of recurrent neural networks is over-populated by a variety of proposed learning rules and protocols. The scope of this work is to define a generalized framework, to move a step forward towards the unification of this fragmented scenario. In the field of supervised learning, two opposite approaches stand out, error-based and target-based. This duality gave rise to a scientific debate on which learning framework is the most likely to be implemented in biological networks of neurons. Moreover, the existence of spikes raises the question of whether the coding of information is rate-based or spike-based. To face these questions, we proposed a learning model with two main parameters, the rank of the feedback learning matrix [Formula: see text] and the tolerance to spike timing τ⋆. We demonstrate that a low (high) rank [Formula: see text] accounts for an error-based (target-based) learning rule, while high (low) tolerance to spike timing promotes rate-based (spike-based) coding. We show that in a store and recall task, high-ranks allow for lower MSE values, while low-ranks enable a faster convergence. Our framework naturally lends itself to Behavioral Cloning and allows for efficiently solving relevant closed-loop tasks, investigating what parameters [Formula: see text] are optimal to solve a specific task. We found that a high [Formula: see text] is essential for tasks that require retaining memory for a long time (Button and Food). On the other hand, this is not relevant for a motor task (the 2D Bipedal Walker). In this case, we find that precise spike-based coding enables optimal performances. Finally, we show that our theoretical formulation allows for defining protocols to estimate the rank of the feedback error in biological networks. We release a PyTorch implementation of our model supporting GPU parallelization.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35727852 PMCID： PMC9249234 DOI： 10.1371/journal.pcbi.1010221

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

Introduction

When confronted with reality, humans learn with high sample efficiency, benefiting from the fabric of society and its abundance of experts in relevant domains. A conceptually simple and effective strategy for learning in this social context is Imitation Learning. One can conceptualize this learning strategy in the Behavioral Cloning framework, where an agent observes a near optimal behavior (expert demonstration), and progressively improves its mimicking performances by minimizing the differences between its own and the expert’s behavior. Behavioral Cloning can be directly implemented in a supervised learning framework. In the last years, a competition between two opposite interpretations of supervised learning is emerging: error-based approaches [1-5], where the error information computed at the environment level is injected into the network and used to improve later performances, and target-based approaches [6-13], where a target for the internal activity is selected and learned. In this work, we provide a general framework, which we call GOAL (Generalized Optimization of Apprenticeship Learning), where these different approaches are reconciled and can be retrieved via a proper definition of the error propagation structure the agent receives from the environment. Target-based and error-based are particular cases of our comprehensive framework. This novel formulation, being more general, offers new insights on the importance of the feedback structure for network learning dynamics, a still under-explored degree of freedom. Moreover, we remark that spike-timing-based neural codes are experimentally suggested to be important in several brain systems [14-17]. This evidence led us to we investigate the role of coding with specific patterns of spikes by introducing a parameter that defines the tolerance to precise spike timing during learning. Although many studies have approached learning in feedforward [9, 18–22] and recurrent spiking networks [2, 3, 8, 10, 12, 23, 24], a very small number of them successfully faced real world problems and reinforcement learning tasks [3, 25]. In this work, we apply our framework to the problem of behavioral cloning in recurrent spiking networks and show how it produces valid solutions for relevant tasks (button-and-food and the 2D Bipedal Walker). From a biological point of view, we focus on a novel route opened by such a framework: the exploration of what feedback strategy is actually implemented by biological networks and in the different brain areas. We propose an experimental measure that can help elucidate the error propagation structure of biological agents, offering an initial step in a potentially fruitful insight-cloning of naturally evolved learning expertise.

Models

The spiking network

In our formalism, neurons are modeled as real-valued variable , where the j ∈ {1, …, N} label identifies the neuron and t ∈ {1, …, T} is a discrete time variable. Each neuron exposes an observable state , which represents the occurrence of a spike from neuron j at time t. We then define the following dynamics for our model: Δt is the discrete time-integration step, while τ and τ are respectively the spike-filtering time constant and the temporal membrane constant. Each neuron is a leaky integrator with a recurrent filtered input obtained via a synaptic matrix and an external signal . wres = −20 accounts for the reset of the membrane potential after the emission of a spike. vth = 0 and vrest = −4 are the threshold and the rest membrane potential.

The supervised learning rule

We aim at training the recurrent spiking network to reproduce a desired output. In the framework of behavioral cloning, this output is the behavior of an expert agent (human, pre-trained artificial intelligence) which already knows an almost optimal solution of a task (see details in the section Application to closed-loop tasks: Behavioral cloning). In order to train the network to reproduce at each time the desired output vector , it is necessary to minimize the loss function: where is a linear readout of the spiking activity of the network and . is defined as: i.e., a temporal filtering of the spikes , where Δt is the temporal bin of the simulation and τ⋆ the timescale of the filtering. It is possible to derive the learning rule by differentiating the previous error function (by following the gradient), similarly to what was done in [3]: where we have used for the pseudo-derivative (similarly to [3]) and reserved for the spike response function that can be computed iteratively as In our case the pseudo-derivative, whose purpose is to replace (since f(⋅) is non-differentiable, see Eq (3)), is defined as , it peaks at and δv is a parameter defining its width. For the complete derivation, we refer to Section A in S1 Text (where we also discuss the ≃ in Eq (6)).

Results

In the following sections, we define a generalized learning framework by identifying two sensitive parameters: the number of constraints R and the sensitivity to precise temporal coding τ⋆. We analytically show how different learning rules presented in the literature can be accounted for as specific cases of our framework. We provide a geometrical interpretation, and strengthen our statement through numerical experiments. Finally, we test the performances of our learning rule as a function of the two main parameters (R, τ⋆) on different tasks: a store and recall task (of a target 3D trajectory), a navigation task (button and food) and a motor task (2D bipedal walker).

Theoretical results

Generalization of the learning framework

As discussed above, in Eq (6) we defined as the desired output (e.g., the target behavior). However, it is possible to imagine that in both biological and artificial systems there are much more constraints, not directly related to the behavior, to be satisfied. One example is the following: it might be necessary for the network to encode an internal state which is useful to produce the behavior and to solve the task (e.g. an internal representation of the position of the agent, contextual information and so on). The encoding of this information can automatically emerge during training, however to directly suggest it to the network might significantly facilitate the learning process. This signal is referred to as hint in the literature [23]. For this reason, we introduce a further set of output targets , k = O + 1, …R and define , k = 1, …R as the collection of and . is the signal decoded from the network activity through a linear readout and it is constrained to be as similar as possible to the target (see Fig 1A). By definition, is the same as but with extra rows. See section Definition of the additional constraints for details on the choice of and .

Fig 1

Framework schematics.

The gradient-based minimization of the error results in the following learning rule:

Framework schematics.

We propose a general framework where the dimensionality of the error feedback R, and the sensitivity to temporal coding τ⋆ can be varied arbitrarily. (A) Graphical depiction of a general supervised learning framework. R errors (difference between the output and the target output) are projected to the network to evaluate the recurrent weights updates. (B) Graphical depiction of the role of the temporal sensitivity parameter τ⋆. (C) Several learning rules present in the literature (e-prop, full-FORCE, LTTS) can be accounted as specific cases of our general framework. (D) Target-based approaches define a specific internal solution for network dynamics (red point), while error-based solutions are distributed in a subspace of possible internal states (green line). However, not all the points on the green line are accessible (given the discrete nature of the spiking activity), and the error-based solution can be sub-optimal. (E) However, when τ⋆ is large, the accessible states are more dense, and it is easier to find a good solution with an error-based approach. (F) The MSE for target-based and error-based solutions becomes comparable for large τ⋆ values. Points are average between five realization of the minimum MSE (after 104 training epochs), with error bars denoting the corresponding standard deviations. The possibility to broadcast specific local errors in biological networks has been debated for a long time [26, 27]. On the other hand, the propagation of a target appears to be more coherent with biological observations [28-31]. For this reason, we propose an alternative formulation allowing to propagate targets rather than errors [6, 27]. This can be easily done by writing the target output as: We stress here the fact that, due to the spikes discretization, the last equality cannot be strictly achieved, and it is only an approximation. One can simply consider s⋆ to be the solution of the optimization problem . The optimal encoding for a continuous trajectory through a pattern of spikes has been broadly discussed in [32]. However, the pattern s⋆ might describe an impossible dynamics (for example, activity that follows periods of complete network silence). For this reason, here we take a different choice. The is the pattern of spikes expressed by the untrained network (where recurrent connections are all set to zero) when the target output is randomly projected as an input (similarly to [8, 9]). It has been demonstrated that this choice allows for fast convergence and encodes detailed information about the target output. With these additional considerations, we can now rewrite our expression for the weight update in terms of the network activity: where is a novel matrix which acts recurrently on the network. The two core new terms are the matrix and . The former defines the dynamics in the space of the internal network activities during learning. The latter provides a specific pattern of spikes, which is directly suggested to the network as the internal solution of the task. We interpret the parameter τ⋆ (the time-scale of the spike filtering, see Eq (5)) as the tolerance to spike timing of the proposed internal solution . This clarifies the use of the subscript ⋆ for the timescale τ⋆, since it concerns the target quantities. In Fig 1B we show in a sketch that, for the same spike displacement between the spontaneous and the target activity, the error is higher when the τ⋆ is lower. However, as demonstrated in the following sections, the network dynamics only converges to for a range of parameters and τ⋆.

Definition of the additional constraints

As discussed above there are many possible choices for the additional constraints (contextual signals, f.r. regularization and so on), and we take the following one. We first compute the target spiking pattern (as described in the section 2.2.1) and trained the network readout (via standard output-error gradient descent, Adam optimizer) to reconstruct . We remark that only serves to define the readout weights, and it will not be used explicitly when using the learning rule in Eq (8). Then, we define . The first O rows of the matrix are taken from the matrix so that for k ≤ O. The other rows are chosen randomly from a Gaussian distribution (with the same variance of the matrix ). In summary, the generalized target is: where the additional constraints are a random linear combination of a hypothetical internal solution . However, we demonstrate that only when the rank is high, the internal dynamics converges to .

Training protocol

We recap in this section the network training procedure used for rank-feedback control (Figs 1 and 2). First, we construct the target spike-pattern from the target output signal: the output signal is randomly projected into the untrained network, alongside the regular input. The spike activity expressed by the untrained network with such input is selected as internal target . Then, we train the linear readout to reproduce the target signal from the newly constructed target spike pattern. Note how, at this stage . The output connectivity matrix is then expanded to include new constraints, resulting in . In practice, we extract the R − O new vectors from a Gaussian distribution with zero mean and variance equal to 2 std(B). Finally, the expanded output connectivity matrix is adopted in the recurrent synaptic updates and Eq (8) is used for training.

Fig 2

Parameter exploration.

The target-based limit

We remark that the formulation described above is equivalent to an error-based approach whose output target is . When the rank of the matrix is comparable to the number of neurons, the matrix is almost diagonal and the learning rule reduces to: In this case, the training of recurrent weights reduces to learning a specific pattern of spikes [33-37]. In this limit, the model Learning Through Target Spikes (LTTS) [9] is recovered (see Fig 1C), with the only difference of the presence of the pseudo-derivative. These two limits are investigated numerically in the section Dimensionality of the solution space. We remark that in the formulation described in Eq (10) it is possible to change the rank of the feedback by directly changing the number of rows in the matrix . Numerical results confirming this theoretical prediction are reported in the following section. A major advantage of target-based limit is in the implementability and plausibility of the error propagation. In the general error-based case, the performances/activity of neurons has to be read, compared with the target output and then broadcasted to all neurons. This process requires time. This is reflected in the non locality in Eq (10) due to the matrix . To update the weights between neuron i and neuron j it is necessary to know the local errors from all other neurons. On the contrary, the rule derived in the target-based limit, only requires the local error of the post-synaptic neuron.

Interpretation of the framework

The major strength of our formulation is the capability to encompass very different learning approaches in the same framework. We have already noted in the previous section how in the high-rank and small-τ⋆ limit, one recovers the LTTS model, where a specific pattern of spikes is learned. In the large-τ⋆ regime the precise temporal-coding of spikes is blurred out (see Fig 1B), preventing learning for a specific spike-pattern. However, with a high-rank configuration a target rate-based internal solution is still identified: this is the known full-FORCE solution [8], where the learnt input currents induce an internal target activity, which is suited for the task. Loosening the learning constraints, i.e. reducing the feedback matrix rank , progressively enlarges the space of internal solutions. When matches the output dimensionality, we recover the known error-based approach (e.g. e-prop [3]): no internal dynamics is imposed on the system, which is only guided by the projection of the output error. Fig 1C visually represents where these models are located in the plane. Our novel general description of different learning approaches offers a new tool to better investigate their relationships. Here we provide a geometrical interpretation of our model, that intuitively explains what is the role of the two parameters and τ⋆ in defining the network internal solution to a task. Fig 1D–1E represent the error space, the difference between network activity and the target activity at a specific time t (two sample neurons are represented, and ). As discussed in the previous section, the target-based learning rule univocally defines one solution , ∀i, t (, represented by the red point in Fig 1D–1E). Thus, the high-rank learning rule (and the target-based one) tries to make the network dynamics converge to the red point (as represented by the black arrow in Fig 1D–1E). On the other hand, the error-based rule defines a set of possible solutions, defined by , ∀k, t (this can be represented as a line in the space in Fig 1D–1E, green line) in which the MSE is low. In other words, using the low-rank learning rule is equivalent to looking for the closest solution next to the green line (in Fig 1D–1E) (as represented by the black arrow in Fig 1D–1E). However, not all the points on the green line are accessible to the network (given the discrete nature of the spiking activity, see Fig 1D black crosses), and the error-based solution achieved during the learning procedure can be sub-optimal (see Fig 1D, green point which is far from the green line). Indeed, when τ⋆ is small , and it can assume only values −1, 0, or 1. However, when τ⋆ is large, the signals produced by the network are filtered, and the possible values for are no longer −1, 0, or 1. As a result, the accessible states are denser in the space of possible internal solutions (see Fig 1E, black crosses), and it is easier to find a good solution with an error-based approach (see Fig 1E, green point, which is now closer to the green line). This theoretical prediction is confirmed in numerical experiments (see the following section).

Numerical results

Store and recall

To investigate the role of the and τ⋆ parameters in our learning rule (see Eqs (8) and (10)), we considered a store and recall task. The network is asked to reproduce the target trajectory when prompted with a clock-like input signal , with a random Gaussian matrix with zero mean and σinp variance and , C = 5 (see Fig 2A for a graphical representation of the task).

Parameter exploration.

(A) We benchmarked our framework on a store and recall task. The network has to autonomously reproduce a target 3D trajectory, given a clock like input (bottom). Dashed line: target output. Solid line: network output at the end of the training for low and high rank conditions (green and blue respectively). (B) MSE (between the target output and the network output) as a function of training epochs, in the store and recall task of a 3D trajectory. Low rank (R = 3, green) and high rank (R = N, red) performances are compared. (C) Spike error as a function of training epochs, for high and low rank learning rule. (D) The MSE (color-coded) as a function of the rank R and the timescale τ⋆. (E) The spike error (color-coded) as a function of the rank R and the timescale τ⋆. (F) Convergence time (color-coded) as a function of the rank R and the timescale τ⋆. is a temporal pattern composed of O = 3 independent continuous signals. Each target signal is specified as the superposition of the four frequencies f ∈ {1, 2, 3, 5} Hz with uniformly extracted random amplitude A ∈ [0.5, 2.0] and phase ϕ ∈ [0, 2π]. We defined the additional constraints as described in section 2.2.2 and used the learning rule in Eq (8). Given this formulation, we can arbitrarily modulate the parameters and τ⋆. First we validated the intuition (see Fig 1D and 1E) that larger τ⋆ time constant, with the consequent enrichment of available network states, should progressively erase the difference between an error-based (low-rank) and target-based (high-rank) learning approach. We considered the two scenarios where and and trained the network till convergence for increasing values of τ⋆. The results, collected in Fig 1F, clearly illustrate how the difference between the two approaches vanishes for increasing τ⋆. In Fig 2B and 2C we have reported the output-error (measured as the MSE) and the spike-error as a function of training epochs for a particular choice of the τ⋆ parameter. Fig 2C shows that for a high rank, the internal activity of the network converges exactly to internal proposed target . This confirms that learning with a high rank is equivalent to a target approach. On the other hand, when the rank is low, this does not happen, and the network autonomously finds an alternative internal dynamics, that is different from , but still produces an output similar to (as shown in Fig 2B). Both methods achieve low output errors (Fig 2B), with the high-rank approach eventually scoring a lower MSE (the readout limit, i.e. the lowest achievable error given the pre-trained readout matrix ), while low-rank allows for a faster convergence. To better grasp the interplay between the rank and the τ⋆ parameters, we trained several instances of the same task to explore the model behavior in the full plane. We measured the output- and spike-error (Fig 2D and 2E), plus an estimate of the convergence time Tconv (Fig 2F), quantified as the number of epochs needed to halve the initial output error. Only high-rank feedback achieved low spike-errors (Fig 2E), with a non-trivial dependence on the optimal τ⋆. The LTTS algorithm was found to be the most robust in this sense, reliably achieving low spike errors for a broad range of τ⋆. Interestingly, the MSE metric (Fig 2D) highlighted two regions of low output-error, corresponding either to pure error-based () or high-rank, each with different optimal τ⋆. Finally, the convergence time highlights how a low rank systematically allows for a faster convergence (Fig 2F, light region in the bottom-right part of the panel). A possible explanation for this is that, while the target-based (high rank) solution is unique, there are many possible error-based (low-rank) solutions. For this reason is easier to find a close error-based solution starting from a random initial condition in the network activity space (see Fig 1D), while it requires more time to get the target-based solution. However, training slows for low rank and low τ⋆, with magenta-colored conditions signaling failure to reduce output-error by half.

Dimensionality of the solution space

The learning formulation of Eq (10) offers a major insights on the role played by the feedback matrix . Consider the learning problem (with fixed input and target output) where the synaptic matrix w is refined to minimize the output error (by converging to the proper internal dynamics). The learning dynamics can be easily pictured as a trajectory where a single point is a complete history of the network activity , with n = 1, …E where E is the total number of learning epochs. Upon initialization, a network is located at a point s0 marking its untrained spontaneous dynamics. The following point s1 is the activity produced by the network after applying the learning rule defined in Eq (10), and so on. By inspecting Eq (10) one notes that a sufficient condition for halting the learning is , where ϵ is an arbitrary small positive number. If ϵ is small enough it is possible to write: In the limit of a full-rank matrix (example: the LTTS limit where is diagonal) the only solution to Eq (13) is and the learning halts only when the target is cloned. When the rank is lower, the solution to Eq (13) is not unique, and the dimensionality of possible solutions is defined by the kernel of the matrix (the collection of vectors λ such that ). We have: . We run the store and recall experiment in order to confirm our theoretical predictions. We repeated the experiment for different values of the rank . The matrix is set to , i = 1, …N, , where δ is the Kronecker delta (the analysis for the case random provides analogous results and is reported in Fig C in S1 Text). When the rank is N, different replicas of the learning (different initialization of recurrent weights) converge almost to the same internal dynamics . This is reported in Fig 3A (left) where a single trajectory represents the first 2 principal components (PC) of the vector . The convergence to the point (0, 0) represents the convergence of the dynamics to . When the rank is lower (, see Fig 3A, right) different realizations of the learning converge to different points, distributed on an line in the PC space. This can be generalized by investigating the dimension of the convergence space as a function of the rank. The dimension of the vector evaluated in the trained network is estimated as , where λ are the principal component variances normalized to one (∑ λ = 1). We found a monotonic relation between the dimension of the convergence space and the rank (see Fig 3B, more information on the PC analysis and the estimation of the dimensionality in Section B in S1 Text). This observation confirms that when the rank is very high, the solution is strongly constrained, while when the rank becomes lower, the internal solution is free to move in a subspace of possible solutions. We suggest that this measure can be used in biological data to estimate the dimensionality of the learning constraints in biological neural network from the dimensionality of the solution space.

Fig 3

Error propagation and dimensionality of the internal solution.

(A) Dynamics along training epochs of the in the first two principal components for different repetition of the training with variable initial conditions. The error propagation matrix has maximum rank (, target-based limit). (B) Same as in (A), but with an error propagation matrix with rank . (C) Dimensionality of the solution space as a function of the rank of the error propagation matrix. (D) Dynamics along training epochs of the in the first two principal components, when a white noise is included in the synaptic dynamics. (E) Estimation of the dimensionality of the solution space, sampled thanks to fluctuations induced on the synaptic weights, as a function of the rank of the error propagation matrix.

Error propagation and dimensionality of the internal solution.

Dimensionality estimation on single trial

The dimensionality estimation described above requires the knowledge of the s⋆ and the repetition of several realizations of the training procedure. However, this information is not available in an experimental setup. For this reason, we propose an alternative approach which could be directly applied to experimental data. We perform only one realization of the training, but we assume the presence of noise on the synaptic dynamics as follows, by adding white noise to Eq (10) and following the equation: where ϵ = 0.1 and is a normal variable. In Fig 3C we reported dynamics along training epochs of the in the first two principal components. Since, the is not experimentally accessible, we replaced it with , the internal dynamics of the network at the end of the training. We observe that a first phase is dominated by the learning dynamics, while the second phase is dominated by the synaptic noise. This second phase allows exploring the space of possible internal solutions, even without running several times the training experiment. By estimating the dimensionality of this sampled space (in the same way as described in the previous section) we observe a monotonic dependence between the rank of the matrix , and the dimensionality of such a space (see Fig 3D). This methodology could be directly applied to data, allowing to provide an estimation of the dimensionality of the space of possible internal solutions to the same problem. We suggest that this could be directly related to the structure of the feedback during training, as demonstrated in our model.

Application to closed-loop tasks: Behavioral cloning

We face the general problem of an agent interacting with an environment with the purpose to solve a specific task. This is in general formulated in term of an association, at each time t, between a state defined by the vector and actions defined by the vector . The agent evaluates its current state and decides an action through a policy . Two possible and opposite strategies to approach the problem to learn an optimal policy are Reinforcement Learning and Imitation Learning. In the former, the agent starts by trial and error and the most successful behaviors are potentiated. In the latter the optimal policy is learned by observing an expert which already knows a solution to the problem. Behavioral Cloning belongs to the category of Imitation Learning and its scope is to learn to reproduce a set of expert behaviours (actions) , k = 1, …O (where O is the output dimension) given a set of states , h = 1, …I (where I is the input dimension). Our approach is to explore the implementation of Behavioral Cloning in recurrent spiking networks. In what follows, we assume that the action of the agent at time t, is evaluated by a recurrent spiking network and can be decoded through a linear readout , where . is a temporal filtering of the spikes (similarly to in Eq (1), with a time scale τ⋆). The network is trained to reproduce the target behavior of the expert .

Button-and-food task

To investigate the effects of the rank of feedback matrix, we design a button-and-food task (see Fig 4A for a graphical representation), which requires for a precise trajectory and to retain the memory of the past states. In this task, the agent starts at the center of the scene, which features also a button and an initially locked target (the food). The agent task is to first push the button so to unlock the food and then reach for it. We stress that to change its spatial target from the button to the food, the agents has to remember that it already pressed the button (the button state is not provided as an input to the network during the task). In our experiment we kept the position of the button (expressed in polar coordinates) fixed at rbtn = 0.2, θbtn = −90° for all conditions, while food position had rfood = 0.7 and variable θfood ∈ [30°, 150°]. The agent learns via observations of a collection of experts behaviours, which we indicate via the food positions . The expert behavior is a trajectory which reaches the button and then the food in straight lines (T = 80). The network receives as input (I = 80 input units) the vertical and horizontal differences of both the button’s and food’s positions with respect to agent location ( respectively). These quantities are encoded through a set of tuning curves. Each of the Δ values are encoded by 20 input units with different Gaussian activation functions. Agent output is the velocity vector v (O = 2 output units). We used η = ηRO = 0.01 (with Adam optimizer), moreover τRO = 10ms. Agent performances are measured by defining a reward function r that considers the importance to push the button before taking the food: where is the button-state indicator variable that is zero when the button is locked and one otherwise, the are the agent and target position vectors and d(⋅, ⋅) is the standard euclidean distance. We repeated training for different values of the rank of the feedback matrix , computed from (with δ the Kronecker delta, the analysis for the case random provides analogous results and is reported in Section C.2 in S1 Text), in a network of N = 300 neurons, and compared the overall performances (more information in Section C.2 in S1 Text). Fig 4B and 4C reports the rastergram for 100 random neurons and the dynamics of the membrane potential for 3 random neurons during a task episode. In Fig 4D we reported an example of the actions (v, v, red an green respectively) trajectories, the target ones, and the ones reproduced by the network (dashed and solid lines respectively). In Fig 4E we report the agent training trajectories, color-coded for the final reward and the rewards obtained by the network after the behavioral cloning. Indeed, all the training conditions () show good convergence. In Fig 4F the final reward is reported as a function of the target angle θfood for different ranks (ranks are color-coded using the same scheme as Fig 4G and purple arrows indicate the training conditions). As expected, the reward is maximum concurrently with the training condition. Moreover, it can be readily seen how high-rank feedback structures allows for superior performances for this task. Finally, in Fig 4G the average reward across all target conditions is reported as a function of the rank , further highlighting the benefits of a high-rank feedback structure for this task.

Fig 4

Button-and-food task.

(A) Sketch of the task. An agent starts at the center of the environment domain (left) and is asked to reach a target. The target is initially “locked”. The agent must unlock the target by pushing a button (middle) placed behind and then reach for the target (right). (B). Rasterplot of the activity of a random sample of 100 neurons across 80 time unit of a task episode. (C). Temporal dynamics of the membrane potential of three example units. (D) Target v (dashed lines), the velocity direction in the bi-dimensional plane, and the one reproduced by the network after the behavioral cloning (solid lines). (E) Example trajectories produced by a trained agent for different target locations. Purple arrows depict the positions of the food for the observed expert behaviors. (F) Final reward obtained by a trained agent as a function of the target position (measured by the angle θ with a fixed radius of r = 0.7 as measured from the agent starting position). Individual lines are average values over 10 repetitions. Color codes for different ranks in the error propagation matrix. (G) Average over all the target positions of final reward as a function of the rank. Error bars represent the standard deviation of the mean.

Button-and-food task.

2D Bipedal Walker

We benchmarked our behavioral cloning learning protocol on the 2D Bipedal Walker standard task provided through the OpenAI gym (https://gym.openai.com [38], MIT License). The environment and the task are sketched in Fig 5A: a bipedal agent has to learn to walk and to travel as long a distance as possible. The expert behavior is obtained by training a standard feed-forward network with PPO (proximal policy approximation [39], in particular we used the code provided in [40], MIT License). The sequence of states-actions is collected in the vectors , k = 1, …O, , h = 1, …I, t = 1, …T, with T = 400, O = 4, I = 14 (we excluded the LIDAR inputs, see Fig 5C for an example of the states-actions trajectories). The average reward performed by the expert is 〈r〉 ≃ 180 while a random agent achieves 〈r〉 ≃ −120. We performed behavioral cloning by using the learning rule in Eq (10) in a network of N = 500 neurons. We chose the maximum rank () and evaluate the performances for different values of τ⋆ (more information in Section D in S1 Text). Fig 5B and 5C (bottom) reports the rastergram for 100 random neurons and the dynamics of the membrane potential for 3 random neurons during a task episode. For each value of τ⋆ we performed 10 independent realizations of the experiment. For each realization, the is computed, and the recurrent weights are trained by using Eq (10). The optimization is performed using gradient ascent and a learning rate η = 1.0. In Fig 5D we report the spike error at the end of the training. The internal dynamics almost perfectly reproduces the target pattern of spikes for τ⋆ < 0.5ms, while the error increases for larger values. The readout time-scale is fixed to τRO = 5ms while the readout weights are initialized to zero and the learning rate is set to ηRO = 0.01. Every 75 training iterations of the readout we test the network and evaluate the average reward 〈r〉 over 50 repetitions of the task. We then evaluate the maximum 〈r〉 obtained for each episode (and average it over the 10 realizations). In Fig 5E it is reported the average of the maximum reward as a function of τ⋆. The decreasing monotonic trend suggests that learning with specific pattern of spikes (τ⋆ → 0) enables for optimal performances in this walking task. We stress that in this experiment we used a clamped version of the learning rule. In other words, we substituted to in the evaluation of in Eq (7). This choice, which is only possible when the maximum rank is considered (), allows for faster convergence and better performances. The results for the non-clamped version of the learning rule are reported in section D.2 in S1 Text.

Fig 5

2D Bipedal Walker.

(A). Representation of the 2D Bipedal Walker environment. The task is to successfully control the bipedal locomotion of the agent, reward is measured as the travelled distance across the horizontal direction. The agent receives a state vector containing several measurements such as joints position, velocity and LIDAR for environment sensing and outputs the torque vector for the four leg joints. (B) Rasterplot of the activity of a random sample of 100 neurons across T = 100 time unit of a task episode (C) Temporal dynamics of a subset of the core input state variables, action vector and spike dynamics. Top panels report respectively: v, the velocity vector in the bi-dimensional plane, the angles of the two leg joints with colors matching the scheme of panel A, and the action vector a containing the torque τhip,knee for the two joints of the left leg. (Bottom) Temporal dynamics of membrane potential for three randomly sampled neurons. (D). Average spike error ΔS as a function of the τ⋆. Error bars represent the standard deviation of the mean. (E). Average final episode reward as a function of the τ⋆. Error bars represent the standard error.

2D Bipedal Walker.

Discussion

Despite the experimental, theoretical, and computational progresses, neuroscience is still a relatively young field of study. The sign of this can be observed in the fragmented panorama of different theories and models proposed in the literature. In recent years, theoretical neuroscientists have formulated new frameworks attempting at providing more general explanations to aspects concerning intelligence and learning [41, 42]. In this work we contribute to this generalization effort by providing a general framework that is capable to account for different learning approaches by modulating two sensible parameters, the rank of the feedback error propagation and the tolerance to precise spike timing τ⋆ (see Fig 1C). We argue that many proposed learning rules can be seen as specific cases of our general framework (e-prop, LTTS, full-FORCE). In particular, the generalization on the rank of the feedback matrix allowed us to understand the target-based approaches as emerging from error-based ones when the number of independent constraints is high. Moreover, we understood that different values lead to different dimensionality of the solution space. If we see the learning as a trajectory in the space of internal dynamics, when the rank is maximum, every training converges to the same point in this space. On the other hand, when the is lower, the solution is not unique, and the possible solutions are distributed in a subspace whose dimensionality is inversely proportional to the rank of the feedback matrix. We suggest that this finding can be used to produce experimental observable to deduce the actual structure of error propagation in the different regions of the brain. On a technological level, our approach offers a strategy to clone on a (spiking) chip an expert behavior either previously learned via standard reinforcement learning algorithms or acquired from a human agent. Our formalism can be directly applied to train an agent to solve closed-loop tasks through a behavioral cloning approach. This allowed solving tasks that are relevant in the reinforcement learning framework by using a recurrent spiking network, a problem that has been faced successfully only by a very small number of studies [3]. Moreover, our general framework, encompassing different learning formulations, allowed us to investigate what learning method is optimal to solve a specific task. We demonstrated that a high number of constraints can be exploited to obtain better performances in a task in which it was required to retain a memory of the internal state for a long time (the state of the button in the button-and-food task). On the other hand, we found that a typical motor task (the 2D Bipedal Walker) strongly benefits from precise timing coding, which is probably due to the necessity to master fine movement controls to achieve optimal performances. In this case, a high rank in the error propagation matrix is not really relevant. From the biological point of view, we conjecture that different brain areas might be located in different positions in the plane presented in Fig 1C.

Limitations of the study

We chose relevant but very simple tasks in order to test the performances of our model and understand its properties. However, it is very important to demonstrate if this approach can be successfully applied to more complex tasks, e.g. requiring both long-term memory and fine motor skills. It would be of interest to measure what are the optimal values for both the rank of feedback matrix and τ⋆ in a more demanding task. Finally, we suggested that our framework allows for inferring the error propagation structure. However, we observe that the measure we proposed is indirect since it is necessary to estimate the dimensionality of the solution space first and then deduce the dimensionality of the learning constraints. Future development of the theory might be to formulate a method that directly infers from the data the laws of the dynamics in the solution space induced by learning.

Supporting Information document.

(PDF) Click here for additional data file. 6 Jan 2022 Dear Mr. Muratore, Thank you very much for submitting your manuscript "Error-based or target-based? A unifying framework for learning in recurrent spiking networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Michele Migliore Associate Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors propose an interesting model to unify target and error-based learning rules. The proposed framework defines a spectrum of learning regimes thanks to the manipulation of the number of constraints imposed during training. Such a number of constraints is directly defined thanks to the rank of the error propagation matrix. The idea underlying the work has the potential to constitute a relevant contribution to the field. However, the manuscript should be restructured to be clearer and more simulations are needed. Major: - The first main concern regards the tasks adopted to demonstrate the advantages of the framework. For the first task, a high rank of the error propagation matrix seems to be beneficial. For the second application, the importance of the rank is not shown and the authors state that it is almost irrelevant. Thus, the work demonstrates the advantage of a target-based learning rule, and a successful example of the framework when the system is closer to an error-based regime is missing. The lack of this demonstration negatively impacts the applicability of the framework and the understanding of the reader. At the end of the paper, the reader is left wondering why the model, whose aim is to provide a unification between the two learning paradigms, has been exploited only in the target-based regime. Thus, I suggest demonstrating some advantages of the system when the rank of the matrix does not correspond to the number of nodes. If the latter is not achievable, the authors should clearly explain why their model can still offer advantages in comparison to other published alternatives. For instance, this could be accomplished by comparing the performance of the model proposed with e-prop, LTTS, or full-FORCE. The convergence time of the model could be also reported for the different cases explored, providing an additional metric to evaluate the performance in the different regimes. - Relative to the above, it would also be important to show what is the impact of the additional constraints Y_{k, init}. A number of targets equal to N can be set by simply using the untrained network, and it would be beneficial to know what is the value of O (Eq. 10) in the different cases. - The paper is difficult to read and should be structured differently. In general, the paper feels fragmented and multiple sections could be merged. The authors should introduce the neuronal model and the network architecture used, and then introduce the learning rule adopted. The additional constraints, which are fundamental for the work, are briefly explained in 2.1.5, while other targets are introduced in Section 2.1.4 by using an untrained network with the desired output as input similarly to the full-FORCE algorithm. The complete set of constraints is unclear. I would suggest explaining the overall model in one section and drawing an explicit diagram that depicts also the untrained network, the different targets, and where these are coming from. After such an overall definition, the authors could describe the different learning regimes. Also, it is important to describe how the system is initialized since the untrained network is used for learning. Minor: - Fig.1 D should be explained in more detail since it provides information on how the model fits in the literature. - The notation needs to be unified across the paper. For instance, before the constraints are called q* and then later the authors have introduced Y_{k, init}. - The statement "The first..." at the end of Section 2.1.5 is unclear. - Section 2.2.2 and 2.2.1 could be merged, and the title of section 2.2.2 removed. I suggest explaining the second "noisy" methodology to estimate the dimensionality of the solution as an alternative, and to give minor emphasis to biological data. In the end, biological data have not been used in the paper. Instead, the authors could introduce this alternative approach and its advantages (since repetitions are not needed) and conclude with a note on biological plausibility. - Fig. 3 is almost obvious and could go to Supplementary Material. In contrast, Fig. 1 D shows how different models adopt diverse characteristic times. The authors could comment on this and provide some understanding of the relation between timescale and model. Are the target-based models more flexible? Is the framework proposed more flexible to changes in the characteristic time? Reviewer #2: This an interesting study that makes a substantial contribution to the topic of learning in the context of network dynamics. The authors study the differences between error-based and target-based approaches, both analytically and numerically, also applying their approach to two ‘real-world’ tasks. I believe that this work can be of interest to a relatively broad readership, such as the network science community, or researchers studying learning, to name a few. Nevertheless, I still have a few small comments for small improvements that I go over below. Also, although the text is understandable, I would advise to have the manuscript proofread by a native English speaker before publication. Likewise, abbreviations used should be initially defined (e.g. the use of ‘LTTS’). Introduction: As this paper is aimed at a relatively broad and not specific audience, I think it would be helpful for the reader if the topic was introduced a bit better. For instance, I don’t understand what this sentence comes to say “When first confronted with reality, humans learn with high sample efficiency, benefiting from the fabric of society and its abundance of experts in all relevant domains.” Are you referring to babies in this case? If so, do they benefit from experts in different domains? Or do you simply refer to its’ parents? “Moreover, we observe that spike-timing-based neural codes are experimentally suggested to be important in several brain systems” You mention ‘we observe’ - perhaps this is a minor detail, but you go on to cite other papers, so this isn’t your observation. Also, I think it would be helpful to explain what ‘spike-timing-based neural codes’ are and why they are important. The connection between this part and the rest of the intro isn’t clear and it comes a bit out of nowhere. I think it should be a paragraph on its’ own, with an explanation of the motivation for this section of the paper. Lastly, before we move on to the results section, it might be valuable to prepare the reader for what comes up next. You will explore these questions theoretically and then numerically (each with several separate steps). A clarification of the logical steps throughout could be useful. Results: As mentioned, there are a few small language issues. For instance in section 2.1.3 “propose an alternative formulations allowing to evaluate target” Should be corrected. I think that the paper could either be re-structured, or, at least, that the authors would make better connections between the different sections. For instance, in section 2.2.3 you start by saying: “As we discussed above the τ⋆” Where is this discussed above? It isn’t anywhere in section 2.2.2. Why is this here now? In the experiment section, many choices are made by the authors. Although this is of course necessary, no explanation for the reasoning behind these choices are given (e.g.: “The readout time-scale is fixed to τRO = 5ms while the readout weights are initialized to zero and the learning rate is set to ηRO = 0.01.” Why are these values chosen? I think there should be some kind of ‘sensitivity analysis’, showing that similar results would be gained with slight variations of these parameters, or, at least, giving justification for them. Discussion: The discussion is a bit disappointing, as it is very brief and doesn’t really engage with the literature. If this work isn’t placed adequately within the existing studies, it can loose much of its relevance. Also, there is no discussion with regards to the limitations of this study. On a more specific note (and perhaps this is a bit outside the usual scope of the field), I think that it would be interesting to consider, and discuss, the consequence of non-experts in the vicinity of the agent. In this paper, you assume that the agent learns from experts that surround it. This is a fair starting assumption, but in reality we usually have heterogenous populations where actually there is a chance of learning undesirable traits (by imitation). Figure 3: I think that in subplots A and B it would be better to name the x-axis ‘number of iterations’. The caption of Panel C reads “Scatter plot of mse vs ΔS for different values of τ⋆.” But on the plot itself we have the absolute value of ΔS. This should be resolved. Also, this isn’t a scatter plot. This is also color-coded, but compared with plots A and B, this isn’t mentioned. As the color coding is equivalent in all three, this can be mentioned once for the whole figure. Figure 4: Subplot C isn’t helpful in its’ current form. Either show less examples, colour-code them differently (keeping everything blue makes it harder to distinguish), or expand the size of the plot. For subplot D please write explicitly what the dashed and solid lines are. In Panel G - what do the error bars show? Please state this explicitly. Also, what is the color gradient? Same as in F? If so it should be placed on the right of G, or G should have its own. Figure 5: Subplot B should be split into two (as it has two subplots). So the bottom one should have its own letter referring to it. Subplot D has the same problem as Fig4C. Reviwers conclusion: In conclusion, I think this is a nice and insightful piece of work, and I would therefore recommend it for publication after minor revision. As the changes suggested are minor, the editor may decide to accept them without my further involvement. But I am also happy to check the revised version of this manuscript. Reviewer #3: The authors introduced a general framework for supervised learning in recurrent spiking networks, with two main parameters, the rank of the feedback error propagation and the tolerance to precise spike timing. And they analyzed the learning performance for different parameters. The research topic is very important for not only constructing efficient machine learning methods but also understanding biological neural networks. However, I have some concerns below. Major comments - While error-based learning is familiar, target-based learning is a bit unfamiliar to many readers. Therefore, I recommend that the authors add more explanations about target-based learning in the Introduction. Also, they should describe the critical differences in the characteristics and learning performances between the two. Moreover, LTTS and full-Force, which appear in Results, should be explained briefly. - The authors state that ‘Moreover, our general framework, encompassing different learning formulations, allowed us to investigate what learning method is optimal to solve a specific task.’ in the Discussion. However, only two tasks, the button-and-food task and the 2D Bipedal Walker, were used for checking the performance in the manuscript. It seems that the best performance was achieved at a higher rank of error propagation matrices. What kind of task favors a lower rank of error propagation matrices? After all, the higher the rank of D, the better it is? If so, the claim ‘different brain areas might be located in different positions in the plane presented in Fig.1C’ is invalid for the rank of D (error-based or target-based learning). Minor comments - Clarify the definition of \\delta_t and \\tau_star in Eq. (2). - Why is the star mark in \\tau of Eq.(2)? It’s confusing to me because it looks like a desired time scale parameter. - Fig. 4F: We cannot see the error bars in F, but there is a description of error bars in the caption. - Fig. 4G: We can see the messy graph around 300. - In p.9, Fig.5D, Fig.5G -> Fig.4D, Fig.4G - What’s the reward? The performance r? ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 11 Mar 2022 Submitted filename: PCB - Reviewers Reply.docx Click here for additional data file. 4 Apr 2022 Dear Mr. Muratore, Thank you very much for submitting your manuscript "Error-based or target-based? A unifying framework for learning in recurrent spiking networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the recommendations of Rev.1. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Michele Migliore Associate Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The manuscript has been improved considerably. However, we feel that it is still necessary to improve the clarity of exposition. Practically, the authors propose an interesting analysis and methodology to expand the target dimensionality to control the number of constraints that the network is subjected to. The expansion of the target is random and accomplished by adding random values in the definition of the 'read-out' matrix. The fact that this expansion can help the learning process to discover a better solution is interesting per se. We feel that this result should be clearly introduced as one of the main contributions of the work. The idea is clearly fascinating and the analysis is novel, but it is still not simple to understand the complete training process. We advise the authors to describe, point by point, how learning is accomplished. After careful reading: - First, the target signal has been processed by the untrained network. - Then, the read-out has been trained to reproduce the target signal. - The output connectivity matrix is expanded to include possible constraints. - Finally, such an output connectivity matrix is adopted in the learning process to adapt the recurrency. We think it would be better to describe this methodology clearly in one section and to justify the different choices. Since the core of the work lies in the possibility to add constraints, thanks to which it is possible to move in the target/error-based spectrum, we suggest that the authors should introduce the different tasks by saying explicitly what the number of constraints vs the dimensionality of the target is. This can be inferred and it was explicit in your previous response, but saying it also in the text would help the reader. We have also some comments on the figures: - Fig. 1D-E. How is this figure generated? The authors should clarify if this is an intuitive explanation of the different regimes of learning or if it was generated quantitatively. The regime where tau/delta_t<<1 would correspond to a simulation where the discretization step is greater than the characteristic time, and I am unsure how this could be practically achieved with numerical methods. Can you clarify this point? Also, the ideal space represented has \\delta s_1 and \\delta s_2 as x and y-axis, but the line corresponding to the error-based regime is increasing until it reaches the target-based limit for a maximum rank. In that limit, \\delta s\\approx 0. I would have imagined the target-based limit on the left approximately at zero, and the error-based limit on the right. Increasing the rank would correspond to going from right to left in this case, etc. Probably, the intended ideal x-axis is the rank. Also, what are the circles? Do they simply reflect the density of the possible solutions? I would avoid writing 'good' or 'bad'. I feel that some clarifications are needed, but these panels are surely an interesting addition to the paper. - Fig.2 C I can understand that the target-based limit recovers the correct succession of spikes more accurately than the error-based regime, but I am not sure that in this way the result is meaningful. Simply, the error-based regime is characterized by a low-rank matrix and can not find the correct multi-dimensional succession of spikes. If I understood correctly, the result is an immediate consequence of the methodology adopted. Perhaps, it would be interesting to see if the error-based regime can recover the succession of spikes from the 'original' target, neglecting the additional constraints. Minor: The text has typos, please do extensive proofreading. Fig.2 A, please consider using dots for the target to make it visible. Fig.2 D-E-F, please fix the labels on the x-axis. Reviewer #2: the authors have replied to my comments and I therefore recommend the paper for publication. Reviewer #3: The authors have done a great job in addressing my concerns in the revision of the manuscript. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 5 May 2022 Submitted filename: PCB - Reviewers Reply.pdf Click here for additional data file. 17 May 2022 Dear Mr. Muratore, We are pleased to inform you that your manuscript 'Error-based or target-based? A unifying framework for learning in recurrent spiking networks' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Michele Migliore Associate Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************************************************** 15 Jun 2022 PCOMPBIOL-D-21-02197R2 Error-based or target-based? A unified framework for learning in recurrent spiking networks Dear Dr Muratore, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

27 in total

1. Rapid neural coding in the retina with relative spike latencies.

Authors: Tim Gollisch; Markus Meister
Journal: Science Date: 2008-02-22 Impact factor: 47.728

Review 2. Cortical region interactions and the functional role of apical dendrites.

Authors: M W Spratling
Journal: Behav Cogn Neurosci Rev Date: 2002-09

3. Matching recall and storage in sequence learning with spiking neural networks.

Authors: Johanni Brea; Walter Senn; Jean-Pascal Pfister
Journal: J Neurosci Date: 2013-06-05 Impact factor: 6.167

4. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex.

Authors: Matthew Larkum
Journal: Trends Neurosci Date: 2012-12-25 Impact factor: 13.837

5. Supervised Learning in Spiking Neural Networks for Precise Temporal Encoding.

Authors: Brian Gardner; André Grüning
Journal: PLoS One Date: 2016-08-17 Impact factor: 3.240

6. full-FORCE: A target-based method for training recurrent networks.

Authors: Brian DePasquale; Christopher J Cueva; Kanaka Rajan; G Sean Escola; L F Abbott
Journal: PLoS One Date: 2018-02-07 Impact factor: 3.240

7. A solution to the learning dilemma for recurrent networks of spiking neurons.

Authors: Guillaume Bellec; Franz Scherr; Anand Subramoney; Elias Hajek; Darjan Salaj; Robert Legenstein; Wolfgang Maass
Journal: Nat Commun Date: 2020-07-17 Impact factor: 14.919

8. SuperSpike: Supervised Learning in Multilayer Spiking Neural Networks.

Authors: Friedemann Zenke; Surya Ganguli
Journal: Neural Comput Date: 2018-04-13 Impact factor: 2.026

9. Sleep-like slow oscillations improve visual classification through synaptic homeostasis and memory association in a thalamo-cortical model.

Authors: Cristiano Capone; Elena Pastorelli; Bruno Golosio; Pier Stanislao Paolucci
Journal: Sci Rep Date: 2019-06-20 Impact factor: 4.379

10. Thalamo-cortical spiking model of incremental learning combining perception, context and NREM-sleep.

Authors: Bruno Golosio; Chiara De Luca; Cristiano Capone; Elena Pastorelli; Giovanni Stegel; Gianmarco Tiddia; Giulia De Bonis; Pier Stanislao Paolucci
Journal: PLoS Comput Biol Date: 2021-06-28 Impact factor: 4.475