Literature DB >> 35814496

Neurons learn by predicting future activity.

Artur Luczak¹, Bruce L McNaughton^1,2, Yoshimasa Kubo¹.

Abstract

Understanding how the brain learns may lead to machines with human-like intellectual capacities. It was previously proposed that the brain may operate on the principle of predictive coding. However, it is still not well understood how a predictive system could be implemented in the brain. Here we demonstrate that the ability of a single neuron to predict its future activity may provide an effective learning mechanism. Interestingly, this predictive learning rule can be derived from a metabolic principle, where neurons need to minimize their own synaptic activity (cost), while maximizing their impact on local blood supply by recruiting other neurons. We show how this mathematically derived learning rule can provide a theoretical connection between diverse types of brain-inspired algorithms, thus, offering a step toward development of a general theory of neuronal learning. We tested this predictive learning rule in neural network simulations and in data recorded from awake animals. Our results also suggest that spontaneous brain activity provides "training data" for neurons to learn to predict cortical dynamics. Thus, the ability of a single neuron to minimize surprise: i.e. the difference between actual and expected activity, could be an important missing element to understand computation in the brain.

Entities: Chemical

Year: 2022 PMID： 35814496 PMCID： PMC9262088 DOI： 10.1038/s42256-021-00430-y

Source DB: PubMed Journal: Nat Mach Intell ISSN： 2522-5839

Introduction

Neuroscience is at the stage biology was at before Darwin. It has a myriad of detailed observations but no single theory explaining the connections between all of those observations. We do not even know if such a brain theory should be at the molecular level, or at the level of brain regions, or at any scale between. However, looking at Deep Neural Networks, which have achieved remarkable results in tasks ranging from cancer detection to self-driving cars, may provide useful insights. Although such networks may have different inputs and architectures, most of their impressive behavior can be understood in terms of the underlying common learning algorithm, called backpropagation [1]. Therefore, a better understanding of the learning algorithm(s) used by the brain could be central to develop a unifying theory of the brain function. There are two main approaches to investigate learning mechanisms in the brain: 1) experimental, where persistent changes in neuronal activity are induced by a specific intervention [2]; and 2) computational, where algorithms are developed to achieve specific computational objectives while still satisfying selected biological constraints [3,4]. Here, we explore an additional option: 3) theoretical derivation, where a learning rule is derived from basic cellular principles, i.e. from maximizing metabolic energy of a cell. Using this approach, we found that maximizing energy balance by a neuron leads to a predictive learning rule, where a neuron adjusts its synaptic weights to minimize surprise: i.e. the difference between actual and predicted activity. Interestingly, this derived learning rule has a direct relation to some of the most promising biologically inspired learning algorithms, like predictive coding and temporal difference learning (see below), and Hebbian based rules can be seen as a special case of our predictive learning rule (see Discussion). Thus, our approach may provide a theoretical connection between multiple brain-inspired algorithms, and may offer a step toward development of a unified theory of neuronal learning. There are multiple lines of evidence suggesting that the brain operates as a predictive system [5-10]. However, it remains controversial as to how exactly predictive coding could be implemented in the brain [4]. Most of proposed mechanisms involve specially designed neuronal circuits with “error units” to allow for comparing expected and actual activity [11-14]. Whereas those models assume a predictive circuit, we propose an alternative, where there is an internal predictive model within a neuron. As many basic properties of neurons are highly conserved throughout evolution [15-17], therefore, we suggest that a single neuron using predictive learning rule could provide an elementary unit from which a variety of predictive brains may be built. Interestingly, our predictive learning rule can also be obtained by modifying a temporal difference learning algorithm to be more biologically plausible. Temporal difference learning is one of the most promising ideas of how backpropagation-like algorithms could be implemented in the brain. It is based on using differences in neuronal activity to approximate top-down error signals [4,18-24]. A typical example of such algorithms is Contrastive Hebbian Learning [25-27], which was proven to be equivalent to backpropagation under certain assumptions [28]. Contrastive Hebbian Learning requires networks to have reciprocal connections between hidden and output layers, which allows activity to propagate in both directions (Fig. 1a). The learning consists of two separate phases. First, in the ‘free phase’, a sample stimulus is continuously presented to the input layer and the activity propagates through the network until the dynamics converge to an equilibrium (activity of each neuron achieves steady-state level). In the second ‘clamped phase’, in addition to presenting stimulus to the input, the output neurons are also held clamped at values representing stimulus category (e.g.: 0 or 1), and the network is again allowed to converge to an equilibrium. For each neuron, the difference between activity in the clamped and free phases is used to modify synaptic weights (w) according to the equation: where i and j are indices of pre- and post- synaptic neurons respectively, and α is a small number representing learning rate. Intuitively, this can be seen as adjusting weights to push each neuron’s activity in the free phase, closer to the desired activity represented by the clamped phase. The obvious biological plausibility issue with this algorithm is that it requires the neuron to experience exactly the same stimulus twice in two separate phases, and that the neuron needs to ‘remember’ its activity from the previous phase. Our predictive learning rule provides a solution to this problem by predicting free phase steady-state activity, thus eliminating requirement for two separate stimulus presentations.

Fig. 1.

Basics of the algorithm.

a. Schematic of the network. Note that activity propagates back-and-forth between hidden and output layers. b. Sample neuron activity in the free phase in response to different stimuli (marked with shades of blue). The free phase responses were used to train a linear model to predict a steady-state activity from the activity at earlier time steps (marked by shaded area; see main text for details). Bottom traces show duration of inputs, and dots represent predicted activity. c. Activity of a neuron in response to a new stimulus with the network output clamped. Initially the network receives only input signal (free phase), but after a few steps the output signal is also presented (clamped phase; bottom black trace). The red dot represents steady-state free phase activity predicted from initial activity (shaded region). For comparison, the dashed line shows a neuron’s activity in the free phase if the output is not clamped. Synaptic weights (w) are adjusted in proportion to the difference between steady-state activity in clamped phase and predicted free phase activity .

For manuscript clarity, first we will describe how our predictive learning rule can be obtained by modifying Contrastive Hebbian Learning algorithm. Next, we will validate the predictive learning rule in simulation and in data recorded from awake animals, and we will show how our results shed new light on the function of spontaneous activity. The details of derivation of the learning rule by maximizing neuron energy balance will be presented at the end.

Results

Predictive learning rule and Contrastive Hebbian Learning.

As mentioned earlier, Contrastive Hebbian Learning algorithm requires network to converge to steady-state equilibrium in two separate learning phases, thus exactly the same stimulus has to be presented twice. However, this is unlikely to be the case in the actual brain. Here, we propose to solve this problem by combining both activity phases into one, which is inspired by sensory processing in the cortex. For example, in visual areas, when presented with a new picture, there is initially bottom-up driven activity containing mostly visual attributes of the stimulus (e.g. contours). This is then followed by top-down modulation containing more abstract information, e.g., “this object is a member of category x”, or “this object is novel” (Supplementary Fig. 1). Accordingly, our algorithm first runs only the initial part of the free phase, which represents bottom-up stimulus driven activity, and then, after a few steps the network output is clamped, corresponding to top-down modulation. The novel insight here is that the initial bottom-up activity is enough to allow neurons to predict the steady-state part of the free phase activity, and the mismatch between predicted free phase and clamped phase can then be used as a teaching signal. To implement this idea in our model, for each neuron, activity during twelve initial time steps of the free phase (,..,) was used to predict its steady-state activity at time step 120: (Fig. 1b). Specifically, first, we presented sample stimuli in free phase to train linear model, such that: , where denote predicted activity, λ and b correspond to coefficients and offset terms of the least-squares model, and terms in brackets correspond to time steps. Next, a new set of stimuli was used for which free phase was ran only for the first 12 steps, and from step 13 the network output was clamped (Fig. 1c). The above least-squares model was then applied to predict free phase steady-state activity for each neuron, and the weights were updated based on the difference between predicted and clamped activity (Methods). Thus, to modify synaptic weights, in Eq. (1) we replaced activity in free phase with predicted activity : However, the problem is that this equation implies that a neuron needs also to know the predicted activity of all its presynaptic neurons , which could be not realistic. To solve this problem, we replaced by the actual presynaptic activity in the clamped phase , which we validated in network simulations (see the next section). This change leads to the following simplified synaptic plasticity rule: Thus, to modify synaptic weights, a neuron only compares its actual activity with its predicted activity , and applies this difference in proportion to each input contribution .

Learning rule validation in neural network simulations

To test if predictive learning rule can be used to solve standard machine learning tasks, we created the following simulation. The neural network had 784 input units, 1000 hidden units, and 10 output units, and it was trained on a hand-written digit recognition task (MNIST [29]; Supplementary Fig. 2; Methods). This network achieved 1.9% error rate, which is similar to neural networks with comparable architecture trained with the backpropagation algorithm [29]. This demonstrates that the network with predictive learning rule can solve challenging non-linear classification tasks. To verify that neurons could correctly predict future free phase activity, we took a closer look at sample neurons. Figure 2a illustrates the activity of all 10 output neurons in response to an image of a sample digit after the first epoch of training. During time steps 1–12, only the input signal was presented and the network was running in free phase. At time step 13, the output neurons were clamped, with activity of 9 neurons set to 0 and the activity of one neuron representing correct image class set to 1. For comparison, this figure also shows the activity of the same neurons without clamped outputs (free phase). It illustrates that, after about 50 steps in free phase, the network achieves steady-state, with predicted activity closely matching. When the network is fully trained, it still takes about 50 steps for the network dynamics in free phase to converge to steady-state (Fig. 2b). Note that, although all units initially increase activity at the beginning of the free phase, they later converge close to 0, except the one unit representing the correct category. Again, predictions made from the first 12 steps during free phase closely matched the actual steady-state activity. The hidden units also converged to steady-state after about 50 steps. Figure 2c illustrates the response of one representative hidden neuron to 5 sample stimuli. Because hidden units experience the clamped signal only indirectly, through synapses from output neurons, their steady-state activity is not bound to converge only to 0 or 1, as in the case of output neurons. Actual and predicted steady-state activity for hidden neurons is presented in Figure 2d. The average correlation coefficient between predicted and actual free phase activity was R=1±0.0001SD (averaged across 1000 hidden neurons in response to 200 randomly selected test images). Note that for we always used a cross-validation approach, where we trained a predictive model for each neuron on a subset of the data and applied that model to new examples, which were then used for updating weights (Methods). Thus, neurons were able successfully to generalize their predictions to new unseen stimuli. The network error rate for the training and test dataset is shown in Fig. 2e. This demonstrates that the predictive learning rule worked well, and each neuron accurately predicted its future activity.

Fig. 2.

Neuron prediction of expected activity.

a. Activity of 10 output neurons in response to a sample stimulus at the beginning of the network training. Gray area indicates the extent of the free phase (time steps 1–12). Solid red lines show activity of the neurons clamped at step 13. For comparison, dashed lines represent free phase activity if output neurons had not been clamped. Dots show predicted steady-state activity in free phase based on initial activity (from steps 1–12). b. Activity of the same neurons after network training. Note that free phase and predicted activity converged to desired clamped activity. c. Activity of a representative neuron in the hidden layer in response to 5 different stimuli after network training. Solid and dashed lines represent clamped and free phase respectively, and dots show predicted activity. d. Predicted vs. actual free phase activity. For visualization clarity only every tenth hidden neuron out of 1000 is shown, in response to 20 sample images. Different colors represent different neurons, but some neurons may share the same color due to the limited number of colors. Distribution of points along the diagonal shows that predictions are accurate. e. Decrease in error rate across training epochs. Yellow and green lines denote learning curves for training and test data set respectively. Note that in each epoch we only used 2% out of 60,000 training examples.

Biologically motivated network architectures

We also tested the predictive learning rule in multiple other network architectures, which were designed to reflect additional aspects of biological neuronal networks. First, we introduced a constraint that 80% of the hidden neurons were excitatory, and the remaining 20% had only inhibitory outputs. This follows observations that biological neurons release either excitatory or inhibitory neurotransmitters, not both (Dale’s law [30]), and that about 80% of cortical neurons are excitatory. The network with this architecture achieved an error rate of 2.66% (Supplementary Fig. 3a). We also tested our algorithm in a network without symmetric weights, which resulted in similar performance as the original network (1.96%, Supplementary Fig. 3b). Moreover, we implemented the predictive learning rule in a network with spiking neurons, which again achieved a similar error rate of 2.46% (Supplementary Fig. 4). Our predictive learning rule was also tested in a deep convolutional network (Fig. 3a), which architecture was shown to resemble neuronal processing in the visual system [31,32]. Using this convolutional network, we tested our algorithm on a more challenging dataset for biologically inspired algorithms: CIFAR-10 [33]. This dataset consists of color images representing 10 different classes like: airplanes, cars, birds, cats, etc. We achieved 20.03% error rate, which was comparable with training the same network using backpropagation through time algorithm (Fig. 3b; see Methods for details; code to reproduce those results is available at: https://github.com/ykubo82/bioCHL/tree/master/conv). Altogether, this shows that our predictive learning rule performs well in a variety of biologically motivated network architectures.

Fig. 3.

Implementation of predictive learning rule in a multilayer convolutional neuronal network.

a. Depiction of our convolutional network architecture (see Methods). b. Learning curves for convolutional network trained using a predictive learning rule (green) and for comparison the same network trained using backpropagation through time (BPTT). The red line shows a learning curve for BPTT using the same learning rates as in our predictive model (LR: 0.4, 0.028, 0.025); yellow line: BPTT with learning rate of 0.1 for all layers, and violet line: BPTT with learning rate of 0.2 for all layers. It illustrates that on CIFAR-10, performance of the deep network using our predictive learning rule was comparable with BPTT.

Predictive learning rule validation in awake animals

To test whether real neurons could also predict their future activity, we analyzed neuronal recordings from the auditory cortex in awake rats (Methods). As stimuli we presented 6 tones, each 1s long and interspersed by 1s of silence, repeated continuously for over 20 minutes (Supplementary Information). For each of the 6 tones we calculated separately average onset and offset response, giving us 12 different activity profiles for each neuron (Fig. 4a). For each stimulus, the activity in the time window 15–25ms was used to predict average future activity within the 30–40ms window. We used 12-fold cross-validation, where responses from 11 stimuli were used to train the least-square model, which was then applied to predict neuron activity for the 1 remaining stimulus. This procedure was repeated 12 times for each neuron. The average correlation coefficient between actual and predicted activity was R = 0.36±0.05 SEM (averaged across 55 cells from 4 animals, Fig. 4b). Distribution of correlations coefficients for individual neurons were significantly different from 0 (t-test p<0.0001; all tests were two-sided; insert in Fig. 4b). This shows that neurons have predictable dynamics, and from an initial neuronal response its future activity could be estimated.

Fig. 4.

Predicting future activity of cortical neurons.

a. Response of a representative neuron to different stimuli. For visualization only 5 out of 12 responses are shown. Gray area indicates the time window which was used to predict future activity. Dots show predicted average activity in the 30–40ms time window. Colors correspond to different stimuli. b. Actual vs. predicted activity for 55 cells from 4 animals in response to 12 stimuli. Different colors represent different neurons, but some neurons may share the same color due to the limited number of colors. Insert: histogram of correlation coefficients for individual neurons. Skewness of the distribution to the right shows that for most neurons the correlation between actual and predicted response was positive.

However, much stronger evidence supporting our learning rule is provided by predicting long-term changes in cortical activity. Specifically, repeated presentation of stimuli over tens of minutes induces long-term changes in neuronal firing rates [34], similar as in perceptual learning. Importantly, based on our model, it was possible to infer which individual neurons will increase, and which neurons will decrease their firing rate. To explain it, first let’s look at neural network simulations in Fig. 5a. It shows that, for a neuron, the average change in activity from one learning epoch to the next, depends on the difference between clamped (actual) activity and predicted (expected) activity, in the previous learning epoch (Fig. 5a; correlation coefficient R = 0.35, p<0.0001; Supplementary Information). Similarly, for cortical neurons, we found that the change in firing rate from the 1st to the 2nd half of the experiment was positively correlated with differences between evoked and predicted activity during the 1st half of experiment (R = 0.58, p<0.0001; Fig. 5b, Supplementary Information). Those changes in activity patterns were blocked by an NMDA receptor antagonist, as we showed using this data in [34], which provides strong support that this phenomenon depends on synaptic plasticity. Results presented in Fig. 5 could be understood in terms of Eq. 3, where if actual activity is higher than predicted, then synaptic weights are increased, thus leading to higher activity of that neuron in the next epoch. Therefore, similar behavior of artificial and cortical neurons, where firing rate changes to minimize ‘surprise’: difference between actual and predicted activity, provides a strong evidence in support of the predictive learning rule presented here.

Fig. 5.

Long-term changes in neuronal activity in our model and in cortical neurons.

a, Average change in clamped steady-state activity between 2 consecutive learning epochs in our network model. This change relates to ‘surprise’: the difference between actual (clamped) and predicted activity, in the earlier epoch (n=7; Supplementary Information). Each dot represents one neuron. Regression line is shown in yellow. b, Average change in firing rate between 1st and 2nd half of our experiment with repetitive auditory stimulation. This firing rate change correlates with the difference between stimulus evoked and predicted activity during the 1st half of the experiment (Supplementary Information). Each dot represents the activity of one neuron averaged across stimuli. The similar behavior of cortical and artificial neurons suggests that both may be using essentially the same learning rule. Thus, this evidence that a neuronal change in firing rate relates to ‘surprise’, provides a novel insight about neuronal plasticity.

Deriving predictive model from spontaneous activity

Next, we tested whether spontaneous brain activity could also be used to predict neuronal dynamics during stimulus presentation. Spontaneous activity, such as during sleep, is defined as an activity not directly caused by any external stimuli. However, there are many similarities between spontaneous and stimulus evoked activity [35-38]. For example, spontaneous activity is composed of ~50–300 ms long population bursts called packets, which resemble stimulus evoked patterns [39]. This is illustrated in Figure 6a, where spontaneous activity packets in the auditory cortex are visible before sound presentation [40,41]. In our experiments, each 1s long tone presentation was interspersed with 1s of silence, and the activity during 200–1000 ms after each tone was considered as spontaneous (animals were in soundproof chamber; Supplementary Information). The individual spontaneous packets were extracted to estimate neuronal dynamics (Methods). Then the spontaneous packets were divided into 10 groups based on similarity in PCA space (Supplementary Information), and for each neuron we calculated its average activity in each group (Fig. 6b). As in previous analyses in Fig. 4a, the initial activity in time window 5–25ms was used to derive the least-square model to predict future spontaneous activity in the 30–40ms time window (Supplementary Information). This least-square model was then applied to predict future evoked responses from initial evoked activity for all 12 stimuli. Figure 6c shows actual vs predicted evoked activity for all neurons and stimuli (correlation coefficient R = 0.2±0.05 SEM, averaged over 40 cells from 4 animals; the insert shows the distribution of correlation coefficients of individual neurons; p=0.0008, t-test). Spontaneous brain activity is estimated to account for over 90% of brain energy consumption [42], however the function of this activity still is a mystery. The foregoing results offer a new insight: because neuronal dynamics during spontaneous activity is similar to evoked activity [35-38], thus spontaneous activity can provide “training data” for neurons to build a predictive model.

Fig. 6.

Predicting stimulus evoked responses from spontaneous activity dynamics.

a, Sample spiking activity in the auditory cortex before and during tone presentation. Note that spontaneous activity is not continuous, but rather composed of bursts called packets which are similar to tone evoked packets. The red trace shows smoothed multiunit activity: summed activity of all neurons (adopted with permission from [41]). b, Spontaneous packets were divided into 10 groups based on population activity patterns. Activity of a single neuron in 5 different spontaneous packet groups is shown. Gray area indicates the time window used for predicting future average activity within the 30–40ms time window (marked by arrow). This predictive model derived from spontaneous activity was then applied to predict future evoked activity based on initial evoked response. c, Actual vs. predicted tone evoked activity. Plot convention is the same as in Fig. 4. Skewness of the histogram to the right shows that for most neurons the evoked dynamics can be estimated based on spontaneous neuron’s activity.

Learning rule derivation by maximizing neuron energy.

Interestingly, predictive learning rule in Eq. 3: is not an ad hoc algorithm devised to solve a computational problem, but this form of learning rule arises naturally as a consequence of minimizing a metabolic cost by a neuron. Most of the energy consumed by a neuron is for electrical activity, with synaptic potentials accounting for ~50% and action potential for ~20% of ATP used [43]. Using a simplified linear model of neuronal activity, this energy consumption for a neuron j can be expressed as , where represents the activity of pre-synaptic neuron i, w represents synaptic weights, b is a constant to match energy units, and β describes a non-linear relation between neuron activity and energy usage, which is estimated to be between 1.7 – 4.8 [44]. The remaining ~30% of neuron energy is consumed on housekeeping functions, which could be represented by a constant ˗ɛ. On the other hand, the increase in neuronal population activity also increases local blood flow leading to more glucose and oxygen entering a neuron (see review on neurovascular coupling: [45]). This activity dependent energy supply can be expressed as: , where x represents spiking activity of neuron k from a local population of K neurons ; b is a constant and β reflects the exponential relation between activity and blood volume increase, which is estimated to be in range β: 1.7–2.7 [44]. Note that sum of local population activity , also includes activity of neuron , as all local neurons contribute to local neurovascular coupling. Putting all the above terms together, the energy balance of a neuron j could be expressed as: This formulation shows that to maximize energy balance, a neuron has to minimize its electrical activity (be active as little as possible), but at the same time, it should maximize its impact on other neurons’ activities to increase blood supply (be active as much as possible). Thus, weights have to be adjusted to strike a balance between two opposing demands: maximizing the neuron’s downstream impact and minimizing its own activity (cost). This energy objective of a cell could be paraphrased as “lazy neuron principle”: maximum impact with minimum activity. We can calculate such required changes in synaptic weights ∆w that will maximize neuron’s energy E by using gradient ascent method. For that, we need to calculate derivative of E with respect to The appearance of in the last term in (Eq. 5) comes from the fact that , includes which is function of , as explained above. Thus, if we denote population activity as: , and considering that , then after moving in front of brackets and after switching order of terms we obtain: In case that β = 2 and β = 2, this formula simplifies from exponential to linear. However, even if β and β are anywhere in the range: 1.7 < β < 4.8 and 1.7 < β < 2.7 [44], the expression still is well approximated by its linearized version: for typical values of x in range 0–1 (Supplementary Fig. 5). After also denoting that and and after taking in front of brackets, we obtain: Although in this derivation we used linear model of a neuron, including a non-linear neural model like ReLU: f(x) = x+ = max( 0, x ) leads to similar expression (Supplementary Information). Moreover, if we use the same derivation steps but to maximize neuron energy balance in the future, then Eq. 7, changes to (Eq. S7): (see Supplementary Information for details of derivation) Note that Eq. S7 has the same form as the predictive learning rule in Eq. 3: . Here, represents population recurrent activity, which can be thought of as top-down modulation, similar to . Also note that activity of neuron from Eq. 7, became here future predicted activity: . Thus, this derivation shows that the best strategy for a neuron to maximize future energy resources requires predicting its future activity. Altogether, this reveals an unexpected connection, that learning in neural networks could result from simply maximizing energy balance by each neuron.

Discussion

Here we present theoretical, computational and biological evidence that the basic principle underlying single neuron learning may rely on minimizing future surprise: a difference between actual and predicted activity. Thus, a single neuron is not only performing summation of its inputs, but it also predicts the expected future, which we propose is a crucial component of the brain’s learning mechanism. Note that a single neuron has complexity similar to single cell organisms, which were shown to have ‘intelligent’ adaptive behaviors, including predicting consequences of its action in order to navigate toward food and away from danger [46-48]. This suggests that typical neuronal models used in machine learning may be too simplistic to account for the essential computational properties of biological neurons. Our work suggests that a predictive mechanism may be an important computational element within neurons, which could be crucial to understand learning mechanisms in the brain. This is supported by a theoretical derivation showing that the predictive learning rule provides an optimal strategy for maximizing metabolic energy of a neuron. To our knowledge, this is the first time where a synaptic learning rule has been derived from basic cellular principles, i.e. from maximizing energy of a cell. This provides a more solid theoretical basis over previous biologically-inspired algorithms, which were developed ad hoc to solve specific computational tasks while still satisfying selected biological constraints. However, it should be emphasized that many of those previous algorithms provided novel and insightful ideas which enabled the development of our model. Importantly, our derived learning rule provides a theoretical connection between those diverse types of brain-inspired algorithms, as discussed below. One of the most influential ideas about brain’s learning algorithm was proposed by Donald Hebb, which is based on correlated firing: a.k.a. ‘cells that fire together wire together’ [49]. This could be written as: , where is change in synaptic weight between neurons i and j, ∝ denotes proportionality, and and represents pre- and post- synaptic activity, respectively. Note that this is a special case of our predictive learning rule: when , i.e. when a neuron does not make any prediction (note that here and represent actual activity as it is the case in clamped phase (i.e. and in Eq. 3), thus for comparison clarity, hat symbol ^ can be omitted here). Despite its influential role, the original Hebb’s rule was shown to be unstable, as the synaptic weights will tend to increase or decrease exponentially. To solve this problem, a BMC theory was proposed [50], which can be expressed in a simplified form as: , where can be considered as the average activity of neuron j across all input patterns. Note that if in our equation: , we would use the simplest predictive model: always predicting the average activity, then and our predictive rule becomes equivalent to the core part of BCM rule and could be seen as a linearized version of the full BCM rule. However, it was noted that networks trained using the BCM rule do not achieve the same level of accuracy as other learning rules [51]. It is consistent with our experience that performance of our algorithm deteriorated when we used average activity of each neuron for predictions. We interpret from this, that dynamically adjusting predictions based on most recent activity allows for more precise weight adjustments. Moreover, we described in the results section, how our predictive learning rule directly relates to Contrastive Hebbian Learning, which belongs to class of temporal difference learning algorithms. Our algorithm is also similar to other predictive algorithms (see Introduction). The main difference is that we propose that neurons can internally calculate their predictions, rather than relying on specialized neuronal circuits, as pointed out by the Reviewer. We already mentioned earlier that organisms with simpler neuronal systems may not have predictive circuits as proposed to exist in the cortex [12,14]. Thus, a predictive learning rule at the level of a single neuron may provide more basic description of the learning process across different brains. However, our model should not be taken as precluding possibility that in more complex brains, in addition to intracellular predictions, neurons may form predictive circuits to enhance predictive abilities of an organism. Our model is also closely related to the work of [52-54] where depolarization of basal dendrites serves as a prediction of top down signals from apical dendrites in pyramidal neurons. Again, our derived model could be seen as generalization of those ideas as it is not constrained to any specific cell type. The other interesting aspect of our model is that it belongs to the category of energy-based models, for which it was shown that synaptic update rules are consistent with spike-timing-dependent plasticity [55]. Considering all the above, we suggest that our plasticity rule derived from basic metabolic principles could serve as a common denominator for diverse types of biologically inspired learning algorithms, and as such, it may offer a step toward development of a unified neuronal learning theory. Biological neurons have a variety of cellular mechanisms which operate on time scales of 1~100ms suitable for implementing predictions [56-60]. The most likely mechanism appears to be calcium signaling. For example, when a neuron is activated, it leads to corresponding elevation of somatic calcium for tens of ms [61]. This time period with elevated calcium could indicate that a certain level of new input is expected to arrive in that time window. For instance, if bottom-up visual stimulus triggers multiple spikes in a neuron, then the resulting proportional increase in calcium concentration may signal that a higher level of follow up activity is expected, which could correspond to predicting a higher level of e.g. top-down modulation. This would be consistent with our experimental data, where higher activity at stimulus onset is correlated with higher activity ~20 ms later (Fig. 4; see Suppl. Information for more details on the plausibility of the predictive mechanism implementation and on proposed experiments to test it more directly). Interestingly, the core prediction of BCM and our model that synaptic weights should increase/decrease if neuron is stimulated above/below expected activity, is supported by experimental evidence where applying strong/weak electrical stimulation to cells in CAl induced LTP/LTD, respectively [62], which also involves calcium-depended mechanisms [63]. There are also other possible cellular properties which could support predictive mechanisms. For instance, it was shown that neurons can preferentially respond to inputs arriving at specific resonance frequencies (range: ~1–50 Hz) [64,65]. This provides another example that neurons do have cellular mechanisms to ‘remember’ and to ‘act’ accordingly based on their past activity tens of ms earlier [58]. Therefore, considering the cellular mechanisms listed above and the consistency of our model with experimental data presented in Figs 4–6, altogether it shows that neurons are at least capable of implementing the predictive learning rule. Our work also suggests that packets could be basic units of information processing in the brain. It is well established that sensory stimuli evoke coordinated bursts (packets) of neuronal activity lasting from tens to hundreds of ms. We call such population bursts packets, because they have stereotypical structure, with neurons active at the beginning conveying bottom-up sensory information (e.g. this is a face), and later in the packet neurons represent additional higher order information (e.g. this is a happy face of that particular friend)[66]. Also the later part of the packet can encode if there is discrepancy with expectation (e.g. this is a novel stimulus [67,68]; Supplementary Fig. 1). This is likely because only the latter part of the packet can receive top-down modulation after information about that stimulus is exchanged between other brain areas, which is the case even during passive stimulus presentation [69,70]. Thus, our work suggests that the initial part of the packet can be used to infer what the rest of the brain may ‘think’ about this stimulus, and the difference from this expectation can be used as a learning mechanism to modify synaptic connections. This could be the reason why, for example, we cannot process visual information faster than ~20 frames/s, as only after evaluating if a given image is consistent with expectation, can the next image be processed by the next packet, which takes ~50ms. Our predictive learning rule thus implies, that sensory information is processed in discrete units and each packet may represent an elementary unit of perception. When recording neuronal activity in the cortex, the slowest oscillations (<10Hz) are by far the most dominant [41,71], and it is one of the biggest questions in neuroscience: what is the function of those oscillations [72]. Therefore it is worth noticing how learning rule derived from the basic cellular principles may relate to packets which are the main part of slow oscillations [39,73,74]. As described above, dividing information into discrete packets, could provide an effective mechanism to improve neuronal predictions. It could allow for easier differentiation of feed-forward signals arriving during the initial wave of a packet from predicted top-down information arriving later. Another big question in neuroscience is about the function of spontaneous brain activity [42]. For example, why would the brain spend so much energy to generate packets even during e.g. sleep? Interestingly, similarly as in the brain where most energy is consumed by spontaneous activity [42], in our model most energy (i.e. computational time) is used for free phase network activity, which allows intracellular predictive model to learn network dynamics in an unsupervised way. Thus, free phase activity in our model suggests that the function of spontaneous packets could be to provide neurons with diverse training data to improve the robustness of the predictive model, as supported by results presented in Fig. 6. Moreover, note that free phase activity may also be used for unsupervised learning. For example, if new input is present in the free phase, neurons can still calculate if such evoked activity is consistent with internal model predictions. If not, then weights can be modified to get free phase activity evoked by new stimuli to be closer to the prediction (it is the same mechanism as we use in clamped phase during supervised learning). This is a similar idea to unsupervised pre-training [75], however more future work is needed to investigate it.

Limitations

While the present study proposes a novel theoretical perspective on neuronal learning, this also comes with caveats that should be taken into account. Due to limits in current technology, parts of our model could not yet be properly validated experimentally. The major caveat in our model is the assumption of a cellular mechanism for predicting future activity. Although, neurons do have activity dependent calcium signaling [61], there is no direct evidence that neurons use it to predict expected activity. The data that we present in Figs 4 & 6, shows that neurons have predictable dynamics, and this should be interpreted as only demonstrating that the main prerequisites for the predictive learning rule have been met, but it does not prove that neurons use it to make predictions. Also, for computational simplicity, in our model we present only one stimulus at a time to the network. In contrast, brains receive a constant stream of sensory stimuli, and new sensory inputs can arrive at the same time as top-down signals, which is not the case in our model. However, new sensory stimuli arriving during already in progress neuronal packets were shown to be suppressed [76], which could serve to largely reduce interference between stimuli, as assumed in our model. The biological validity of this model assumption should be more directly tested. It is also important for our model that all data presented in the free phase to train predictive model have the same statistical distribution as data presented in the clamped phase. If only noise inputs were presented to the network in the free phase, then performance of our model would likely deteriorate. As mentioned earlier, numerous studies showed that spontaneous brain activity is not like a random noise, but rather it has similar statistical properties to stimulus evoked patterns [35-38]. That, together with experimental results presented in Fig. 6, provide rationale for our network to use data with similar distributions during free and clamped phase. Moreover, there are other open questions about this model. For instance, consistently with our model, individual neurons can respond to novel stimuli with higher or lower firing rate as compared to familiar stimuli [67,77]. However, on average, neurons recorded in cortex show typically higher firing rate to novel stimuli [67,77], which is not explained by our model. This discrepancy could be due to inherent sampling bias in electrophysiological recordings toward most active cells [78]. It also may suggest additional network level predictive mechanisms which could explain the elevated response to novel stimuli, as proposed in [13,14]. More future work is needed to answer those questions. It should also be noted that although our analytical derivation of the synaptic learning rule provides an important first step to link predictive learning models to metabolic activity, it required to largely simplify description of metabolic processes to only the few most important variables. Biological accuracy of this simplified description still needs to be investigated. Future work should also explore if implementing a non-linear predictive model within neurons could further improve performance of our network. Nevertheless, considering that presented model provides theoretical connection between diverse types of brain-inspired algorithms, this work could lead to a better understanding of neuronal principles [79].

Methods

Neural Network (MNIST dataset)

The code for our network with predictive learning rule, which we used to produce results presented in Fig. 2 is available at https://github.com/ykubo82/bioCHL, which contains all implementation details. Briefly, the base network has the architecture: 784–1000-10 with sigmoidal units, and with symmetric connections (see Supplementary Fig. 3-4 for more biologically plausible network architectures which we also tested). Neuron activity dynamics in hidden layer is described as in standard network with Contrastive Hebbian Learning [80]: where denotes weight from neuron p in input layer to neuron j in hidden layer, denotes weight from output layer neuron to hidden layer neuron j, b is a bias, t is a time step and S is a sigmoid activation function. Parameter h=0.1 is the Euler method’s time-step commonly used to improve computational stability. However, changing h to 0.2 or 1 resulted in similar network performance here. In the standard implementation of Contrastive Hebbian Learning, all top-down connections are also multiplied by a small number γ (~0.1) [80]. This different treatment of feed-forward and feedback connections could be biologically questionable as many brain circuits are highly recurrent and e.g. granule cells do not seem to have specific dendrites for receiving feedback signals. Therefore, to make our network more biologically plausible we set this feedback gain factor γ to 1, thus allowing our network to learn by itself what should be the contribution of each input. For output layer, term is set to 0 as there are no top-down connections to that layer. Neurons in the input layer do not have any dynamics as their activity is set to a value corresponding to pixel intensity in the presented image. To accelerate training, we used AdaGrad [81], and we applied a learning rate of 0.03 to the hidden layer and 0.02 for the output layer. Synaptic weights for neurons in hidden and output layers were modified as described in Eq. 3.

Future activity prediction

For all the predictions we used a cross-validation approach. Specifically, in each training cycle we ran free phase on 490 examples, which were used to derive least-squares model for each neuron to predict its future activity at time step 120 , from its initial activity at steps 1–12 (,..,). This can be expressed as: where terms in brackets correspond to time steps, and λ and b correspond to coefficients and offset terms found by least-squares method. Next, 10 new examples were taken, for which free phase was ran only for 12 steps, then the above derived least-squares model was applied to predict free phase steady-state activity for each of 10 examples. From step 13 the network output was clamped. The weights were updated based on the difference between predicted and clamped activity calculated only from those 10 new examples. This process was repeated 120 times in each training epoch. Moreover, the MNIST dataset has 60,000 examples which we used for the above described training, and 10,000 additional examples which were only used for testing. For all plots in Figure 2 & 3 we only used test examples which network never saw during training. This demonstrates that each neuron can accurately predict its future activity even for novel stimuli which were never presented before.

Convolutional Neural Network (CIFAR-10 dataset)

The convolutional network has an input layer of size: 32×32×3, corresponding to size of a single image with 3 color channels in CIFAR-10 dataset (this dataset consists of 5000 training and 1000 test images for each of 10 classes [33]). The network has two convolutional and pooling layers followed by one fully connected output layer (Fig. 3a). The filter size for all the convolutional layers is 3×3 with stride 1, and the number of filters is 256 and 512 for first and second convolutional layers, respectively. We did not use zero-padding. For pooling, we used the max pooling with 2×2 filters and stride 2. The activation function for the convolutional and the fully connected layers was the hard-sigmoid activation function, S(x) = (1+hardtanh(x-1))*0.5, as implemented in [24]. The learning rates were: 0.4, 0.028, and 0.025 for the first, second convolutional layer, and for the fully connected output layer, respectively. The Euler method’s time-step h was set to 1. Considering that clamping output neurons at only two extreme values: 0 or 1, may not be the most accurate model of top-down signals in the brain, thus here we implemented weak clamping as proposed in [23]. Shortly, instead of setting the value of the output neuron to 0 or 1 during clamped phase, output neurons were only slightly nudged toward required values. For example, if an output neuron should have value of 1, then it was clamped at value: , where is free phase steady-state activity of that output neuron, and ε is a small nudging factor toward 1. To calculate nudging for each neuron we used a clamping factor of 0.01 as described in [23]. This network with our predictive learning rule achieved 20.03% accuracy on CIFAR-10 dataset. Using the original ‘hard’ clamping, changing h to 0.1 or increasing number of neurons to 326 in the first layer gave similar results. We also directly compared predictive learning rule with backpropagation through time (BPTT) on the same convolutional network (Fig. 3). We selected BPTT as it uses a roll-out through time which is more comparable to our model. To ensure generality of presented results, we repeated the training with BPTT three times using different learning rates for each simulation. Using BPTT with the same learning rates as in our predictive model (0.4, 0.028, 0.025), the error rate was 20.88%. For BPTT with learning rate of 0.1 for all layers, the error rate was 21.23%, and 22.77% for a learning rate of 0.2 (Fig. 3b). Code for convolutional network was adopted from [82], which we modified to include our predictive learning rule. To reproduce our results, our code for the convolutional network with all implementation details is available at: https://github.com/ykubo82/bioCHL/tree/master/conv. Altogether, those results show that our predictive learning rule can also be successfully implemented in deeper networks and on more challenging tasks.

Surgery, recording and neuronal data

The experimental procedures for the awake, head-fixed experiment have been previously described [40,41] and were approved by the Rutgers University Animal Care and Use Committee, and conformed to NIH Guidelines on the Care and Use of Laboratory Animals. Briefly, a headpost was implanted on the skull of four Sprague-Dawley male rats (300–500g) under ketamine-xylazine anesthesia, and a craniotomy was performed above the auditory cortex and covered with wax and dental acrylic. After recovery the animal was trained for 6–8 days to remain motionless in the restraining apparatus. On the day of the surgery, the animal was briefly anesthetized with isoflurane, the dura was resected, and after a recovery period, recording began. For recording we used silicon microelectrodes (Neuronexus technologies, Ann Arbor MI) consisting of 8 or 4 shanks spaced by 200µm, with a tetrode recording configuration on each shank. Electrodes were inserted in layer V in the primary auditory cortex. Units were isolated by a semiautomatic algorithm (klustakwik.sourceforge.net) followed by manual clustering (klusters.sourceforge.net)[83]. Only neurons with average stimulus evoked firing rates higher than 3 SD above pre-stimulus baseline were used in analysis, resulting in 9, 12, 12, and 22 neurons from each rat. For predicting evoked activity from spontaneous, we also required that neurons must have mean firing rate during spontaneous packets above said threshold which reduced the number of neurons to 40. The spontaneous packet onsets were identified from the spiking activity of all recorded cells as the time of the first spike marking a transition from a period of global silence (30 ms with at most one spike from any cell) to a period of activity (60 ms with at least 15 spikes from any cells), as described before in [40,73].

65 in total

1. Accuracy of tetrode spike separation as determined by simultaneous intracellular and extracellular measurements.

Authors: K D Harris; D A Henze; J Csicsvari; H Hirase; G Buzsáki
Journal: J Neurophysiol Date: 2000-07 Impact factor: 2.714

Review 2. Neuronal oscillations in cortical networks.

Authors: György Buzsáki; Andreas Draguhn
Journal: Science Date: 2004-06-25 Impact factor: 47.728

3. Experience-dependent sharpening of visual shape selectivity in inferior temporal cortex.

Authors: David J Freedman; Maximilian Riesenhuber; Tomaso Poggio; Earl K Miller
Journal: Cereb Cortex Date: 2005-12-28 Impact factor: 5.357

Review 4. A brief history of time (constants).

Authors: C Koch; M Rapp; I Segev
Journal: Cereb Cortex Date: 1996 Mar-Apr Impact factor: 5.357

5. The Helmholtz machine.

Authors: P Dayan; G E Hinton; R M Neal; R S Zemel
Journal: Neural Comput Date: 1995-09 Impact factor: 2.026

6. Amplification of EPSPs by axosomatic sodium channels in neocortical pyramidal neurons.

Authors: G Stuart; B Sakmann
Journal: Neuron Date: 1995-11 Impact factor: 17.173

7. The "wake-sleep" algorithm for unsupervised neural networks.

Authors: G E Hinton; P Dayan; B J Frey; R M Neal
Journal: Science Date: 1995-05-26 Impact factor: 47.728

8. Ca2+ signaling via the neuronal calcium sensor-1 regulates associative learning and memory in C. elegans.

Authors: M Gomez; E De Castro; E Guarin; H Sasakura; A Kuhara; I Mori; T Bartfai; C I Bargmann; P Nef
Journal: Neuron Date: 2001-04 Impact factor: 17.173

9. Subthreshold oscillations and resonant frequency in guinea-pig cortical neurons: physiology and modelling.

Authors: Y Gutfreund; Y yarom; I Segev
Journal: J Physiol Date: 1995-03-15 Impact factor: 5.182

10. An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity.

Authors: James C R Whittington; Rafal Bogacz
Journal: Neural Comput Date: 2017-03-23 Impact factor: 2.026

2 in total

1. Neurons learn by predicting future activity.

Authors: Artur Luczak; Bruce L McNaughton; Yoshimasa Kubo
Journal: Nat Mach Intell Date: 2022-01-25

2. Combining backpropagation with Equilibrium Propagation to improve an Actor-Critic reinforcement learning framework.

Authors: Yoshimasa Kubo; Eric Chalmers; Artur Luczak
Journal: Front Comput Neurosci Date: 2022-08-23 Impact factor: 3.387

2 in total