| Literature DB >> 28522969 |
Benjamin Scellier1, Yoshua Bengio1.
Abstract
We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well-defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point or stationary distribution) toward a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged toward their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal "back-propagated" during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not. We also show experimentally that multi-layer recurrently connected networks with 1, 2, and 3 hidden layers can be trained by Equilibrium Propagation on the permutation-invariant MNIST task.Entities:
Keywords: Hopfield networks; artificial neural network; backpropagation algorithm; biologically plausible learning rule; contrastive hebbian learning; deep learning; fixed point; spike-timing dependent plasticity
Year: 2017 PMID: 28522969 PMCID: PMC5415673 DOI: 10.3389/fncom.2017.00024
Source DB: PubMed Journal: Front Comput Neurosci ISSN: 1662-5188 Impact factor: 2.380
Figure 1The input units . The state variable s includes the hidden units h and output units y. The targets are denoted by d. The network is recurrently connected with symmetric connections. Left. Equilibrium Propagation applies to any architecture, even a fully connected network. Right. The connection with Backpropagation is more obvious when the network has a layered architecture.
Figure 2Comparison between the traditional framework for Deep Learning and our framework. Left. In the traditional framework, the state of the network fθ(v) and the objective function J(θ, v) are explicit functions of θ and v and are computed analytically. The gradient of the objective function is also computed analytically thanks to the Backpropagation algorithm (a.k.a automatic differentiation). Right. In our framework, the free fixed point is an implicit function of θ and v and is computed numerically. The nudged fixed point and the gradient of the objective function are also computed numerically, following our learning algorithm: Equilibrium Propagation.
Correspondence of the phases for different learning algorithms: Back-propagation, Equilibrium Propagation (our algorithm), Contrastive Hebbian Learning (and Boltzmann Machine Learning) and Almeida-Pineida's Recurrent Back-Propagation.
| First Phase | Forward Pass | Free Phase | Free Phase (or Negative Phase) | Free Phase |
| Second Phase | Backward Pass | Weakly Clamped Phase | Clamped Phase (or Positive Phase) | Recurrent Backprop |
Figure 3Training and validation error for neural networks with one hidden layer of 500 units (top left), two hidden layers of 500 units (top right), and three hidden layers of 500 units (bottom). The training error eventually decreases to 0.00% in all three cases.
Hyperparameters.
| 784-500-10 | 20 | 4 | 0.5 | 1.0 | 0.1 | 0.05 | ||
| 784-500-500-10 | 100 | 6 | 0.5 | 1.0 | 0.4 | 0.1 | 0.01 | |
| 784-500-500-500-10 | 500 | 8 | 0.5 | 1.0 | 0.128 | 0.032 | 0.008 | 0.002 |
The learning rate ϵ is used for iterative inference (Equation 43). β is the value of the clamping factor in the second phase. α.