| Literature DB >> 36082305 |
Yoshimasa Kubo1, Eric Chalmers2, Artur Luczak1.
Abstract
Backpropagation (BP) has been used to train neural networks for many years, allowing them to solve a wide variety of tasks like image classification, speech recognition, and reinforcement learning tasks. But the biological plausibility of BP as a mechanism of neural learning has been questioned. Equilibrium Propagation (EP) has been proposed as a more biologically plausible alternative and achieves comparable accuracy on the CIFAR-10 image classification task. This study proposes the first EP-based reinforcement learning architecture: an Actor-Critic architecture with the actor network trained by EP. We show that this model can solve the basic control tasks often used as benchmarks for BP-based models. Interestingly, our trained model demonstrates more consistent high-reward behavior than a comparable model trained exclusively by BP.Entities:
Keywords: Actor-Critic (AC); Equilibrium Propagation; backpropagation; biologically plausible; reinforcement learning
Year: 2022 PMID: 36082305 PMCID: PMC9446087 DOI: 10.3389/fncom.2022.980613
Source DB: PubMed Journal: Front Comput Neurosci ISSN: 1662-5188 Impact factor: 3.387
Train Actor-Critic by EP and BP.
| for episode = 1, 2,…., E do |
| for j = 1, 2,…, J do |
| Compute xj,f with Eqs 1, 2 // index f means free phase |
| Select action based on the probability of xj,f |
| Execute action aj in emulator and observe reward rj and state sj + 1 |
| store transition (aj, rj, sj, sj + 1) in D |
| set s = s′ |
| if D has enough transition then |
| Sample random minibatch of transitions (ak, rk, sk, sk
+
1) from D |
| Compute |
| // Update actor weights |
| Compute |
| Compute |
| Compute |
| |
| // Update critic weights |
| Perform a gradient stop on |
| end if |
| end for |
| end for |
FIGURE 1Images of environments in tasks for our model: CartPole-v0 (left), Acrobot-v1 (center), and LunarLander-v2 (right). CartPole-v0 task: A pole is on a cart, and this pole is unstable. The goal of this task is to move the cart to left or right to balance the pole. Acrobot-v1: a robot arm is composed of two joints. The goal of this task is to swing the arm to reach the black horizontal line. LunarLander-v2: There is a spaceship that tries to land. The goal of this task is to land the spaceship between the flags smoothly by moving the spaceship.
Parameters for our models on each task.
| Task | NN Actor | NN Critic | α1 for Actor | α2 for Actor | β for Actor | α for Critic | Iteration for Actor | |
| 1st phase | 2nd phase | |||||||
| CartPole | 4-256-2 | 4-256-1 | 0.0001 | 0.0001 | 0.02 | 0.001 | 150 | 25 |
| Acrobot | 6-256-3 | 6-256-1 | 0.001 | 0.001 | 0.02 | 0.001 | 150 | 25 |
| LunarLander | 8-512-4 | 8-512-1 | 0.0001 | 0.002 | 0.03 | 0.0003 | 180 | 25 |
NN describes number of neurons in each layer, α1 is the learning rate for the weights between the input and hidden layer, α2 is the learning rate for the weights between the hidden and output layers, and 1st and 2nd phases mean duration of free phase and weakly clamped phases, respectively. Results for additional learning rates may be found in our Supplementary Section “The other learning rates for EP-BP”.
FIGURE 2Plotting the reward vs. episode for CartPole-v0 (left), Acrobot-v1 (center), and LunarLander-v2 (right) on both backpropagation (BP) and EP-BP. Solid lines shows mean across 8 runs and shaded area denote standard deviation. Note that for Acrobot-v1, the agent receives –1 as punishment until it reaches the target.
FIGURE 3Average rewards and std error (SEM) for the last 25% of episodes for backpropagation (BP) and EP-BP on CartPole-v0, Acrobot-v1, and LunarLander-v2.
FIGURE 4Average variability and std error (SEM) for the last 25% of episodes for backpropagation (BP) and EP-BP on CartPole-v0, Acrobot-v1, and LunarLander-v2.
FIGURE 5Mean and std error (SEM) (shaded area) for the probability of actions that EP-BP and backpropagation (BP) models takes on CartPole-v0.