| Literature DB >> 30755662 |
Maxence Ernoult1,2,3, Julie Grollier4, Damien Querlioz5.
Abstract
One of the biggest stakes in nanoelectronics today is to meet the needs of Artificial Intelligence by designing hardware neural networks which, by fusing computation and memory, process and learn from data with limited energy. For this purpose, memristive devices are excellent candidates to emulate synapses. A challenge, however, is to map existing learning algorithms onto a chip: for a physical implementation, a learning rule should ideally be tolerant to the typical intrinsic imperfections of such memristive devices, and local. Restricted Boltzmann Machines (RBM), for their local learning rule and inherent tolerance to stochasticity, comply with both of these constraints and constitute a highly attractive algorithm towards achieving memristor-based Deep Learning. On simulation grounds, this work gives insights into designing simple memristive devices programming protocols to train on chip Boltzmann Machines. Among other RBM-based neural networks, we advocate using a Discriminative RBM, with two hardware-oriented adaptations. We propose a pulse width selection scheme based on the sign of two successive weight updates, and show that it removes the constraint to precisely tune the initial programming pulse width as a hyperparameter. We also propose to evaluate the weight update requested by the algorithm across several samples and stochastic realizations. We show that this strategy brings a partial immunity against the most severe memristive device imperfections such as the non-linearity and the stochasticity of the conductance updates, as well as device-to-device variability.Entities:
Year: 2019 PMID: 30755662 PMCID: PMC6372620 DOI: 10.1038/s41598-018-38181-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(a) Each weight is implemented with two memristors, i.e. W = G+ − G−, conductance updates follow the memristor characteristic dictated by Eq.(1). (b–d) Illustration of the memristive imperfections taken into account: (b) non-linearity (conductance dependent update), (c) cycle-to-cycle variability, (d) device-to-device variability (from left to right).
Test error rate achieved by the three architectures under study on MNIST with typical values of the device parameters in terms of non-linearity, cyle-to-cycle and device-to-device variability.
| Network topology | RBM + softmax | Discriminative RBM | Deep Belief Net | |
|---|---|---|---|---|
| 785-301-10 | 795-301 | 785-501-511-2001 | ||
| Software-based | Test error | 7.0 ± 0.5% | 6.6 ± 0.3% | 5.7 ± 0.1% |
| Near-linear device | Test error | 8.3 ± 0.1% | 6.4 ± 0.2% | 6.6 ± 0.2% |
|
|
|
|
| |
| Non-linear device | Test error | 17.3 ± 0.3% | 15 ± 0.1% | 28.4 ± 0.3% |
|
|
|
|
| |
| Cycle-to-cycle variability | Test error | 15.1 ± 0.4% | 11.9 ± 0.5% | 9.2 ± 0.3% |
| Device-to-device variability | Test error | 20.3 ± 0.3% | 13.9 ± 0.5% | 22.6 ± 0.5% |
Each topology includes the bias. Each simulation was performed over 30 epochs with a mini-batch size of 100, we indicate the mean error rate and the variance over five trials.
Figure 2Schematics of the three architectures under study: (a) a Restricted Boltzmann Machine topped by a softmax classifier (“RBM + softmax”), (b) a Discriminative Restricted Boltzmann Machine (“Discriminative RBM”), (c) a Deep Belief Net.Blue, grey and green filled circles stand for visible, hidden and label neurons respectively.
Figure 3Examples of hidden features extracted by, from left to right: standard RBM (trained with a learning rate of 0.05), a memristive Discriminative RBM (trained under Cst with β = 0.005, Δt/Δt = 1/1000), another memristive Discriminative RBM (trained under Cst with β = 3, Δt/Δt = 1/5000). Each gray-scale picture represents the values of the 784 weights connecting the visible layer to a given hidden unit.
Figure 4(a) Test error rate achieved by the Discriminative RBM as a function of the programming pulse width for different mini-batch sizes and different number of parallel Gibbs chains to evaluate the Contrastive Divergence term (‘# CD’) in the near-linear case (β = 0.005). (b) Same Figure as (a) in the non-linear case (β = 3). (c) Optimal test error rate achieved by the Discriminative RBM for different values of β with different mini-batch sizes and different number of parallel Gibbs chains to evaluate the Contrastive Divergence term (‘# CD’). For mini-batches size of 100, each simulation was ran over 30 epochs, 5 times per value of pulse width, error bars indicate median, first quartile and third quartile. For mini-batches of size 1, each simulation was ran over 50 epochs 5 times per value of pulse width to ensure convergence.
Figure 5Typical time trace of the Δt/Δt statistics in terms of the mean value (blue plain line) and standard deviation (shaded blue region around the line).
RProp table of truth for any mini-batch size and number of parallel Gibbs chains. The notations are defined on the body text.
|
| |||||
|
|
|
|
| 5 other cases | |
|
|
|
|
|
| |
| Case |
|
|
| ||
Figure 6(a) Test error rate achieved by the Discriminative RBM as a function of the programming pulse width when trained with Cst and RProp driven pulse widths for β = 0.005. (b) Same as (a) with β = 3. Grey dashed lines indicate 10% and 20% on the left and right panel respectively. Each simulation was ran over 30 epochs with a mini-batch size of 100, 5 times per value of pulse width, error bars indicate median, first quartile and third quartile.
Figure 7(a) Test error rate achieved by the Discriminative RBM as a function of cycle-to-cycle variability for every combination of the pulse width programming scheme (Cst, RProp) and number of parallel Gibbs chains used to evaluate Contrastive Divergence (# CD). (b) Optimal conductance increment-to-noise ratio as a function of cycle-to-cycle variability associated with each curve of the left panel. When using 20 Gibbs chains (blue curves), from σ/(G − G) = 6.10−3 onwards (vertical gray dashed line) the conductance update overcomes the noise increase, accounting for the improved performance compared to the use of a single Gibbs chain (orange curves), regardless of the programming scheme. Each simulation was ran over 30 epochs with a mini-batch size of 100, 5 times per value of pulse width, error bars indicate median, first quartile and third quartile.
Figure 8Test error rate achieved by the Discriminative RBM as a function of device-to-device variability for every combination of the pulse width programming scheme (Cst, RProp) and the number of parallel Gibbs chains (# CD). Each simulation was ran over 30 epochs with a mini-batch size of 100, 5 times per value of pulse width, error bars indicate median, first quartile and third quartile.
Summary of the results obtained on the Discriminative RBM.
| 1 CD, Cst | 1 CD, RProp | 20 CD, Cst | 20 CD, RProp | ||
|---|---|---|---|---|---|
| Near-linear device | Test error | 6.4 ± 0.1% | 6 ± 0.2% | 6.7 ± 0.1% | 6.7 ± 0.2% |
|
|
|
|
|
| |
| [10−4, 2.10−2] | [10−4, 10−1] | [5.10−5, 10−2] | [5.10−5, 5.10−2] | ||
| Non-linear device | Test error | 14.7 ± 0.2% | 14.9 ± 0.2% | 10.7 ± 0.2% | 10.5 ± 0.2% |
|
|
|
|
|
| |
| [5.10−5, 3.10−3] | [5.10−5, 7.10−2] | [10−5, 4.10−3] | [10−5, 2.10−2] | ||
| Cycle-to-cycle variability | Test error | 16.8 ± 0.7% | 18.8 ± 0.3% | 10.2 ± 0.2% | 11.2 ± 0.1% |
| Device-to-device variability | Test error | 13.9 ± 0.6% | 14.9 ± 0.4% | 9.6 ± 0.3% | 10.8 ± 0.3% |
Each simulation was ran over 30 epochs with a mini-batch size of 100, 5 times per value of pulse width, error bars indicate median, first quartile and third quartile.