| Literature DB >> 32477047 |
S R Nandakumar1, Manuel Le Gallo1, Christophe Piveteau1,2, Vinay Joshi1,3, Giovanni Mariani1, Irem Boybat1,4, Geethan Karunaratne1,2, Riduan Khaddam-Aljameh1,2, Urs Egger1, Anastasios Petropoulos1,5, Theodore Antonakopoulos5, Bipin Rajendran3, Abu Sebastian1, Evangelos Eleftheriou1.
Abstract
Deep neural networks (DNNs) have revolutionized the field of artificial intelligence and have achieved unprecedented success in cognitive tasks such as image and speech recognition. Training of large DNNs, however, is computationally intensive and this has motivated the search for novel computing architectures targeting this application. A computational memory unit with nanoscale resistive memory devices organized in crossbar arrays could store the synaptic weights in their conductance states and perform the expensive weighted summations in place in a non-von Neumann manner. However, updating the conductance states in a reliable manner during the weight update process is a fundamental challenge that limits the training accuracy of such an implementation. Here, we propose a mixed-precision architecture that combines a computational memory unit performing the weighted summations and imprecise conductance updates with a digital processing unit that accumulates the weight updates in high precision. A combined hardware/software training experiment of a multilayer perceptron based on the proposed architecture using a phase-change memory (PCM) array achieves 97.73% test accuracy on the task of classifying handwritten digits (based on the MNIST dataset), within 0.6% of the software baseline. The architecture is further evaluated using accurate behavioral models of PCM on a wide class of networks, namely convolutional neural networks, long-short-term-memory networks, and generative-adversarial networks. Accuracies comparable to those of floating-point implementations are achieved without being constrained by the non-idealities associated with the PCM devices. A system-level study demonstrates 172 × improvement in energy efficiency of the architecture when used for training a multilayer perceptron compared with a dedicated fully digital 32-bit implementation.Entities:
Keywords: deep learning; in-memory computing; memristive devices; mixed-signal design; phase-change memory
Year: 2020 PMID: 32477047 PMCID: PMC7235420 DOI: 10.3389/fnins.2020.00406
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Figure 1Mixed-precision computational memory architecture for deep learning. (A) A neural network consisting of layers of neurons with weighted interconnects. During forward propagation, the neuron response, x, is weighted according to the connection strengths, W, and summed. Subsequently, a non-linear function, f, is applied to determine the next neuron layer response, x. During backward propagation, the error, δ, is back-propagated though the weight layer connections, W, to determine the error, δ, of the preceding layer. (B) The mixed-precision architecture consisting of a computational memory unit and a high-precision digital unit. The computational memory unit has several crossbar arrays whose device conductance values G represent the weights W of the DNN layers. The crossbar arrays perform the weighted summations during the forward and backward propagations. The resulting x and δ values are used to determine the weight updates, ΔW, in the digital unit. The ΔW values are accumulated in the variable, χ. The conductance values are updated using p = ⌊χ/ϵ⌋ number of pulses applied to the corresponding devices in the computational memory unit, where ϵ represents the device update granularity.
Figure 2Phase-change memory characterization experiments and model response. (A) The mean and standard deviation of device conductance values (and the corresponding model response) as a function of the number of SET pulses of 90 μA amplitude and 50 ns duration. The conductance was read 38.6 s after the application of each SET pulse. The 10,000 PCM devices used for the measurement were initialized to a distribution around 0.06μS. (B) The distribution of conductance values compared to that predicted by the model after the application of 15 SET pulses. (C) The average conductance drift of the states programmed after each SET pulse. The corresponding model fit is based on the relation, , that relates the conductance G after time t from programming to the conductance measurement at time t0 and drift exponent ν. Each color corresponds to the conductance read after the application of a certain number of SET pulses, ranging from 1 to 20. (D) Experimentally measured conductance evolution from 5 devices upon application of successive SET pulses compared to that predicted by the model. These measurements are based on 50 reads that follow each of the 20 programming instances. Each line with different color shade corresponds to a different device.
Figure 3MCA training experiment using on-chip PCM devices for handwritten digit classification. (A) Network structure used for the on-chip mixed-precision training experiment for MNIST data classification. Each weight, W, in the network is realized as the difference in conductance values of two PCM devices, G and G. (B) Stochastic conductance evolution during training of G and G values corresponding to 5 arbitrarily chosen synaptic weights from the second layer. Each color corresponds to a different synaptic weight. (C) The number of device updates per epoch from the two weight layers in mixed-precision training experiment and high-precision software training (FP64), showing the highly sparse nature of weight update in MCA. (D) Classification accuracies on the training (dark-shaded curves) and test (blue-shaded curves) set from the mixed-precision training experiment. The maximum experimental test set accuracy, 97.73%, is within 0.57% of that obtained in the FP64 training. The experimental behavior is closely matched by the training simulation using the PCM model. The shaded areas in the PCM model curves represent one standard deviation over 5 training simulations. (E) Inference performed using the trained PCM weights on-chip on the training and test dataset as a function of time elapsed after training showing negligible accuracy drop over a period of 1 month.
Figure 4MCA training validation on complex networks. (A) Convolutional neural network for image classification on CIFAR-10 dataset used for MCA training simulations. The two convolution layers followed by maxpooling and dropout are repeated thrice, and are followed by three fully-connected layers. (B) The classification performance on the training and the test datasets during training. It can be seen that the test accuracy corresponding to MCA-based training eventually exceeds that from the high-precision (FP32) training. (C) The maximal training and test accuracies obtained as a function of the dropout rates. In the absence of dropout, MCA-based training significantly outperforms FP32-based training. (D) The LSTM network used for MCA training simulations. Two LSTM cells with 512 hidden units followed by a fully-connected (FC) layer are used. (E) The BPC as a function of the epoch number on training and validation sets shows that after 100 epochs, the validation BPC is comparable between the MCA and FP32 approaches. The network uses a dropout rate of 0.15 between non-recurring connections. (F) The best BPC obtained after training as a function of the dropout rate indicates that without any dropout, MCA-based training delivers better test performance (lower BPC) than FP32 training. (G) The GAN network used for MCA training simulations. The generator and discriminator networks are fully-connected. The discriminator and the generator are trained intermittently using real images from the MNIST dataset and the generated images from the generator. (H) The Frechet distance obtained from MCA training and the generated images are compared to that obtained from the FP32 training. (I) The performance measured in terms of the Frechet distance as a function of the number of epochs obtained for different mini-batch sizes and optimizers for FP32 training. To obtain convergence, mini-batch size >1 and the use of momentum are necessary in the training of the GAN.
Energy and time estimated based on application specific integrated circuit (ASIC) designs for processing one training image in MCA and corresponding fully digital 32-bit and mixed-precision designs.
| 32-bit design | Energy | 5.62 μJ | 0.09 μJ | 8.64 μJ | 14.35 μJ |
| Time | 7.31 μs | 0.59 μs | 15.36 μs | 23.27 μs | |
| Fully digital mixed-precision design | Energy | 1.78 μJ | 0.016 μJ | 0.076 μJ | 1.87 μJ |
| (4-bit weights, 8-bit activations/errors) | Time | 6.41 μs | 0.13 μs | 0.79 μs | 7.33 μs |
| MCA—computational memory | Energy | 7.29 nJ | 2.15 nJ | 0.05 nJ | |
| Time | 0.27 μs | 0.13 μs | – | ||
| MCA—digital unit | Energy | 8.97 nJ | 2.76 nJ | 61.98 nJ | |
| Time | 0.34 μs | 0.09 μs | 1.19 μs | ||
| MCA—total | Energy | 16.3 nJ | 4.91 nJ | 62.03 nJ | 83.2 nJ |
| Time | 0.61 μs | 0.22 μs | 1.19 μs | 2.02 μs |
The numbers are for a specific two-layer perceptron with 785 input neurons, 250 hidden neurons, and 10 output neurons.