| Literature DB >> 31447628 |
Brian D Hoskins1, Matthew W Daniels1, Siyuan Huang2, Advait Madhavan1,3, Gina C Adam2, Nikolai Zhitenev1, Jabez J McClelland1, Mark D Stiles1.
Abstract
Neural networks based on nanodevices, such as metal oxide memristors, phase change memories, and flash memory cells, have generated considerable interest for their increased energy efficiency and density in comparison to graphics processing units (GPUs) and central processing units (CPUs). Though immense acceleration of the training process can be achieved by leveraging the fact that the time complexity of training does not scale with the network size, it is limited by the space complexity of stochastic gradient descent, which grows quadratically. The main objective of this work is to reduce this space complexity by using low-rank approximations of stochastic gradient descent. This low spatial complexity combined with streaming methods allows for significant reductions in memory and compute overhead, opening the door for improvements in area, time and energy efficiency of training. We refer to this algorithm and architecture to implement it as the streaming batch eigenupdate (SBE) approach.Entities:
Keywords: back propagation; memristor; network training; neuromorphic; singular value decomposition; stochastic gradient descent
Year: 2019 PMID: 31447628 PMCID: PMC6691093 DOI: 10.3389/fnins.2019.00793
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Comparison of asymptotic scaling for common methods of a × b crossbar training.
| SGD | Outer product | 0 | |||
| MBGD1 | Outer Product | 0 | |||
| MBGD2 | Sequential or column-wise | ||||
| Rank | Outer Product | ||||
| Rank | Outer product |
We define m as the total number of minibatches, n as the amount of training data per minibatch, and k is the decomposition rank when appropriate. For the SVD method, we take scaling laws for the R-SVD algorithm, choosing μ = max(a, b) and η = min(a, b) to get the best scaling (Golub and Van Loan, .
Figure 1(A) Example of the contribution of the normalized singular values to the batch update for the middle layer of a 728 × 256 × 128 × 10 network trained for MNIST with ReLU and sigmoidal activation. The batch size is 10,000. (B) Cumulative sum of the contribution of the first k singular values. The sum of the first few vectors approaches the total sum, one, showing that they contain most of the batch information.
Figure 2Simplified comparison of the training algorithms for (A) stochastic gradient descent (SGD), (B) mini-batch gradient descent (MBGD), (C) the singular value decomposition (SVD) approximation of the batch, and (D) streaming batch eigenupdates (SBE). Both SGD and SBE are rank 1 and calculated on the fly, achieving the highest degree of acceleration.
Figure 3Difference between SBE values and the full SVD values for (A,C) singular vectors X and δ as calculated by ) and (B,D) singular values as calculated by 1. Batch sizes are 32 (A,B) and 1024 (C,D). The larger batches show greater fidelity with more iterations. The sharp increases in the difference correspond to the update of the weight matrix and subsequent change in the gradient.
Figure 4Test set error rate vs. (A) the epochs and (B) the matrix updates. Training set loss functions under different SGD and batch learning rules (batch size is 32) vs. the number of (C) epochs and (D) matrix updates. The SVD and SBE algorithms required more epochs to train but fewer matrix updates.
Figure 5Summary of impact of different training rules vs. batch size including (A) the number of epochs to train the training set loss function down to 0.1 (dashed lines) and 0.01 (solid lines), and (B) the number of matrix updates to set the loss function to 0.1(dashed lines) and to 0.01 (solid lines). The SVD and SBE training rules increase the update efficiency, but not as much as full batch update.