Literature DB >> 35508766

Nonideality-Aware Training for Accurate and Robust Low-Power Memristive Neural Networks.

Dovydas Joksas¹, Erwei Wang^2,3, Nikolaos Barmpatsalos¹, Wing H Ng¹, Anthony J Kenyon¹, George A Constantinides², Adnan Mehonic¹.

Abstract

Recent years have seen a rapid rise of artificial neural networks being employed in a number of cognitive tasks. The ever-increasing computing requirements of these structures have contributed to a desire for novel technologies and paradigms, including memristor-based hardware accelerators. Solutions based on memristive crossbars and analog data processing promise to improve the overall energy efficiency. However, memristor nonidealities can lead to the degradation of neural network accuracy, while the attempts to mitigate these negative effects often introduce design trade-offs, such as those between power and reliability. In this work, authors design nonideality-aware training of memristor-based neural networks capable of dealing with the most common device nonidealities. The feasibility of using high-resistance devices that exhibit high I-V nonlinearity is demonstrated-by analyzing experimental data and employing nonideality-aware training, it is estimated that the energy efficiency of memristive vector-matrix multipliers is improved by almost three orders of magnitude (0.715 TOPs-1 W-1 to 381 TOPs-1 W-1 ) while maintaining similar accuracy. It is shown that associating the parameters of neural networks with individual memristors allows to bias these devices toward less conductive states through regularization of the corresponding optimization problem, while modifying the validation procedure leads to more reliable estimates of performance. The authors demonstrate the universality and robustness of this approach when dealing with a wide range of nonidealities.

Entities: Chemical

Keywords: hardware acceleration; memristors; neural networks

Mesh：

Year: 2022 PMID： 35508766 PMCID： PMC9189678 DOI： 10.1002/advs.202105784

Source DB: PubMed Journal: Adv Sci (Weinh) ISSN： 2198-3844 Impact factor: 17.521

Introduction

Artificial neural networks (ANNs) are now routinely used in machine learning tasks ranging from text generation[ ] to autonomous driving.[ ] However, rapidly increasing number of parameters in modern ANNs is making them time‐ and power‐consuming during both training[ ] and inference[ ] stages. This makes it challenging to apply machine learning approaches in environments where resources are tightly constrained.[ , ] One of the proposed solutions to improve the efficiency of ANNs has been to adopt different computer architectures. The need to transfer data between memory and computing units in the von Neumann architecture is the main bottleneck in modern computers[ ]; this is especially evident in machine learning where large amounts of data are utilized. In this specific case, an alternative can be memristor‐based ANNs, or memristive neural networks (MNNs). With this approach, memristive crossbar arrays are used to physically compute vector‐matrix products, which are one of the most fundamental operations in ANNs.[ , ] This is done without the need to constantly move large amounts of data: matrix entries are encoded into memristor conductances, vector entries—into applied voltages, and the result of the operation is extracted from the output currents produced according to Ohm's law and Kirchhoff's current law.[ ] Memristors—when used and programmed as analog devices—can encode values at much higher density, but at a cost of lower precision. A number of nonidealities may prevent from accurately programming device conductances or may cause deviations from intended electrical behavior. Such nonidealities include stuck devices, device‐to‐device (D2D) and cycle‐to‐cycle variability, drift in resistance states, line resistance, I‐V nonlinearity, and others.[ ] Potential solutions do exist but many of them introduce a number of trade‐offs. For example, to ensure more linear I‐V characteristics, one may use low‐resistance devices[ ]; however, this results in higher power consumption. Alternatively, the effects of I‐V nonlinearities may be minimized by utilizing pulse‐width modulation,[ ] but that comes at a cost of increased clock cycles for each encoded input.[ ] Other techniques of dealing with memristor nonidealities include the following: However, many of these approaches are technology‐specific and thus difficult to apply to different types of devices. in situ (re)training of weights (or just a subset of them[ ]) to recover from the effects of nonidealities,[ , , , , ] including in convolutional neural networks (CNNs),[ , ] recurrent structures,[ ] and in neural networks used for reinforcement learning[ ] modifying device structure, including inserting a buffer layer,[ ] inserting an electro‐thermal modulation layer,[ ] and adopting bilayer structure[ ] using additional circuitry[ , ] to ensure more stable behavior. When optimizing the performance of memristive systems (as opposed to individual devices), software approaches may be preferable because they are usually technology‐agnostic. For general applications, mapping or redundancy schemes can be used to mitigate the effects of faulty devices[ ] or line resistance.[ ] In the specific context of MNNs, multiple smaller nonideal networks may replace a large one and increase the accuracy in this way.[ ] Alternatively, modifying ex situ training has been proposed: altering the cost function[ ] or injecting noise into the synaptic weights[ ] can make MNNs more robust to the effects of nonidealities. Memristor‐oriented ex situ training is indeed a very promising method of making MNNs feasible. However, it has been applied by considering only a limited number of nonidealities, while the robustness of this technique is not well understood. In this work, we propose a number of improvements to memristive ex situ training, which are summarized in Figure .

Figure 1

Overview of the novel ex situ training technique designed for memristive neural networks. Redefining the output operation of synaptic layers allows to take non‐ohmic device behavior into account, which makes working with power‐efficient high‐resistance (and high‐nonlinearity) devices feasible. Relating weight parameters in artificial neural networks to individual conductances in crossbar arrays enables to further decrease power consumption through regularization, as well as to adapt to nonidealities that depend on the conductance values. Using an aggregate validation metric provides a more reliable way of assessing memristive neural network performance. First, our novel training technique addresses the problem of nonlinearity. Deviations from the linear relationship between inputs and outputs in crossbar arrays are a major obstacle; however, none of the aforementioned methods directly address this issue. Although existing works often take conductance deviations into account during ex situ training, crossbar arrays are still usually modeled as structures computing a product of a vector of voltages and a matrix of conductances. To reflect non‐ohmic behavior of memristive devices (illustrated in Figure 1) during training, we propose incorporating nonlinearities into the node functions of MNNs. Our aim is to embrace memristor nonlinearity so that the network can learn to be robust toward this nonideality—or even take advantage of it—during training based on stochastic gradient descent. Existing works attempting to take device variations into account during ex situ training often result in potentially lower training accuracy.[ ] Depending on the nature of the nonideality that MNNs are trained on, our method could potentially perform even better than conventional ANNs. For example, if nonlinear response of the device is sufficiently consistent, nonlinearities may increase the network capacity.[ ] We also show how to improve nonideality‐aware training by exposing conductances to the training process in a more direct way. By constraining MNN weights to be nonnegative, they can be related to conductances in a linear way. This allows to minimize the effects of nonidealities in cases where their severity is dependent on the conductance values of the memristors employ regularization as a high‐level tool for controlling power consumption of MNNs Finally, we propose improving validation techniques applied during the training of MNNs. Most memristive nonidealities are stochastic in nature, therefore computing validation metrics only once at specified checkpoints may provide unreliable estimates of neural network performance, as illustrated in Figure 1. We propose computing validation metrics multiple times and using an aggregate value (like the median) to determine which version of the MNN to save for inference. In this work, we explore how to make nonideality‐aware training more effective. We employ experimental data from a SiO memristor device and a 128 × 64 Ta/HfO2 memristor crossbar array, and consider multiple nonidealities: I‐V nonlinearity, stuck devices, and D2D variability. We give special focus to I‐V nonlinearity because existing works often prefer low‐resistance devices as they exhibit more linear I‐V behavior. We show that our proposed training function enables the use of the more power‐efficient high‐resistance memristors by minimizing the accuracy loss due to high nonlinearity and variability, which typically manifest themselves in high‐resistance devices.[ ] We demonstrate how the proposed mapping between nonnegative weights and memristor conductances enables further power savings through ℓ1 regularization. We additionally show how MNN validation during training may be improved by taking the stochastic nature of many of the nonidealities into account. Importantly, we demonstrate the feasibility of our novel training technique. Networks produced using nonideality‐aware ex situ training are better adapted to nonidealities and can even handle significant uncertainty in device behavior or designers' understanding of what that behavior is. Weights learned once on a digital system can be transferred onto multiple crossbar‐based physical implementations, even if the nonidealities manifest themselves differently in each system—our methodology does not require retraining once the weights are mapped onto crossbars. To assess the robustness of nonideality‐aware training, we even expose MNNs to different setups than they were trained for—we find that the networks employing this technique are much more robust than the networks trained using the previous, more conventional approaches.

Design

Nonlinearity‐Aware Training

The main focus of this work is the design of a novel ex situ MNN training scheme that could handle nonidealities characterized by the nonlinear relationship between inputs and outputs. Nonidealities like stuck devices, programming inaccuracies, or random telegraph noise can typically be represented with a change in memristor conductance alone. They are thus more straightforward to take into consideration during ex situ training because one can inject noise into the conductance array to represent their effect. We refer to these nonidealities as linearity preserving. On the other hand, nonidealities like I‐V nonlinearity or line resistance cannot be represented by simply disturbing the conductances of crossbar devices. We refer to such nonidealities as linearity nonpreserving. In this work, we redefine the output operations of the synaptic layers to take the effect of linearity‐nonpreserving nonidealities into account. To the best of our knowledge, this is the first time these nonidealities are addressed during ex situ training. Existing works usually model memristor crossbars as structures computing linear dot products, while only activation functions are assumed to introduce nonlinearities. Specifically, outputs are calculated using inputs , weights , and a nonlinear activation function f, as shown in Equation (1). During inference, , and are then mapped onto and from voltages, conductances, and currents, respectively. However, this creates a discrepancy between the linear dot‐product training nodes and memristor nonidealities that deviate from ohmic device behavior. Therefore, we suggest modifying the output operation of the synaptic layers to reflect the nature of linearity‐nonpreserving nonidealities. Specifically, in cases where the nonlinearity is limited to individual devices (i.e. where devices experience I‐V nonlinearity), we propose replacing the approach in Equation (1) with the approach in Equation (2). That is, the activation function is unchanged but the product is replaced with a nonlinear function g that captures memristors' non‐ohmic I‐V behavior. The exact form of g will depend on the mapping scheme, examples of which are explored in Section 2.2 non‐ohmic behavior model, which might typically be determined by the devices used; in Section 2.4, we present one possibility motivated by experimental device data

Modified Weight Implementation

As mentioned previously, synaptic weights of MNNs are typically implemented using conductances of memristive devices. While weights can usually take any real value, conductances are nonnegative—this presents design challenges. In this work, we propose training double the number of weights,[ ] but constrain them to be nonnegative and associate them with individual devices. This approach creates a more natural mapping between neural network parameters and conductances, as well as enables weight regularization to act as a method for decreasing the power consumption.

Conventional Approaches

In typical MNN implementations, both inputs and outputs are mapped onto voltages[ , ] V and from currents I, respectively, in a proportional way using scaling factors k and k , as shown in Equation (3). where k is the conductance scaling factor typically made equal to and k is determined before training. To encode both positive and negative weights, pairs of conductances are employed. Conductances G + and G − are introduced into “positive” and “negative” bit lines of crossbar arrays, where the output currents of the latter are subtracted from the output currents of the former; this is referred as differential pair architecture. Each weight is typically made proportional to the difference of G + and G − (with k acting as the constant of proportionality), which enables to encode any real number within a finite interval. However, infinite conductance combinations will produce the same difference,[ ] thus the network designer may have to make an arbitrary choice of how to perform this mapping. For example, to encode weights , the two conductances may be picked symmetrically around the average value,[ ] as shown in Equation 4. where . Although there might be advantages to using the mapping scheme like the one in Equation (4),[ ] the choice of mapping could be explicitly tied to certain objectives. For example, Ref. [25] points out that differential pair architecture with aware mapping can be advantageous for mitigating the effects of stuck devices. Alternatively, a mapping scheme that optimizes some metric (like power consumption) may be employed. Indeed, such a scheme is used throughout this work for mapping the weights of conventionally trained ANNs onto conductances—we minimize power consumption by ensuring that at least one device in the pair is set to G off, as demonstrated in Equation (5). However, choosing the optimal scheme manually is a low‐level approach that requires understanding the physical characteristics of MNNs. Thus, even if training is done ex situ, network designer has to make choices about not only the conventional training hyperparameters (like learning rate), but also how the system will be implemented physically because that will affect MNN performance. We see this as an additional obstacle to making memristive implementations of ANNs feasible in the real world.

Our Approach—Double Weights

In this work, instead of tweaking the mapping function, we decided to change the characteristics of the weights that are being trained. Specifically, we train two sets of nonnegative weights, and , which we refer to as double weights; similar approach has been explored in Ref. [40]. We map double weights onto conductances in the aforementioned “positive” and “negative” bit lines, respectively. Although all the weight parameters are nonnegative, the negative contribution of i th input on j th output can still be encoded because of the differential pair architecture in the physical system. Only the nonlinearity‐aware ex situ training function in Equation (2) has to be adjusted leading to the form in Equation (6). The adoption of double weights allows to relate every weight in to the corresponding conductance in the same way, i.e., are linearly mapped onto G ± ∈ [G off, G on], as shown in Equation (7). A clear advantage of our approach is that double weights allow for more direct optimization of MNN behavior. Exposing raw device characteristics—i.e., conductances—to the training algorithm, enables it to select the combination that has both the optimal performance (as defined by some metric like loss) and high robustness. For example, if a certain nonideality manifests itself to a greater degree at low conductances, the training algorithm would be able to push double weight pairs (and by extension, conductances) toward higher values. Because of the differential pair architecture, setting G + and G − to 1.0 and 2.0 mS, respectively, will—at least in the case of linearity‐preserving nonidealities—have the same effect as setting them to 3.0 and 4.0 mS, respectively. Therefore, with double weights, the training algorithm should be able to choose conductance combinations that minimize the negative influence of nonidealities. Additionally, double weights allow regularization to act as a high‐level tool for controlling the importance of power consumption. We propose training with the ℓ1 sparsification regularizer,[ ] which can not only improve training,[ ] but also promote lower conductances because they are linearly related to weight parameters, as demonstrated in Equation (7). During backward propagation, the regularizer influences training loss, inducing conductance pairs to descend toward G off. Instead of manually tweaking the mapping function, network designer can decide to what extent energy efficiency should be prioritized by simply adjusting, say, regularization factor in ℓ1 regularization. This can be incorporated into typical hyperparameter tuning process that is performed before deploying ANNs in practice.

Modified Validation

We also propose a modified model validation scheme more fit for MNNs. To determine when to stop the training (or which version of the network to use when the training takes a predetermined set of epochs), a validation dataset is commonly employed—a metric (like error or loss) is computed for this dataset at certain epochs and is used for picking the optimal version of the network.[ ] However, many of memristor nonidealities are at least partly stochastic in nature, thus, say, validation accuracy at any given epoch may not be entirely representative of the model's quality purely due to random chance. Because of this, we suggest computing the validation metric multiple times and using an aggregated value for higher reliability. Choices can be made about In the simulations of this work, we computed validation error 20 times every 20 epochs and saved the model whenever the median validation error decreased. aggregate value that is used how frequently validation is performed how many times validation metric is computed at each checkpoint

Nonidealities

In this work, we explore a wide range of memristor nonidealities and utilize experimental data. We use two different memristor technologies—SiO ‐ and Ta/HfO2‐based resistive random‐access memory (RRAM). More details on the two technologies can be found in Section 5 and our previous publications.[ , ]

I‐V Nonlinearity

One of the most common ways to characterize deviations from ohmic behavior in memristive devices has been by considering two points on an I‐V curve.[ , ] For example, one may define nonlinearity at voltage V ref as the ratio of the conductance at that voltage to the conductance at half that voltage,[ ] as shown in Equation (8). Nonlinearity of 1 can then be characterized as indicative of ohmic behavior; similarly, any deviations from that value indicate I‐V nonlinearity. This metric can be useful in describing non‐ohmic behavior at different voltages but it is more challenging to utilize it for modeling purposes. In this work, we utilized silicon oxide devices to investigate the effects of current–voltage nonlinearity. SiO devices can undergo resistance switching—typical I‐V switching curve is shown in Figure S1 (Supporting Information), while a more detailed analysis of resistance switching performance can be found in our previous study.[ ] For the purposes of this work, to achieve a wide range of resistance states and to analyze I‐V nonlinearity, incremental positive sweeps were used to gradually reset the device from the low‐resistance state to the high‐resistance state. I‐V curves of two subsets of all achieved states are shown in Figure . Low‐resistance discrete states (in Figure 2a) exhibit more linear behavior and experience little variability in nonlinearity. On the other hand, high‐resistance states (in Figure 2b) are more nonlinear and the nonlinearity is less predictable.

Figure 2

SiO data utilized in this work. I–V sweeps of a SiO device are shown for a) a subset of the low‐resistance region (with resistance at 0.1 V ranging from 289.8 to 1169 Ω), and b) a subset of the high‐resistance region (with resistance at 0.1 V ranging from 445.2 kΩ to 1.905 MΩ). For all curves, only the range of voltages from 0.0 V to 0.5 V was considered. Nonlinearity was computed by dividing the conductance at voltage V by the conductance at voltage 0.5V. Poole–Frenkel fits in low‐ (left) and high‐resistance (right) regions for c) c and d) d ε. In both panels, to ensure dimensionless inputs to logarithms, R, c and d ε are the amounts of the corresponding quantities in SI units, e.g., R is not the resistance, but rather the number of ohms of resistance measured at 0.1 V. Marker colors represent the mean nonlinearity of the fits, rather than the experimental I‐V curves. Multiple conduction mechanisms have been proposed for modeling I‐V behavior of memristors. The models may incorporate principles like Fowler‐Nordheim tunneling, thermionic emission, and Poole–Frenkel emission.[ ] Data in Figure 2a,b were best fit using Poole–Frenkel model in Equation (9). where I is the current, V is the voltage, c is a constant (which has units of conductance), T is the temperature, d is assumed to be oxide thickness or effective thickness for partially oxidized filament, and ε is the permittivity. T is room temperature, thus the following parameters were fit: c and the product of d and ε. Different sets of fits were produced for above and below the conductance quantum G 0 = 2e 2/h because states below this value experienced different trends and/or amounts of variability. This is not surprising as memristive devices have been reported to exhibit different behavior in different resistance states, and that is often tied to the conductance quantum.[ , ] The fits for the two sets of parameters are shown in Figure 2c,d. Slopes close to −1 and intercepts close to 0 in Figure 2c demonstrate that c does indeed act similarly to conductance, i.e. reciprocal of resistance. In high‐resistance states on the right, however, the variability is noticeably higher. In Figure 2d, product d ε is indicative of the extent to which a given curve is nonlinear—the smaller this product, the more nonlinear the I‐V curve is. For the low‐resistance states, d ε decreases with increasing resistance; however, for the first few states, this product is large—Equation (9) can approximate linear behavior (ohmic conduction); however, the values of dε are physically plausible only for the high resistance states ( The simulations were performed by considering two different resistance ranges from our experimental data and by subsequently using fits from Figure 2c,d. Low‐resistance group was constructed by interpolating between the lowest achieved resistance state of 289.8 Ω and five times that resistance. Similarly, to ensure the same dynamic range, high‐resistance group was constructed by interpolating between the highest achieved resistance state of 1.905 MΩ and one fifth that resistance.[ ] As hinted earlier, it is naïve to simply use the aforementioned linear fits without taking into account significant deviations. In fact, the presence of uncertainty in the model is one of our main goals because we wish to demonstrate that nonideality‐aware training can adapt not only to deviations from linear behavior but also to stochastic behavior. This makes nonideality‐aware training approach generalizable because it does not require exact knowledge of device behavior—it improves the performance even when different hardware is used as will be demonstrated in Section 3.6. During simulations involving nonlinear I‐V behavior, the output current of any given device was determined in the following way: We note that such treatment, where only the linear relationship between the two sets of residuals is considered, likely overestimates, rather than underestimates, the amount of uncertainty for a given device. Additional information on heteroscedasticity, the correlation between the residuals of the two sets of parameters, and the justification for using normal distribution can be found in Section 5. c and d ε were interpolated from the fits in Figure 2c,d using resistance (parameter) R c and d ε were disturbed using multivariate normal distribution with the covariance matrix determined using the residuals of the fits current I was computed using Equation (9) Linearity‐nonpreserving nonidealities like I‐V nonlinearity cannot be simulated using conventional noise injection methods, which simply disturb the conductance values. Instead, a forward propagation function must be defined reflecting the nonlinear relationship between inputs and outputs. One can express the procedure described earlier in the form of the aforementioned function g representing nonlinear behavior; this can be done by combining Equations (3), (7), and (9) leading to the form in Equation (10), which we implement using TensorFlow. where and are slopes and intercepts, respectively, of the corresponding fits in Figure 2c,d, is the covariance matrix of the residuals, and all inputs to logarithms or exponents are the amounts of quantities in SI units.

Stuck Devices

Additionally, we explore linearity‐preserving nonidealities, which can be simulated using noise injection into the conductances. One of such nonidealities is devices stuck in a particular state, which is a very common issue in memristors. The effect of stuck devices can be explored in isolation, but it also easily lends itself to being simulated along other nonidealities thus allowing to investigate the effectiveness of nonideality‐aware training in more complex scenarios. In the modeling of this nonideality, we consider both real experimental data (where we draw the state in which devices may get stuck from a probability distribution) and a simplified model (where we assume devices can get stuck in only one state). Data from 128 × 64 Ta/HfO2 memristor crossbar array were analyzed for the purposes of modeling stuck devices' behavior. Figure shows 11 potentiating and depression cycles (each consisting of a 100 voltage pulses) for a fraction of the devices. By considering the minimum and maximum conductance values achieved by each of the 8192 devices over 11 cycles, G off and G on were chosen as the median of these minimum and maximum values, respectively. Devices whose maximum range was less than half the median range (where median range had been defined as G on − G off) were classified as “stuck.”

Figure 3

Experimental HfO2 data utilized in this work. a) Eleven potentiation and depression cycles of 82 Ta/HfO2 devices—data are shown for 1% of the devices in a 128 × 64 crossbar array. Any device whose maximum conductance range was less than half the median range was classified as “stuck.” The conductance of Ta/HfO2 crossbar devices was modulated by controlling the voltage pulses applied to the gate of the selector transistor; more details are provided in Section 5 and Ref. [21]. b) Mean conductances of all stuck devices together with their estimated probability density function constructed using Gaussian kernel density estimation. Stuck devices were simulated using a probabilistic model. Using the aforementioned stuck devices' definition, 10.1% of the devices were classified as such. To simplify modeling, these devices were simulated to be fully stuck, meaning their conductance could not be changed even by a small amount.[ ] The average values of all stuck devices are denoted by markers on the y axis of Figure 3b. The probability density function of these average values was constructed using kernel density estimation with truncated normal distributions. Scott's rule[ ] was used for bandwidth estimation, while mirror reflections of the distributions were employed to correct for bias at the 0 S clipping boundary.[ ] When simulating the effect of the nonideality, each device could be set to any conductance state between G off and G on; however, every device also had a 10.1% probability of getting stuck. When a device was classified as stuck, its conductance would be drawn from the probability distribution constructed using kernel density estimation in Figure 3b. Additionally, a simple model of devices getting stuck in G off or G on was considered. This allowed to combine nonidealities together, specifically with SiO I‐V nonlinearities. For both states, devices were simulated to have 5% probability of getting stuck. As with experimental stuck devices' model, this was simulated as noise injection into the conductances. Both models of stuck device behavior were used to test the robustness of nonideality‐aware training.

Device‐to‐Device Variability

We also consider D2D variability arising from inaccuracies during device programming. During mapping onto conductances, one may end up with different values than intended; in some memristors, these (resistance) deviations are modeled using lognormal distribution.[ ] As with stuck devices, this can be incorporated into training by disturbing the values in each iteration—in this case, by drawing from lognormal distribution. We use this nonideality mostly to explore double weights (see Section 3.4) the effects of incredibly stochastic nonidealities (see Section 3.6) For the lognormal modeling, we linearly interpolate the standard deviation of the natural logarithm of resistances from the following values meant to represent different device behaviors: 0.25 for R off and R on (more uniform D2D variability) 0.5 for R off and 0.05 for R on (less uniform D2D variability) 0.5 for R off and R on (high‐magnitude D2D variability)

Results and Discussion

Nature of Training

Figure contains training curves for MNNs trained on MNIST dataset and exposed to I‐V nonlinearities. Although not used to affect any aspect of the training process, error curves for the test subset[ ] are included as well. Testing on nonideal configuration during training can help to understand the differences between standard and nonideality‐aware training.

Figure 4

Training results for standard and nonideality‐aware schemes when exposed to I–V nonlinearities. a–c) Low I‐V nonlinearity (data from Figure 2a), d–f) high I–V nonlinearity (data from Figure 2b); a, d) standard training (same training and validation curves), b, e) nonideality‐aware training, c, f) nonideality‐aware training with regularization. The panels show curves for one of five sets of trained networks. Networks were trained on MNIST dataset. Where error was computed multiple times, the curves show the median value, as well as the region bounded by the minimum and maximum values. Figure 4a–c explores the effect of low I‐V nonlinearity. In Figure 4a, the validation curve (which is computed assuming digital implementation of the ANN) is closely coupled with the test curve (which assumes that nonideal effects are present)—this suggests negligible effect of low I‐V nonlinearity. Consequently, nonideality‐aware training produces similar results on the test set both without (Figure 4b) and with (Figure 4c) regularization. Figure 4d–f explores the effect of high I‐V nonlinearity. In Figure 4d, where the results of standard training are presented, we notice that validation and test curves are detached from one another. Not only that, but the global minimum of the (nonideal) test curve occurs very early in the training, while the (ideal) validation error keeps decreasing. This indicates that without taking nonidealities into account during training, a highly suboptimal version of the ANN may be chosen for inference stage with nonidealities. Nonideality‐aware training without (Figure 4e) and with (Figure 4f) regularization is much more effective—the validation and test curves are closely coupled together and the test error decreases to lower values.

Performance Improvement

Inference results for I‐V nonlinearity simulations are summarized in Figure . Because the simulated memristors are nonlinear, power consumption was computed using P = IV, instead of P = I 2 R, for each of the individual devices. Apart from crossbar arrays, MNN implementations require additional circuitry,[ ] which suggests power consumption due to passive elements dominates in the µS range and above (i.e., the conductance range of the device investigated in this work); only at lower conductances does the relative impact of other energy components become significant.

Figure 5

Inference results of using nonideality‐aware training to deal with I–V nonlinearities. The three box plots on the right refer to memristive neural networks that used data from Figure 2a, and the three on the left—to memristive neural networks that used data from Figure 2b. Networks were trained on MNIST dataset. The variability in power consumption for the data points of each group is small thus the average power consumption is used for the horizontal position of each box plot. Box plots representing low‐resistance (and, by extension, low‐nonlinearity) devices are presented on the right side of Figure 5. With such devices, nonideality‐aware training without regularization achieves median error of 4.3%. This is close to the error—4.6%—of (digital) ANNs of the same size trained using standard procedure; the small difference is indicative of the relatively small effect of low I‐V nonlinearity (and its uncertainty). However, regularization not only decreases the power consumption by more than half, but also helps to achieve median error rate of 4.1%. As shown on the left side of Figure 5, MNNs implemented using high‐resistance devices achieve almost three orders of magnitude lower power consumption. However, MNNs trained using the standard procedure have median error of 41.6%, which would be unacceptable in most scenarios. Fortunately, adjusted training results in much lower error rate, while maintaining low power consumption (compared to low‐nonlinearity devices). In the non‐regularized case, the median error is 9.1%, and in regularized MNNs it is 7.1%. Thus, nonideality‐aware training makes it feasible to use orders of magnitude more power‐efficient high‐resistance devices while maintaining error rates similar to those achieved with low‐resistance devices. One may also compare estimated absolute energy efficiency in both cases. By assuming the values used in[ ]—read pulses of 50 ns, and two operations per synaptic weight (multiplication and accumulation)—one can calculate energy efficiency in OPs−1W−1 using Equation (11). where n is the number of synaptic weights and P avg is the average power consumption. Using these assumptions, standard training using low‐resistance devices achieves energy efficiency of 0.715 TOPs−1W−1, while nonideality‐aware training using high‐resistance devices achieves energy efficiency of 234 TOPs−1W−1 in the nonregularized case and of 381 TOPs−1W−1 in the regularized case. As explained earlier, these estimates incorporate power consumption only on crossbar arrays.

More Complex Architectures and Datasets

To understand how well nonideality‐aware training performs on more complex tasks, we employed CIFAR‐10 dataset. For this, we trained CNNs assuming that their convolutional layers would be implemented digitally, and their fully connected layers—using memristive crossbar arrays suffering from high I‐V nonlinearity. Standard training is explored in Figure , and nonideality‐aware training in Figure 6b. As with MNNs trained on MNIST, there is a much greater coupling between validation and test curves when nonidealities are taken into account. As shown in Figure 6c, nonideality‐aware approach reduces the median inference error from 43.0% to 18.9%.

Figure 6

Results for standard and nonideality‐aware schemes when employed by convolutional neural networks. a) Standard training, b) nonideality‐aware training, and c) inference error comparing the approaches in (a) and (b). Networks were trained on CIFAR‐10 dataset and their fully connected layers were exposed to high I‐V nonlinearity (data from Figure 2b) during inference. Panels (a) and (b) show curves for one of five sets of trained networks.

Importance of Weight Implementation

Double weights can be advantageous because they expose conductances to the training process in a more direct way. To investigate the utility of this modified weight implementation, we utilized lognormal D2D variability of two kinds: This allowed to more uniform variability where the relative magnitude of deviations is the same throughout [G off,G on] less uniform variability where the relative magnitude of deviations is much greater near G off compared to G on—similar to what is experienced in real devices when trying to program them[ ] evaluate the performance of double weight implementation when the severity of nonideality does not depend on the conductance value test whether double weight implementation would outperform standard weights when exposed to nonidealities whose severity depended on the conductance value Both standard weights with different mapping schemes and double weights without and with regularization are investigated in Figure .

Figure 7

Comparison of weight implementations. Conductance distributions in the first synaptic layer of a memristive neural network that deals with more uniform device‐to‐device variability and utilizes a) conventional weights mapped symmetrically around average conductance, b) conventional weights mapped onto devices by preferring lowest total conductance, c) double weights, d) double weights with regularization; and e–h) corresponding conductance distributions for a memristive neural network dealing with less uniform device‐to‐device variability. Scatter plots contain conductance data from 10% of the devices for one of five sets of networks. Inference error for different weight implementations of memristive neural networks that deal with i) more and j) less uniform device‐to‐device variability, where the box plot colors correspond to different weight implementations in a–d and e–h, respectively First, we consider weight implementations in the context of D2D variability with relatively high uniformity across the conductance range. Figure 7a,b shows conductances resulting from mapping conventional weights using rules in Equations (4) and (5), respectively. In both cases, the conductances are all mapped along either one or two line segments. In contrast, Figure 7c,d shows conductances obtained from double weight implementation and training without and with regularization, respectively. In the nonregularized case (Figure 7c), the conductances are distributed mostly around the diagonal, though they are spread out, unlike in Figure 7a. Regularization results in more data points in the bottom left corner of the diagram representing low conductance values, as shown in Figure 7d. Figure 7e–h demonstrates the utility of double weights. Here, the MNNs were trained to deal with less uniform D2D variability where the disturbances were much greater at low conductance values compared to high conductance values. As demonstrated in Figure 7g, the training naturally results in most of the pairs concentrated in the top right corner representing high conductance values. Like before, regularization in Figure 7h results in conductances with lower values. Figure 7i shows the inference error for various weight implementations of MNNs encountering more uniform D2D variability. Conventional weights with mapping rule in Equation (5) result in the lowest median error of 6.3%; in comparison, double weights achieve median error of 7.2% without regularization and median error of 6.7% with regularization. Conventional weights with mapping rule in Equation (4) achieve the highest median error of 7.9%, indicating the unpredictability of the performance of mapping methods. Even so, it is important to point out that double weights do not achieve optimal performance in this case. We hypothesize that in scenarios where the dependence of the severity of the nonideality on the conductance values is not strong, double weights might struggle to find optimal configuration—infinitely many pairs may result in the same behavior, making it a more computationally difficult problem. Figure 7j shows the inference error for equivalent weight implementations of MNNs dealing with less uniform D2D variability. In this case, the advantage of double weights is much more apparent—the median error in nonregularized case is 5.4% compared to 5.9% and 6.6% resulting from conventional weights with mapping rules in Equations (4) and (5), respectively. When double weights are trained by employing regularization, they take on lower values thus decreasing power consumption but also increasing the error—in the case of this specific nonideality, there is a tradeoff between energy efficiency and accuracy. However, double weights together with regularization provide a straightforward way of specifying to what extent low power consumption should be prioritized at the expense of accuracy.

Memristive Validation

As explained in Section 2.3, many memristive nonidealities are nondeterministic, therefore it might be advantageous to compute an aggregate metric for use in validation. During training, with each batch, we simulate nonidealities separately, e.g., parameters for I‐V nonlinearity are drawn from a probability distribution or the exact devices that get stuck are picked randomly each time. As a result, we believe that memristive validation can provide more reliable estimates of performance during training. Although we hypothesize that in aggregate this method will achieve only marginally better performance, it should help to avoid choosing a highly suboptimal version of the weights, which might yield higher error in a small number of cases. Of course, memristive validation parameters may have to be optimized individually for each training configuration. For example, in the memristive CNN training in Figure 6b, the variability of validation error is usually lower at any given checkpoint than between checkpoints. In that case, one may increase the frequency of checkpoints by either decreasing the number of repeats at each checkpoint (and thus increasing uncertainty) or keeping it the same (and thus increasing computation time). At the extreme—if one can afford additional training time—validation error may be computed every epoch multiple times.

Nonideality Agnosticism

Accurate modeling of nonidealities for nonideality‐aware ex situ training is a significant challenge. First, the nature of nonidealities encountered in practice may be different than what was modeled for the purposes of training. For example, the existence of D2D variability of SiO memristors means that the behavior of any individual device is not perfectly representative of the nature of other devices. Therefore, to hedge against fitting the model to the behavior of any specific device, we assumed that Poole‐Frenkel parameters are inferred using a linear fit (determined by a trend in the experimental data) and disturbed by drawing random deviations from a probability distribution. Even so, in different devices, these trends and amounts of deviations may be different. Second, in the real world, one may encounter completely different types of nonidealities. If the training takes into account the effects of only, say, I‐V nonlinearities, MNNs could still suffer from, for example, stuck devices when deployed. Therefore, it is important to find out how robust the MNNs employing nonideality‐aware training are. To investigate this, we utilized networks trained either by assuming no nonidealities or by being exposed to one of the eight different combinations of nonidealities. In the case of both types of I‐V nonlinearity and two types of D2D variability, networks were additionally trained using regularization. During inference, each group of networks was then exposed to the In total, this produced (1 + 8 + 2 + 2) × (1 + 8) = 117 scenarios; median inference error for each is presented in the heatmap in Figure .

Figure 8

Extent of nonideality agnosticism of nonideality‐aware training. Median inference error in percent is shown for various training setups. Compared to standard (ideal) training, nonideality‐aware approach usually results in lower error rate during inference, even if the nonideality encountered is different from the one the networks were exposed to during training. This is especially the case with more severe nonidealities. Networks were trained on MNIST dataset. Each row in the heatmap uses a separate instance of the logarithmically scaled colormap. Additional information on training and inference setups can be found in Table S4 (Supporting Information).

setup that they were trained on setups of the other groups of networks Extent of nonideality agnosticism of nonideality‐aware training. Median inference error in percent is shown for various training setups. Compared to standard (ideal) training, nonideality‐aware approach usually results in lower error rate during inference, even if the nonideality encountered is different from the one the networks were exposed to during training. This is especially the case with more severe nonidealities. Networks were trained on MNIST dataset. Each row in the heatmap uses a separate instance of the logarithmically scaled colormap. Additional information on training and inference setups can be found in Table S4 (Supporting Information). The heatmap shows that, in most cases, the lowest error for a given nonideality during inference is achieved by the network that was exposed to it during training. A few exceptions exist, including regularized networks that were exposed to more uniform D2D variability during training—when exposed to high I‐V nonlinearity during inference, they achieve lower error of 8.5% compared to the error of 9.1% of networks that were trained (without regularization) on this exact nonideality. This may suggest that the nature of some nonidealities might often overlap and, when additional techniques like regularization are employed, the performance might be increased beyond even what can be achieved with the knowledge of a particular nonideality. Importantly, Figure 8 demonstrates that nonideality‐aware training makes networks robust to the effects a wide array of nonidealities. Except for the ideal case and low‐severity nonidealities (low I‐V nonlinearity and devices stuck at G off), nonideality‐aware training results in lower median error compared to conventional training for all MNN groups—even when the networks encounter different nonidealities during inference. The heatmap also suggests that the effect of regularization on robustness is not always the same. In the case of high I‐V nonlinearity, regularization usually results in higher error (compared to nonregularized case) when encountering different nonidealities. On the other hand, regularization in networks dealing with D2D variability produces more robust behavior—when encountering different nonidealities, they usually achieve lower error.

Conclusion

In this work, design a novel nonideality‐aware training scheme that can improve the performance of MNNs. We demonstrate the importance of taking nonideal device behavior into account during training—without that, training performance may not be indicative of how well the networks will perform during inference. Importantly, we show the utility of nonideality‐aware training in dealing with linearity‐nonpreserving nonidealities, specifically I‐V nonlinearity. Our simulations show that the dichotomy between stable device behavior and power efficiency can become less relevant if the training stage is adjusted. Indeed, if nonideality‐aware training is used, high‐nonlinearity devices may achieve similar error rate, while also having almost three orders of magnitude better energy efficiency of 381 TOPs−1W−1 (with regularization) compared to 0.715 TOPs−1W−1 with the conventional training scheme and low‐resistance devices. We additionally explore a number of adjacent factors that are worth considering while dealing with MNN training. For example, we explore the use of double weights as a way to control conductance values of MNNs more directly trough methods like ℓ1 regularization. We explore the need to consider how validation is performed during training when nonidealities are stochastic and thus the outputs of an MNN are nondeterministic. In contrast to many previous works, we also investigate how robust nonideality‐aware training is—we find that, compared to conventional training, it can usually achieve much lower error when encountering nonidealities that it was not trained to deal with. Besides, nonideality‐aware training can be applied not just to one single type of nonideality—it can deal with multiple nonidealities at once, for example I‐V nonlinearity and stuck devices, as shown in Figure 8. Nonideality‐aware training schemes are critical to making memristor‐based ANNs feasible. Such schemes are key to ensuring low error rate and high energy efficiency. Using experimental data and several novel optimization techniques, we demonstrate that our training design deals with a wide range of nonidealities, importantly linearity‐nonpreserving nonidealities, which have not been addressed during ex situ training before. Further, we demonstrate that high‐resistance operational ranges can be used to reduce power consumption by almost three orders of magnitude without significant accuracy loss, despite the high level of associated nonidealities.

Experimental Section

Fabrication of SiO

SiO RRAM devices were fabricated on a Si substrate with 1µm of thermal oxide on top. A 100 nm of Mo was deposited on top of the Si/SiO2 substrate that served as the bottom electrode. The SiO layer was sandwiched between the bottom and top electrodes and was deposited by reactive sputtering. The top electrodes consisted of 5 nm Ti wetting layer followed by a 100 nm of Au; their pattern was defined using a shadow mask. The device sizes ranged from 200 µm × 200 µm to 800 µm × 800 µm.

Characterization of SiO

Electrical characterization of a 400 µm × 400 µm device was performed using Keithley 4200A‐SCS. The signals were applied to the top electrode (Au), while the bottom electrode (Mo) was connected to the ground. The device required an initial electroforming step before stable resistive switching could be achieved. The forming process was carried out by a negative voltage sweep, which stopped when the current had reached the limit of 3 mA. Subsequently, 18 voltage sweeps were performed to guarantee proper device performance: the voltage was ramped from 0.0 V, to ±2.5 V, and back to 0.0 V using a 3 mA current compliance. After this, to achieve a wide range of resistances, incremental positive sweeps were applied to the sample, starting from 0.5 V and increasing by 0.05 V in each run. This was being repeated until there was no further resistance change, i.e., the filament had returned to its initial (post‐forming) state. The obtained I‐V curves are shown in (Figure S2, Supporting Information), while a subset of curves utilized in this work are shown in Figure 2a,b.

Fabrication of Ta/HfO2

Ta/HfO2 1T1R array contained NMOS transistors (with feature size of 2 nm) and Pt/HfO2/Ta RRAM devices. The bottom electrode was deposited by evaporating 20 nm Pt layer on top of a 2 nm Ta adhesive layer. A 5 nm HfO2 switching layer was deposited by atomic layer deposition using water and tetrakis(dimethylamido)hafnium as precursors at 250°C. 50 nm Ta sputtered layer followed by 10 nm Pd served as the top electrode.[ ] Fabrication process is described in more detail in Ref. [56].

Characterization of Ta/HfO2

Device conductance was being increased using SET pulses (500 µs @ 2.5 V and gate voltage linearly increasing from 0.6 V to 1.6 V). After each 100‐pulse cycle, RESET pulses (5 µs @ 0.9 V linearly increasing to 2.2 V and gate voltage of 5 V) were used to reduce the conductance. More information can be found in Ref. [21].

Simulations

The following architectures were employed: fully connected ANNs (trained on MNIST[ ]) containing fully connected layer with 25 hidden neurons and logistic activation function fully connected layer with 10 output neurons and softmax activation function CNNs (trained on CIFAR‐10[ ]) containing convolutional layer with 32 output filters, 3 × 3 kernel size and ReLU activation function pooling layer with 2 × 2 pool size convolutional layer with 64 output filters, 3 × 3 kernel size and ReLU activation function pooling layer with 2 × 2 pool size convolutional layer with 64 output filters, 3 × 3 kernel size and ReLU activation function with maximum value of 1 fully connected layer with 25 hidden neurons and logistic activation function fully connected layer with 10 output neurons and softmax activation function To account for high variability of nonidealities (which were nondeterministic), five networks were trained for each configuration. Each trained network went through 25 inference runs, totaling 5 × 25 = 125 runs for each configuration.

Networks used 4

1 training‐validation split. All networks were trained for a 1000 epochs with batch size of 64. Where ℓ1 regularization had been employed, regularization factor of 10−4 was used. To ensure double weights stayed nonnegative, NonNeg weight constraint provided by the Keras machine learning library was utilized. For any given batch (whose size was 64 during training, as mentioned before, and 100 during inference), conductances were disturbed once in the case of linearity‐preserving nonidealities and Poole‐Frenkel parameters associated with individual devices were drawn from a probability distribution once in the case of I‐V nonlinearity.

Statistical Analysis

In all box plots, the maximum whisker length was set to 1.5 × IQR. In Figure 4 and equivalent plots, curves with semitransparent regions consist of two parts summarizing 20 inference repeats at certain epochs: opaque curve representing the median values and semi‐transparent region bounded by the minimum and maximum values. To avoid large file size, only a subset of all data points is presented in Figures 3 and 7; these subsets were chosen randomly using NumPy. Fifty‐three SiO resistance states were achieved using the procedure described earlier but several (four) of them were excluded from the analysis. Specifically, states where there were abrupt changes in current were not considered. This was done by excluding the curves where the maximum ratio of the second derivative of current (with respect to voltage) to average current exceeded a threshold of 0.1 V−2. The analysis of the residuals of ln(c) and ln(d ε) is provided in Figure . One of the issues, which was evident, was that these two sets of residuals correlated to some extent, especially at higher resistance states as can be seen in Figure 9b,d. If these deviations were simulated independently, the amount of uncertainty would be significantly overestimated, which is why covariance matrix of the residuals is used in Equation (10). The rationale for simulating the deviations using normal distribution is provided in the normal probability plots in Figure 9e–h.

Figure 9

Residuals from the trends of Poole–Frenkel parameters. Residuals of ln(c) for a) low‐ and b) high‐resistance states. Residuals of ln(d ε) for c) low‐ and d) high‐resistance states. Normal probability plots of the residuals of ln(c) for e) low‐ and f) high‐resistance states. Normal probability plots of the residuals of ln(d ε) for g) low‐ and h) high‐resistance states. In all panels, the inputs to logarithms are made dimensionless by using the amounts of the corresponding quantities in SI units.

Conflict of Interest

A.M. and A.J.K. are co‐founders and W.H.N. is an employee of Intrinsic, a company developing memristor technology.

Author Contributions

D.J. and E.W. contributed equally to this work, as did A.M. and G.A.C. Supporting Information Click here for additional data file.

15 in total

1. Optode Design Space Exploration for Clinically-robust Non-invasive Fetal Oximetry.

Authors: Daniel D Fong; Vivek J Srinivasan; Kourosh Vali; Soheil Ghiasi
Journal: ACM Trans Embed Comput Syst Date: 2019-10-08 Impact factor: 1.193

2. Activation function dependence of the storage capacity of treelike neural networks.

Authors: Jacob A Zavatone-Veth; Cengiz Pehlevan
Journal: Phys Rev E Date: 2021-02 Impact factor: 2.529

3. Memristive devices for computing.

Authors: J Joshua Yang; Dmitri B Strukov; Duncan R Stewart
Journal: Nat Nanotechnol Date: 2013-01 Impact factor: 39.213

4. Effect of conductance linearity and multi-level cell characteristics of TaO_x-based synapse device on pattern recognition accuracy of neuromorphic system.

Authors: Changhyuck Sung; Seokjae Lim; Hyungjun Kim; Taesu Kim; Kibong Moon; Jeonghwan Song; Jae-Joon Kim; Hyunsang Hwang
Journal: Nanotechnology Date: 2018-03-16 Impact factor: 3.874