| Literature DB >> 36117617 |
Alexandre Bittar1,2, Philip N Garner1.
Abstract
Artificial neural networks (ANNs) are the basis of recent advances in artificial intelligence (AI); they typically use real valued neuron responses. By contrast, biological neurons are known to operate using spike trains. In principle, spiking neural networks (SNNs) may have a greater representational capability than ANNs, especially for time series such as speech; however their adoption has been held back by both a lack of stable training algorithms and a lack of compatible baselines. We begin with a fairly thorough review of literature around the conjunction of ANNs and SNNs. Focusing on surrogate gradient approaches, we proceed to define a simple but relevant evaluation based on recent speech command tasks. After evaluating a representative selection of architectures, we show that a combination of adaptation, recurrence and surrogate gradients can yield light spiking architectures that are not only able to compete with ANN solutions, but also retain a high degree of compatibility with them in modern deep learning frameworks. We conclude tangibly that SNNs are appropriate for future research in AI, in particular for speech processing applications, and more speculatively that they may also assist in inference about biological function.Entities:
Keywords: artificial intelligence; deep learning; physiologically plausible models; signal processing; speech recognition; spiking neurons; surrogate gradient learning
Year: 2022 PMID: 36117617 PMCID: PMC9479696 DOI: 10.3389/fnins.2022.865897
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 5.152
Figure 1Different surrogate gradient functions to approximate the derivative of the step-function responsible for spike generation.
Results on the SHD dataset for SNNs with different types of readout layer.
|
|
|
|
|
|
|---|---|---|---|---|
| LIF 3 × 128 | 87.27 |
| 62.45 | 77.99 |
| LIF 3 × 512 |
| 87.27 | 67.74 | 79.46 |
| adLIF 3 × 128 |
| 90.35 | 90.07 | 88.56 |
| adLIF 3 × 512 |
| 93.52 | 90.53 | 89.61 |
Bold values indicate the best performing type of readout layer.
Figure 2SNN model architecture with F input features, two recurrent layers with H hidden units each, and a final readout layer for the C classes. Feedforward and recurrent connections are represented with solid and dotted arrows respectively. An input unit is simply the value of that feature at the given time step. Each hidden unit integrates the incoming stimuli from feedforward and recurrent connections at time step t with Equation (12). The membrane potential is updated and a spike is emitted if threshold is reached. A readout unit integrates the purely feedforward incoming stimuli, updates its membrane potential (without spiking) and accumulates it over time using the softmax function. After passing the whole sequence through the network, the outputs then go into a cross-entropy loss function. Back-Propagation is made possible via the use of a boxcar surrogate gradient.
State-of-the-art on the SHD dataset.
|
|
|
|---|---|
| Attention (Yao et al., |
|
| Recurrent + adaptation (Yin et al., | 90.4 |
| Recurrent + adaptation (Yin et al., | 84.4 |
| Recurrent + data augm. (Cramer et al., | 83.2 |
| Recurrent + heter. time const. (Perez-Nieves et al., | 82.7 |
| Recurrent (Cramer et al., | 71.4 |
| Non-recurrent (Cramer et al., | 47.5 |
| CNN (Cramer et al., |
|
| LSTM (Cramer et al., | 89 |
Bold values indicate the best performing network of its category (ANN or SNN).
State-of-the-art on the SSC dataset.
|
|
|
|---|---|
| Recurrent + adaptation (Yin et al., |
|
| Recurrent + heter. time const. (Perez-Nieves et al., | 57.3 |
| Recurrent (Cramer et al., | 50.9 |
| Non-recurrent (Cramer et al., | 41.0 |
| CNN (Cramer et al., |
|
| LSTM (Cramer et al., | 73 |
Bold values indicate the best performing network of its category (ANN or SNN).
Figure 3Standard representation via filterbank features (A) and spike train representation via LAUSCHER (B) of the same spoken digit (seven in English) from the SHD dataset.
State-of-the-art on the SC dataset (version 2 with 35 labels).
|
|
|
|---|---|
| Recurrent + adaptation (Salaj et al., |
|
| Recurrent + adaptation (Shaban et al., | 91 |
| Transformers (Gong et al., |
|
| Attention RNN (De Andrade et al., | 93.9 |
Bold values indicate the best performing network of its category (ANN or SNN).
Results on the SHD dataset.
|
|
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
|
| Tandem | No | 3 | 128 | 62.64 |
| 1024 | 68.01 | |||
| LIF | No | 3 | 128 | 87.04 |
| 1024 | 89.29 | |||
| adLIF | No | 3 | 128 | 93.06 |
| 1024 | 93.57 | |||
| RLIF | Yes | 3 | 128 | 89.75 |
| 1024 | 92.51 | |||
| RadLIF | Yes | 3 | 128 | 92.88 |
| 1024 |
| |||
| MLP | No | 3 | 128 | 61.63 |
| RNN | Yes | 3 | 128 | 73.48 |
| liBRU | Yes | 3 | 128 | 89.61 |
| GRU | Yes | 3 | 128 |
|
| SNN SOTA (Yao et al., | 91.1 | |||
| ANN SOTA (Cramer et al., | 92.4 | |||
The number of successes or positives in an experiment with binary outcomes can be modeled as a binomial distribution. The posterior of the binomial parameter is beta distributed for trivial priors. Here, the equal tailed 95% credible intervals for a flat prior are between ±2.1 and ±0.9% for test set accuracies between 61.63 and 94.62%. Note that larger ANNs were also tested but only obtained slight improvements and remained under the performance of SNNs of the same size. Bold values indicate the best performing network of its category (ANN or SNN).
Results on the HD dataset.
|
|
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
|
| LIF | No | 3 | 128 | 98.40 |
| RLIF | Yes | 3 | 128 |
|
| MLP | No | 3 | 128 | 96.99 |
| RNN | Yes | 3 | 128 | 99.13 |
| liBRU | Yes | 3 | 128 |
|
| GRU | Yes | 3 | 128 | 99.91 |
Here the equal tailed 95% credible intervals are between ±0.5 and ±0.1% for test set accuracies between 96.99 and 99.96%. Bold values indicate the best performing network of its category (ANN or SNN).
Results on the SSC dataset.
|
|
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
|
| LIF | No | 3 | 128 | 66.67 |
| 512 | 68.14 | |||
| adLIF | No | 3 | 128 | 71.66 |
| 512 | 73.58 | |||
| RLIF | Yes | 3 | 128 | 73.87 |
| 512 | 75.91 | |||
| RadLIF | Yes | 3 | 128 | 73.25 |
| 512 | 76.21 | |||
| 1024 |
| |||
| MLP | No | 3 | 128 | 29.27 |
| RNN | Yes | 3 | 128 | 70.01 |
| liBRU | Yes | 3 | 512 | 78.70 |
| GRU | Yes | 3 | 512 |
|
| SNN SOTA (Yin et al., | 74.2 | |||
| ANN SOTA (Cramer et al., | 77.7 | |||
Here the equal tailed 95% credible intervals on the accuracies are all about ±0.6% due to the large size of the test set. Bold values indicate the best performing network of its category (ANN or SNN).
Results on the SC dataset.
|
|
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
|
| LIF | No | 3 | 128 | 82.12 |
| 512 | 83.03 | |||
| adLIF | No | 3 | 128 | 90.46 |
| 512 | 93.12 | |||
| RLIF | Yes | 3 | 128 | 90.71 |
| 512 | 93.58 | |||
| RadLIF | Yes | 3 | 128 | 92.48 |
| 512 |
| |||
| MLP | No | 3 | 128 | 48.80 |
| 512 | 53.16 | |||
| RNN | Yes | 3 | 128 | 90.61 |
| 512 | 92.09 | |||
| liBRU | Yes | 3 | 128 | 94.55 |
| 512 |
| |||
| GRU | Yes | 3 | 128 | 93.65 |
| 512 | 94.32 | |||
| SNN SOTA (Salaj et al., | 91.21 | |||
| ANN SOTA (Gong et al., | 98.11 | |||
Here the equal tailed 95% credible intervals are between ±0.9 and ±0.4% for test set accuracies between 48.80 and 95.06%. Bold values indicate the best performing network of its category (ANN or SNN).
Results on the SHD dataset for a non-recurrent adLIF network of size 3 × 128, with a sparser connectivity in the first layer.
|
|
|
|
|---|---|---|
|
|
|
|
| 0 | 109,864 | 93.06 |
| 10 | 100,904 | 92.83 |
| 50 | 65,064 | 92.14 |
| 90 | 29,224 | 92.33 |
| 95 | 24,744 | 92.14 |
| 99 | 21,160 | 91.59 |
Here a sparsity proportion p corresponds to randomly removing p% of the connecting weights in the first layer.
Results on the SHD dataset for a recurrent RadLIF network of size 3 × 1,024, with a sparser connectivity in all hidden layers.
|
|
|
|
|---|---|---|
|
|
|
|
| 0 | 3,893,288 | 94.62 |
| 10 | 3,507,035 | 93.80 |
| 50 | 1,962,024 | 93.57 |
| 90 | 417,013 | 92.51 |
| 95 | 223,886 | 93.20 |
| 99 | 69,385 | 90.95 |
Here, a sparsity proportion p corresponds to randomly removing p% of the connecting weights (both feedforward and recurrent) in all hidden layers.
Comparison between our adaptation scheme and an alternative moving threshold on the SHD and SC datasets.
|
|
|
|
|
|---|---|---|---|
| SHD | RadLIF 3 × 128 | 88.79 |
|
| SHD | RadLIF 3 × 512 | 90.21 |
|
| SHD | RadLIF 3 × 1,024 | 89.25 |
|
| SC | RadLIF 3 × 128 | 90.95 |
|
| SC | RadLIF 3 × 512 | 93.61 |
|
Bold values indicate the best performing type of adaptation.