| Literature DB >> 32256308 |
Jibin Wu1, Emre Yılmaz1, Malu Zhang1, Haizhou Li1,2, Kay Chen Tan3.
Abstract
Artificial neural networks (ANN) have become the mainstream acoustic modeling technique for large vocabulary automatic speech recognition (ASR). A conventional ANN features a multi-layer architecture that requires massive amounts of computation. The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation. Motivated by their unprecedented energy-efficiency and rapid information processing capability, we explore the use of SNNs for speech recognition. In this work, we use SNNs for acoustic modeling and evaluate their performance on several large vocabulary recognition scenarios. The experimental results demonstrate competitive ASR accuracies to their ANN counterparts, while require only 10 algorithmic time steps and as low as 0.68 times total synaptic operations to classify each audio frame. Integrating the algorithmic power of deep SNNs with energy-efficient neuromorphic hardware, therefore, offer an attractive solution for ASR applications running locally on mobile and embedded devices.Entities:
Keywords: acoustic modeling; automatic speech recognition; deep spiking neural networks; neuromorphic computing; tandem learning
Year: 2020 PMID: 32256308 PMCID: PMC7090229 DOI: 10.3389/fnins.2020.00199
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Figure 1Comparison of the synchronous and asynchronous computational paradigms adopted by (A) ANNs and (B) SNNs, respectively (revised from Pfeiffer and Pfeil, 2018).
Figure 2Block diagram of a conventional ASR system. The acoustic and linguistic components are incorporated to jointly determine the most likely hypothesis.
Figure 4System flowchart for SNN training within a tandem neural network, wherein SNN layers are used in the forward pass to determine the spike count and spike train. The ANN layers are used for error back-propagation to approximate the gradient of the coupled SNN layers.
Figure 3The neuronal dynamic of an integrate-and-fire neuron (red). In this example, three pre-synaptic neurons are sending asynchronous spike trains to this neuron. Output spikes are generated when the membrane potential V crosses the firing threshold (top right corner).
PER (%) on the TIMIT development and test sets.
| MFCC (13-dim.) | 16.7 | 26.4 | ||
| FBANK (13-dim.) | 15.8 | – | ||
| FMLLR | 14.9 | – | ||
| MFCC (13-dim.) | 17.1 | 17.8 | 18.5 | 19.1 |
| FBANK (13-dim.) | 18.5 | |||
| MFCC (40-dim.) | 17.3 | 18.2 | 18.7 | 19.8 |
| FBANK (40-dim.) | 17.8 | 19.1 | ||
| FMLLR | 15.8 | 16.5 | 17.2 | 17.4 |
The upper panel reports the results of various ANN and SNN architectures from the literatures, and the lower panel presents the results achieved by the ANN and SNN models in this work (AM, acoustic model,
the best result to date). The best results given by the speaker-independent features at each column are marked in bold.
WERs (%) achieved on the monolingual and mixed segments of the FAME test set.
| FBANK (40-dim.) | Kaldi-ANN (Yılmaz et al., | 32.4 | 39.7 | 49.9 | 36.2 |
| MFCC (40-dim.) | TDNN-LSTM (Yılmaz et al., | 31.5 | 39.5 | 47.9 | 35.2 |
| MFCC (13-dim.) | ANN | 34.6 | 50.0 | 49.9 | 39.9 |
| MFCC (13-dim.) | SNN | 33.8 | 45.3 | 47.9 | 38.2 |
| FBANK (13-dim.) | ANN | 34.3 | 47.5 | 48.1 | 39.0 |
| FBANK (13-dim.) | SNN | 33.1 | 44.3 | 46.5 | 37.3 |
| MFCC (40-dim.) | ANN | 35.2 | 48.4 | 51.7 | 40.2 |
| MFCC (40-dim.) | SNN | 33.7 | 44.2 | 46.9 | 37.7 |
| FBANK (40-dim.) | ANN | 34.4 | 46.3 | 49.8 | 39.0 |
| FBANK (40-dim.) | SNN | ||||
| FMLLR | ANN | 31.2 | 42.1 | 47.2 | 35.7 |
| FMLLR | SNN | 31.5 | 39.5 | 46.6 | 35.2 |
The upper panel summarizes the number of words from each language subset. The middle panel provides the results of state-of-the-art ANN architectures (Yılmaz et al., .
WER (%) achieved on the Librispeech development and test sets.
| Kaldi | p-norm ANN | 9.2 | 9.7 | ||
| PyTorch-Kaldi | Li-GRU | – | 8.6 | ||
| MFCC | 10.3 | 10.5 | 10.6 | 10.9 | |
| FBANK | 9.6 | 10.2 | 10.6 | ||
| MFCC (40-dim.) | 10.1 | 10.6 | |||
| FBANK (40-dim.) | 9.6 | 10.2 | 10.1 | ||
| FMLLR | 9.2 | 9.3 | 9.7 | 9.9 | |
| MFCC | 9.2 | 9.9 | 9.6 | 10.3 | |
| FBANK | 8.6 | 9.7 | 9.1 | 10.0 | |
| MFCC (40-dim.) | 8.6 | ||||
| FBANK (40-dim.) | 9.4 | 9.7 | |||
| FMLLR | 8.4 | 9.2 | 8.8 | 9.7 | |
The upper panel gives the results, with 100-h of training data, reported at the Github repo of Kaldi and PyTorch-Kaldi. The middle and lower panel present the results achieved by ANN and SNN models in this work using 100-h and 360-h of training data, respectively. The best results given by the speaker-independent features in the middle and lower panel are marked in bold. (AM, acoustic model,
: reported at Github repo).
Comparison of the computational costs between SNN and ANN.
| Num. of frames | 474 | 287 | 274 | 268 | 223 | |
| MFCC (40-dim.) | 1.71 | 1.73 | 1.76 | 1.71 | 1.68 | 1.72 |
| FBANK (40-dim.) | 1.08 | 1.08 | 1.14 | 1.09 | 1.10 | 1.10 |
| FMLLR | 0.67 | 0.68 | 0.71 | 0.66 | 0.67 | 0.68 |
The ratio of their required total synaptic operations [SynOps(SNN) / SynOps(ANN)] is reported. It worth mentioning that ANNs use more costly MAC operations than the AC operations used in the SNNs.
Figure 5Average spike count per neuron of different SNN layers on the TIMIT corpus. The results of different input features are color-coded. Sparse neuronal activities can be observed in this bar chart.
Pseudo Codes For The Tandem Learning Rule