| Literature DB >> 32325814 |
Ayesha Pervaiz1, Fawad Hussain1, Huma Israr2, Muhammad Ali Tahir2, Fawad Riasat Raja3, Naveed Khan Baloch1, Farruh Ishmanov4, Yousaf Bin Zikria5.
Abstract
The advent of new devices, technology, machine learning techniques, and the availability of free large speech corpora results in rapid and accurate speech recognition. In the last two decades, extensive research has been initiated by researchers and different organizations to experiment with new techniques and their applications in speech processing systems. There are several speech command based applications in the area of robotics, IoT, ubiquitous computing, and different human-computer interfaces. Various researchers have worked on enhancing the efficiency of speech command based systems and used the speech command dataset. However, none of them catered to noise in the same. Noise is one of the major challenges in any speech recognition system, as real-time noise is a very versatile and unavoidable factor that affects the performance of speech recognition systems, particularly those that have not learned the noise efficiently. We thoroughly analyse the latest trends in speech recognition and evaluate the speech command dataset on different machine learning based and deep learning based techniques. A novel technique is proposed for noise robustness by augmenting noise in training data. Our proposed technique is tested on clean and noisy data along with locally generated data and achieves much better results than existing state-of-the-art techniques, thus setting a new benchmark.Entities:
Keywords: acoustic modelling; automatic speech recognition; data science; deep learning; deep neural networks; kaldi; language modelling; speech command set; voice recognition; word error rate
Year: 2020 PMID: 32325814 PMCID: PMC7219662 DOI: 10.3390/s20082326
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Generic model of an automatic speech recognition (ASR) system.
Language data files.
| File Name | Pattern of File | Pattern Example |
|---|---|---|
| lexicon.txt | <word> <phone 1> <phone 2> | backward B AE K W ER D |
| nonsilence_phones.txt | <phone> | AA, AE, AH, AO |
| silence_phones.txt | <phone> | SIL |
| optional_scilence.txt | <phone> | SIL |
Figure 2Steps involved in MFCC calculation.
Figure 3Model for the training of GMM-HMM. CMVN, cepstral mean and variance normalization; Tri, triphone; MLLT, maximum likelihood linear transform.
Figure 4Model for the training of DNN.
Figure 5Model for the training of LSTM.
Figure 6Models of CNN.
Figure 7Model for decoding. WER, word error rate; SER, sentence error rate.
Dataset details.
| Command | Noise |
|---|---|
| core words | |
| down, eight, five, four, go, left, nine, | doing_the_dishes (1 min 35 s) |
| no, off, on, one, right, seven, six, | dude_miaowing (1 min 1 s) |
| stop, three, two, up, yes, zero | exercise_bike (1 min 1 s) |
| auxiliary words | pink_noise (1 min 0 s) |
| bed, bird, cat, dog, happy, | running_tap (1 min 0 s) |
| house, marvin, sheila, tree, wow | white_noise (1 min 1 s) |
| additional 5 commands in V2 | |
| backward, follow, forward, learn, visual |
WER of models trained on clean data. MMI, maximum mutual information; MPE, maximum phone error.
| Model | Tested on Clean Test Set | Tested on Noisy Test Set | WER (Local Dataset) | ||
|---|---|---|---|---|---|
| WER (%) (V1) | WER (%) (V2) | WER (%) (V1) | WER (%) (V2) | ||
| mono | 26.79 | 34.63 | 80.23 | 76.91 | 52.96 |
| tri1 | 7.53 | 11.03 | 65.27 | 62.40 | 23.36 |
| tri2a | 7.92 | 11.25 | 65.30 | 63.17 | 22.50 |
| tri2b | 6.75 | 9.88 | 64.14 | 59.69 | 21.29 |
| tri2_MMI | 6.37 | 8.28 | 63.39 | 57.50 | 20.14 |
| tri2b_MMI_b | 6.17 | 8.05 | 63.31 | 57.31 | 20.14 |
| tri2b_MPE | 6.44 | 9.24 | 64.30 | 58.89 | 22.32 |
| tri3b | 5.77 | 8.72 | 60.41 | 57.72 | 15.14 |
| tri3b_MMI | 5.24 | 7.29 | 57.38 | 53.75 | 14.29 |
| tri3b_fMMI + MMI | 5.15 | 6.43 | 56.95 | 52.76 | 13.82 |
WER of models trained on noisy data.
| Model | Tested on Clean Test Set | Tested on Noisy Test Set | ||
|---|---|---|---|---|
| WER (%) (V1) | WER (%) (V2) | WER (%) (V1) | WER (%) (V2) | |
| mono | 94.20 | 92.61 | 84.47 | 84.40 |
| tri1 | 43.32 | 54.90 | 53.65 | 55.32 |
| tri2a | 44.56 | 46.97 | 52.43 | 54.14 |
| tri2b | 47.47 | 45.22 | 42.11 | 54.18 |
| tri2b_MMI | 41.81 | 43.95 | 39.40 | 49.11 |
| tri2b_MMI_b | 40.79 | 43.43 | 39.65 | 49.26 |
| tri2b_MPE | 38.65 | 36.72 | 42.80 | 52.42 |
| tri3b | 32.56 | 37.87 | 33.40 | 43.11 |
| tri3b_MMI | 24.23 | 32.81 | 31.55 | 40.25 |
| tri3b_fMMI + MMI | 20.85 | 28.42 | 32.11 | 38.58 |
WER of deep models trained and tested on clean data.
| Model | WER (%) V1 | WER (%) V2 | Model | WER (%) V1 | WER (%) V2 |
|---|---|---|---|---|---|
| LSTM | 4.60 | 6.06 | Base [ | 14.6 | 11.8 |
| CNN-max | 3.82 | 3.39 | Caps-inputvec4 [ | 11.6 | - |
| CNN-avg | 4.41 | 3.93 | Caps-channel32 [ | 11.3 | - |
| CNN-max-sameconv | 4.10 | 3.50 | Caps-outputvec4 [ | 10.5 | - |
| CNN-max-addconv | 3.69 | 3.15 | LSTM baseline [ | 9.24 | - |
| CNN-max-avg | 3.89 | 3.88 | Direct enhancement [ | 8.53 | - |
| DNN (L = 1) | 2.43 | 1.64 | T-Fmasking enhancement [ | 7.1 | - |
| DNN (L = 2) | 3.00 | 1.78 | Neural attention [ | 5.7 | 6.1 |
WER of deep models trained on clean data and tested on noisy data.
| Model | WER (%) V1 | WER (%) V2 |
|---|---|---|
| Caps-inputvec4 [ | 47.3 | - |
| Caps-channel32 [ | 47.4 | - |
| Caps-outputvec4 [ | 44.7 | - |
| LSTM | 69.09 | 62.79 |
| DNN (L = 1) | 38.63 | 25.07 |
| DNN (L = 2) | 49.13 | 45.09 |
WER of models tested on locally generated data.
| Model | WER (%) (Local Dataset) |
|---|---|
| LSTM | 15.50 |
| DNN (L = 1) | 16.89 |
| DNN (L = 2) | 10.89 |
WER of deep models trained on noisy data.
| Model | Tested on Clean Test Set | Tested on Noisy Test Set | ||
|---|---|---|---|---|
| WER (%) (V1) | WER (%) (V2) | WER (%) (V1) | WER (%) (V2) | |
| TF-Masking [ | - | - | - | 16.08 |
| LSTM | 22.09 | 62.04 | 21.08 | 28.26 |
| DNN (L = 1) | 5.24 | 55.88 | 14.31 | 12.55 |
| DNN (L = 2) | 10.91 | 40.29 | 15.91 | 14.44 |
WER of models with different training options and the optimization function trained and tested on clean data.
| Model | WER (%) V2 | WER (%) V1 |
|---|---|---|
| CNN-max-Optimized | 12.67 | 3.59 |
| CNN-max-sgdm | 4.58 | 4.82 |
| CNN-max-rmsprop | 3.41 | 3.84 |
| CNN-max-Adam | 3.39 | 3.82 |
WER of models with different beam parameters trained and tested on clean data.
| Model | WER |
|---|---|
| LSTM (beam = 5) | 6.16 |
| LSTM (beam = 15) | 6.06 |
| LSTM (beam = 30) | 5.95 |