| Literature DB >> 35885089 |
Abstract
Owing to the loss of effective information and incomplete feature extraction caused by the convolution and pooling operations in a convolution subsampling network, the accuracy and speed of current speech processing architectures based on the conformer model are influenced because the shallow features of speech signals are not completely extracted. To solve these problems, in this study, we researched a method that used a capsule network to improve the accuracy of feature extraction in a conformer-based model, and then, we proposed a new end-to-end model architecture for speech recognition. First, to improve the accuracy of speech feature extraction, a capsule network with a dynamic routing mechanism was introduced into the conformer model; thus, the structural information in speech was preserved, and it was input to the conformer blocks via sequestered vectors; the learning ability of the conformed-based model was significantly enhanced using dynamic weight updating. Second, a residual network was added to the capsule blocks, thus, the mapping ability of our model was improved and the training difficulty was reduced. Furthermore, the bi-transformer model was adopted in the decoding network to promote the consistency of the hypotheses in different directions through bidirectional modeling. Finally, the effectiveness and robustness of the proposed model were verified against different types of recognition models by performing multiple sets of experiments. The experimental results demonstrated that our speech recognition model achieved a lower word error rate without a language model because of the higher accuracy of speech feature extraction and learning using our model architecture with a capsule network. Furthermore, our model architecture benefited from the advantage of the capsule network and the conformer encoder, and also has potential for other speech-related applications.Entities:
Keywords: Chinese speech recognition; bi-transformer; capsule network; conformer; end-to-end
Year: 2022 PMID: 35885089 PMCID: PMC9324068 DOI: 10.3390/e24070866
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1Conformer encoder model architecture [24].
Figure 2Architecture of the capsule networks.
Figure 3Architecture of our proposed speech recognition model.
Figure 4Combination of the capsule and residual networks.
Figure 5Diagram of the conformer blocks.
Figure 6Structure of the U2 model.
Figure 7Framewise and CTC networks classifying a speech signal [31].
Comparison of speech recognition models with different encoder and decoders.
| Models | WER (%) | Parameter Quantity (M) |
|---|---|---|
| CNN × 2 + ResconvLSTM × 8 + DNN + CTC | 17.59 | - |
| LAS | 13.20 | - |
| CB | 6.21 | 50.0 |
| CT | 6.82 | 47.8 |
| TT | 7.50 | 33.8 |
| TB | 6.55 | 35.8 |
| Caps-TB | 6.29 | 37.7 |
| Caps-TT | 7.31 | 35.5 |
| Caps-CT | 6.36 | 49.7 |
| Caps-CB | 5.97 | 51.9 |
Figure 8The loss function curves during training of different models.
Performance comparisons of the recognition models based on the performer encoder.
| Model | Enc-Dec-Dm-Head | WER (%) | Parameter Quantity (M) |
|---|---|---|---|
| Caps-PB | 12-6-256-4 | 8.01 | 36.1 |
| Caps-CB | 12-6-256-4 | 5.97 | 51.9 |
| PB | 12-6-256-4 | 8.24 | 34.2 |
| CB | 12-6-256-4 | 6.21 | 50.0 |
Performance improvement after parameter balance and analysis.
| CB | CB_enlarged | Caps-CB | |
|---|---|---|---|
| WER (%) | 6.21 | 6.18 | 5.97 |
| Parameter quantity (M) | 50.00 | 51.95 | 51.94 |
Performance achieved with different router iterations in the capsule network.
| 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|
| WER (%) | 5.97 | 6.25 | 6.23 | 6.27 | 6.28 |
| Parameter quantity (M) | 51.9 | 51.9 | 51.9 | 51.9 | 51.9 |
Figure 9WERs of different models with variable beam sizes.
Transcription performance in Chinese with different models when the beam size was 20.
| Models | Transcription |
|---|---|
| Truth | 这令被贷款的员工们寝食难安 |
| (a) LAS | 这令被 |
| (b) CB | 这令被贷款的员工们 |
| (c) Caps-CB | 这令被贷款的员工们 |
| Truth | 按照扶优扶大扶强的原则 |
| (a) LAS | 按照 |
| (b) CB | 按照 |
| (c) Caps-CB | 按照 |