| Literature DB >> 32218379 |
Long Zhang1, Ziping Zhao1, Chunmei Ma1, Linlin Shan2, Huazhi Sun1, Lifen Jiang1, Shiwen Deng3, Chang Gao4.
Abstract
Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network-deep neural network (DNN-DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.Entities:
Keywords: ASR; CAPT; CTC; attention-based; automatic pronunciation error detection; end-to-end; seq2seq model
Year: 2020 PMID: 32218379 PMCID: PMC7180994 DOI: 10.3390/s20071809
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Framework of a typical automatic pronunciation error detection (APED) system.
Figure 2Block diagram of the end-to-end APED system based on hybrid connectionist temporal classification (CTC)/attention architecture.
Figure 3An overall framework of the end-to-end automatic speech recognition (ASR) system based on hybrid CTC/attention architecture.
A list of initials and finals in Mandarin.
| Type | Quantity | Phone Units |
|---|---|---|
| Initial | 21 | b p m f d t n l g k h j q x zh ch sh r z c s |
| Simple final | 9 | a o e i u ü -i1 -i2 er |
| Compound final | 13 | ai ei ao ou ia ie ua uo üe iao iu uai ui |
| Final with a nasal ending | 16 | an ian uan üan en in un ün ang iang uang eng ing ueng ong iong |
Note: There are 39 finals in Chinese Pinyin defined by linguistic phoneticists. The symbols -i1 and -i2 are respective of the simple final which can follow only the initials zh, ch, sh, and z, c, s, but not any other initials. Although ê is also a simple final in Chinese Pinyin, it is not independently syllabled. It is always combined with i and ü to form the compound final ie and üe, so it is not placed in the simple final list. In conclusion, there are 59 total phones, including 21 initials, and 38 finals in this paper.
Number of sentences of announcers in China Central Television (CCTV) news speech corpus.
|
|
|
|
|
|
|
|
|
| 5131 | 5468 | 4195 | 1884 | 681 | 17,359 |
|
|
|
|
|
|
|
|
|
| 5268 | 5349 | 4657 | 425 | 232 | 15,931 |
Phone tokens for correct and incorrect pronunciations on different datasets.
| Data Collection | Phones in Total | Phones with Pronunciation Error | Pronunciation Error Rate % |
|---|---|---|---|
| Training Set | 408,000 | 50,616 | 12.41% |
| Develop Set | 35,496 | 4432 | 12.49% |
| Test Set | 36,312 | 4544 | 12.51% |
| Total | 479,808 | 59,592 | 12.42% |
Experimental configuration of CTC Attention.
| Acoustic Unit | Mono-Phone (Initial and Final in Mandarin) |
|---|---|
|
| The window length is 30 ms and the frame shift is 30 ms. The input feature is a 40-dimensional filter bank with first-order and second-order derivatives, as well as a 3-dimensional pitch. |
|
| The output of the CTC is 59 units, including 58 labels of initials and finals and one blank label. Because CTC does not need a context decision tree to achieve good performance, mono phone (initial or final) is taken as the acoustic unit. The lower frame rate can reduce the computational cost of the decoding process and greatly improve the decoding speed. The input of the attention-based model is the same as the CTC, and the encoder is shared. The output of attention-based model is 60 units, including 58 phone labels and < SOS > < EOS >. In the decoding process, the irregular alignment can be further eliminated by combining the probability score based on the attention and CTC in the one-pass beam search algorithm. CCTV, PSC-G1-112, and PSC-Train-1000 speech corpora are used as training data sets. Finally, the performance is tested in the PSC-Test-89 speech corpus. |
Figure 4The hierarchical evaluation structures.
A result analysis of APED Based on ASR.
| Canonical Phone in the Reference Transcription |
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| ||
|
| C | C | S | S | D | D | I | I |
|
| T | F | T | F | T | F | F | |
|
| TA | FA | FR | TR | FR | TR | FR | TR |
Note: the phones, marked p and q in this table, refer to two different phones in the phone set. The result analysis of ASR includes four cases: C (correct), S (substitution error), D (deletion error) and I (insertion error). The result marked by experts includes two cases: T (correct pronunciation) and F (pronunciation error). Result analysis of APED includes four cases: TA (true acceptance) and FR (false rejection), FA (false acceptance) and TR (true rejection).
Performance of ASR systems with different end-to-end models.
| Name | PER % |
|---|---|
| CTC | 15.36 |
| Attention | 15.25 |
| CTC_Attention ( | 14.17 |
| CTC_Attention ( | 13.34 |
| CTC_Attention ( | 13.33 |
| CTC_Attention ( | 13.45 |
| CTC_Attention ( | 13.75 |
| CTC_Attention ( | 13.86 |
| CTC_Attention ( | 14.21 |
| CTC_Attention ( | 14.28 |
| CTC_Attention ( | 14.44 |
| CTC_Attention ( |
|
Note: the phone segments marked wrong pronunciations by experts in the test set are ignored when the phone error rate (PER) of ASR is calculated.
Figure 5Effect of the hyper-parameter in the hybrid model.
Performance of ASR systems when the number of layers in their bidirectional long short-term memory projection (BLSTMP) encoders is different.
| Name | PER % | |||
|---|---|---|---|---|
| 2 | 3 | 4 | 5 | |
| CTC | 15.36 | 14.25 | 13.34 | 14.28 |
| Attention | 15.25 | 13.79 | 13.06 | 13.87 |
| CTC_Attention ( | 13.01 | 11.24 |
| 11.43 |
Performance of ASR systems with different model architectures.
| Name | PER % |
|---|---|
| GMM_HMM_GOP | 28.64 |
| DNN_HMM_GOP | 12.79 |
| DNN_DNN_AGP |
|
| CTC_Attention | 10.25 |
Performance evaluation of different APED systems for all initials and finals in Mandarin.
| FRR | FAR | Precision | Recall | F-Measure | Accuracy | |
|---|---|---|---|---|---|---|
| GMM_HMM_GOP | 29.10 | 31.89 | 25.07 | 68.11 | 36.65 | 70.55 |
| DNN_HMM_GOP | 13.57 | 18.86 | 46.09 | 81.14 | 58.79 | 85.77 |
| DNN_DNN_AGP | 5.85 | 35.97 | 61.01 | 64.03 | 62.48 |
|
| CTC_Attention | 8.62 | 18.55 | 57.47 | 81.45 |
| 90.14 |
Performance of APED systems for initial zh.
| FRR | FAR | Precision | Recall | F-Measure | Accuracy | |
|---|---|---|---|---|---|---|
| GMM_HMM_GOP | 29.10 | 31.91 | 52.55 | 68.09 | 59.32 | 70.00 |
| DNN_HMM_GOP | 13.57 | 18.90 | 73.88 | 81.10 | 77.32 | 84.72 |
| DNN_DNN_AGP | 5.85 | 26.00 | 85.69 | 74.00 | 79.42 | 87.68 |
| CTC_Attention | 8.62 | 18.59 | 81.72 | 81.41 |
|
|
Performance of APED systems for initial g.
| FRR | FAR | Precision | Recall | F-Measure | Accuracy | |
|---|---|---|---|---|---|---|
| GMM_HMM_GOP | 29.10 | 31.91 | 19.48 | 68.09 | 30.29 | 70.64 |
| DNN_HMM_GOP | 13.57 | 18.89 | 38.19 | 81.11 | 51.93 | 85.93 |
| DNN_DNN_AGP | 6.85 | 28.82 | 51.79 | 71.18 | 59.96 |
|
| CTC_Attention | 8.62 | 18.57 | 49.42 | 81.43 |
| 90.45 |
Performance of APED systems for final ang.
| FRR | FAR | Precision | Recall | F-Measure | Accuracy | |
|---|---|---|---|---|---|---|
| GMM_HMM_GOP | 29.11 | 31.92 | 48.8 | 68.08 | 56.85 | 70.08 |
| DNN_HMM_GOP | 13.57 | 18.89 | 70.89 | 81.11 | 75.66 | 84.89 |
| DNN_DNN_AGP | 5.86 | 26.01 | 83.74 | 73.99 | 78.56 |
|
| CTC_Attention | 8.99 | 18.58 | 78.67 | 81.42 |
| 88.23 |
Performance of APED systems for final a.
| FRR | FAR | Precision | Recall | F-Measure | Accuracy | |
|---|---|---|---|---|---|---|
| GMM_HMM_GOP | 29.10 | 31.90 | 21.49 | 68.10 | 32.67 | 70.61 |
| DNN_HMM_GOP | 13.57 | 18.91 | 41.13 | 81.09 | 54.58 | 85.87 |
| DNN_DNN_AGP | 6.85 | 29.80 | 54.53 | 70.20 | 61.38 |
|
| CTC_Attention | 8.62 | 18.62 | 52.46 | 81.38 |
| 90.33 |
Figure 6Accuracy of different models for four phones, zh, g, ang, and a.
Figure 7F-Measure of different models for four phones, zh, g, ang, and a.
Performance Comparison of ASR systems before and after adding pitch features.
| Input Features | PER |
|---|---|
| Filterbank | 10.26 |
| Filterbank + pitch | 10.25 |
Performance Comparison of APED systems before and after adding pitch features.
| Input Features | FRR | FAR | Precision | Recall | F-Measure | Accuracy |
|---|---|---|---|---|---|---|
| Filterbank | 8.72 | 17.99 | 57.35 | 82.01 |
| 90.12 |
| Filterbank + pitch | 8.62 | 18.55 | 57.47 | 81.45 | 67.39 |
|