| Literature DB >> 25860959 |
Yan-Hua Long1, Hong Ye1.
Abstract
Nowadays, although automatic speech recognition has become quite proficient in recognizing or transcribing well-prepared fluent speech, the transcription of speech that contains many disfluencies remains problematic, such as spontaneous conversational and lecture speech. Filled pauses (FPs) are the most frequently occurring disfluencies in this type of speech. Most recent studies have shown that FPs are widely believed to increase the error rates for state-of-the-art speech transcription, primarily because most FPs are not well annotated or provided in training data transcriptions and because of the similarities in acoustic characteristics between FPs and some common non-content words. To enhance the speech transcription system, we propose a new automatic refinement approach to detect FPs in British English lecture speech transcription. This approach combines the pronunciation probabilities for each word in the dictionary and acoustic language model scores for FP refinement through a modified speech recognition forced-alignment framework. We evaluate the proposed approach on the Reith Lectures speech transcription task, in which only imperfect training transcriptions are available. Successful results are achieved for both the development and evaluation datasets. Acoustic models trained on different styles of speech genres have been investigated with respect to FP refinement. To further validate the effectiveness of the proposed approach, speech transcription performance has also been examined using systems built on training data transcriptions with and without FP refinement.Entities:
Mesh:
Year: 2015 PMID: 25860959 PMCID: PMC4393320 DOI: 10.1371/journal.pone.0123466
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Framework of the automatic filled pause refinement.
Fig 2FP-Dict illustration of word “BUT”.
Performance (in %) comparison between two lightly supervised decoding systems using different acoustic models on bbc.dev.
|
| Sub | Del | Ins |
|
|---|---|---|---|---|
| SWB-FP.AM | 6.0 | 2.4 | 2.0 | 10.4 |
| BN-FP.AM | 6.1 | 2.6 | 2.3 | 11.0 |
Fig 3DET curves comparison of FP refinement performances varying with the scale factor r of pronunciation probability ratio.
Comparison of Performance (in %) among three different FP refinement systems on bbc.dev and bbc.eval.
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
| FA | MA |
|
| FA | MA |
| SWB-FP | 86.0 | 84.5 | 13.8 | 15.5 | 87.0 | 87.2 | 13.0 | 12.8 |
| BN-FP | 84.2 | 83.0 | 15.6 | 17.0 | 83.2 | 83.8 | 16.9 | 16.2 |
| RL-FP | 83.8 | 83.3 | 16.1 | 16.7 | 84.7 | 85.0 | 15.4 | 15.0 |
Performance comparison on bbc.dev between RL-FP systems with different iteration of Acoustic Model updating.
| #iteration | FA | MA |
|---|---|---|
| 1 | 16.1 | 16.7 |
| 2 | 11.6 | 10.2 |
| 3 | 7.2 | 4.1 |
Speech transcription performances on bbc.dev and bbc.eval using system LST and LST-FP.
|
|
| |||
|---|---|---|---|---|
| LST | LST-FP | LST | LST-FP | |
| Sub | 15.1 | 14.6 | 16.0 | 15.1 |
| Del | 3.3 | 2.8 | 3.4 | 2.6 |
| Ins | 4.0 | 3.4 | 3.6 | 3.2 |
| WER | 22.4 | 20.8 | 23.0 | 20.9 |