| Literature DB >> 29713836 |
A Paats1,2, T Alumäe3, E Meister3, I Fridolin4.
Abstract
The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.Entities:
Keywords: Automatic speech recognition; Estonian language; Radiology; Spontaneous dictation; Word error rate
Mesh:
Year: 2018 PMID: 29713836 PMCID: PMC6148813 DOI: 10.1007/s10278-018-0085-8
Source DB: PubMed Journal: J Digit Imaging ISSN: 0897-1889 Impact factor: 4.056
Different ASR system development versions
| Version number | ASR system characteristics |
|---|---|
| v1 | GMM acoustic model, language model trained on 1-year reports |
| v2 | DNN acoustic model, language model trained on 1-year reports |
| v3 | + language model trained on 5-year reports |
| v4 | + better noise modeling in language model |
| v5 | + better modeling of sentence breaks |
| v6 | + less aggressive silence detection |
| v7 | + acoustic model adapted using in-domain data |
| v8 | + language model adapted using spoken data |
Distribution of dictated reports among radiologists as “Total no. reports” and modalities (XR X-ray, CT computed tomography, MR magnetic resonance tomography, US ultrasound). The number of total words per radiologist is given as “Total no. words“
| Radiologist | Total no. reports | Modality | Total no. words | |||
|---|---|---|---|---|---|---|
| CT | MR | XR | US | |||
| No. 1 | 19 | 8 | 3 | 4 | 4 | 2006 |
| No. 2 | 19 | 7 | 4 | 4 | 4 | 1250 |
| No. 3 | 22 | 9 | 13 | 2031 | ||
| No. 4 | 22 | 10 | 4 | 4 | 4 | 2463 |
| No. 5 | 19 | 8 | 9 | 2 | 1675 | |
| No. 6 | 20 | 8 | 10 | 2 | 1875 | |
| No. 7 | 20 | 8 | 8 | 4 | 2057 | |
| No. 8 | 20 | 8 | 4 | 8 | 1693 | |
| No. 9 | 19 | 6 | 13 | 1701 | ||
| No. 10 | 19 | 8 | 7 | 4 | 1409 | |
| No. 11 | 20 | 8 | 8 | 4 | 1768 | |
| Total | 219 | 88 | 42 | 42 | 47 | 19,928 |
| Mean | 19.9 | 8.0 | 7.0 | 5.3 | 5.2 | 1811 |
| SD | 1.1 | 1.0 | 4.0 | 2.4 | 3.3 | 331 |
Fig. 1The total WER (mean, SD) for model versions 1 to 8
Fig. 2Word error rates by modality for different model versions
Fig. 3Word error rates corresponding to individual radiologist for different model versions
Fig. 4Median word error rate improvement with maximum, minimum, first, and third quartile between the first (v1) and the last (v8) model versions corresponding to individual radiologist
Fig. 5Median of word error rate improvement with maximum, minimum, first and third quartile between first (v1) and last (v8) model versions by modality