| Literature DB >> 35839638 |
Marama Diaz-Asper1, Terje B Holmlund2, Chelsea Chandler3, Catherine Diaz-Asper4, Peter W Foltz5, Alex S Cohen6, Brita Elvevåg7.
Abstract
Speech rate and quantity reflect clinical state; thus automated transcription holds potential clinical applications. We describe two datasets where recording quality and speaker characteristics affected transcription accuracy. Transcripts of low-quality recordings omitted significant portions of speech. An automated syllable counter estimated actual speech output and quantified the amount of missing information. The efficacy of this method differed by audio quality: the correlation between missing syllables and word error rate was only significant when quality was low. Automatically counting syllables could be useful to measure and flag transcription omissions in clinical contexts where speaker characteristics and recording quality are problematic.Entities:
Keywords: Automatic speech recognition; Syllables; Word error rate
Mesh:
Year: 2022 PMID: 35839638 PMCID: PMC9378537 DOI: 10.1016/j.psychres.2022.114712
Source DB: PubMed Journal: Psychiatry Res ISSN: 0165-1781 Impact factor: 11.225
Summary statistics of the two data sets.
| Telephone samples ( | App samples ( | |||||
|---|---|---|---|---|---|---|
| Patients ( | Healthy ( | |||||
| mean | std | mean | std | mean | std | |
| Duration of samples, in seconds | 152 | 67 | 29 | 14 | 25 | 8 |
| Word counts, human transcriptions | 337 | 169 | 45 | 21 | 61 | 21 |
| Word counts, machine transcriptions | 180 | 127 | 39 | 18 | 60 | 21 |
| Word Error Rates (%) | 63 | 20 | 45 | 33 | 17 | 17 |
| - Omission errors (%) | 47 | 25 | 15 | 14 | 4 | 5 |
| - Substitution errors (%) | 15 | 7 | 24 | 15 | 11 | 8 |
| - Intrusion errors (%) | 1 | 1 | 6 | 25 | 2 | 13 |
| Syllable counts: | ||||||
| a. From human transcriptions | 453 | 229 | 58 | 27 | 81 | 28 |
| b. From machine transcriptions | 251 | 176 | 51 | 23 | 80 | 27 |
| c. From counting syllable nuclei | 597 | 280 | 64 | 30 | 76 | 27 |
| Proportion of Missing Syllables (i.e., b/c) | 0.57 | 0.22 | 0.16 | 0.26 | −0.07 | 0.22 |
From Diaz-Asper et al. (2021).
From Holmlund et al. (2020).
Fig. 1.Panel A: An illustration of the sheer amount of information missing in some transcripts. Word count in human and ASR transcriptions of the samples collected in telephone interviews, ranked by length of transcript, shows how ASR transcriptions (black bars) had much lower word counts compared to the human transcriptions (blue bars). This was suggestive of a high incidence of omission errors. Panel B: Automatic syllable detection closely, but not perfectly, matched the amount of syllables detected and transcribed by humans. Panel C: The proportion of missing syllables in telephone interview transcripts (blue) showed a strong positive correlation with the word error rate (r = 0.91), supporting the notion that syllables missing can be used to estimate the amount of discrepancies in the ASR outputs from low quality recordings with challenging speaker characteristics. The relationship was weaker in the App samples from patients with SMI (orange, r = 0.52), and non-existent in high-quality recordings of healthy participants (green, r = 0.01). For figure with colors, please see online version. Panel D: Omission errors were likely to be the main source of missing syllables in transcripts and the relationship was indeed slightly stronger with omission errors specifically, rather than the overall WER.