| Literature DB >> 32550644 |
Adam S Miner1,2,3, Albert Haque4, Jason A Fries3, Scott L Fleming5, Denise E Wilfley6, G Terence Wilson7, Arnold Milstein8, Dan Jurafsky4,9, Bruce A Arnow1, W Stewart Agras1, Li Fei-Fei4, Nigam H Shah3.
Abstract
Accurate transcription of audio recordings in psychotherapy would improve therapy effectiveness, clinician training, and safety monitoring. Although automatic speech recognition software is commercially available, its accuracy in mental health settings has not been well described. It is unclear which metrics and thresholds are appropriate for different clinical use cases, which may range from population descriptions to individual safety monitoring. Here we show that automatic speech recognition is feasible in psychotherapy, but further improvements in accuracy are needed before widespread use. Our HIPAA-compliant automatic speech recognition system demonstrated a transcription word error rate of 25%. For depression-related utterances, sensitivity was 80% and positive predictive value was 83%. For clinician-identified harm-related sentences, the word error rate was 34%. These results suggest that automatic speech recognition may support understanding of language patterns and subgroup variation in existing treatments but may not be ready for individual-level safety surveillance.Entities:
Keywords: Depression; Translational research
Year: 2020 PMID: 32550644 PMCID: PMC7270106 DOI: 10.1038/s41746-020-0285-8
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Patient demographics and therapy session information.
| Patient demographics | Average | Standard deviation | Median | Min | Max |
|---|---|---|---|---|---|
| Number of patients | 100 | – | – | – | – |
| Female (%) | 87 | – | – | – | – |
| Age (years) | 23 | 5 | 21 | 18 | 52 |
| Session information | |||||
| Length | |||||
| Minutes | 45 | 11 | 47 | 13 | 69 |
| Number of words | 6574 | 2102 | 6387 | 824 | 11,310 |
| Time talking per session (min) | |||||
| Patient | 25 | 9 | 26 | 2 | 46 |
| Therapist | 20 | 8 | 19 | 4 | 41 |
| Words spoken per session ( | |||||
| Patient | 3665 | 1550 | 3555 | 277 | 7043 |
| Therapist | 2909 | 1128 | 2886 | 547 | 6213 |
Similarity between the human-transcribed reference standard and ASR-transcribed sentences.
| Word overlap | Semantic similarity | ||||||
|---|---|---|---|---|---|---|---|
| Group | Error Rate, % | Shapiro–Wilk | Semantic distance, pts | Shapiro–Wilk | |||
| Aggregate | |||||||
| Total | 100 | 25% ± 12% | 0.93 | <0.001 | 1.20 ± 0.31 | 0.97 | 0.03 |
| Speaker | |||||||
| Patient | 100 | 25% ± 12% | 0.86 | <0.001 | 1.19 ± 0.33 | 0.94 | <0.001 |
| Therapist | 100 | 26% ± 11% | 0.88 | <0.001 | 1.20 ± 0.29 | 0.99 | 0.57 |
| Patient gender | |||||||
| Male | 13 | 24% ± 9% | 0.95 | 0.55 | 1.17 ± 0.30 | 0.95 | 0.55 |
| Female | 87 | 25% ± 13% | 0.84 | <0.001 | 1.19 ± 0.33 | 0.94 | <0.001 |
Plus/minus values denote standard deviation. Lower error rate is better. Lower semantic distance is better. Shapiro–Wilk tests were conducted to test the normality assumption (Supplementary Fig. 2). Low p values indicate the data are not normally distributed.
Fig. 1Automatic speech recognition performance, overall and by subgroup.
Evaluation of ASR transcription performance compared to the human-generated reference transcription. Each box denotes the 25th and 75th percentile. Box center-lines denote the median. Whiskers denote the minimum and maximum values, excluding any outliers. Outliers, denoted by diamonds, are defined as any point further than 1.5× the interquartile range from the 25th or 75th percentile. Sample sizes are listed in Table 2. NS not significant means the difference is not statistically significant. a Comparison of word overlap (i.e., word error rate). Lower word error rate is better. b Comparison of semantic similarity (i.e., semantic distance). Lower semantic distance is better.
Performance on clinically-relevant utterances by patients.
| PHQ | Keywordsa | Number of positives | True positives | False negatives | False positives | Sensitivity | Positive predictive value |
|---|---|---|---|---|---|---|---|
| 1 | Interest, interested, interesting, interests, pleasure | 169 | 127 | 42 | 38 | 75% | 77% |
| 2 | Depressed, depressing, feeling down, hopeless, miserable | 74 | 63 | 11 | 12 | 85% | 84% |
| 3 | Asleep, drowsy, sleepiness, sleeping, sleepy | 114 | 85 | 29 | 19 | 75% | 82% |
| 4 | Energy, tired | 143 | 115 | 28 | 22 | 80% | 84% |
| 5 | Overeat, overeating | 5 | 3 | 2 | 0 | 60% | 100% |
| 6 | Bad, badly, poorly | 405 | 336 | 69 | 56 | 83% | 86% |
| 7 | Mindfulness | 11 | 9 | 2 | 0 | 82% | 100% |
| 8 | Fidget, fidgety, restless, slow, slowing, slowly | 39 | 28 | 11 | 13 | 72% | 68% |
| 9 | Dead, death, depression, died, suicide | 103 | 86 | 17 | 18 | 83% | 83% |
| Weighted average | 1063 | 852 | 211 | 178 | 80% | 83% |
aFor each question of the Patient Health Questionnaire (PHQ-9), relevant keywords were identified by querying the Unified Medical Language System using each PHQ question to generate search terms. Each table row denotes a different question from the PHQ-9. Number of occurrences refer to how often the keywords appear in our transcribed therapy sessions. True positives refer to a correct transcription by the automatic speech recognition system. False negatives and false positives denote incorrect transcriptions. Sample size is denoted by the number of positives.
Transcription errors made by the automatic speech recognition system.
| Meaning (semantics) | |||
|---|---|---|---|
| Similar to reference standard | Different from reference standard | ||
| Form (Words or Phonetics) | Similar to reference standard | 1. Tuesday, I had found out about that my grandmother 2. Came back and | 1. I have still been feeling 2. Do you have any plans to |
| Different from reference standard | 1. Depends on like what I eat or what 2. Comfortable to expressing | 1. It still stings. It 2. I’m going to try to | |
Each numbered sentence is a different sentence containing both the reference standard and ASR transcription. Strikethrough denotes the human-generated reference standard. Underline denotes the speech recognition system’s erroneous output. Black text denotes agreement.