| Literature DB >> 32195368 |
Terje B Holmlund1, Chelsea Chandler2, Peter W Foltz2,3, Alex S Cohen4, Jian Cheng5, Jared C Bernstein5, Elizabeth P Rosenfeld5, Brita Elvevåg1,6.
Abstract
Verbal memory deficits are some of the most profound neurocognitive deficits associated with schizophrenia and serious mental illness in general. As yet, their measurement in clinical settings is limited to traditional tests that allow for limited administrations and require substantial resources to deploy and score. Therefore, we developed a digital ambulatory verbal memory test with automated scoring, and repeated self-administration via smart devices. One hundred and four adults participated, comprising 25 patients with serious mental illness and 79 healthy volunteers. The study design was successful with high quality speech recordings produced to 92% of prompts (Patients: 86%, Healthy: 96%). The story recalls were both transcribed and scored by humans, and scores generated using natural language processing on transcriptions were comparable to human ratings (R = 0.83, within the range of human-to-human correlations of R = 0.73-0.89). A fully automated approach that scored transcripts generated by automatic speech recognition produced comparable and accurate scores (R = 0.82), with very high correlation to scores derived from human transcripts (R = 0.99). This study demonstrates the viability of leveraging speech technologies to facilitate the frequent assessment of verbal memory for clinical monitoring purposes in psychiatry.Entities:
Keywords: Biomarkers; Diagnosis; Human behaviour; Language
Year: 2020 PMID: 32195368 PMCID: PMC7066153 DOI: 10.1038/s41746-020-0241-7
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Description of participants and story recall trials.
| Patients ( | Healthy ( | |||
|---|---|---|---|---|
| M (SD) | Range | M (SD) | Range | |
| Age, years (SD) | 49.7 (10.4) | 30.0–67.0 | 21.7 (1.4) | 18.0–26.0 |
| Education, years (range) | 12.3 (1.4) | 7.0–16.0 | a | 12.0–13.0 |
| % female | 52.2% | 62.0% | ||
| Brief Psychiatric Rating Scaleb | ||||
| Affective | 2.1 (1.0) | 1.0–5.3 | ||
| Agitation | 1.6 (0.6) | 1.0–3.8 | ||
| Positive | 2.2 (1.2) | 1.0–5.5 | ||
| Negative | 2.1 (1.0) | 1.0–5.5 | ||
| Number of story recall trials | 354 | 681 | ||
| Responses with recognizable speech (%)c | 86.0% | 95.9% | ||
| Responses < 10 words (%) | 19.7% | 5.4% | ||
aThe exact education level for healthy volunteers was not registered but estimated to be within this range, according to their current academic progress (i.e., they were students attending an undergraduate university course).
bPresence of symptoms rated on a 1–7 scale (not present-extremely severe).
cWords detected by human transcribers and both ASR systems.
Fig. 1A summary of the procedure for administration and analysis of verbal memory using smart devices.
a In this example, a story was presented about a girl and her balloons at a birthday party. The participants were asked to “Remember the balloon story, so you can retell it again later”, and both immediate and delayed recall was assessed. b A total of 104 participants were tested. Patients tolerated the task but had more trials where they did not provide verbal responses. c Humans listened to the responses and rated them for accuracy on a scale between 0 and 6. Our ground truth measure was the average of multiple ratings, and the individual raters correlated with this ground truth between R = 0.73 and R = 0.89. d Humans transcribed the response recordings, and the similarity of these transcriptions to the original story was compared. Two features of similarity were extracted, namely a word count procedure and a measure of distance in a semantic space. A regression model produced predicted ratings, and these correlated with average human ratings at R = 0.83, well within the range of individual human raters. e The same computational procedure was used on transcripts derived using generic and customized automatic speech recognition systems. The performance of the automated predictive model was still within the level of individual raters with predicted scores correlating with the average human ratings at R = 0.82. f A linear model based on transcriptions from the custom ASR system predicted the human ratings well, except for a tendency to assign a higher score to some short responses.
Description of calculated measures, by group and transcription method.
| Patients ( | Healthy ( | d | |||||
|---|---|---|---|---|---|---|---|
| Mean | SD | Mean | SD | ||||
| Human rating (0–6) | 3.3 | 1.3 | 4.6 | 1.1 | 1.1 | 13.4 | <0.001 |
| Word count | 48.7 | 22.4 | 65.2 | 21.4 | 0.8 | 9.1 | <0.001 |
| Common types, calculated from | |||||||
| Human transcription | 16.4 | 6.8 | 26.7 | 8.1 | 1.4 | 17.8 | <0.001 |
| Generic ASR | 14.4 | 6.5 | 25.4 | 7.9 | 1.5 | 19.7 | <0.001 |
| Custom ASR | 16.5 | 6.5 | 26.7 | 7.9 | 1.4 | 18.3 | <0.001 |
| Word Mover’s Distance, calculated from | |||||||
| Human transcription | 1.7 | 0.5 | 1.3 | 0.4 | −1.0 | −12.0 | <0.001 |
| Generic ASR | 1.8 | 0.5 | 1.3 | 0.4 | −1.2 | −14.3 | <0.001 |
| Custom ASR | 1.7 | 0.4 | 1.3 | 0.4 | −1.1 | −12.6 | <0.001 |
| Predicted scores, calculated from | |||||||
| Human transcription | 3.4 | 0.9 | 4.6 | 0.9 | 1.3 | 15.4 | <0.001 |
| Generic ASR | 3.4 | 0.8 | 4.6 | 0.9 | 1.4 | 17.8 | <0.001 |
| Custom ASR | 3.4 | 0.9 | 4.6 | 0.9 | 1.3 | 15.8 | <0.001 |
d Cohen’s d, t Welchs t-test, two-sided, p p-value, Holm-corrected.
Fig. 2Accuracy of the automatic speech recognition (ASR) systems was different between the two ASR approaches and the two groups, but this did not have a large effect on predicted ratings.
a The ASR had a lower word error rate on responses from healthy participants compared to responses from patients. The word error rate was also lower on an ASR system customized to the verbal memory task, compared to a generic, off-the-shelf system. The custom system approached the level of errors from human transcribers (7.2%), level indicated by the grey horizontal line. Error bars represent the 95% confidence intervals of the means. b Scores from a predictive model using natural language processing methods on human transcription was highly correlated with scores derived using transcriptions from a generic system with higher word error rates. c Scores derived from transcriptions using a customized ASR system with lower error rates correlated even better with scores derived using the resource-demanding human transcription procedure, arguably producing equivalent results.