| Literature DB >> 34194313 |
Matej Martinc1, Fasih Haider2, Senja Pollak1, Saturnino Luz2.
Abstract
Background: Advances in machine learning (ML) technology have opened new avenues for detection and monitoring of cognitive decline. In this study, a multimodal approach to Alzheimer's dementia detection based on the patient's spontaneous speech is presented. This approach was tested on a standard, publicly available Alzheimer's speech dataset for comparability. The data comprise voice samples from 156 participants (1:1 ratio of Alzheimer's to control), matched by age and gender. Materials andEntities:
Keywords: Alzheimer's dementia detection; acoustic features; language; lexical features; machine learning; natural language processing; speech; speech processing
Year: 2021 PMID: 34194313 PMCID: PMC8236853 DOI: 10.3389/fnagi.2021.642647
Source DB: PubMed Journal: Front Aging Neurosci ISSN: 1663-4365 Impact factor: 5.750
Figure 1Average position of nouns that appear at least 20 times in the training set. AD and non-AD stand for average position in the speech transcripts of patients with AD and control group patients, respectively. Difference denotes the absolute difference between these averages, and Freq denotes the frequency of the noun in the corpus. The nouns are sorted according to the difference column.
Feature importance of nouns in a random forest classifier according to its position in 1st, 2nd, or 3rd chunk of each transcript.
| Window | 0.09904 | 0.02905 | 0.01041 | 0.13849 |
| Sink | 0.06526 | 0.03472 | 0.01101 | 0.11099 |
| Stool | 0.06090 | 0.02709 | 0.01988 | 0.10787 |
| Action | 0.07408 | 0.00796 | 0.00591 | 0.08795 |
| Curtain | 0.03686 | 0.02560 | 0.01131 | 0.07377 |
| Mother | 0.02548 | 0.01984 | 0.01852 | 0.06384 |
| Dish | 0.02689 | 0.01874 | 0.00951 | 0.05514 |
| Cookie | 0.03305 | 0.01190 | 0.00929 | 0.05424 |
| Water | 0.03082 | 0.01380 | 0.00704 | 0.05167 |
| Hand | 0.02241 | 0.01573 | 0.00780 | 0.04594 |
| Girl | 0.01303 | 0.01129 | 0.00828 | 0.03260 |
| Boy | 0.01023 | 0.00914 | 0.00903 | 0.02840 |
| Jar | 0.01080 | 0.00957 | 0.00724 | 0.02762 |
| Plate | 0.01398 | 0.00489 | 0.00475 | 0.02362 |
| Floor | 0.00970 | 0.00700 | 0.00651 | 0.02322 |
| Kid | 0.00787 | 0.00773 | 0.00566 | 0.02126 |
| Thing | 0.00657 | 0.00624 | 0.00424 | 0.01705 |
| Sister | 0.00870 | 0.00484 | 0.00112 | 0.01465 |
| Lady | 0.00482 | 0.00364 | 0.00264 | 0.01110 |
| Kitchen | 0.00578 | 0.00263 | 0.00217 | 0.01057 |
Sum is the sum of all three scores.
Figure 2Main feature engineering steps presented on the example preprocessed input sentence “There are tie back curtains at the window”. Audio and word feature vectors (i.e., embeddings) are combined (Symbol “⌢” symbol denotes concatenations) and fed into an ADR feature generation procedure. The six resulting features are used in five distinct feature configurations.
Results of the three best feature configurations in the LOOCV setting and on the test set in terms of accuracy.
| Temporal + char4grams | audio + text | 0.8611 | |
| New + char4grams | audio + text | 0.8750 | |
| char4grams | text | 0.8611 | 0.8958 |
| top three late fusion | / | ||
| BERT—reimplementation of Yuan et al. ( | / | 0.8426 | 0.8333 |
| ERNIE best related work (Yuan et al., | / | / | 0.8958 |
The feature configurations column indicates which feature configuration has been used and whether char4grams have been added, and column Input modality shows the modality on which ADR features have been generated. The best individual methods' results in LOOCV and on the test set, as well as the late fusion of all three methods, are shown in bold. The row labelled top three late fusion presents the results of employing late/decision fusion (i.e., the use of majority voting) over the three best approaches.
Figure 3Boxplot summarizing the accuracy distributions for 50 classifiers on the test set for the Temporal feature configuration (text, audio, and text+audio), char4grams and char4grams combined with text and audio (text+audio+char4grams).
Comparison with state-of-the-art studies conducted on subsets of the Pitt dataset.
| Haider et al. ( | 78.7% | Acoustic |
| Luz ( | 68.0% | Acoustic |
| Fraser et al. ( | 81.9% | Text/acoustic |
| Yancheva and Rudzicz ( | 80.0% | Text/acoustic |
| Hernández-Domínguez et al. ( | 68.0% | Text |
| Mirheidari et al. ( | 75.6% | Text |
| ADReSS challenge baseline | 62.5% | Acoustic |
| ADReSS challenge baseline | 75.00% | Text |
| Text | ||
| Yuan et al. ( | 85.40% | Text |
| Syed et al. ( | 85.42% | Text |
| Balagopalan et al. ( | 83.33% | Text |
| Sarawgi et al. ( | 83.33% | Text/acoustic |
| Pompili et al. ( | 81.25% | Text/acoustic |
| Koo et al. ( | 81.25% | Text/acoustic |
| Cummins et al. ( | 81.25% | Text/acoustic |
| Searle et al. ( | 81.25% | Text/acoustic |
| Edwards et al. ( | 79.17% | Text/acoustic |
| Rohanian et al. ( | 79.17% | Text/acoustic |
| Martinc and Pollak ( | 77.08% | Text |
| Pappagari et al. ( | 75.00% | Text/acoustic |
| Acoustic/text/temporal | ||
| Acoustic/text/temporal | ||
The top three results are shown in bold. Results of this study are presented in Italics.
Figure 4Test set attention scores prescribed by the BERT model for 16 nouns presented in Figure 1. The height of each column indicates the attention given to a specific noun in a specific position in the transcript. A blue coloured column indicates that a specific noun appeared in the transcript belonging to the non-AD class, while a red coloured column indicates that a noun at this position appeared in a transcript belonging to the AD class. The positions (x-axis) range from 1 (i.e., first word in the transcript) to 256 (i.e., last word in the corpus).