| Literature DB >> 34046588 |
Youxiang Zhu1, Xiaohui Liang1, John A Batsis2, Robert M Roth3.
Abstract
Examination of speech datasets for detecting dementia, collected via various speech tasks, has revealed links between speech and cognitive abilities. However, the speech dataset available for this research is extremely limited because the collection process of speech and baseline data from patients with dementia in clinical settings is expensive. In this paper, we study the spontaneous speech dataset from a recent ADReSS challenge, a Cookie Theft Picture (CTP) dataset with balanced groups of participants in age, gender, and cognitive status. We explore state-of-the-art deep transfer learning techniques from image, audio, speech, and language domains. We envision that one advantage of transfer learning is to eliminate the design of handcrafted features based on the tasks and datasets. Transfer learning further mitigates the limited dementia-relevant speech data problem by inheriting knowledge from similar but much larger datasets. Specifically, we built a variety of transfer learning models using commonly employed MobileNet (image), YAMNet (audio), Mockingjay (speech), and BERT (text) models. Results indicated that the transfer learning models of text data showed significantly better performance than those of audio data. Performance gains of the text models may be due to the high similarity between the pre-training text dataset and the CTP text dataset. Our multi-modal transfer learning introduced a slight improvement in accuracy, demonstrating that audio and text data provide limited complementary information. Multi-task transfer learning resulted in limited improvements in classification and a negative impact in regression. By analyzing the meaning behind the AD/non-AD labels and Mini-Mental State Examination (MMSE) scores, we observed that the inconsistency between labels and scores could limit the performance of the multi-task learning, especially when the outputs of the single-task models are highly consistent with the corresponding labels/scores. In sum, we conducted a large comparative analysis of varying transfer learning models focusing less on model customization but more on pre-trained models and pre-training datasets. We revealed insightful relations among models, data types, and data labels in this research area.Entities:
Keywords: Alzheimer’s Disease; Deep learning; Dementia; Early Detection; Speech analysis; Spontaneous speech; Transfer learning
Year: 2021 PMID: 34046588 PMCID: PMC8153512 DOI: 10.3389/fcomp.2021.624683
Source DB: PubMed Journal: Front Comput Sci ISSN: 2624-9898
Cookie Theft Picture Datasets
| Dataset | Language | Total | HC | MCI | AD |
|---|---|---|---|---|---|
| ADReSS ( | English | 156 | 78 | 78 | |
| Pitt Corpus ( | English | 312 | 104 | 208 | |
| WLS ( | English | 1366 | |||
| IVA ( | English | 33 | 16 | 17 | |
| Hebrew CTP ( | Hebrew | 70 | 35 | 35 | |
| MECSD ( | Mandarin | 85 | 65 | 20 | |
| NTU ( | Mandarin | 50 | 40 | 10 | |
| Swedish CTP ( | Swedish | 67 | 36 | 31 | |
| French CTP ( | French | 58 | 25 | 33 |
Figure 1.Supervised Classification Approach
Figure 2.Text BERT and Speech BERT
Figure 3.Multi-modal transfer learning using Text/Speech BERT (Dual BERT)
Figure 4.Multi-task learning using Text BERT
AD Classification results using audio or text and with or without pre-training. AD: Alzheimer’s Disease. Accuracy: mean and standard deviation of results of 5 rounds. Best: highest accuracy of all epochs in 5 rounds.
| Model | Pre-training dataset | Classes | Precision % | Recall % | Fl % | Accuracy % | Best % |
|---|---|---|---|---|---|---|---|
| Audio ( | - | 62 | |||||
| AD | 60 | 75 | 67 | ||||
| MobileNet | - | 59.00 ± 5.66 | 72.91 | ||||
| AD | 61.40 ± 6.89 | 59.80 ± 21.07 | 57.60 ± 10.54 | ||||
| ImageNet | 58.80 ± 3.49 | 77.08 | |||||
| AD | 55.80 ± 2.48 | 90.40 ± 1.96 | 69.00 ± 1.67 | ||||
| YAMNet | - | 53.80 ± 6.88 | 79.17 | ||||
| AD | 53.40 ± 5.95 | 87.60 ± 9.56 | 65.80 ± 1.33 | ||||
| AudioSet | 66.20 ± 4.79 | 83.33 | |||||
| AD | 64.40 ± 3.93 | 73.40 ± 8.82 | 68.60 ± 4.84 | ||||
| Speech BERT | - | 66.67 ± 2.95 | 77.08 | ||||
| AD | 65.84 ± 2.43 | 69.16 ± 5.65 | 67.39 ± 3.71 | ||||
| LibriSpeech | 63.33 ± 3.12 | 79.17 | |||||
| AD | 61.48 ± 2.76 | 71.67 ± 4.86 | 66.12 ± 3.08 | ||||
| Text ( | - | 75 | - | ||||
| AD | 83 | 62 | 71 | ||||
| BERT base | - | 76.67 ± 1.56 | 81.25 | ||||
| AD | 75.47 ± 2.08 | 79.17 ± 2.63 | 77.23 ± 1.50 | ||||
| BooksCorpus/Wiki | 80.83 ± 2.04 | 85.42 | |||||
| AD | 83.64 ± 2.22 | 76.67 ± 2.04 | 80.00 ± 2.13 | ||||
| BERT large | BooksCorpus/Wiki | 81.67 ± 3.34 | 87.50 | ||||
| AD | 80.65 ± 2.66 | 83.33 ± 5.89 | 81.89 ± 3.64 | ||||
| Longformer | BooksCorpus/Wiki/ Realnews/Stories | 82.08 ± 2.83 | 89.58 | ||||
| AD | 88.14 ± 2.09 | 74.17 ± 5.53 | 80.44 ± 3.55 |
AD Classification results of multi-modal learning using both audio and text. AD: Alzheimer’s Disease. Accuracy: mean and standard deviation of results of 5 rounds. Best: highest accuracy of all epochs in 5 rounds.
| Model | Fusion / Training | Classes | Precision % | Recall % | Fl % | Accuracy | Best % |
|---|---|---|---|---|---|---|---|
| Speech BERT | - | 63.33 ± 3.12 | 79.17 | ||||
| AD | 61.48 ± 2.76 | 71.67 ± 4.86 | 66.12 ± 3.08 | ||||
| BERT base | - | 80.83 ± 2.04 | 85.42 | ||||
| AD | 83.64 ± 2.22 | 76.67 ± 2.04 | 80.00 ± 2.13 | ||||
| Dual BERT | Add / Joint | 81.25 ± 1.86 | 85.42 | ||||
| AD | 84.41 ± 2.13 | 76.67 ± 2.04 | 80.35 ± 1.95 | ||||
| Add / Separate | 82.08 ± 1.66 | 85.42 | |||||
| AD | 86.05 ± 2.60 | 76.67 ± 2.04 | 81.06 ± 1.69 | ||||
| Concat / Separate | 82.08 ± 1.02 | 85.42 | |||||
| AD | 84.21 ± 2.52 | 79.17 ± 2.63 | 81.54 ± 1.01 | ||||
| Concat / Joint (No pre-train speech) | 82.08 ± 1.66 | 87.50 | |||||
| AD | 84.10 ± 1.91 | 79.17 ± 2.63 | 81.53 ± 1.83 | ||||
| Concat / Joint (Longformer) | 82.08 ± 2.12 | 89.58 | |||||
| AD | 86.95 ± 3.38 | 75.83 ± 6.12 | 80.79 ± 2.74 | ||||
| Concat / Joint | 82.50 ± 1.02 | 85.42 | |||||
| AD | 85.48 ± 1.46 | 78.34 ± 1.67 | 81.74 ± 1.10 | ||||
| Concat / Joint (BERT large) | 82.92 ± 1.56 | 87.50 | |||||
| AD | 83.04 ± 3.97 | 83.33 ± 5.89 | 82.92 ± 1.86 | ||||
| YAMNet + BERT base | Concat / Joint | 80.83 ± 2.43 | 89.58 | ||||
| AD | 82.70 ± 3.65 | 82.50 ± 5.53 | 82.45 ± 3.07 |
Figure 5.Threshold-based strategy (0–30)
Threshold-based strategy (20–30). The highest accuracy in training, the highest accuracy in testing, and the testing accuracy corresponding to the highest accuracy in training are in bold.
| Accuracy (Training) | Accuracy (Testing) % | |
|---|---|---|
| 20 | 86.92 | 75.00 |
| 21 | 88.79 | 79.17 |
| 22 | 89.72 | 81.25 |
| 23 | 90.65 | 83.33 |
| 24 | 92.52 | 87.50 |
| 25 | 95.33 | 87.50 |
| 26 | ||
| 27 | 96.26 | 89.58 |
| 28 | 95.33 | |
| 29 | 87.85 | 83.33 |
| 30 | 71.03 | 70.83 |
Classification and regression results of multi-task transfer learning using CTP text. AD: Alzheimer’s Disease. Accuracy: mean and standard deviation of results of 5 rounds. Best: highest accuracy of all epochs in 5 rounds. RMSE: mean and standard deviation of Root-Mean-Square Errors of 5 rounds. Best RMSE: lowest RMSE of all epochs in 5 rounds.
| Model | Pre-training | Settings | Accuracy % | Best % | RMSE | Best RMSE |
|---|---|---|---|---|---|---|
| Text ( | - | Classification | 75 | - | ||
| - | Regression | 5.20 | - | |||
| BERT base | No | Classification | 76.67 ± 1.56 | 81.25 | - | - |
| Regression | - | - | 5.18 ± 0.04 | 4.65 | ||
| Multi-task | 78.75 ± 1.56 | 83.33 | 4.70 ± 0.02 | 4.39 | ||
| Yes | Classification | 80.83 ± 2.04 | 85.42 | - | - | |
| Regression | - | - | 4.15 ± 0.01 | 4.06 | ||
| Multi-task | 80.83 ± 1.56 | 87.50 | 4.96 ± 0.01 | 4.20 |
The best classification cases of the audio-based, text-based, and multi-modal models. AD: Alzheimer’s Disease. Accuracy: mean and standard deviation of results of 5 rounds. Best: highest accuracy of all epochs in 5 rounds.
| Input | Model (with pre-training) | Classes | Precision % | Recall % | Fl% | Accuracy % | Best % |
|---|---|---|---|---|---|---|---|
| Audio | YAMNet | 66.20 ± 4.79 | 83.33 | ||||
| AD | 64.40 ± 3.93 | 73.40 ± 8.82 | 68.60 ± 4.84 | ||||
| Text | Longformer | 82.08 ± 2.83 | 89.58 | ||||
| AD | 88.14 ± 2.09 | 74.17 ± 5.53 | 80.44 ± 3.55 | ||||
| Audio + Text | Dual BERT Concat / Joint (BERT large) | 82.92 ± 1.56 | 87.50 | ||||
| AD | 83.04 ± 3.97 | 83.33 ± 5.89 | 82.92 ± 1.86 |