| Literature DB >> 35370608 |
Loukas Ilias1, Dimitris Askounis1.
Abstract
Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model via a Multimodal Shifting Gate. Finally, we introduce an approach to capture both the inter- and intra-modal interactions by concatenating the textual and visual representations and utilizing a self-attention mechanism, which includes a gate model. Experiments conducted on the ADReSS Challenge dataset indicate that our introduced models demonstrate valuable advantages over existing research initiatives achieving competitive results in both the AD classification and MMSE regression task. Specifically, our best performing model attains an accuracy of 90.00% and a Root Mean Squared Error (RMSE) of 3.61 in the AD classification task and MMSE regression task, respectively, achieving a new state-of-the-art performance in the MMSE regression task.Entities:
Keywords: BERT; Log-Mel spectrogram; Multimodal Shifting Gate; Vision Transformer; co-attention; dementia; self-attention
Year: 2022 PMID: 35370608 PMCID: PMC8969102 DOI: 10.3389/fnagi.2022.830943
Source DB: PubMed Journal: Front Aging Neurosci ISSN: 1663-4365 Impact factor: 5.750
Figure 1BERT + ViT + Co-Attention.
Figure 2Multimodal BERT - eGeMAPS + ViT.
Figure 3BERT + ViT + Gated Self-Attention.
Overview of the multimodal state-of-the-art approaches, which are later compared with our work.
|
|
|
|
|
|---|---|---|---|
| Cummins et al. ( | Fusion Maj./W-avg (3-best) | Bag-of-Audio-Words, zero-frequency filtered (ZFF) signals, and BiLSTM-Attention network | AD/MMSE |
| Rohanian et al. ( | LSTM with Gating (Acoustic + Lexical + Dis) | Acoustic, Linguistic Features, Bi-LSTM, gating mechanism | AD/MMSE |
| Edwards et al. ( | System 3: Phonemes and Audio | Phoneme written pronunciation using CMUDict + acoustic features | AD |
| Pompili et al. ( | Fusion of System | Fusion of x-vectors with linguistic features, train SVM | AD |
| Koo et al. ( | Bimodal Network (Ensembled Output) | Ensemble (top-5 bimodal networks) | AD/MMSE |
| Martinc and Pollak ( | GFI, NUW, Duration, Character 4-grams, Suffixes, POS tag, UD | Feature extraction, Logistic Regression Classifier | AD |
| Pappagari et al. ( | Acoustic & Transcript | Fusion of the acoustic (x-vectors) and transcript (BERT) model scores | AD |
| Pappagari et al. ( | Acoustic+silence & Transcript | Average the scores from the different models, four silence features | MMSE |
| Zhu et al. ( | Dual BERT | Concatenation of the representations obtained by BERT and Speech BERT | AD |
| Mahajan and Baths ( | Model C | Neural network consisting of CNN, BiLSTM, Attention, GRU, and Dense layers | AD |
| Shah et al. ( | Majority vote (NLP + Acoustic) | Final prediction by taking a linear weighted combination of the individual model predictions | AD |
| Shah et al. ( | Random Forest (NLP) + gradient boosting (acoustic) | Language/fluency/n-gram features, MFCC and delta coefficients, Dimensionality Reduction Techniques | MMSE |
| Syed et al. ( | Audio + Text | Majority level approach of six models, averaging-based fusion | AD/MMSE |
| Sarawgi et al. ( | Ensemble | Majority voting approach, average the predictions | AD/MMSE |
| Syed et al. ( | Attempt 4 | Label fusion from the top-5 performing models from audio and text modalities (top-5 from each modality), average value of predictions of individual models | AD/MMSE |
| Farzana and Parde ( | SELECTED-FEATURE | For selecting the features, a Random Forest regression model was trained. The authors retained only features having mean decrease impurity (MDI) values exceeding a predefined threshold | MMSE |
Overview of the unimodal state-of-the-art approaches using only speech, which are later compared with our work.
|
|
|
|
|
|---|---|---|---|
| Cummins et al. ( | SiameseNet | A deep Siamese neural network consisting of convolutional layers. As an input, the model used either 8-s or 16-s segments. | AD |
| Cummins et al. ( | BoAW fusion (3-best) | MelFrequency Cepstral Coefficient (MFCC), log-Mel, and the COMPARE acoustic feature set | MMSE |
| Rohanian et al. ( | LSTM (Acoustic) | Higher-order statistics of COVAREP features. Bi-LSTM training | AD/MMSE |
| Edwards et al. ( | System 1: Audio | LDA posterior probabilities of ComParE2016 features | AD |
| Pompili et al. ( | x-vectors_SRE | The authors use both the SRE and the Voxceleb models for the x-vectors framework. train SVM | AD |
| Koo et al. ( | VGGish | The authors used VGGish features and trained a neural network consisting of Attention Layer, CNN, Bi-LSTM, and Dense Layers. | AD/MMSE |
| Pappagari et al. ( | Acoustic + Silence | Silency features, x-vector PCA-transformed coefficients, Probabilistic Linear Discriminant Analysis (PLDA) for detection and Support Vector Regression (SVR) for MMSE prediction | AD/MMSE |
| Zhu et al. ( | YAMNet | The input of YAMNet is the Mel spectrogram from audio data with dimensions of (p, t, 1) | AD |
| Mahajan and Baths ( | Model B0 (emobase) | GRU taking in audio segment features and finally combining the features from the speech segments into a common vector | AD |
| Shah et al. ( | Majority vote (Acoustic) | Acoustic feature extraction across all speech segments, weighted majority vote classification on segments | AD |
| Shah et al. ( | Gradient Boosting (Acoustic) | MFCC 1–16 features and their delta coefficients from 26 Mel-bands | MMSE |
| Syed et al. ( | Audio (fusion) | Majority level approach of three acoustic models, averaging-based fusion | AD/MMSE |
| Chlasta and Wolk ( | DemCNN | Convolutional neural network for speech classification using the raw waveform | AD |
| Meghanani et al. ( | CNN - LSTM (MFCC) | 21 models are fitted using the above 21 bootstrap samples and the outputs are combined by a majority voting scheme for final classification. | AD |
| Meghanani et al. ( | pBLSTM-CNN (log-Mel) | Bagging of 21 models by averaging the outputs. | MMSE |
| Farzana and Parde ( | acoustic-all | Mel Frequency Cepstral Coefficients (MFCCs), mean value, variance, etc. | MMSE |
| Syed et al. ( | Attempt 3 | Label fusion from the top-5 performing models from the audio modality, prediction from the BERT base uncased RangePool | AD/MMSE |
Overview of the unimodal state-of-the-art approaches using only text, which are later compared with our work.
|
|
|
|
|
|---|---|---|---|
| Cummins et al. ( | bi-LSTM-Att | GloVe 100d as pretrained weights, maximum word number for each transcript is 200, Bi-LSTM with attention | AD/MMSE |
| Rohanian et al. ( | LSTM (Lexical + Dis) | GloVe features of 100d, disfluency markers (self-repair), Bi-LSTM | AD/MMSE |
| Edwards et al. ( | System 2: Phonemes | The authors transcribed the segment text into phoneme written pronunciation using CMUDict. FastText was trained on the phoneme representation | AD |
| Pompili et al. ( | Sentence Embedding | Sentence embeddings are computed by averaging the second to twelfth hidden layers of each word., train SVM | AD |
| Koo et al. ( | Transformer-XL | The authors extracted textual features using Transformer-XL and trained a neural network consisting of CNN, Attention, Bi-LSTM, and Dense Layers. | AD/MMSE |
| Pappagari et al. ( | Transcript | The authors train a BERT model. | AD/MMSE |
| Zhu et al. ( | Longformer | Training of Longformer | AD |
| Mahajan and Baths ( | Model A0 | Neural network consisting of CNN, LSTM, and Dense layers | AD |
| Shah et al. ( | Logistic Regression (NLP) | Language and fluency features, n-gram features, Dimensionality Reduction Techniques | AD |
| Shah et al. ( | Random Forest (NLP) | Language and fluency features, n-gram features, Dimensionality Reduction Techniques | MMSE |
| Syed et al. ( | Text (fusion) | Fusion of top-3 performing models from the textual modality | AD/MMSE |
| Syed et al. ( | Attempt 5 | Label fusion from the top-10 performing models from text modalities, average of MMSE score predictions from the top-10 performing models | AD/MMSE |
| Balagopalan et al. ( | BERT | Training of BERT model | AD |
| Farzana and Parde ( | n-gram | All lexicosyntactic features, SVR training | MMSE |
| Meghanani et al. ( | fastText, bi+trigram | The authors fit 21 models and the outputs are combined by a majority voting scheme for final classification. In the regression task, the outputs of these bootstrap models are averaged to arrive at the final MMSE score | AD/MMSE |
AD Classification Task: Performance comparison among proposed models and state-of-the-art approaches on the ADReSS Challenge test set.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| |||||
| Cummins et al. ( | - | - | 85.40 | 85.20 | - |
| Rohanian et al. ( | - | - | - | 79.20 | - |
| Edwards et al. ( | 81.82 | 75.00 | 78.26 | 79.17 | - |
| Pompili et al. ( |
| 66.67 | 78.05 | 81.25 | - |
| Koo et al. ( | 89.47 | 70.83 | 79.07 | 81.25 | - |
| Martinc and Pollak ( | - | - | - | 77.08 | - |
| Pappagari et al. ( | 70.00 | 88.00 | 78.00 | 75.00 | - |
| Zhu et al. ( | 83.04 ± 3.97 | 83.33 ± 5.89 | 82.92 ± 1.86 | 82.92 ± 1.56 | - |
| Mahajan and Baths ( | 78.94 | 62.50 | 69.76 | 72.92 | - |
| Shah et al. ( | - | - | - | 83.00 | - |
| Syed et al. ( | - | 87.50 | - | 89.58 | 91.67 |
| Sarawgi et al. ( | 83.00 | 83.00 | 83.00 | 83.00 | - |
| Syed et al. ( | - | - | - | 79.17 | - |
|
| |||||
| Cummins et al. ( | - | - | 81.20 | 81.30 | - |
| Rohanian et al. ( | - | - | - | 72.90 | - |
| Edwards et al. ( | 80.95 | 70.83 | 75.56 | 77.08 | - |
| Pompili et al. ( | 82.35 | 58.33 | 68.29 | 72.92 | - |
| Koo et al. ( | 80.00 | 83.33 | 81.63 | 81.25 | - |
| Pappagari et al. ( | 69.00 | 83.00 | 75.00 | 72.92 | - |
| Zhu et al. ( | 88.14 ± 2.09 | 74.17 ± 5.53 | 80.44 ± 3.55 | 82.08 ± 2.83 | - |
| Mahajan and Baths ( | 76.47 | 54.16 | 63.41 | 68.75 | - |
| Shah et al. ( | - | - | - | 85.00 | - |
| Syed et al. ( | - |
| - |
| 91.67 |
| Syed et al. ( | - | - | - | 85.42 | - |
| Balagopalan et al. ( | 83.89 | 83.33 | 83.27 | 83.32 | 83.33 |
| Meghanani et al. ( | 86.00 | 79.00 | 83.00 | 83.33 | 88.00 |
|
| |||||
| Cummins et al. ( | - | - | 70.80 | 70.80 | - |
| Rohanian et al. ( | - | - | - | 66.60 | - |
| Edwards et al. ( | 58.62 | 70.83 | 64.15 | 60.42 | - |
| Pompili et al. ( | 54.17 | 54.17 | 54.17 | 54.17 | - |
| Koo et al. ( | 78.95 | 62.50 | 69.77 | 72.92 | - |
| Pappagari et al. ( | 70.00 | 58.00 | 63.00 | 66.70 | - |
| Zhu et al. ( | 64.40 ± 3.93 | 73.40 ± 8.82 | 68.60 ± 4.84 | 66.20 ± 4.79 | - |
| Mahajan and Baths ( | 65.21 | 62.50 | 63.82 | 64.58 | - |
| Shah et al. ( | - | - | - | 65.00 | - |
| Syed et al. ( | - | 83.33 | - | 81.25 | 79.17 |
| Chlasta and Wolk ( | 62.50 | 62.50 | 62.50 | 62.50 | 62.50 |
| Meghanani et al. ( | 82.00 | 38.00 | 51.00 | 64.58 | 92.00 |
| Syed et al. ( | - | - | - | 64.58 | - |
|
| |||||
|
| 92.83 ± 6.39 | 81.67 ± 2.04 | 86.81 ± 3.37 | 87.50 ± 3.49 | |
|
| 74.51 ± 1.01 | 87.50 ± 6.45 | 80.35 ± 2.77 | 78.75 ± 2.04 | 70.00 ± 3.12 |
|
| 73.91 ± 2.40 | 81.79 ± 1.72 | 79.58 ± 2.04 | 67.50 ± 4.08 | |
|
| 76.57 ± 3.74 | 89.17 ± 5.65 | 82.28 ± 3.49 | 80.83 ± 3.58 | 72.50 ± 5.65 |
|
| 90.87 ± 3.50 | 89.17 ± 2.04 | 90.00 ± 1.56 | 90.83 ± 4.08 | |
Reported values are mean ± standard deviation. Results are averaged across five runs. Best results per evaluation metric are in bold.
MMSE Regression Task: performance comparison among proposed models and state-of-the-art approaches on the ADReSS Challenge test set.
|
|
|
|---|---|
|
| |
| Cummins et al. ( | 4.65 |
| Rohanian et al. ( | 4.54 |
| Koo et al. ( | 3.77 |
| Pappagari et al. ( | 5.32 |
| Shah et al. ( | 6.01 |
| Syed et al. ( | 4.47 |
| Martinc and Pollak ( | 5.06 |
| Syed et al. ( | 4.91 |
| Farzana and Parde ( | 4.34 |
|
| |
| Cummins et al. ( | 4.66 |
| Rohanian et al. ( | 4.88 |
| Koo et al. ( | 4.02 |
| Pappagari et al. ( | 5.86 |
| Shah et al. ( | 5.62 |
| Syed et al. ( | 3.74 |
| Syed et al. ( | 4.30 |
| Farzana and Parde ( | 4.61 |
| Meghanani et al. ( | 4.87 |
|
| |
| Cummins et al. ( | 6.45 |
| Rohanian et al. ( | 5.93 |
| Koo et al. ( | 5.08 |
| Pappagari et al. ( | 5.97 |
| Shah et al. ( | 6.67 |
| Syed et al. ( | 5.86 |
| Meghanani et al. ( | 5.90 |
| Farzana and Parde ( | 6.42 |
| Syed et al. ( | 5.18 |
|
| |
|
| 4.20 ± 0.47 |
|
| 5.64 ± 0.11 |
|
| 5.50 ± 0.30 |
|
| 5.62 ± 0.12 |
|
| |
Reported values are mean ± standard deviation. Results are averaged across five runs. Best results are in bold.