Literature DB >> 33763518

Multilingual automation of transcript preprocessing in Alzheimer's disease detection.

Abstract

INTRODUCTION: Analyzing linguistic functions can improve early detection of Alzheimer's disease (AD). To date, no studies have focused on creating a universal pipeline for clinical transcript preprocessing.
METHODS: This article presents a simple and efficient method for processing linguistic and phonetic data, sequencing subproblems of cleaning, normalization, and measure extraction tasks. Because some of these tasks are language- and context- dependent, they were designed to be easily configurable, thus increasing their scalability when dealing with new corpora.
RESULTS: Results show improved performances over previous studies in this time-consuming preprocessing task. Moreover, our findings showed that some discursive markers extracted from transcripts revealed a significant correlation (>0.5) with cognitive impairment severity. DISCUSSION: This article contributes to the literature on AD by presenting an efficient pipeline that allows speeding up the transcripts preprocessing task. We further invite other researchers to contribute to this work to help improve the quality of this pipeline (https://github.com/LiNCS-lab/usAge).

Entities: Chemical

Keywords: Alzheimer's disease; discursive markers; early detection; linguistic features; phonetic features; pipeline; transcript preprocessing

Year: 2021 PMID： 33763518 PMCID： PMC7975846 DOI： 10.1002/trc2.12147

Source DB: PubMed Journal: Alzheimers Dement (N Y) ISSN： 2352-8737

INTRODUCTION

Monitoring and detecting Alzheimer's disease (AD) at an early stage is becoming more crucial as the number of people affected by the disease increases rapidly every year. Currently, nearly 50 million people are living with AD globally, and that number is expected to reach 152 million by. Many studies have covered computer‐based approaches to evaluating and monitoring cognitive functions and detecting AD at an early stage. , The Cookie Theft picture description task is widely used to monitor and detect the disease. In this study, we analyze transcripts and audio clips from the Pitt Corpus (English) and the CRIUGM (Centre de recherche de l'Institut universitaire de gériatrie de Montréal) Corpus (Quebec French), as listed in Table 1. Extracting valuable measures can be difficult when working with multilingual datasets. Therefore, this work presents a multilingual pipeline approach that preprocesses and extracts multiple linguistic and phonetic characteristics from data. To evaluate our approach, we compared the results of both datasets.

TABLE 1

Distribution of interviews used for experimentation

Corpus name	Language	Criteria	Diagnosis/type	Samples
CRIUGM	French	<40 y/o	Healthy young	26
		>50 y/o	Old	29
Pitt Corpus	English	MSSE	HC	242
			AD or MCI	300

Abbreviations: AD, Alzheimer's disease; HC, healthy control; MCI, mild cognitive impairment; MMSE, Mini‐Mental State Examination.

Distribution of interviews used for experimentation Abbreviations: AD, Alzheimer's disease; HC, healthy control; MCI, mild cognitive impairment; MMSE, Mini‐Mental State Examination.

RESEARCH IN CONTEXT

Systematic review: While many studies on computer‐based approaches to the early detection of Alzheimer's disease (AD) in a picture description task context have shown great potential over the past few years, most of them cover a specific language. Generally, studies analyzing transcripts based on patients’ interviews tend to restrict their research to a specific cognitive task. Interpretation: We developed a multilingual and context‐independent pipeline for transcript preprocessing (https://github.com/LiNCS‐lab/usAge), which extracts a variety of linguistic and phonetic measures that can eventually be used to monitor and detect AD at an early stage. Future directions: Transcript preprocessing plays a key role in extracting linguistic measures, as it holds valuable information for cognitive assessment. Further research could focus on understanding how linguistic functions are altered differently in patients with different spoken languages.

METHODS

In this work, we present a methodology based on a pipeline architecture for processing transcripts. This allows the division of the work into subprocesses, which makes it easier to approach the multilingualism factor. Each subprocess is seen as a single entity that can be adapted to different languages and contexts. The pipeline is divided into the following six main modules: typographic normalization and cleaning, part‐of‐speech (POS) tagging, POS adjustment, POS distribution measurement, linguistic measurement, and phonetic measurement. Multilingual modules are identified in blue, as illustrated in Figure 1. We will go through each of the modules to explain how they contribute to transcript preprocessing.

FIGURE 1

Transcript preprocessing pipeline architecture. MFCC, mel‐frequency cepstral coefficients; POS, parts of speech

Typographic normalization and cleaning

Working with transcripts carries multiple challenges due to the sparsity of related norms. Transcripts may appear in different formats, such as plain text files or transcription files (.cha). Also, different discursive marker norms can be used in annotating transcripts as most are produced by hand. Typographic errors could also be injected into transcripts as they are human‐made. To tackle this problem, the cleaning and normalization task is easily adjustable with configuration files, using a rule‐based approach. This allows adaptation of the process to match different languages and different interview contexts, as we can specify new rules. The process thus cleans transcripts and extracts discursive markers, which have been shown to correlate highly with the disease. In fact, they are widely used in the best performing predictive models to detect AD in English and in French, as shown in Table 1 (respectively 6/10 and 10/10).

POS tagging

POS tagging tools have proven their effectiveness in recent years and are now widely used in natural language processing tasks. In our work, we used FreeLing 4.0 to analyze and tag transcripts, because it supports many different languages, although its flexibility in tagging words and tagging norms may vary from one language to another. Authors of FreeLing have reported >95% accuracy on journalistic texts; this limitation regarding the training of POS‐taggers is addressed in Section 4. As an addition to this module, we therefore converted tags to a universal form, allowing the following modules to analyze and manipulate transcripts from various corpora. This task was tested on both English and French transcripts but may be used for numerous other languages.

POS adjustment

Because POS tags are statistically determined, some annotation errors may be introduced into transcripts. This module consists mainly in finding the most common mistakes and updating them to the correct form programmatically. It evaluates and analyzes the tags, thus allowing improvements in the quality of the results in the following modules. However, this module must be adapted only once for each language because it depends on a language's structure and rules. In our work, we adapted it for English and French tags.

POS distribution measurement

To measure the distribution of POS tags, the frequency and ratio of the following tags were evaluated: adjectives, conjunctions, nouns, prepositions, verbs, and auxiliary verbs. Because the POS tagging module universalizes tags, this process can be applied to different languages.

Linguistic measurement

Linguistic characteristics were automatically extracted within this module. We used the most common linguistic measures from previous works, and which have shown a significant correlation with the disease (e.g., Brunet's index, type‐token ratio, Honore's statistic). Because these characteristics are based on straightforward distribution of words and lemmas, this module is language‐independent.

Phonetic measurement

For phonetic characteristics, we used the python_speech_features 0.6 tool to estimate the first 13 mel‐frequency cepstral coefficients (MFCCs). We then estimated the mean, kurtosis, skewness, and variance of those values. Audio files of interviews normally consist of a patient and an interviewer speaking, so we segmented the audio to keep only the patient. In future work, a speaker diarization could be done to extract the patient's voice and thus increase the accuracy of the phonetic measurements.

RESULTS

To understand all our linguistic and phonetic measures and how they interact with AD, we performed a correlation analysis. All in all, we extracted 100 features, separated into four different categories: discursive markers, POS distribution, linguistic characteristics, and phonetic characteristics. We also included information coverage measures that were presented in the work of Hernández‐Domínguez. We then ran a feature selection process to extract the most valuable features. With the selected features, we trained different predictive models and evaluated their performance with a 10‐fold cross‐validation, as presented in Table 2. Finally, we analyzed the correlation of the extracted measures with the disease.

TABLE 2

Average AUC on 10‐fold cross‐validation models with different feature type combinations (baseline = decision tree classifier)

CRIUGM ^a Corpus (French)
Feature types	Model	AUC	F‐score	F‐score baseline
Cov‐ling‐phon‐	Svm	0.92	0.91	0.83
Cov‐phon‐pos‐	Svm	0.92	0.97	0.91
Cov‐ling‐phon‐pos‐	Svm	0.90	0.93	0.81
Markers‐ling‐phon‐pos‐	Svm	0.89	0.96	0.77
Cov‐phon‐	Svm	0.89	0.90	0.91
Markers‐ling‐	Svm	0.88	0.89	0.85
Markers‐cov‐ling‐	Svm	0.88	0.86	0.82
Markers‐cov‐ling‐	Rfc	0.86	0.86	0.82
Markers‐ling‐phon‐	Rfc	0.86	0.90	0.81
Markers‐cov‐phon‐pos‐	Svm	0.86	0.93	0.86

Abbreviations: AUC, area under the curve; cov, information coverage features; ling, linguistic features; markers, discursive markers features; phon, phonetic features; POS, parts of speech; pos, POS distribution features.

Centre de recherche de l'Institut universitaire de gériatrie de Montréal

Average AUC on 10‐fold cross‐validation models with different feature type combinations (baseline = decision tree classifier) Abbreviations: AUC, area under the curve; cov, information coverage features; ling, linguistic features; markers, discursive markers features; phon, phonetic features; POS, parts of speech; pos, POS distribution features. Centre de recherche de l'Institut universitaire de gériatrie de Montréal

Discursive markers

Discursive markers have demonstrated their ability to distinguish healthy controls from AD patients quite remarkably. One of the most correlated features with these markers is the number of false starts in both English and French corpus (respectively, 0.26 and 0.62). We hypothesize that patients with AD tend to forget how to describe an object or a person, which forces them to retrace their sentences. Also, we found an inverse correlation with the number of synonyms extracted from transcripts in both languages (respectively, –0.13 and –0.31). This could be explained by the fact that AD patients have a smaller vocabulary variety when describing an image. Finally, the number of repetitions detected in both corpora correlates highly with the disease (respectively, 0.35 and 0.37), which is consistent with previous studies. ,

POS distribution

For the POS tags distribution, auxiliary verb frequencies were not correlated in the same way in English and in French. We found that in French, the correlation was positive (0.28), while in English it was negative (–0.16). This could be due to the fact that auxiliary verbs cannot necessarily be translated in the same way between the languages (e.g., Je suis allé à l’école: I went to school) and therefore, measures may vary. Similarly, conjunctions and adjectives did not have the same type of correlation between English and French. On the other hand, we found that AD patients tend to use fewer nouns in both languages, which correlates with previous findings. That being said, a POS distribution should be considered and analyzed in each language separately, because it does not necessarily have the same representation in each case.

Linguistic characteristics

For the Pitt Corpus, lexical richness correlations were mostly consistent with previous studies. With the CRIUGM dataset, most measures were inconsistent with the results obtained with the Pitt Corpus, and indeed, were sometimes highly correlated with the disease (e.g., Yule's characteristic K [0.44]). We believe that this could be due to the size of the dataset, which is very small, compared to the English dataset. Nonetheless, this module may be considered a benchmark, because the results match those of the same experiment conducted on the Pitt Corpus.

Phonetic characteristics

Considering phonetic characteristics, results with the Pitt Corpus are relatively consistent with previous studies. There may have been some differences in correlation values due to the fact that we segmented the audio to remove the interviewer's voice. For the CRIUGM dataset, some of the MFCCs’ mean, skewness, and variance values were highly correlated with the disease (>0.4). Again, those high correlations might be explained by the size of the dataset and the manual audio segmentation task, which could bias the results.

Modeling

For both corpora, we tested different combinations of feature types, which showed discursive markers to be the most common feature type found in the best predictive models overall. With the Pitt Corpus, our best model had an average area under the curve (AUC) of 76%, which is relatively consistent with previous studies. , Looking at the CRIUGM Corpus, our best model had an average AUC of 92%. This result, which is significantly high, may be explained by the very small dataset size and the high correlation found in multiple features.

DISCUSSION

This work contributes in many ways to improving the quality and efficiency of transcript and audio preprocessing to extract measures that characterize linguistic and phonetic functions.a Furthermore, we expand its use by making the processing adaptable to many different languages. Results demonstrate its consistency with previous studies, as well as with a new cohort of French participants. Although we suspect that FreeLing POS‐taggers are not entirely reliable for speech data in various languages, the results were sufficiently reliable to build the pipeline. In a future version, we intend to replace this library with spaCy's library, which has been trained on a wider type of texts (including speech). Further research could focus on including languages with different structures and rules, as that could expand its usage. We would also like to include the information coverage measure extraction as part of a new module in our pipeline, as it has proven its capacity to significantly distinguish AD patients from healthy controls. Finally, we believe it would be interesting to compare results between proportionate datasets of different languages to evaluate how the disease may affect cognitive functions in patients differently.

CONFLICTS OF INTEREST

The authors have declared that no conflicts of interest exists.

3 in total

1. Linguistic Features Identify Alzheimer's Disease in Narrative Speech.

Authors: Kathleen C Fraser; Jed A Meltzer; Frank Rudzicz
Journal: J Alzheimers Dis Date: 2016 Impact factor: 4.472

2. The natural history of Alzheimer's disease. Description of study cohort and accuracy of diagnosis.

Authors: J T Becker; F Boller; O L Lopez; J Saxton; K L McGonigle
Journal: Arch Neurol Date: 1994-06

3. Computer-based evaluation of Alzheimer's disease and mild cognitive impairment patients during a picture description task.

Authors: Laura Hernández-Domínguez; Sylvie Ratté; Gerardo Sierra-Martínez; Andrés Roche-Bergua
Journal: Alzheimers Dement (Amst) Date: 2018-03-13

3 in total