Literature DB >> 29780871

Computer-based evaluation of Alzheimer's disease and mild cognitive impairment patients during a picture description task.

Laura Hernández-Domínguez¹, Sylvie Ratté¹, Gerardo Sierra-Martínez², Andrés Roche-Bergua³.

Abstract

INTRODUCTION: We present a methodology to automatically evaluate the performance of patients during picture description tasks.
METHODS: Transcriptions and audio recordings of the Cookie Theft picture description task were used. With 25 healthy elderly control (HC) samples and an information coverage measure, we automatically generated a population-specific referent. We then assessed 517 transcriptions (257 Alzheimer's disease [AD], 217 HC, and 43 mild cognitively impaired samples) according to their informativeness and pertinence against this referent. We extracted linguistic and phonetic metrics which previous literature correlated to early-stage AD. We trained two learners to distinguish HCs from cognitively impaired individuals.
RESULTS: Our measures significantly (P < .001) correlated with the severity of the cognitive impairment and the Mini-Mental State Examination score. The classification sensitivity was 81% (area under the curve of receiver operating characteristics = 0.79) and 85% (area under the curve of receiver operating characteristics = 0.76) between HCs and AD and between HCs and AD and mild cognitively impaired, respectively. DISCUSSION: An automated assessment of a picture description task could assist clinicians in the detection of early signs of cognitive impairment and AD.

Entities: Chemical

Keywords: Alzheimer's disease (AD); Automatic assessment; Information coverage; Linguistic analysis; Machine learning; Mild cognitive impairment (MCI); Phonetic features; Picture description task

Year: 2018 PMID： 29780871 PMCID： PMC5956933 DOI： 10.1016/j.dadm.2018.02.004

Source DB: PubMed Journal: Alzheimers Dement (Amst) ISSN： 2352-8729

Introduction and motivation

Multiple studies have assessed language functions as early markers of Alzheimer's disease (AD) [1]. Consequently, language is now widely accepted to be one of the first cognitive abilities affected by this dementia. Some of the most commonly used tests in clinical practice are Verbal Fluency by categories, Picture Description, the Boston Naming Test [2], and the Token Test [3], which measure expository speech, oral expression, and comprehension of commands, respectively [4]. This exploration of the changes in language functions derived from AD has attracted significant attention among scientists outside the field of medicine [5]. Researchers, especially those working in natural language processing, have proposed computer-based approaches for automatic and semiautomatic analysis of language in patients suffering from AD [6], [7], [8], [9], [10], [11], [12], [13]. In this work, we propose a methodology to automatically describe patients' performance during a picture description task [14]. We selected this type of test because it elicits spontaneous speech from patients, allowing us to describe not only patients' ability to retrieve information from a visual stimulus but also some of their linguistic characteristics. Our evaluation describes three aspects: the informativeness and pertinence of the description provided by the patient, some linguistic characteristics, such as vocabulary richness and general use of part-of-speech categories, and a phonetic overview.

Information coverage

One of the key objectives of a picture description task is to measure the amount and quality of the information that a patient can provide from a visual stimulus. Even early in the course of the disease, AD patients have been shown to provide less informative descriptions than cognitively intact elderly adults [15]. This measure is generally made by comparing the description provided by the patient to a list containing the main information content units (ICUs) of the image, namely, actors, objects, actions, and places. Over the years, several authors have come up with predefined lists of ICUs for the Cookie Theft picture description task [16], [17], [18], [19], [20], [21]. However, one of the disadvantages of using predefined lists to evaluate elderly patients is that the list author does not necessarily have a similar education level, age, focus, cultural background, and interests as the target population. Also, different authors may come up with different lists, depending on their idiosyncrasies, their own observations, and what they may consider “important” from the picture.

Related computational works

Hakkani-Tür et al. [22] used a manually predefined list as a referent to automatically compare descriptions of the Western Aphasia Battery's Picnic Picture. The authors found a high correlation between the traditional manual assessment and their automated approach. However, the computer-based evaluation had trouble handling ICUs expressed in multiple ways. Pakhomov et al. [23] used manual transcriptions of descriptions of the Cookie Theft picture to assess the performance of patients with frontotemporal lobar degeneration. They compiled a list of predefined ICUs based on Yorkston and Beukelman's study [16] and manually extended it to include lexical and morphological variants of words and phrases. One drawback of this method is that it entails the manual creation of a list that considers as many variants as possible for each ICU. Fraser et al. [24] used a semiautomatic approach to automatically classify Alzheimer's patients and healthy elderly controls (HCs) by analyzing manual transcriptions of descriptions of the Cookie Theft picture in the Pitt Corpus [25]. As a referent, the authors used the predefined list proposed by Croisile et al. [19] and evaluated the frequency of key words used to name the ICUs in different ways. As in Pakhomov et al.'s work [23], manually considering all the ICUs and their linguistic variations is a time-consuming task. Yancheva and Rudzicz [26] automatically extracted the main ICUs retrieved by elderly adults in the Pitt Corpus. The authors contrasted automatically extracted ICUs to a combination of several predefined lists of ICUs. They retrieved most of the human-selected ICUs. In addition, they found that some participants mentioned the object apron, a new ICU that none of the specialists had perceived before. They also observed that HCs were more prone than AD patients to mention this object in their descriptions. The appreciation of the fact that a woman is wearing an apron while doing housework could be attributed to a generational and cultural perception of what the object apron represented to elderly participants taking the test back in the 1980s. Different remarks may be attributable to cultural differences. For example, a non-Caucasian-predominant population may remark on the fact that all the subjects in the Cookie Theft picture are blond. Hence, we consider that a fairer referent for comparison in this task should be constructed by healthy participants of the target population. As such, it would be possible to create referents that are adapted to specific populations from different generations, cultures, and educational and general socioeconomic backgrounds.

The coverage measure

We identify three important tasks for performing a computer-based evaluation of a picture description task: Creating a population-adapted referent. Evaluating the informativeness of descriptions: estimate how much of the information in the referent is being covered by the participant. Evaluating the pertinence of utterances: determine how much of what the participant is saying is covered by the referent. Some participants, particularly those with AD, can drift off-topic. Although this situation is easily detected when performing a manual evaluation, it is a challenging task for an automated analysis. With these tasks in mind, we selected the information coverage measure proposed by Velazquez [27]. He originally proposed the method for comparing the coverage of information in news articles, although it could be used in different contexts. Velazquez proposes a methodology for creating a referent for evaluating the information coverage. One distinguishing feature of his measure is that it uses linguistic patterns that allow the consideration of the context. In addition, the measure allows a two-way analysis of the information coverage, from the referent by the subject of comparison and vice versa. These two measures would allow the estimation of informativeness and pertinence, respectively.

Linguistic characteristics

There is extensive literature covering the analysis of the linguistic characteristics of AD patients [6], [7], [24], [28], [29], [30], [31], [32], [33], [34]. As part of our evaluation, we selected those that most authors have found to correlate significantly with the disease and that could be used in picture description tasks (Table 1). In Section 2.3, we provide further information about the methodology and tools used for extracting these characteristics.

Table 1

Linguistic characteristics selected to evaluate patients' language functions

Measure	Equation	Interpretation
Text size	N	Number of words used in a text
Vocabulary size	V	Number of different lemmas∗
Hapax legomena	V₁	Number of lemmas mentioned only once
Hapax dislegomena	V₂	Number of lemmas mentioned exactly twice
Brunet's W index [35]	W=NV−c	Rationalization of the size of the vocabulary and the length of the text. W is stable when c has values between 0.165 and 0.172 [36]. We used c = 0.172, the original value proposed by Brunet.
Honoré's R statistics [37]	R = 100 · logN/1−V₁/V	A measure based on the ratio of hapax legomena, vocabulary size, and the length of the text
Type-token ratio (TTR)	TTR = V₁/V	TTR measures the ratio of hapax legomena and the size of the vocabulary. It can be sensitive to the size of the sample [38].
Sichel's S[39]	S = V₂/V	Similar to TTR, but using hapax dislegomena, being more robust against samples of different sizes [40]
Yule's characteristic K[41]	K=104[∑i=1Ni2V(i,N)]N2−1N	Yule's is a measure of lexical repetition considered to be text length independent. In this measure, the number of lemmas of frequency i (V (i,N)) is estimated to measure the frequency distribution of a text.
Entropy	H(X)=−∑x∈Xp(x)·log2p(x)	Entropy measures the uniformity of the vocabulary. In the equation, p(x) is the probability of a word x occurring in the text X. We measured the general entropy of the complete text and the average entropy of sentences.

Lemmas refer to words without inflections (in their canonical form).

Linguistic characteristics selected to evaluate patients' language functions Lemmas refer to words without inflections (in their canonical form).

Part-of-speech distribution

We made an evaluation of the frequency and ratio of adjectives, conjunctions, nouns, prepositions, and verbs per 100 words. We also evaluated the frequency of auxiliary verbs and their ratio to the total number of verbs.

Vocabulary richness

Several measures have been explored to evaluate the richness of an author's language. These same measures can be used to evaluate the variability of the vocabulary of patients during a picture description task.

Phonetic analysis

Several authors [23], [24], [35], [36], [37], [38] have found significant differences in the audio signals produced by AD patients as compared to cognitively intact elderly individuals. Mel-Frequency Cepstral Coefficients (MFCCs) are among the most used features for automatic speech analysis. Only the first 12 to 13 MFCCs are usually used because most of the information about the transfer function of the vocal tract is in the lower range of frequencies.

Methods

Corpus

For this work, we used the Pitt Corpus [25] of the DementiaBank database. This corpus contains audio recordings and manual transcriptions of participants undertaking the standard Cookie Theft picture description task from the Boston Diagnostic Aphasia Examination [14]. This password-protected data set is available upon request for research purposes. The participants of the corpus are mainly HCs, probable and possible AD patients, and mild cognitively impaired (MCI) subjects. We excluded other diagnoses from this study due to their scarce numbers in the corpus. In this work, we did not differentiate between probable and possible diagnoses of AD. The main inclusion criterion for our study was that both the transcripts and audio files of the participant were present for each test. We studied 262 participants, with a total of 517 tests (see Table 2). Twenty-five other HC subjects and their tests were set aside for creating the referent. These subjects were not part of the experimentation sample.

Table 2

Distribution of interviews used for experimentation

Variable	All (n = 517)	AD (n = 257)	HC (n = 217)	MCI (n = 43)
Participants	262	169	74	19
Gender
Male	189	87	75	27
Female	328	170	142	16
Education (years)
6–9	55	51	2	2
10–12	200	112	79	9
13–16	209	76	111	22
17+	53	18	25	10
Age (years)
Under 50	6	0	5	1
50–59	81	21	57	3
60–69	188	81	94	13
70–79	190	111	57	22
80+	52	44	4	4

Abbreviations: AD, Alzheimer's disease; HC, healthy elderly control; MCI, mild cognitive impairment; n, number of tests.

Distribution of interviews used for experimentation Abbreviations: AD, Alzheimer's disease; HC, healthy elderly control; MCI, mild cognitive impairment; n, number of tests.

Extraction of information coverage measures

Adaptation of the coverage measure

Velazquez's [27] measure uses duplets of linguistic patterns to find the degree to which a referent R is covered by a subject of comparison S. We selected the active voice patterns proposed by Velazquez, given the expository speech nature used during picture description tasks (see Table 3).

Table 3

Active voice linguistic patterns used for the coverage measure (reproduced from Velazquez [27])

p in R	p in S	Interpretation	Example
N-V	N-V	Subject + action	“boy stealing”
V-N	V-N	Action over an object	“stealing cookies”
P-N	P-N	Locations + indirect objects	“in kitchen”
N-V	V-N	Subject + action + object	“woman washing dishes”

Abbreviations: p, pattern; R, referent; S, subject of comparison; N, noun; V, verb; P, preposition.

Active voice linguistic patterns used for the coverage measure (reproduced from Velazquez [27]) Abbreviations: p, pattern; R, referent; S, subject of comparison; N, noun; V, verb; P, preposition. Velazquez splits the text into sentences; for our study, we split it into utterances. The comparison of utterance patterns follows the equation:where R is the referent, S is the document that is the subject of comparison, and p is a linguistic pattern. The parameters, α, are used to modify each pattern's weight. For this work, all patterns were considered to weigh equally; all parameters, α, were thus set to 1.

Automatic preprocessing

We cleaned the original raw text to apply the information coverage measure as follows: We removed all marks of repetitions, hesitations, incomplete words, and pauses. We standardized the names of the most prominent ICUs. For example, all mentions of the words “brother”, “lad”, “kid”, etc., were automatically replaced by “boy” following Fig. 1.

Fig. 1

Linguistic variations of ICUs in the Cookie Theft picture description task. The standardized name of each group is shown. Abbreviation: ICUs, information content units.

Linguistic variations of ICUs in the Cookie Theft picture description task. The standardized name of each group is shown. Abbreviation: ICUs, information content units. We used FreeLing 4.0 [39] for tagging the transcripts with their lemmas and their part of speech. Two consecutive nouns were considered as a single noun. Some authors have found differences in the use of adjectives between AD patients and HCs [8]. During the picture description task, it is common that participants describe objects with adjectives. To take these rich descriptions into account, we joined an adjective preceded by the verb “to be” by means of a forward slash. All part-of-speech tags that were not in the linguistic patterns were discarded for the comparison.

Creation of the referent

Using Velazquez's coverage measure, we created a referent that included the patterns extracted from tests taken by HCs from the same corpus. These entrances were excluded from the evaluation sample. To create the referent, we selected all 25 HCs tested only once, aiming for the referent to be as diverse as possible, while simultaneously avoiding reducing significantly the number of samples left for the evaluation. For each utterance, if the utterance was not already at least 80% covered by the referent, the patterns were added to the referent. Thus, we automatically created an incremental referent that considered different manners used by HCs to describe similar actions and situations. The following are real examples of patterns in the referent: water(N) run(V) water(N) overflow(V) water(N) spill(V) water(N) flow(V) water(N) splash(V) spill(V) water(N) kitchen/water(N) overflow(V)

Scoring participants' performance

Informativeness was estimated by measuring how much of the information in the referent was covered by the participant. To measure the pertinence, we estimated how much of what the participant said was covered by the referent. A low pertinence coverage may indicate that the participant was drifting off-topic. Emulating a typical clinical scoring of a picture description test, we counted the number of utterances from the referent that exceeded an informativeness threshold and a pertinence threshold. We tested three different thresholds: 60%, 80%, and 100%. We also estimated the sum of the informativeness and pertinence.

Extraction of linguistic and phonetic characteristics

For extracting the linguistic characteristics, we conducted a usual natural-language-processing preprocessing by removing all marks of repetitions, hesitations, incomplete words, and pauses. We used FreeLing 4.0 [39] to automatically tag the transcripts with their lemmas and part of speech. We then automatically extracted the linguistic characteristics mentioned in Section 1.2. We used python_speech_features 0.6 [40] to estimate the first 13 MFCC values of the sound waves in 25-ms segments. As per Fraser et al. [24], our features consisted of the mean, kurtosis, skewness, and variance of the values of each MFCC.

Automatic classification

To automatically discriminate between HCs and cognitively impaired individuals, we used two widely recognized machine learning algorithms, namely, Support Vector Machine (SVM) [41] and Random Forests Classifier [42]. In Asgari et al.'s study [43], a succinct and elegantly simplified explanation of both algorithms and of their use in a linguistic analysis for detecting MCI is presented. For our evaluation, we performed two binary classification experiments: first, a classification between participants with AD and HCs; then, we added the MCI participants to the sample and classified HCs and cognitively impaired participants. The sample of MCI participants was too small to be used as a learning class. For this work, we used 90% of the evaluation sample as the training set and 10% as the test set. We performed a 10-fold cross-validation (see Fig. 2). We report average of the 10 test classifications.

Fig. 2

Data set partitioning during a 10-fold cross-validation process to evaluate classifiers. The blue section indicates the part of the dataset that is being used as test, while the remaining gray area indicates the part of the dataset being used as training set in each fold.

Results

Feature analysis

Our evaluation of participants' picture descriptions covered a total of 105 features, divided into information coverage measures and linguistic and phonetic characteristics. We estimated their correlation with the severity of the cognitive impairment diagnosis (healthy = 0, MCI = 1, and AD = 2) and with the Mini–Mental State Examination results of participants. These correlations are reported in Table 4.

Table 4

Correlations∗ of features with the severity of cognitive impairment and with the MMSE

Correlation to cognitive impairment		Correlation to MMSE score
Variable	Corr.†	Variable	Corr.†
Informativeness t = 100%	−0.408	Informativeness t = 100%	0.443
Informativeness t = 80%	−0.388	Informativeness t = 80%	0.437
Informativeness score	−0.334	Informativeness score	0.429
Informativeness t = 60%	−0.333	Informativeness t = 60%	0.372
Informativeness variance	−0.257	Informativeness variance	0.338
Hapax legomena	−0.254	Auxiliary verb frequency	0.305
Pertinence t = 100%	−0.222	Hapax legomena	0.265
Auxiliary verb frequency	−0.216	Auxiliary verb rate	0.241
MFCC-12 kurtosis	0.205	Noun frequency	0.226
Pertinence t = 80%	−0.201	Preposition rate	0.194
MFCC-8 kurtosis	0.198	Pertinence t = 100%	0.192
MFCC-12 skewness	−0.185	General entropy	0.189
Noun frequency	−0.183	Vocabulary size	0.187
Honoré's R statistics	−0.180	Pertinence t = 80%	0.183
MFCC-10 kurtosis	0.163	Honoré's R statistics	0.177
Conjunction rate	0.163	Preposition frequency	0.175
Vocabulary size	−0.156	MFCC-12 skewness	0.173

Abbreviations: MMSE, Mini–Mental State Examination; Corr, correlation; t, threshold; MFCC, Mel-Frequency Cepstral Coefficient.

All correlations presented with P value < .001. Variables are shown in descending order with respect to the strength of their correlation.

Controlled for education, age, and gender.

Correlations∗ of features with the severity of cognitive impairment and with the MMSE Abbreviations: MMSE, Mini–Mental State Examination; Corr, correlation; t, threshold; MFCC, Mel-Frequency Cepstral Coefficient. All correlations presented with P value < .001. Variables are shown in descending order with respect to the strength of their correlation. Controlled for education, age, and gender. We additionally analyzed the correlation of the features with respect to age, gender, and education. Our findings are reported in Table 5.

Table 5

Correlations∗ of features with socioeconomic variables

Age		Gender		Education
Variable	Corr.†	Variable	Corr.‡	Variable	Corr.§
MFCC-3 kurtosis	−0.200	MFCC-10 kurtosis	−0.239	Preposition freq.	0.230
Conjunction freq.	0.182	MFCC-12 variance	0.181	Hapax legomena	0.222
Brunet's W index	0.179	MFCC-13 variance	0.179	Vocabulary size	0.219
General entropy	0.177	MFCC-5 skewness	0.175	Text size	0.207
Auxiliary verb freq.	0.175	MFCC-5 variance	0.174	General entropy	0.201
MFCC-1 variance	0.172	MFCC-8 skewness	−0.169	Adjective freq.	0.200
MFCC-6 kurtosis	−0.168	MFCC-10 skewness	0.163	Conjunction freq.	0.191
MFCC-5 kurtosis	−0.164			Noun freq.	0.190
MFCC-9 kurtosis	−0.158			Informativeness t = 60%	0.190
Informativeness score	0.156			Auxiliary verb freq.	0.184
				Verb freq.	0.172
				Informativeness score	0.169
				Brunet's W index	0.167

Abbreviations: Corr, correlation; MFCC, Mel-Frequency Cepstral Coefficient; freq., frequency.

All correlations presented with P value < .001. Variables are shown in descending order with respect to the strength of their correlation.

Controlled for education, gender, and cognitive status.

Controlled for age, education, and cognitive status.

Controlled for age, gender, and cognitive status.

Correlations∗ of features with socioeconomic variables Abbreviations: Corr, correlation; MFCC, Mel-Frequency Cepstral Coefficient; freq., frequency. All correlations presented with P value < .001. Variables are shown in descending order with respect to the strength of their correlation. Controlled for education, gender, and cognitive status. Controlled for age, education, and cognitive status. Controlled for age, gender, and cognitive status.

Binary classification

We tested the performance of the algorithms with each type of feature independently and then combinations of them. The results of the first and second experiments are shown in Tables 6 and 7, respectively. The best model represents the performance of the algorithms with a higher area under the curve of receiver operating characteristics (AUC) during the 10-fold cross-validation process.

Table 6

Performance of classifiers separating HCs from AD patients

Learner	Features	Accuracy	Sensitivity	Specificity	Precision	F-score	AUC
Average performance
RFC	Ling	0.72	0.76	0.67	0.74	0.75	0.72
SVM	Ling	0.75	0.75	0.74	0.77	0.76	0.75
RFC	Cov	0.73	0.78	0.67	0.73	0.75	0.72
SVM	Cov	0.74	0.80	0.67	0.74	0.77	0.74
RFC	Phon	0.59	0.66	0.52	0.62	0.64	0.59
SVM	Phon	0.62	0.70	0.52	0.63	0.66	0.61
RFC	Cov + Ling	0.78	0.84	0.72	0.78	0.81	0.78
SVM	Cov + Ling	0.79	0.79	0.78	0.82	0.80	0.79
RFC	Best∗	0.75	0.78	0.71	0.76	0.77	0.74
SVM	Best∗	0.79	0.81	0.77	0.81	0.81	0.79
Best model
RFC	Ling	0.81	0.77	0.86	0.87	0.82	0.82
SVM	Ling	0.85	0.85	0.86	0.88	0.86	0.85
RFC	Cov	0.85	0.88	0.82	0.85	0.87	0.85
SVM	Cov	0.85	0.88	0.82	0.85	0.87	0.85
RFC	Phon	0.67	0.65	0.68	0.71	0.68	0.67
SVM	Phon	0.72	0.84	0.57	0.70	0.76	0.71
RFC	Cov + Ling	0.94	1.00	0.86	0.90	0.95	0.93
SVM	Cov + Ling	0.88	0.81	0.95	0.95	0.88	0.88
RFC	Best∗	0.85	0.85	0.86	0.88	0.86	0.85
SVM	Best∗	0.87	0.80	0.95	0.95	0.87	0.88

Abbreviations: AD, Alzheimer's disease; AUC, area under the curve of receiver operating characteristics; HCs, healthy elderly controls; RFC, Random Forests Classifier; SVM, Support Vector Machine classifier; Ling, set of all linguistic features; Cov, set of all information coverage features; Phon, set of all phonetic features; Cov + Ling, a combination of all linguistic and information coverage features.

NOTE. The best results are indicated in bold.

A combination of all features with P value < .001 when correlating with cognitive impairment.

Table 7

Performance of classifiers separating HCs from cognitively impaired patients (AD or MCI, indistinctly)

Learner	Features	Accuracy	Sensitivity	Specificity	Precision	F-score	AUC
Average performance
RFC	Ling	0.70	0.78	0.59	0.73	0.75	0.69
SVM	Ling	0.72	0.80	0.61	0.74	0.77	0.70
RFC	Cov	0.74	0.83	0.61	0.75	0.79	0.72
SVM	Cov	0.73	0.86	0.56	0.73	0.79	0.71
RFC	Phon	0.59	0.79	0.31	0.61	0.69	0.55
SVM	Phon	0.61	0.81	0.33	0.62	0.70	0.57
RFC	Cov + Ling	0.76	0.84	0.66	0.77	0.81	0.75
SVM	Cov + Ling	0.78	0.85	0.68	0.78	0.82	0.76
RFC	Best∗	0.77	0.82	0.69	0.78	0.80	0.75
SVM	Best∗	0.75	0.82	0.65	0.76	0.79	0.73
Best model
RFC	Ling	0.78	0.80	0.76	0.83	0.81	0.78
SVM	Ling	0.87	0.87	0.86	0.90	0.88	0.87
RFC	Cov	0.83	0.87	0.77	0.84	0.85	0.82
SVM	Cov	0.83	0.90	0.73	0.82	0.86	0.81
RFC	Phon	0.67	0.90	0.36	0.66	0.76	0.63
SVM	Phon	0.65	0.90	0.29	0.64	0.75	0.59
RFC	Cov + Ling	0.85	0.87	0.82	0.87	0.87	0.84
SVM	Cov + Ling	0.85	0.87	0.82	0.87	0.87	0.84
RFC	Best∗	0.87	0.87	0.86	0.90	0.88	0.87
SVM	Best∗	0.83	0.83	0.82	0.86	0.85	0.83

NOTE. The best results are indicated in bold.

A combination of all features with P value < .001 when correlating with cognitive impairment.

Performance of classifiers separating HCs from AD patients Abbreviations: AD, Alzheimer's disease; AUC, area under the curve of receiver operating characteristics; HCs, healthy elderly controls; RFC, Random Forests Classifier; SVM, Support Vector Machine classifier; Ling, set of all linguistic features; Cov, set of all information coverage features; Phon, set of all phonetic features; Cov + Ling, a combination of all linguistic and information coverage features. NOTE. The best results are indicated in bold. A combination of all features with P value < .001 when correlating with cognitive impairment. Performance of classifiers separating HCs from cognitively impaired patients (AD or MCI, indistinctly) Abbreviations: AD, Alzheimer's disease; AUC, area under the curve of receiver operating characteristics; HCs, healthy elderly controls; RFC, Random Forests Classifier; SVM, Support Vector Machine classifier; Ling, set of all linguistic features; Cov, set of all information coverage features; Phon, set of all phonetic features; Cov + Ling, a combination of all linguistic and information coverage features; MCI, mild cognitive impairment. NOTE. The best results are indicated in bold. A combination of all features with P value < .001 when correlating with cognitive impairment.

Discussion

We presented a methodology for a computer-based evaluation of a picture description task. This evaluation aims not only to score participants' performance during the task itself but also to analyze their language and phonetic productions in a single commonly used noninvasive clinical test. Our objective is to provide clinicians with computational aids for the early detection of signs that might alert of the presence of MCI or AD. From the features observed in Table 4, the strongest correlations with the severity of the cognitive impairment were obtained with the information coverage measures. The less informative or pertinent the picture description, the higher the severity of the impairment. These correlations were consistent with the participants' scores on the Mini–Mental State Examination and were mostly independent of age, gender, and education. Our findings on the correlation of linguistic and phonetic features with cognitive impairment were consistent with previous literature and provided a broader evaluation of the participants' performance. In general, vocabulary richness and syntactic complexity measures were inversely correlated with the severity of the disease. These variables were also positively correlated with the number of years of education. Phonetic variables were highly correlated with age and gender. We also observed an increase in the entropy of the description, which may indicate more chaotic or disorganized descriptions. An increased rate of conjunctions was also positively correlated with cognitive impairment and age. The use of coordinating conjunctions in spoken language is not necessarily an indicator of grammatical connections. These coordinators in speech often have a “loose discoursal linking function” [44] and other pragmatic functions [45]. Hence, an increased use of conjunctions does not automatically imply a more complex discourse. In this work, we observed that a high use of conjunctions in a picture description task may even indicate hesitation. For example, in the participant's description, “she has water on the floor and… and basically, it's kind of—uh—a distressing scene,” the participant appears to be repeating the conjunction “and” to gain time to further evaluate the scene. We tested SVM and Random Forests Classifier first with each type of feature independently, and then with combinations of same. When we experimented with all three types of features together, we carried out a preselection of the best features. For this selection, we chose features with P < .001 when correlating to the severity of the cognitive impairment.

Comparison to other approaches

Contrasting our results against previous works on automated evaluation of picture description tasks can be difficult for multiple reasons. First and foremost, it is not customary in natural language processing to provide performance metrics such as AUC and specificity. Although accuracy, precision, recall, and F-score are usually illustrative in classes with similar sample sizes, these values could become misleading when the classes are skewed. An additional challenge in contrasting these methods is that not every author works with the same data distribution even when using the same data set. With machine learning algorithms, the ways the samples are distributed along the data set and in the training and test sets lead to slightly different results. Authors tend to report the results obtained with a distribution in which their algorithms performed at their best. Finally, despite using the Pitt Corpus, previous works differ in the number of samples used during their evaluation. Fraser et al. [24] used 233 HC and 240 AD samples; Yancheva and Rudzicz [26] used 241 HC and 255 AD samples; for this work, we used 242 HC and 257 AD samples (about 10% of the HC sample was used to form the referent and was not included in the evaluation). There is no clear explanation from previous authors regarding why they did not include all the samples in their experimentation. Fraser et al. [24] reported an accuracy of 81.92%, whereas Yancheva and Rudzicz [26] reported an accuracy, precision, recall, and F-score of 80%. In both works, the authors performed a classification between HCs and AD participants, without including the MCI sample. In our work, two SVM classifiers tied with the highest AUC at 0.79 in this task (Table 6). The first learner used a combination of all the information coverage and linguistic features, whereas the second used a combination of all features with P < .001 when correlating with cognitive impairment. The second algorithm presented a higher sensitivity (81%) and a higher F-score (81%), comparable to state-of-the-art work [24] that uses a manually made list of ICUs. When we incorporated the MCI sample into the experiment (Table 7), we observed that the SVM learner, trained with information coverage and linguistic features, performed at the highest AUC (0.76). There was an expected increase in the false-negative rate (specificity = 68%). However, the sensitivity was still high at 85%. The best model of an experiment represents the highest performance achieved by an algorithm during the cross-validation process. This indicates the highest potential of the algorithms in classifying new data with similar characteristics to the sample. For the first experiment (Table 6), the best model had an AUC of 0.93, with excellent sensitivity and a true-negative rate of 86% when classifying HCs and AD patients. When the MCI sample was incorporated (Table 7), the best model had an AUC of 0.87, with sensitivity 87% and specificity 86%.

Study advantages and limitations

One of the advantages of our proposed methodology is that the informativeness and pertinence measures are estimated against an automatically created referent. This referent has the particularity of being adaptable to differences in population or even to different pictures for description. Previous automated works present difficulties at considering linguistic variabilities for expressing similar notions. With our proposed approach, the referent is created from examples of descriptions from healthy age-related individuals. Hence, it incorporates different ways of expressing similar ideas and even what could be considered as normal deviations from topics. It also allows for the consideration of context through the accounting of linguistic patterns of phrases, rather than just of isolated words. In this regard, the bigger the sample set aside for creating the referent, the richer and more variate the referent. To our knowledge, this is the first time that an automatic measure of pertinence has been implemented in a picture description task. While most computational approaches focus only on the information coverage, one advantage of our measure is that it helps to detect when patients drift off-topic, a highly challenging task in automatic analysis. One disadvantage of our approach is that it sacrifices part of the HC group to create the referent, reducing the availability of HC samples for training the algorithms. Our study also presented a limitation when evaluating MCI patients, yielding a high false-negative rate.

Future work

In both experiments, we observed that our phonetic characteristics were not sufficiently discriminative or had little to no effect in the performance of the algorithms. As previous authors have reported, the use of more complex acoustic and rhythm features could significantly increase the automatic classification performance of HCs and AD patients. In future work, we propose to extend the research scope with the evaluation of the performance of the information coverage metrics in descriptions of different picture description tasks or even in different restricted-discourse tests. Also, there is a potential to perform multilingual studies because all the features proposed in this work are language independent or can be adapted for studies in different languages. Finally, we intend to research the effects of different HC sample sizes for creating the referent. Systematic review: Alzheimer's disease (AD) patients and healthy elderly controls have shown significant differences in their performance on picture description tests. We present a computer-based methodology to evaluate the performance of patients during this task. Interpretation: Using 10% of the healthy elderly controls and an information coverage measure, we created a population-specific referent. Against this referent, we automatically assessed the informativeness and pertinence of descriptions of the Cookie Theft picture and extracted linguistic and phonetic features. Applying machine learning algorithms, we classified healthy elderly controls and AD patients with results (sensitivity = 81%; AUC = 0.79) comparable to state-of-the-art work that uses a manually made referent. We incorporated mild cognitively impaired participants (sensitivity = 85%; AUC = 0.76). Our findings encourage the use of computer-based procedures as an aid in clinical practice. Future directions: Our proposed approach could be applied to other picture tasks and languages. Additional studies are needed with larger mild cognitively impaired samples and more complex phonetic features to improve the classification accuracy.

18 in total

1. The token test: A sensitive test to detect receptive disturbances in aphasics.

Authors: E DE RENZI; L A VIGNOLO
Journal: Brain Date: 1962-12 Impact factor: 13.501

2. An analysis of connected speech samples of aphasic and normal speakers.

Authors: K M Yorkston; D R Beukelman
Journal: J Speech Hear Disord Date: 1980-02

3. Analysis of speech-based measures for detecting and monitoring Alzheimer's disease.

Authors: A Khodabakhsh; C Demiroglu
Journal: Methods Mol Biol Date: 2015

4. Computerized analysis of speech and language to identify psycholinguistic correlates of frontotemporal lobar degeneration.

Authors: Serguei V S Pakhomov; Glenn E Smith; Dustin Chacon; Yara Feliciano; Neill Graff-Radford; Richard Caselli; David S Knopman
Journal: Cogn Behav Neurol Date: 2010-09 Impact factor: 1.600

5. Empty speech in Alzheimer's disease and fluent aphasia.

Authors: M Nicholas; L K Obler; M L Albert; N Helm-Estabrooks
Journal: J Speech Hear Res Date: 1985-09

6. Language disintegration in dementia: effects of etiology and severity.

Authors: D B Hier; K Hagenlocker; A G Shindler
Journal: Brain Lang Date: 1985-05 Impact factor: 2.381

7. Linguistic ability in early life and cognitive function and Alzheimer's disease in late life. Findings from the Nun Study.

Authors: D A Snowdon; S J Kemper; J A Mortimer; L H Greiner; D R Wekstein; W R Markesbery
Journal: JAMA Date: 1996-02-21 Impact factor: 56.272

Review 8. Innovative diagnostic tools for early detection of Alzheimer's disease.

Authors: Christoph Laske; Hamid R Sohrabi; Shaun M Frost; Karmele López-de-Ipiña; Peter Garrard; Massimo Buscema; Justin Dauwels; Surjo R Soekadar; Stephan Mueller; Christoph Linnemann; Stephanie A Bridenbaugh; Yogesan Kanagasingam; Ralph N Martins; Sid E O'Bryant
Journal: Alzheimers Dement Date: 2014-11-15 Impact factor: 21.566

9. Predicting mild cognitive impairment from spontaneous spoken utterances.

Authors: Meysam Asgari; Jeffrey Kaye; Hiroko Dodge
Journal: Alzheimers Dement (N Y) Date: 2017-02-27

10. Connected speech as a marker of disease progression in autopsy-proven Alzheimer's disease.

Authors: Samrah Ahmed; Anne-Marie F Haigh; Celeste A de Jager; Peter Garrard
Journal: Brain Date: 2013-10-18 Impact factor: 13.501

15 in total

1. Narrative video scene description task discriminates between levels of cognitive impairment in Alzheimer's disease.

Authors: Stephanie Reeves; Victoria Williams; Francisco M Costela; Rocco Palumbo; Olivia Umoren; Mikaila M Christopher; Deborah Blacker; Russell L Woods
Journal: Neuropsychology Date: 2020-01-30 Impact factor: 3.295

2. A Review of Automated Speech and Language Features for Assessment of Cognitive and Thought Disorders.

Authors: Rohit Voleti; Julie M Liss; Visar Berisha
Journal: IEEE J Sel Top Signal Process Date: 2019-11-07 Impact factor: 6.856

3. Speech Analysis by Natural Language Processing Techniques: A Possible Tool for Very Early Detection of Cognitive Decline?

Authors: Daniela Beltrami; Gloria Gagliardi; Rema Rossini Favretti; Enrico Ghidoni; Fabio Tamburini; Laura Calzà
Journal: Front Aging Neurosci Date: 2018-11-13 Impact factor: 5.750

4. Atypical Repetition in Daily Conversation on Different Days for Detecting Alzheimer Disease: Evaluation of Phone-Call Data From Regular Monitoring Service.

Authors: Yasunori Yamada; Kaoru Shinkawa; Keita Shimmei
Journal: JMIR Ment Health Date: 2020-01-12

5. Scaling laws in natural conversations among elderly people.

Authors: Masato S Abe; Mihoko Otake-Matsuura
Journal: PLoS One Date: 2021-02-19 Impact factor: 3.240

6. Multilingual automation of transcript preprocessing in Alzheimer's disease detection.

Authors: Frédéric Abiven; Sylvie Ratté
Journal: Alzheimers Dement (N Y) Date: 2021-03-19

7. Transformer-based deep neural network language models for Alzheimer's disease risk assessment from targeted speech.

Authors: Alireza Roshanzamir; Hamid Aghajan; Mahdieh Soleymani Baghshah
Journal: BMC Med Inform Decis Mak Date: 2021-03-09 Impact factor: 2.796

8. Language Impairment in Alzheimer's Disease-Robust and Explainable Evidence for AD-Related Deterioration of Spontaneous Speech Through Multilingual Machine Learning.

Authors: Hali Lindsay; Johannes Tröger; Alexandra König
Journal: Front Aging Neurosci Date: 2021-05-19 Impact factor: 5.750

9. A systematic literature review of automatic Alzheimer's disease detection from speech and language.

Authors: Ulla Petti; Simon Baker; Anna Korhonen
Journal: J Am Med Inform Assoc Date: 2020-11-01 Impact factor: 4.497

Review 10. Speech- and Language-Based Classification of Alzheimer's Disease: A Systematic Review.

Authors: Inês Vigo; Luis Coelho; Sara Reis
Journal: Bioengineering (Basel) Date: 2022-01-11