Literature DB >> 27239542

Novel verbal fluency scores and structural brain imaging for prediction of cognitive outcome in mild cognitive impairment.

David Glenn Clark¹, Paula M McLaughlin², Ellen Woo³, Kristy Hwang⁴, Sona Hurtz⁵, Leslie Ramirez⁵, Jennifer Eastman⁶, Reshil-Marie Dukes⁷, Puneet Kapur⁸, Thomas P DeRamus⁹, Liana G Apostolova¹⁰.

Abstract

INTRODUCTION: The objective of this study was to assess the utility of novel verbal fluency scores for predicting conversion from mild cognitive impairment (MCI) to clinical Alzheimer's disease (AD).
METHOD: Verbal fluency lists (animals, vegetables, F, A, and S) from 107 MCI patients and 51 cognitively normal controls were transcribed into electronic text files and automatically scored with traditional raw scores and five types of novel scores computed using methods from machine learning and natural language processing. Additional scores were derived from structural MRI scans: region of interest measures of hippocampal and ventricular volumes and gray matter scores derived from performing ICA on measures of cortical thickness. Over 4 years of follow-up, 24 MCI patients converted to AD. Using conversion as the outcome variable, ensemble classifiers were constructed by training classifiers on the individual groups of scores and then entering predictions from the primary classifiers into regularized logistic regression models. Receiver operating characteristic curves were plotted, and the area under the curve (AUC) was measured for classifiers trained with five groups of available variables.
RESULTS: Classifiers trained with novel scores outperformed those trained with raw scores (AUC 0.872 vs 0.735; P < .05 by DeLong test). Addition of structural brain measurements did not improve performance based on novel scores alone.
CONCLUSION: The brevity and cost profile of verbal fluency tasks recommends their use for clinical decision making. The word lists generated are a rich source of information for predicting outcomes in MCI. Further work is needed to assess the utility of verbal fluency for early AD.

Entities: Chemical Disease Gene Species

Keywords: Alzheimer's disease; Cognitive neuropsychology; Dementia; MCI (mild cognitive impairment); MRI (magnetic resonance imaging); Machine learning; Natural language processing

Year: 2016 PMID： 27239542 PMCID： PMC4879664 DOI： 10.1016/j.dadm.2016.02.001

Source DB: PubMed Journal: Alzheimers Dement (Amst) ISSN： 2352-8729

Introduction

Alzheimer's disease (AD) is a major socioeconomic crisis for the 20th century, with a projected 14 million cases by the year 2050 [1]. The dominant hypothesis for the pathogenesis of AD involves early deposition of beta-amyloid in the brain, but clinical trials targeting amyloid during the past decade have not met primary endpoints 2, 3, 4. There is now evidence that beta-amyloid accumulates in the brain >10 years before the onset of cognitive symptoms 5, 6. Although cognitive symptoms appear late, studies in autosomal dominant AD suggest that individuals with mutations have measurable changes in cognition several years before onset of symptoms, when compared to mutation-free individuals [5]. These discoveries raise the concern that if treatments targeting beta-amyloid are to work, it might be necessary to implement them before the onset of symptoms. Two practical challenges arise. First, what is the best way to design clinical trials to ensure that pathophysiological changes of AD are occurring, if the individuals we would like to enroll are asymptomatic and cannot be expected to seek medical attention? Second, if a beta-amyloid targeted treatment is proven to work in the asymptomatic or early symptomatic stages of the disease, how can we identify individuals in the general population who will benefit from them? There is an imminent need for new methods of detecting the earliest changes of AD, as the available biological methods are expensive, invasive, or entail exposure to radiation. Candidate methods under investigation include brief neuropsychological tests, ocular imaging, speech signal analysis, and computerized assessment of gait [7]. Structural MRI has been evaluated for this purpose, but the typical approach requires more than one image over a 6–12-month period, making it relatively expensive for a single patient [8]. Another approach is to use machine learning methods to train a classifier to discern between amyloid-positive and amyloid-negative individuals using available predictor variables, such as demographic data, structural brain imaging, cognitive tests, and blood tests. This approach has shown some success in a recent analysis of individuals with mild cognitive impairment (MCI—a condition thought to be a risk state for dementia 9, 10) who were studied in the AD Neuroimaging Initiative study [11]. The classifier achieved 0.78 area under the receiver operating characteristic curve (AUC) on a test set and 0.76 AUC when predicting conversion from MCI to AD. The current work focuses on verbal fluency tasks—very brief cognitive measures in which the participant is given 1 minute to generate as many words as possible within a certain category, such as animals, or that start with a specific letter. The traditional method of scoring these tests is to simply count the number of unique, valid items in the list. The raw score obtained has proven clinical value 12, 13, 14, and as a result, verbal fluency tasks are performed in many research studies on AD or other cognitive disorders. However, there is strong evidence that careful examination of the words produced during the tasks may have additional clinical value, apart from or in addition to the raw score. The classic method for studying the explicit word content of these lists is to identify clusters of consecutively listed words that are related in some way (e.g., they have similar meaning, rhyme, or start with same two letters) 15, 16, 17, 18. The average length of these clusters is termed the clustering score and is thought to relate to spreading activation in a semantic or lexical network. The number of transitions between clusters is termed the switching score and is thought to relate to an individual's ability to deliberately change the subcategory of items that is currently being searched. Investigators have found that each of these scores has value for predicting dementia in longitudinal studies 19, 20. Some investigators have made use of unsupervised learning methods either on a corpus of fluency word lists [21] or on large English-language corpora 19, 22 to improve prognostications. In the present study, we develop new models for estimating risk of dementia conversion in MCI patients using measures derived from structural brain images and novel verbal fluency scores. In the statistical sub-discipline of machine learning, classification is often enhanced through expansion of the set of predictive features. This approach differs from the traditional inferential statistical approach, in which the primary goal is to identify statistically significant relationships between the independent and dependent variables. A potential weakness of the traditional approach is that one may develop a model with very poor predictive accuracy although it contains only statistically significant predictors. In machine learning, there is no immediate ambition to explain the relationship between the outcome and individual predictors—instead, the model is justified through the quality of predictions it yields. With this goal in mind, we developed a large set of novel predictive scores for five fluency tasks (categories animals and vegetables, and letters F, A, and S). Some of these novel predictive scores have roots in previous work (e.g., based on clustering, switching, or independent components analysis [ICA]), whereas others were developed specifically for this project. Some of the new scores are based on fundamental lexical qualities, such as syllable counts or frequencies of the words generated. Both of these quantities have good face validity and are easy to obtain, and the machine learning approach permits us to consider several possible ways of using them, such as (for a given fluency word list) taking the average, taking the sum, or subtracting the minimum value from the maximum value (i.e., metric range). We based several novel scores on graph theory, a branch of discrete mathematics that provides techniques for analyzing networks. For this approach, we viewed the words in each list as nodes in a network and created weighted graphs by assigning numerical values to the edges or connections between the nodes. These values corresponded to the semantic, orthographic, or phonologic similarity between the two words being connected. Several scores were derived directly from these weighted graphs. The computation of other measures depended on conversion of each weighted graph into an unweighted graph by first identifying a threshold of the similarity metric and then creating a new graph containing only the edges that met the threshold. For further details and rationale behind the predictor variables, see the Supplementary Methods (available online). In addition, we derived several cerebral measurements from structural MRI scans, including hippocampal atrophy and patterns of cortical thinning. Following trends in machine learning research [23], we took the approach of developing ensemble classifiers. In the case of this work, the ensemble classifiers were constructed by training several initial classifiers and then training a final classifier using estimates of risk from the initial classifiers. Our goal was to compare the major subsets of the predictor variables in terms of their value for predicting conversion to dementia from MCI. This goal will be an important step if detailed verbal fluency word list analyses are to contribute to future efforts for identification of early symptomatic or pre-symptomatic AD.

Methods

Neurocognitive testing and consensus diagnoses

One hundred fifty-eight individuals met the inclusion/exclusion criteria set by the UCLA Imaging and Genetic Biomarkers for AD (ImaGene) study. ImaGene prospectively enrolled and followed individuals recruited from two sources: (1) referring UCLA and outside neurologists and (2) our Alzheimer's Disease Research Center (ADRC) ongoing longitudinal database study. The latter group consists of existing research participants who agreed to be contacted for future research opportunities and met ImaGene inclusion and/or exclusion criteria. All subjects provided informed consent after detailed explanation by a study clinician, and the UCLA Institutional Review Board approved the study. To be included, subjects had to be aged at least 50 years, able to independently carry out daily activities of living based on interview, and score ≥24 on the mini-mental state examination (MMSE) [24]. MCI diagnosis was based on Petersen criteria [25] and required an objective cognitive deficit of at least 1.5 SD below age-adjusted and education-adjusted neuropsychological norms on at least one neuropsychological test, global clinical dementia rating (CDR) score <1, preserved general cognitive function, and intact activities of daily living. Cognitively normal participants performed above the −1.5 SD cutoff on the neuropsychological tests (adjusting for age and education) and had a global CDR of 0. Exclusionary criteria for both groups were concurrent medical problems of sufficient severity to impact cognition, history of alcohol or drug abuse in the past 2 years, concurrent neurologic or psychiatric illnesses, contraindications to MRI, cortical strokes or significant white matter changes, and visual and hearing impairment that could interfere with cognitive testing. During each visit, ImaGene participants underwent detailed clinical and cognitive examinations, blood draw, and magnetic resonance imaging (MRI) examination. The neuropsychological battery and average scores by diagnostic group are listed in Table 1. Diagnosis for each subject was based on a consensus by all UCLA ADRC neurologists, neuropsychologists, and other key study personnel.

Table 1

Participant demographics and selected neuropsychological test scores

	CN (n = 51)	MCI-non (n = 83)	MCI-con (n = 24)
Age (y)	68.9 (7.9)	68.7 (8.6)	73.8 (7.9)***
Sex (M:F)	28:23	37:46	9:15
Education (y)	17.6 (2.2)****	16.0 (3.0)	16.0 (2.9)
Mini-mental state examination	28.9 (1.2)****	27.9 (1.7)	25.1 (3.1)****
Animals	22.0 (4.7)****	18.8 (5.1)	14.5 (4.9)****
Vegetables	15.0 (4.3)***	13.0 (4.1)	9.7 (3.7)***
F	16.8 (4.6)****	13.4 (5.1)	11.3 (5.5)*
A	15.6 (4.6)****	11.1 (5.1)	7.9 (4.7)***
S	16.9 (5.5)****	13.4 (5.1)	10.7 (5.0)**
Boston naming test	58.1 (1.9)****	52.1 (8.1)	47.4 (10.6)*
Digit span forward	10.8 (2.3)	10.2 (2.2)	9.5 (1.9)
Digit span backward	8.2 (2.2)****	6.6 (2.4)	5.6 (1.6)**
Trails A	24.9 (8.9)****	33.8 (14.6)	44.5 (18.1)**
Trails B	70.0 (37.2)****	103.6 (55.5)	161.8 (87.5)***
Stroop A	62.7 (12.2)**	69.0 (20.4)	87.7 (18.7)****
Stroop B	47.5 (8.4)**	51.3 (12.6)	56.5 (10.1)**
Stroop C	114.9 (28.3)****	137.5 (44.0)	181.2 (66.6)***
Wisconsin card sort (categories)	4.3 (0.9)****	3.4 (1.9)	2.4 (1.7)**
Wisconsin card sort (errors)	11.7 (8.9)****	22.7 (13.7)	35.8 (19.0)***
Logical memory I	42.9 (9.6)****	32.9 (11.1)	15.9 (8.1)****
Logical memory II	28.3 (7.1)****	18.5 (9.5)	4.6 (4.7)****
Visual recall I	82.9 (13.5)****	80.0 (16.6)	52.6 (17.4)****
Visual recall II	62.8 (25.2)****	38.1 (23.8)	15.2 (21.8)****
Rey-Osterrieth figure copy	33.4 (2.4)****	30.0 (4.7)	29.7 (4.7)
Rey-Osterrieth delayed recall	20.2 (6.7)****	12.7 (7.2)	7.4 (6.6)***

Abbreviations: CN, cognitively normal group; MCI-non, mild cognitive impairment nonconverter; MCI-con, mild cognitive impairment converter to AD. Numbers in parentheses are standard deviations. All statistical comparisons are made to the MCI-non group.

NOTE. *P < .1, **P < .05, ***P < .01, ****P < .001.

Seventy (65.4%) of the MCI participants were classified as “amnestic” because of poor performance on memory measures. Eighteen of these individuals had only memory impairment. Among the 52 amnestic individuals with impairment outside memory, 18 had impairment in only one nonmemory domain (14 executive, one language, two visuospatial, and one attention). The other 34 amnestic MCI patients had impairment in more than one additional nonmemory domain. Thirty-seven (34.6%) of the MCI participants were classified as nonamnestic. Thirty of these individuals had impairment in only one nonmemory domain (12 executive, six language, 11 visuospatial, one attention), whereas the other seven had impairment in executive function plus at least one other domain. During >4 years of follow-up, 24 of 107 individuals with MCI at baseline were determined by the consensus panel to have converted to dementia. Conversion was noted between 0.98 and 4.08 years after the baseline evaluation (M = 1.83 years, SD = 0.84 years). Converters were predominantly amnestic (21 of 24, or 87.5%). Twenty-two of the converters met clinical criteria for AD. The other two cases were clinically diagnosed as having dementia with Lewy bodies. One of these patients died and at autopsy was found to have hippocampal sclerosis without Lewy bodies. The other DLB patient has undergone positron emission tomography with an amyloid-detecting tracer and is amyloid positive.

Overview of machine learning approach

Fluency scores (traditional and novel) and brain imaging measurements were computed and used to create classifiers for discerning individuals with MCI who converted to AD from those that did not convert. All scores were placed in a single data matrix, and missing values for any given score were imputed as the mean of all the nonmissing values for that score. We performed five analyses, each using a different subset of the available scores: raw (traditional scores and counts of intrusions and repetitions), brain (measures derived from structural MRI), raw + brain (the union of the raw set and the brain set), novel (all scores derived from verbal fluency lists, including the raw scores), and novel + brain (the union of the novel and brain sets). Demographic variables of age, sex, and education were included for all analyses. The quality of each classifier was assessed using leave-one-out cross-validation. This means that a separate classifier was constructed with each MCI participant left out and the classifier was then used to make a prediction about whether the left-out participant converted to dementia. The following three analysis steps were undertaken during each cross-validation loop: variable selection, training of an ensemble of individual classifiers, and combination of the ensemble predictions through sparse logistic regression. Variable selection was performed by running the random forests algorithm [26] on the training data set one time with 400 trees and calculating importance for each variable. Importance values were converted to z scores, and separate thresholds were selected for each analysis with iterative search over thresholds between 0 and 2.0. Each ensemble consisted of classifiers with four different architectures: random forests of conditional trees, support vector machines [27], naïve Bayes [28], and multilayer perceptrons [28]. All analyses were performed in R with the following additional libraries: party, e1071, and RSNNS. One classifier of each architecture was trained using the left-in data, and ten additional classifiers of each architecture were trained using bootstrap samples of the left-in data for a total of 44 classifiers. Predictions from each classifier in the ensemble were obtained on the training data itself and for the left-out data point. Predictions of the ensemble were combined linearly with sparse logistic regression to yield the final prediction for the left-out data point. The sparse logistic regression model was trained using conversion as the outcome variable and the ensemble predictions on the training data as the predictor variables. Sparse or “LASSO” (least absolute shrinkage and selection operator) regression produces a lower variance model than traditional regression while automatically performing variable selection. Because the ensemble had also generated predictions on the left-out data point, it was then possible to enter these predictions into the sparse logistic regression model to obtain the final prediction.

Verbal fluency tasks and scoring

Research participants underwent five fluency tasks. During these tasks, they were given 1 minute to generate as many words as possible within certain constraints. For two of the tasks, the constraint was semantic (animals and vegetables), and for three of the tasks, the constraint was orthographic (words had to start with the letters F, A, or S). A psychometrist or neuropsychologist recorded the words generated and the lists were subsequently transcribed into electronic text files by two of the authors (D.G.C. and R.M.D.). Raw and novel scores on verbal fluency word lists were calculated using custom Python software and the NetworkX Python library [29], as described in Table 2. Similarity measurements between words followed methods previously described for analyzing verbal fluency [30].

Table 2

Traditional and novel fluency scores

Score	Description
Traditional
Raw	Count of unique valid items
Intrusions	Count of nonvalid items
Repetitions	Count of repeated items
Classic and miscellaneous lexical
Clustering	Automatically calculated as described in Troyer, et al. (1998a) [15] and Clark, et al. (2014) [19]
Switching	Automatically calculated as with clustering
Mean frequency	Lexical frequencies for all words generated were calculated from the Google n-grams corpus and averaged
Mean number of syllables	Syllables for each word generated were quantified as the number of vowel symbols in the pronunciation listed in the Carnegie Mellon University Pronunciation Dictionary
Metric range of frequency	Calculated as the maximum frequency of words within a list minus the minimum frequency of words in the list
Sum of frequencies	Lexical frequencies were added together
Sum of reciprocal of frequencies	The reciprocal of all the lexical frequencies were added together
Independent components analysis (ICA)
20 scores	ICA was performed on proximity matrices as described in Clark et al. (2014a). Each individual received 20 scores computed as the dot product of the individual's proximity matrix and 20 extracted components
Similarity metric based
Algebraic connectivity	Second smallest eigen value of the Laplacian of the weighted graph
Average clustering coefficient	Given a vertex in a graph, the clustering coefficient for the vertex is the proportion of edges present among the immediate neighbors of the vertex. This value was calculated for all vertices in the thresholded graph and averaged.
Average degree	Average weight of all edges connected to each vertex in the graph
Diameter	Length of the longest geodesic in the weighted graph
Maximum betweenness centrality	For every pair of distinct vertices in the thresholded graph, the shortest path between the pair was identified. The betweenness centrality for each vertex was calculated as the number of shortest paths passing through that vertex. The score was the maximum of these values.
Radius	Length of the shortest geodesic in the weighted graph
Transitivity	3 times the proportion of triangles in a thresholded graph divided by the number of triads (two edges with a common vertex) in the graph
Coherence	A greedy algorithm was used to discover a short Hamiltonian path through the vertices of the weighted graph. The sum of the similarity weights on the actual path taken by the participant was divided by the sum of the similarities on the optimal path.
P&H clustering	Defined as for traditional clustering, but linkages between words were based on the edges in the thresholded graph, as described by Pakhomov & Hemmy (2014) [17]
P&H switching	Analogous to Pakhomov clustering

NOTE. Similarity metrics included orthographic, phonologic, and semantic similarity measures like those described in Clark et al. (2014b). Thus, there were three versions of each of the similarity-metric based scores.

Brain measures

Two types of structural brain measures were incorporated into the predictive models: cortical thickness measurements and volumetric measures of several regions of interest (ROIs). MRI scans of sufficient quality were available for all but one participant from the MCI-non group (N = 157).

Cortical mapping

The cortical mapping measurements were obtained by first preprocessing high-resolution structural T1 brain images using previously described methods 31, 32, 33. For each brain image, these methods yielded a set of 65,280 three-dimensional spatial coordinates and a GM thickness measurement at each point. Independent components analysis (ICA) was used to reduce the dimensionality of the cortical surface data. To do so, a simple interpolation method was used to map the thickness measurements from each individual into a standard set of coordinates. For each three-dimensional point in the standard coordinates, the three nearest neighbors in the individual cortical surface data file were identified using a Euclidean distance measure. The cortical thickness at the standard point was set to an average of the thickness measurements at these three neighboring points, weighted by each point's proximity to the standard point. The interpolated cortical thickness measurements from the right and left hemispheres were concatenated for each research participant, and ICA was undertaken using the fastICA library for R. Thus, each component represented the entire bihemispheric cortical surface. Thirty components were extracted and reformatted for direct visual inspection. Component scores for individual research participants were computed as the dot product of the actual cortical thickness measures with each component.

Region of interest measures

Additional gross measures of the cerebrum included (for each hemisphere) hippocampal volumes, average cortical thickness, and volumes of the superior, inferior, and occipital portions of the lateral ventricles. Hippocampal volumes were derived from manual tracing of the hippocampus proper, dentate gyrus, and subiculum according to a well-established protocol 34, 35. Ventricular volumes were extracted after a semiautomated ventricular segmentation approach, in which lateral ventricles of four MRI scans were initially manually traced and segmented into three partitions per hemisphere (superior horn, temporal horn, and ventricular body/occipital horn) 36, 37. These traces were then converted into three-dimensional parametric ventricular mesh models (termed “atlases”) and were used to fluidly register each unsegmented study image, yielding one segmentation per atlas. The four ventricular segmentations thus derived for each participant were averaged to minimize automated labeling errors. Volumetric measurements were made on the averaged ventricular segmentation.

Assessment of classifier performance

Receiver operating characteristic (ROC) curves were plotted for each of the five ensemble classifiers. The AUC was reported for each classifier. In addition, for each ROC curve, the optimal cut point was defined as the threshold that maximized the F-measure (the harmonic mean of sensitivity and positive predictive value). Accuracy, sensitivity, specificity, negative predictive value, and positive predictive value were measured at this cut point.

Results

Similarity metrics

For examples of words within each fluency task that were judged to have high orthographic, phonological, or semantic similarity, see Supplementary Table 1. Percentages of edges meeting the similarity threshold of 1.0 standard deviations (for each fluency task and similarity measure) are shown in Supplementary Table 2.

Selected variables

Variables selected from the novel fluency score subset are listed in Table 3, along with the sum of the importance values assigned to each variable from all cross-validation iterations. Fig. 1 shows the gray matter independent component with highest estimated importance, which loads most heavily on points in the superior parietal lobe. Supplementary Fig. 1 shows the independent component with second highest importance. This component loads most heavily on points in the anterior mesial temporal lobe, an area likely to include the entorhinal and perirhinal cortices.

Table 3

Cumulative importance values of variables selected from novel scores

Coherence semantic (A)	1589.10	ICA10 (S)	598.70	Raw (animal)	281.51
Coherence–ortho (veg)	1066.63	ICA17 (F)	590.24	ICA2 (veg)	191.70
Frequency–metric range (F)	939.63	Frequency–metric range (animal)	585.68	Coherence–phono (S)	156.36
Coherence–ortho (animal)	878.30	Algebraic connectivity–phono (animal)	585.45	Average degree–semantic (veg)	154.91
Algebraic connectivity–ortho (animal)	777.68	Frequency–sum (animal)	564.00	Average clustering coefficient–phono (veg)	136.50
Radius–ortho (A)	772.28	Coherence–semantic (veg)	534.27	Clustering–classic (animal)	122.58
Frequency–mean (animal)	766.87	Switching–phono (veg)	529.87	Diameter–ortho (A)	36.03
Frequency–sum reciprocal (animal)	745.18	Maximum betweenness–phono (animal)	521.75	Algebraic connectivity–ortho (A)	31.80
Transitivity–phono (veg)	690.30	Transitivity–semantic (animal)	495.77	ICA13 (A)	31.22
Coherence–ortho (A)	684.37	Algebraic connectivity–semantic (A)	475.37	Metric range of similarity–semantic (A)	18.30
Coherence–phono (A)	681.66	Average clustering coefficient–ortho (animal)	439.13	Frequency–mean (veg)	17.83
Maximum betweenness–phono (veg)	676.82	Maximum betweenness–semantic (animal)	420.87	Switching–semantic (animal)	9.14
Transitivity–semantic (S)	673.94	Switching–phono (animal)	407.99	Metric range of similarity–ortho (A)	9.09
Average clustering coefficient–semantic (S)	663.10	Frequency–sum (veg)	400.33	Age	8.86
Frequency–sum reciprocal (veg)	656.95	ICA4 (animal)	305.44	Diameter–semantic (animal)	4.57

Abbreviations: veg, Vegetable; ICA, independent components analysis.

NOTE. Each score (apart from age) originated from one of the five fluency tasks (A, animal, F, S, or veg). For scores dependent on measurements of lexical similarity, the type of similarity measure is included (orthographic, phonological, or semantic). Each importance value listed here represents the sum of the importance measurements across all cross-validation loops.

Fig. 1

Component 15 derived from independent components analysis of cortical thickness measures. The values of the component have been normalized to the interval [0,1]. Individuals with a higher gray matter thickness in the parietal lobes and lower gray matter thickness in the right mesial occipital region would achieve the highest scores for this component.

For the novel score analysis, scores from all five fluency tasks achieved sufficient importance to be selected, although S words figured less prominently. Focusing on the top 10 scores, measures of coherence occupied four of the slots (including the top 2), measures of lexical frequency occupied three slots, and the remaining three slots were occupied by the graph theoretical measures algebraic connectivity, radius, and transitivity. Scores based on ICA, clustering, and switching appeared in the list but with lower importance scores. Among raw scores, only animals achieved sufficient importance to be included. Age was the only demographic variable selected and was selected in only two cross-validation loops. For variables selected during the other four analyses, see Supplementary Tables 3 and 4. Boxplots of 10 scores are shown in Fig. 2.

Fig. 2

Boxplots of 10 selected variables. The top row includes the five top-ranked variables from the analysis including only novel scores. The bottom row includes the three raw scores selected for the “raw” analysis and the two imaging scores selected for the “brain” analysis. Differences between the MCI-non (N) and MCI-con (C) groups are apparent for all variables shown. Factors that may be relevant but cannot be readily depicted include potential interactions among several variables and nonlinear relationships between an individual variable and conversion risk. (A) semantic coherence letter A; (B) orthographic coherence vegetables; (C) metric range of frequency letter F; (D) orthographic coherence animals; (E) orthographic algebraic connectivity animals; (F) raw score for animals; (G) raw score for vegetables; (H) raw score for letter A; (I) volume of right hippocampus; (J) gray matter volume score for independent component 15.

Prediction accuracy

ROC curves for the five classifiers are shown in Fig. 3. As shown in Table 4, the ensemble classifier trained only with novel scores performed the best on all measures apart from specificity, with AUC 0.872, and was significantly better than the classifier trained only with raw scores (AUC 0.719, P < .05 by DeLong test). The raw + brain ensemble showed the highest specificity (0.916). If our goal, however, is to develop an inexpensive screening test then we may place greater emphasis on sensitivity. The novel ensemble shows a clear advantage here, providing 100% sensitivity with 67.5% specificity (Fig. 3).

Fig. 3

ROC curves for the five ensemble classifiers. Novel verbal fluency scores yield the best AUC (0.872). This classifier may be thresholded to have sensitivity 1.00 with a specificity of 0.675. Abbreviations: ROC, receiver operating characteristic curve; AUC, area under the receiver operating characteristic curve.

Table 4

Quality of predictions made by the five ensemble classifiers

	AUC	F	Sensitivity	Specificity	NPV	PPV	Accuracy
Raw	0.719	0.583	0.583	0.880	0.880	0.583	0.813
Brain	0.760	0.536	0.682	0.756	0.894	0.441	0.740
Raw + Brain	0.735	0.524	0.458	0.916	0.854	0.611	0.813
Novel	0.872*	0.667	0.708	0.880	0.913	0.630	0.841
Novel + Brain	0.814	0.625	0.625	0.892	0.892	0.625	0.832

Abbreviations: AUC, area under the receiver operating characteristic curve; F, F-measure (harmonic mean of sensitivity and positive predictive value); NPV, negative predictive value; PPV, positive predictive value.

NOTE. *P < .05 compared to AUC for Raw classifier using DeLong test.

Discussion

The ability to rapidly, noninvasively, and inexpensively identify individuals at high risk for AD is crucial for the application of effective disease-modifying therapies for the general population [7]. Language is an abundant and readily collectible product of human cognition that is associated with neurological function and may be assessed at multiple levels of representation. Language is often noted to be disrupted early during the course of AD 38, 39, 40, suggesting that measurements of language may play an important role in predictive models of AD. Verbal fluency tasks are a simple method for quickly obtaining a constrained linguistic sample and have proven value for differentiating causes of dementia 12, 13, 14. In this study, we examined the value of an assortment of novel verbal fluency scoring methods for predicting subsequent cognitive and functional decline in MCI patients and evaluated the potential contribution of measures from structural brain imaging for the predictive models. We achieved good prediction results using cross-validated ensembles combined linearly with LASSO logistic regression models. These findings contribute to the literature on machine learning for predicting outcomes in MCI because they highlight the value of brief, information-dense cognitive tests from which many potential predictor variables may be extracted.

Predictor variables

Despite the well-known utility of traditional raw scores for diagnosis of cognitive disorders, these scores did not make large contributions to the best classifier. Other easily quantified measures, such as counts of intrusions and repetitions, demonstrated little capacity to predict MCI conversion. Several of the novel scores based on lexical similarity metrics made significant contributions to the final classifiers. Among the various tasks and types of lexical similarity, scores derived through the measurement of coherence appeared to have good utility. We note that three of the top five predictors in Table 3 were coherence measures in which the similarity metric did not coincide with the task demands (e.g., semantic similarity in the letter A task). In each case (as shown in Fig. 2), MCI converters had higher average scores than nonconverters on these measures. This finding suggests that the converters may have been more likely to be distracted by forms of word similarity that were not relevant to the current task. Despite the strong theoretical neuropsychological basis for clustering and switching scores, they were not found to be prominent among the other novel scores introduced here. However, previous work in which switching or clustering was found to have prognostic value was conducted on data from longitudinal studies with larger sample sizes and was not focused on individuals with MCI 19, 20. Thus, future work should continue to consider the potential value of clustering and switching scores, whether calculated by the classic method or with methods based on similarity scores.

Imaging

Novel verbal fluency scores outperformed structural MRI measures for predicting MCI conversion. The brain-only classifier achieved an AUC 0.760 (Table 4). This score is on par with AUC measures reported by other investigators using structural MRI and other biological measures, which have ranged from 0.734 (MRI + CSF in [41]) up to 0.843 (MRI only in [42]). The inconsistent improvements we observed could not justify the expense of undertaking an MRI scan only for use in these types of classifiers. However, other MRI-based measurements, such as resting state functional MRI, diffusion kurtosis imaging, magnetic resonance spectroscopy, or arterial spin labeling, may provide better predictive features.

Limitations

A few limitations of this work should be noted, as they point the way for future research along these lines. First, owing to the small sample size, it was not possible to test the final classifiers on a held-out test set. This shortcoming was mitigated by using a rigorous cross-validation method. Second, the techniques for extracting predictor variables from MRI scans, although state-of-the-art for brain-mapping purposes, are extremely labor intensive. If MRI measures are to be used to achieve our clinical goals, it will be necessary to use an automatic imaging analysis pipeline. Third, the calculation of the predictor variables for this work rests on the availability of accurate electronic transcriptions of fluency word lists. Information about the latencies of the words generated could enhance the quality of the ICA scores and thereby the predictions from them. Moreover, the need for the transcriptions adds to the cost of the technique. Automatic transcription with speech recognition technology may make future work faster, cheaper, and more accurate.

Conclusions

Using cross-validated ensembles of classifiers trained with a variety of novel verbal fluency scores, we show generally good quality predictions of MCI conversion over approximately 5 years of follow-up, with most conversions fitting the AD phenotype. Many novel scores contribute to the quality of the final classifiers. However, lexical frequency measures and certain graph theoretical scores, especially those based on coherence, stand out as having the strongest relationships with conversion risk. Verbal fluency word lists contain a great deal of information, and detailed analysis of their contents may lead to the development of a rapid, inexpensive, and noninvasive method for detecting the earliest pathophysiological changes of AD. Systematic review: The authors reviewed the literature using traditional sources, such journal articles, meeting abstracts, and presentations. Recent observations regarding the natural history of Alzheimer's disease (AD) suggest that some treatments (e.g., those targeting amyloid) may be most effective if administered early. There is a need for inexpensive and noninvasive methods for detecting patients likely to benefit from new treatments. Interpretation: Our findings point to the potential value of applying machine learning methods to natural language samples to realize the goal of early AD detection. Future directions: Further research along these lines will seek (1) to validate these findings in larger samples of patients, (2) to integrate the method with other rapid, inexpensive tests, and (3) to apply speech recognition technology for rapid transcription and scoring of natural language samples.

36 in total

1. Normative data for clustering and switching on verbal fluency tasks.

Authors: A K Troyer
Journal: J Clin Exp Neuropsychol Date: 2000-06 Impact factor: 2.475

2. A computational linguistic measure of clustering behavior on semantic verbal fluency task predicts risk of future dementia in the nun study.

Authors: Serguei V S Pakhomov; Laura S Hemmy
Journal: Cortex Date: 2013-06-14 Impact factor: 4.027

Review 3. Computational anatomical methods as applied to ageing and dementia.

Authors: P M Thompson; L G Apostolova
Journal: Br J Radiol Date: 2007-12 Impact factor: 3.039

4. Prediction of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification.

Authors: Christos Davatzikos; Priyanka Bhatt; Leslie M Shaw; Kayhan N Batmanghelich; John Q Trojanowski
Journal: Neurobiol Aging Date: 2010-07-01 Impact factor: 4.673

5. Phase 3 trials of solanezumab for mild-to-moderate Alzheimer's disease.

Authors: Rachelle S Doody; Ronald G Thomas; Martin Farlow; Takeshi Iwatsubo; Bruno Vellas; Steven Joffe; Karl Kieburtz; Rema Raman; Xiaoying Sun; Paul S Aisen; Eric Siemers; Hong Liu-Seifert; Richard Mohs
Journal: N Engl J Med Date: 2014-01-23 Impact factor: 91.245

6. Effect of tarenflurbil on cognitive decline and activities of daily living in patients with mild Alzheimer disease: a randomized controlled trial.

Authors: Robert C Green; Lon S Schneider; David A Amato; Andrew P Beelen; Gordon Wilcock; Edward A Swabb; Kenton H Zavitz
Journal: JAMA Date: 2009-12-16 Impact factor: 56.272

7. Comparisons of verbal fluency tasks in the detection of dementia of the Alzheimer type.

Authors: A U Monsch; M W Bondi; N Butters; D P Salmon; R Katzman; L J Thal
Journal: Arch Neurol Date: 1992-12

8. The effects of very early Alzheimer's disease on the characteristics of writing by a renowned author.

Authors: Peter Garrard; Lisa M Maloney; John R Hodges; Karalyn Patterson
Journal: Brain Date: 2004-12-01 Impact factor: 13.501

9. Enrichment and stratification for predementia Alzheimer disease clinical trials.

Authors: Dominic Holland; Linda K McEvoy; Rahul S Desikan; Anders M Dale
Journal: PLoS One Date: 2012-10-17 Impact factor: 3.240

10. Connected speech as a marker of disease progression in autopsy-proven Alzheimer's disease.

Authors: Samrah Ahmed; Anne-Marie F Haigh; Celeste A de Jager; Peter Garrard
Journal: Brain Date: 2013-10-18 Impact factor: 13.501

13 in total

1. Inhibitory Control Deficits in Individuals with Amnestic Mild Cognitive Impairment: a Meta-Analysis.

Authors: Rahel Rabi; Brandon P Vasquez; Claude Alain; Lynn Hasher; Sylvie Belleville; Nicole D Anderson
Journal: Neuropsychol Rev Date: 2020-03-12 Impact factor: 7.444

Review 2. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing.

Authors: A Névéol; P Zweigenbaum
Journal: Yearb Med Inform Date: 2017-09-11

Review 3. A review on neuroimaging-based classification studies and associated feature extraction methods for Alzheimer's disease and its prodromal stages.

Authors: Saima Rathore; Mohamad Habes; Muhammad Aksam Iftikhar; Amanda Shacklett; Christos Davatzikos
Journal: Neuroimage Date: 2017-04-13 Impact factor: 6.556

4. Associations between Cortical Thickness and Metamemory in Alzheimer's Disease.

Authors: Tugce Duran; Ellen Woo; Diana Otero; Shannon L Risacher; Eddie Stage; Apoorva B Sanjay; Kwangsik Nho; John D West; Meredith L Phillips; Naira Goukasian; Kristy S Hwang; Liana G Apostolova
Journal: Brain Imaging Behav Date: 2022-01-22 Impact factor: 3.224

5. Pre-Mild Cognitive Impairment: Can Visual Memory Predict Who Rapidly Convert to Mild Cognitive Impairment?

Authors: Eun Hyun Seo; Hoowon Kim; Kyu Yeong Choi; Kun Ho Lee; Il Han Choo
Journal: Psychiatry Investig Date: 2018-09-05 Impact factor: 2.505

6. Effort and Fatigue-Related Functional Connectivity in Mild Traumatic Brain Injury.

Authors: Amy E Ramage; David F Tate; Anneliese B New; Jeffrey D Lewis; Donald A Robin
Journal: Front Neurol Date: 2019-01-18 Impact factor: 4.003

7. Early Identification of Alzheimer's Disease in Mouse Models: Application of Deep Neural Network Algorithm to Cognitive Behavioral Parameters.

Authors: Stephanie Sutoko; Akira Masuda; Akihiko Kandori; Hiroki Sasaguri; Takashi Saito; Takaomi C Saido; Tsukasa Funane
Journal: iScience Date: 2021-02-16

8. Comprehensive verbal fluency features predict executive function performance.

Authors: Julia Amunts; Julia A Camilleri; Simon B Eickhoff; Kaustubh R Patil; Stefan Heim; Georg G von Polier; Susanne Weis
Journal: Sci Rep Date: 2021-03-25 Impact factor: 4.379

9. Effect of astaxanthin-rich extract derived from Paracoccus carotinifaciens on cognitive function in middle-aged and older individuals.

Authors: Masahiro Hayashi; Takashi Ishibashi; Takashi Maoka
Journal: J Clin Biochem Nutr Date: 2018-01-27 Impact factor: 3.114

10. A systematic literature review of automatic Alzheimer's disease detection from speech and language.

Authors: Ulla Petti; Simon Baker; Anna Korhonen
Journal: J Am Med Inform Assoc Date: 2020-11-01 Impact factor: 4.497