Literature DB >> 30128275

Automatic normative quantification of brain tissue volume to support the diagnosis of dementia: A clinical evaluation of diagnostic accuracy.

Meike W Vernooij¹, Bas Jasperse², Rebecca Steketee², Marcel Koek³, Henri Vrooman³, M Arfan Ikram⁴, Janne Papma⁵, Aad van der Lugt², Marion Smits², Wiro J Niessen⁶.

Abstract

Objectives: To assesses whether automated brain image analysis with quantification of structural brain changes improves diagnostic accuracy in a memory clinic setting.
Methods: In 42 memory clinic patients, we evaluated whether automated quantification of brain tissue volumes, hippocampal volume and white matter lesion volume improves diagnostic accuracy for Alzheimer's disease (AD) and frontotemporal dementia (FTD), compared to visual interpretation. Reference data were derived from a dementia-free aging population (n = 4915, aged >45 years), and were expressed as age- and sex-specific percentiles. Experienced radiologists determined the most likely imaging-based diagnosis based on structural brain MRI using three strategies (visual assessment of MRI only, quantitative normative information only, or a combination of both). Diagnostic accuracy of each strategy was calculated with the clinical diagnosis as the reference standard.
Results: Providing radiologists with only quantitative data decreased diagnostic accuracy both for AD and FTD compared to conventional visual rating. The combination of quantitative with visual information, however, led to better diagnostic accuracy compared to only visual ratings for AD. This was not the case for FTD.
Conclusion: Quantitative assessment of structural brain MRI combined with a reference standard in addition to standard visual assessment may improve diagnostic accuracy in a memory clinic setting.

Entities: Disease Gene Species

Mesh：

Year: 2018 PMID： 30128275 PMCID： PMC6096052 DOI： 10.1016/j.nicl.2018.08.004

Source DB: PubMed Journal: Neuroimage Clin ISSN： 2213-1582 Impact factor: 4.881

Introduction

Dementia is a clinical syndrome caused by various brain diseases, of which Alzheimer's disease (AD), vascular dementia and frontotemporal dementia (FTD) are most frequent (Ferri et al., 2005). In early onset dementia AD is the most common cause (approximately 34%) but FTD is also relatively prevalent (12%) (Van der Flier and Scheltens, 2005), which frequently causes a clinical diagnostic dilemma. Dementia is diagnosed clinically as a cognitive disorder interfering with activities of daily life according to core clinical criteria. Imaging biomarkers, cognitive profiling, genetic information, and cerebrospinal fluid (CSF) biochemistry features provide supportive evidence for differential diagnosis (Dubois et al., 2007; McKhann et al., 2011). Yet, the NIA-AA criteria (McKhann et al., 2011) for AD diagnosis only have a sensitivity of 65% for distinguishing probable AD from FTD, with considerable overlap in clinical symptoms, especially in early disease stages (Harris et al., 2015). Early and accurate identification of dementia's underlying causes is important for proper and tailored patient management as well as upcoming disease-modifying treatment options (Ballard et al., 2011; Mattila et al., 2012). By visualizing structural brain changes associated with specific pathological substrates, Magnetic Resonance imaging (MRI) plays an important role in dementia diagnosis and subtype differentiation (Vernooij and Smits, 2012). MRI interpretation in dementia diagnosis can be challenging, as early brain abnormalities may be difficult to detect visually, especially in early stages of the disease. Additionally, brain changes due to a neurodegenerative disorder may be difficult to distinguish from those related to normal aging. One way to potentially improve diagnostic accuracy and confidence is to quantify brain structures from an individual patient and compare these to age- and sex-specific reference data from a healthy population (Brewer, 2009; Ross et al., 2015). Although several MR brain quantification methods are now becoming available and gradually finding their way into clinical applications (Ross et al., 2015; Brewer et al., 2009; Ross et al., 2013), there is no clear concept on how they should be implemented in radiology reading or reporting practice. Whether quantitative information improves diagnostic accuracy, and, if so, it can be used in isolation or should be considered together with other imaging information is not known. In this study, we implemented automated quantification of brain tissue volumes, hippocampal volumes and white matter lesion volumes in our memory clinic and compared these volumes to population reference data. Our aim was to compare three different strategies, namely visual rating of brain MR scans only, quantitative normative assessment only, and a combination of both visual rating and quantitative assessment, to the reference standard of multidisciplinary clinical consensus diagnosis, and to assess diagnostic agreement of these strategies between two observers.

Materials and methods

Patient population

Between December 2009 and September 2011, all new patients who visited our memory clinic and who (Ferri et al., 2005) underwent MRI as part of clinical work up, and (Van der Flier and Scheltens, 2005) received a clinical diagnosis of AD, FTD or MCI, were eligible for this retrospective study. Our memory clinic is specialized in early onset dementia, hence we see a higher proportion of rare dementias (such as FTD) and patients with early disease onset. A total of 42 patients were eligible, 21 patients with AD, 15 with FTD, and 6 with MCI. The clinical diagnosis was based on expert panel consensus using standard diagnostic criteria (McKhann et al., 2011; Rascovsky et al., 2011) and all available information, including neuropsychological information, brain MRI, CSF (if available) and neurological examination. Brain MRI scans were acquired at 3.0 T (GE Healthcare, US), according to a standardized protocol, including sagittal 3D T1-weighted (T1w) inversion recovery (IR) fast spoiled gradient recalled echo (FSPGR) scans with axial and coronal reconstructions (perpendicular to the long axis of the hippocampus); fluid attenuated inversion recovery scans (FLAIR); and T2w scans. Supplementary Table 1 provides all relevant MRI parameters.

Reference population

Reference data were obtained from 4915 non-demented participants (mean age 64 yrs., range 45.7–100.0) from a population-based longitudinal study among community dwelling subjects (Hofman et al., 2015; Ikram et al., 2015) All scans were acquired on a single 1.5 T MR imaging system (GE Healthcare, US). The imaging protocol (Supplementary Table 1) included a 3D T1w IR-FSPGR, a proton density (PD)–weighted sequence and a FLAIR sequence. The PD sequence was applied with a long TR, resulting in bright CSF as in T2w images.

Brain tissue, white matter lesion and hippocampal volume quantification

Gray matter (GM), white matter (WM), white matter lesions (WML) and CSF segmentation was performed with a fully automated method (Vrooman et al., 2007) extended with WML segmentation (de Boer et al., 2009). This involved the segmentation of CSF, GM, and WM by an atlas-based k-nearest neighbor classifier on the MRI data. The classifier was trained by registering brain atlases to the subjects (Vrooman et al., 2007). The GM classification was then used to determine a WML intensity threshold value in a FLAIR scan. Applying this threshold to the FLAIR scan yielded the WML segmentation (de Boer et al., 2009). Total brain volume was calculated by summing WM, GM and WML volumes. Intracranial volume was defined as the sum of total brain volume and CSF volumes. T1w scans were processed using FreeSurfer (4.5.0) to obtain hippocampal volumes (Dale et al., 1999; Desikan et al., 2006).

Lobar volume quantification

To obtain lobar brain volumes, a multi-atlas approach was used (Vibha et al., 2018). Six template scans (atlases) were created in which the frontal, parietal, temporal and occipital lobes of the left and right hemisphere were outlined (Bokde et al., 2005). These atlases were non-rigidly registered (Klein et al., 2010) to a subject brain MRI and labels were assigned to each voxel using majority voting. By combining this lobar mask with the original tissue segmentation, volumes for each brain lobe were calculated (Ikram et al., 2010). Fig. 1 provides an example of the atlas and segmentation results. Intracranial volume (ICV) was used to correct for inter-individual differences in head size, by dividing each volume by ICV in each subject.

Fig. 1

Brain segmentation.

Visual inspection

All patient tissue and lobar segmentation results were visually checked for segmentation errors, revealing no substantial errors. No manual corrections were performed, as this would ultimately hamper translating the workflow to clinical practice. For the 4945 reference subjects, outliers (defined as 2.0 standard deviations from the mean) were found for total brain (n = 134), white matter lesion (n = 66) and hippocampal volumes (n = 172). Outliers were visually checked and if caused by segmentation errors, bad scan quality or significant structural abnormalities, scans were excluded (n = 30) resulting in 4915 scans for creating reference curves.

Reference curves

Age- and sex-specific percentile curves were generated for each quantitative parameter (total brain, lobar brain, hippocampal and WML volumes) using the LMS method (Cole and Green, 1992). Percentile curves (Fig. 2) were generated using the VGAM (1.0–0) package for R (3.2.3).

Fig. 2

Percentile curve for total brain volume.

Percentile curve for total brain volume. For each patient, the age-appropriate percentile value, referred to as “Volume percentile” (Vperc) was calculated for each of the brain volumes, and plotted on the reference curves.

Rating strategies

Two experienced neuro-radiologists (M.S. and M.W.V., each with more than three years of experience in reading memory clinic scans), blinded to all patient characteristics except age and sex, independently provided an imaging-based diagnosis. To reflect a realistic clinical scenario, the raters selected a diagnosis from three categories: AD, FTD, or alternative diagnosis (including no dementia). They were unaware of the proportion of AD, FTD and MCI in the sample. We assessed three diagnostic strategies: Firstly, a visual interpretation of the brain MR imaging scans was performed. Raters interpreted patterns of atrophy and presence of vascular lesions using the 3D T1w FSPGR (including coronal reformats), the T2w, and the T2w-FLAIR sequences by applying standardized visual rating scales such as the global cortical atrophy scale and Koedam scale for lobar atrophy, the medial temporal atrophy scale for hippocampal atrophy, and the Fazekas scale for WML (Scheltens et al., 1998; Scheltens et al., 1995; Pasquier et al., 1996; Koedam et al., 2011). Each rater independently based their final diagnosis on the combination of these visual ratings. Secondly, both raters were provided with Vperc only and provided a diagnosis solely based on these. Raters were left free how to interpret the Vperc (no cut-off values prescribed). As quantitative normative assessment is an evolving concept, we made a deliberate choice not to provide the raters with directions or cut-off values, to be able to assess the effect of having quantitative information available for diagnosis. Additionally, assessing relative values, i.e. Vperc of one structure compared to other structures, is equally important as applying absolute cut-offs to separate regions. Thirdly, the raters reviewed the brain MRI together with the associated Vperc to come to a diagnosis. The above strategies were each separated by three months, with patient identification numbers altered to ensure that current assessments could not be related to previous assessments.

Statistical analysis

For all combinations of assessment strategy and rater, diagnostic accuracy for AD and FTD diagnosis was determined as the sum of the true positive and negative cases divided by the total number of cases. Differences in accuracies between strategies for each diagnosis and for each rater were assessed with McNemar tests. Inter-rater agreement per strategy was calculated using Cohen's κ. In addition to the cross-sectional analysis, we also used follow up information (mean follow up 2.8 yrs., range 0–6.1 years) for possible change in clinical diagnosis and recalculated diagnostic accuracies. Finally, to assess the performance of subjective interpretation of the quantitative information by the clinicians (i.e. without specific cut-offs) in comparison to the use of absolute cut-offs, we determined optimal cut-off values to discriminate between diagnoses, based on Vperc of relevant brain regions (MCI versus AD & FTD based on hippocampal Vperc; FTD versus AD & MCI based on frontal and temporal Vperc, and AD versus FTD & MCI based on hippocampal and parietal Vperc). For each cut-off point, we calculated the distance from the maximum sensitivity and specificity as follows: distance = √[(1 – sensitivity)2 + (1 – specificity)2], and subsequently located the point where distance was minimal. We compared the diagnostic accuracy at this optimal value with the performance of both raters. We separately assessed the correlation between visual rating of WML burden (Fazekas score (Scheltens et al., 1998)) and automated quantification. Spearman rank correlation was calculated between the raters' Fazekas scores and total WML volume (% of ICV) and between Fazekas scores and the WML Vperc. Statistical analyses of diagnostic accuracy (i.e. of assessment strategy; rater; and optimal cut-off values) and of inter-rater agreement were performed with SPSS version 22. To assess differences in accuracies between strategies for each diagnosis and for each rater, Python version 2.7.11+ and the McNemar implementation of the statsmodels python package (version 0.6.1) were used. An α of 0.05 was considered as threshold for statistical significance.

Results

Table 1 shows patient characteristics. Mean age of AD patients was 66.1 +/− 8.5 years, and 43% were female; FTD patients were on average younger (60.0 +/− 6.5 years), and 40% were female.

Table 1

Patient characteristics.

	MCI (N = 6)	AD (N = 21)	FTD (N = 15)	Total (N = 42)
Age in y,mean (SD)	63.2 (8.5)	66.1 (8.5)	60.0 (6.5)	63.5 (8.1)
Gendermale:female	2:4	12:9	9:6	23:19
Duration in y,mean (SD)	1.1 (1.8–3.6)	2.1 (1.1–3.3)	2.1 (1.5–3.0)	2.1 (1.3–3.2)
MMSE,median (IQR)	25.5 (23.0–26.8)	24.0 (22.0–25.0)	25.5 (22.5–29.0)	24.5 (22.3–27.0)

MCI = mild cognitive impairment, AD = Alzheimer's disease, FTD = frontotemporal dementia, SD = standard deviation, IQR = interquartile range.

Patient characteristics. MCI = mild cognitive impairment, AD = Alzheimer's disease, FTD = frontotemporal dementia, SD = standard deviation, IQR = interquartile range. Tables 2 shows regional brain and lesion volumes expressed as percentage of intracranial volume (%ICV), and age appropriate Vperc, respectively. Differences in regional brain volumes between patient groups were more evident when expressed as Vperc than as %ICV. These differences were evident in brain regions that are known to be affected in AD and FTD (i.e. hippocampus, frontal and temporal lobes).

Table 2

Median (IQR) of brain, lobar, hippocampal and white matter lesion volumes expressed as percentage of intracranial volume (%ICV) and as volume percentiles (Vperc) for the diagnosis groups.

Volume	%ICV			Vperc
	MCI	AD	FTD	MCI	AD	FTD
Total brain	84.8(84.1–86.6)	83.6(82.5–84.9)	82.0(80.9–84.3)	86.6(52.5–98.2)	73.8(25.6–88.5)	16.2(4.2–61.0)
WML	1.0(0.3–1.7)	1.1(0.3–2.0)	0.4(0.2–0.9)	78.0(67.9–95.2)	88.5(75.4–95.0)	82.4(54.2–96.3)
Frontal right	14.2(14.0–14.4)	13.8(13.4–14.3)	13.2(12.5–14.3)	30.4(21.8–33.1)	10.6(5.1–47.4)	0.9(0.0–46.9)
Frontal left	14.3(14.2–14.5)	14.0(13.3–14.9)	13.2(12.0–14.9)	29.8(13.7–42.4)	19.8(3.8–73.9)	0.6(0.0–68.8)
Temporal right	8.6(8.4–9.3)	8.3(7.9–8.5)	7.9(7.9–8.9)	63.6(26.0–96.5)	19.2(4.8–47.1)	4.9(2.3–74.2)
Temporal left	8.3(8.0–8.4)	7.7(7.5–7.9)	7.4(6.9–7.9)	89.3(57.2–94.1)	29.9(17.6–50.6)	6.4(0.2–39.2)
Parietal right	8.7(8.6–9.0)	8.8(8.4–9.1)	8.9(8.1–9.4)	78.8(60.1–95.5)	82.0(51.3–98.0)	93.1(25.3–99.8)
Parietal left	9.3(9.1–9.4)	9.3(9.1–9.7)	9.7(9.4–10.1)	81.8(64.2–90.4)	78.7(49.1–98.4)	96.8(83.0–99.7)
Occipital right	5.5(5.2–5.8)	5.3(5.2–5.7)	5.4(5.2–6.0)	69.1(45.4–91.3)	60.4(37.6–91.9)	59.2(32.5–98.3)
Occipital left	5.4(5.2–5.7)	5.3(5.0–5.5)	5.5(5.3–5.8)	85.9(62.0–98.6)	72.7(29.0–86.8)	88.5(72.1–98.6)
Right hippocampus	0.357(0.310–0.388)	0.282(0.258–0.322)	0.301(0.276–0.338)	28.3(7.9–69.8)	2.28(0.7–12.4)	2.7(0.6–14.6)
Left hippocampus	0.363(0.262–0.402)	0.290(0.247–0.328)	0.290(0.263–0.313)	32.3(0.7–63.5)	3.2(0.4–14.0)	1.8(0.2–6.1)

IQR = interquartile range

%ICV = percentage of intracranial volume.

Vperc = volume percentile (age appropriate percentile value calculated for each of the volumes, plotted on the reference curves).

MCI = mild cognitive impairment, AD = Alzheimer's disease, FTD = frontotemporal dementia.

WML = white matter lesion.

Median (IQR) of brain, lobar, hippocampal and white matter lesion volumes expressed as percentage of intracranial volume (%ICV) and as volume percentiles (Vperc) for the diagnosis groups. IQR = interquartile range %ICV = percentage of intracranial volume. Vperc = volume percentile (age appropriate percentile value calculated for each of the volumes, plotted on the reference curves). MCI = mild cognitive impairment, AD = Alzheimer's disease, FTD = frontotemporal dementia. WML = white matter lesion. Table 3 shows diagnostic accuracy for AD and FTD of both observers and all three strategies. Table 4 shows the results of the comparisons of accuracy between strategies (McNemar test). Neither for AD nor FTD did the use of quantitative information alone improve diagnostic accuracy compared to visual assessment only. Moreover, for FTD this strategy led to a significantly worse accuracy (p =/<0.01 for both raters). For AD, diagnostic accuracy improved with the combined visual/quantitative strategy compared to visual assessment only (73.8% for both raters compared to 59.5% in rater A (p = .03) and 66.7% in rater B (p = .45) for visual only strategy). For FTD, visual assessment only performed slightly better than the combined visual/quantitative strategies, but this difference was not statistically significant (p = .4); diagnostic accuracy was high with both strategies.

Table 3

Accuracy of the different rating scenarios per observer.

Observer	Disease	Scenario	TP + TN/all	Accuracy
A	AD	Visual only	25/42	59.5
		Quantitative only	22/42	52.4
		Combined	31/42	73.8
	FTD	Visual only	38/42	90.5
		Quantitative only	28/42	66.7
		Combined	35/42	83.3
B	AD	Visual only	28/42	66.7
		Percentiles only	26/42	61.9
		Combined	31/42	73.8
	FTD	Visual only	36/42	85.7
		Combined	33/42	78.6

AD = Alzheimer's disease; FTD = frontotemporal dementia; TP = true positive; TN = true negative.

Accuracy was calculated as (TP + TN/all subjects)*100%.

Table 4

Comparison of the three different strategies.

		AD		FTD
Observer	Comparison	Higher accuracy for	P*	Higher accuracy for	P*
A	visual vs Vperc	Visual	0.55	Visual	0.01
A	visual vs combined	Combined	0.03	Visual	0.4
A	Vperc vs combined	Combined	0.02	Combined	0.04
B	visual vs Vperc	Visual	0.8	Visual	0.002
B	visual vs combined	Combined	0.45	Visual	0.4
B	Vperc vs combined	Combined	0.27	Combined	0.02

Abbreviations: Vperc = Volume percentile; * = p-value for comparison of diagnostic accuracy between two strategies (Mc Nemar test).

Accuracy of the different rating scenarios per observer. AD = Alzheimer's disease; FTD = frontotemporal dementia; TP = true positive; TN = true negative. Accuracy was calculated as (TP + TN/all subjects)*100%. Comparison of the three different strategies. Abbreviations: Vperc = Volume percentile; * = p-value for comparison of diagnostic accuracy between two strategies (Mc Nemar test). During a mean follow up of 2.8 years (0.0–6.1 years), 7 of the 42 initial diagnoses had changed. This did not lead to a change in accuracy of any of the strategies (Supplementary Table 2). Supplementary Table 3 shows the comparison of automated classification accuracy using optimal cut-off values with rater classification. Especially for FTD using these optimal cut-off values improves performance. The correlation between Fazekas score (0–3) and total WML volume expressed as %ICV was high, with Spearman correlation coefficient of 0.75 (for both raters, p < .01). This correlation dropped to 0.57 (p < .01) for Fazekas and Vperc WML, with most variation in Vperc for the Fazekas scores of 0 and 1 (Supplementary Fig. 1). Interrater agreement between both observers was 69% (kappa 0.55, p < .01) when using visual assessment only, 62% (kappa 0.42, p < .01) when solely using quantitative information, and 67% (kappa 0.5, p < .01) when combining visual and quantitative assessment for diagnosis.

Discussion

In the setting of a memory clinic, we evaluated how adding quantitative volumetric brain data and population reference data affect the accuracy of radiologists' MR imaging-based dementia diagnosis. Providing experienced radiologists with only quantitative data significantly decreased diagnostic accuracy compared to conventional visual rating methods. Yet, the combination of quantitative data with visual rating of brain MR imaging suggested better diagnostic accuracy of AD, but not that of FTD. Strengths of our study are the large dataset of reference subjects from the general population, enabling us to compare patient data to age and sex-specific normative volumetric data. The automated algorithms that were used can be easily implemented in a general clinical setting. We normalized for intracranial volume, as differences in head size would otherwise preclude a fair comparison between individuals. This is illustrated by our findings in WML: we found a high correlation between the visual Fazekas score and automated absolute WML volume, which decreased when WML were age- and sex-adjusted as volume percentile. Lower visual WML scores in particular showed a wide range of variation in percentile WML load. Although this needs further investigation, it suggests that quantifying WML relative to normal aging may be more sensitive for identifying subjects with a ‘higher than normal’ relative WML load. There are also limitations that need to be considered. Firstly, the sample size was modest. This is inherent to the nature of the sample, as it included only patients with early onset (<65 years) dementia, which is less prevalent than late onset dementia. We specifically selected this sample because diagnosis in early onset dementia is much more challenging than in late onset dementia. A second limitation is that subjects in the reference population were all scanned on a single scanner using the same scan parameters. Patients were scanned using a different scanner, with different field strength and scan parameters. These differences may hamper comparison between subjects and application of absolute volume cut-off values. However, relative comparisons of volumes between regions within one patient will still be valid. Inter-scanner effects have been studied (Cover et al., 2011; Wolz et al., 2014; Opfer et al., 2016; Abdulkadir et al., 2011; Kruggel et al., 2010) and future studies should focus on developing quantitative markers that are robust to inter-scanner differences (Puonti et al., 2013; van Opbroek et al., 2015). Finally, as reference standard we used the clinical diagnosis based on established criteria and including the full clinical picture. Although this is the most optimal diagnosis in the setting of lack of pathological confirmation, the clinical diagnosis may still be wrong, especially in the early disease stages. We specifically investigated this issue by repeating analyses with available follow up data and found that diagnostic accuracy did not change substantially. Another issue is that patients may have mixed or multiple pathologies, which is difficult to detect clinically, but may be detected better by volumetric quantification. Our current study design would not be able to show this potential advantage. Having a group of MCI patients as a control condition instead of cognitively normal subjects or ‘healthy controls’ may also have attenuated our ability to distinguish the three groups, since MCI is a heterogeneous group among which subjects may have brain changes that are in the spectrum of AD abnormalities. Yet, this composition of the patient group optimally reflects the clinical setting, as MCI is a very common alternative diagnosis in a memory clinic population. Of greater importance than the absolute diagnostic performance is the comparison between the three strategies. For AD the best diagnostic strategy appears to be using quantitative information combined with visual inspection, providing the highest accuracy and highest interrater agreement. The addition of quantitative information to visual inspection may provide added value to the experienced rater, either by providing clues for interpreting the quantitative information or by directing attention to brain regions that may only show subtle changes on visual inspection. The added value of quantitative information was solely present for diagnosis of AD and not for FTD. This was rather unexpected, but may be due to patterns of atrophy being more visually obvious in FTD than in AD, even in the early stage of disease, as evidenced by the high diagnostic accuracy of visual inspection alone. At present, accuracy is not yet sufficient for clinical implementation (ranging from 52.4–66.7% when clinicians subjectively interpreted quantitative information only and from 73.8–83.3% when they combined quantitative and visual information). Longitudinal imaging may further improve performance of quantitative assessment, as the accelerated rate of atrophy associated with progression of the disease will probably be more evident in the quantitative information. Relative regional decreases in volume in particular will facilitate (differential) diagnosis (Mak et al., 2015; Scahill and Fox, 2007). Still, the discrepancy between the value of quantitative information and visual information for diagnosis seems to be in contrast with other studies investigating the relationship between qualitative and quantitative assessment of MRI for dementia diagnosis. For example Harper et al. (Harper et al., 2016) found a high correlation between regional visual scales and voxel-based morphometry (VBM). Our study was however not limited to (disease specific) regions. Moreover, in the Harper study, VBM results were not corrected for age, which may have resulted in more exaggerated measures of volume loss, due to both aging and neurodegeneration, than in our study. Although vascular dementia patients were not included, we evaluated the agreement between a qualitative and quantitative assessment of WML, which could be used in the context of diagnosing vascular dementia. The correlation between the visual Fazekas score and automated absolute WML volume was very high, but decreased when WML were age- and sex-adjusted (as Vperc). In particular the lower visual WML scores showed a wide range of variation in percentile WML load. Although this needs further investigation, this may indicate that quantitative evaluation of WML against the background of normal aging may be more sensitive to identifying subjects with a ‘higher than normal’ relative WML load. Interestingly, dementia diagnosis based solely on quantitative information had poor accuracy as well as low concordance between raters, lower than based on visual inspection alone. We therefore also evaluated automated classification when using optimal thresholds of the quantitative image features, which did improve on rater accuracy. It should be noted that selecting optimal cut-points on the data causes overestimation of the performance. Nevertheless, it suggests that interpretation of percentile curves warrants new guidelines for interpretation, and more experience, to improve diagnostic accuracy. At present, our results suggest that quantitative image information should not be used as stand-alone information, without visual inspection of scans. Future studies should focus on providing cut-off values to determine ‘significant atrophy’ or guidelines on how to interpret the quantitative information, also to rule out training effects that may arise due to the novelty of the method. In the current study, we aimed to simulate the current clinical process, which uses hippocampal volume and lobar volume as the most important diagnostic imaging markers in dementia. Our objective was to investigate whether normative values for these structures improved or at least resembled the accuracy of the visual assessment. However, research literature has put forward several potentially specifically affected structures in neurodegeneration (e.g. entorhinal cortex and subcortical structures such as caudate and putamen), and providing percentiles of these structures could potentially be informative to the raters. This would however introduce additional hurdles for interpretation and an important learning curve. Nevertheless, our implementation does provide the opportunity to add more elaborate and refined features of brain volume loss, which may ultimately exceed visual rating performance. This process may be extended by exploiting many more image features or image information by employing e.g. machine learning or deep learning approaches, which have received increased interest in recent years and have the potential to improve subject classification. Future efforts could therefore be directed towards training diagnostic classifiers based on multiple imaging markers extracted from the reference data, or directly investigating which information in the reference imaging data is most informative for differential diagnosis. In a challenge comparing performance of computer-aided diagnosis algorithms to classify subjects into normal controls, MCI and AD it was shown that methods including more imaging biomarkers (e.g. hippocampal volume, shape and texture) performed best (Bron et al., 2015). Future research should focus on determining which (combination of) quantitative imaging biomarkers is most informative in computer-aided diagnosis of dementia. In view of our results it is to be expected that providing reference curves of such imaging biomarkers, in combination with visual assessment, will provide the most accurate diagnosis of AD. In conclusion, this study indicates that age-appropriate percentile values of automatically quantified regional brain volumes may improve accuracy and inter-rater agreement of the radiological diagnosis of AD. Further studies should focus on overcoming the present technical limitations and on developing guidelines on the interpretation of such quantitative biomarkers. The following are the supplementary data related to this article. Fazekas scores and Vperc for white matter lesions. Box and whisker plots showing 25th, 50th and 75th percentiles (boxes) and extremes (whiskers) for the volume of white matter lesions in cm3 (WML, top) and Volume percentiles (Vperc, bottom) for each Fazekas score by observer A (left) and observer B (right). Supplementary material

Funding

Wiro Niessen and Meike Vernooij were partially funded by the EU FP7 framework project VPH-Dare-IT (601055) and the Horizon 2020 project EuroPOND (666992). Funding for this project was further provided by the Coolsingel Foundation (‘Stichting Coolsingel’) under project nr. 2012–86 (‘Automatic digital assessment of brain scans’). Typical segmentation result for automated lobar segmentation (left panel, showing frontal and parietal lobes) and automated tissue segmentation (right panel, with white matter lesions indicated in red). Example of percentile curve showing total brain volume for male subjects derived from the reference population. The curve shows the decrease of total brain volume (y-axis; expressed as percentage of intracranial volume) in relation to age (x-axis). The lines indicate different percentile lines (range from 5th to 95th percentile; with the green line indicating the 50th percentile line).

39 in total

1. Atlas based brain volumetry: How to distinguish regional volume changes due to biological or physiological effects from inherent noise of the methodology.

Authors: Roland Opfer; Per Suppa; Timo Kepp; Lothar Spies; Sven Schippling; Hans-Jürgen Huppertz
Journal: Magn Reson Imaging Date: 2015-12-23 Impact factor: 2.546

2. Multi-spectral brain tissue segmentation using automatically trained k-Nearest-Neighbor classification.

Authors: Henri A Vrooman; Chris A Cocosco; Fedde van der Lijn; Rik Stokking; M Arfan Ikram; Meike W Vernooij; Monique M B Breteler; Wiro J Niessen
Journal: Neuroimage Date: 2007-05-21 Impact factor: 6.556

3. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest.

Authors: Rahul S Desikan; Florent Ségonne; Bruce Fischl; Brian T Quinn; Bradford C Dickerson; Deborah Blacker; Randy L Buckner; Anders M Dale; R Paul Maguire; Bradley T Hyman; Marilyn S Albert; Ronald J Killiany
Journal: Neuroimage Date: 2006-03-10 Impact factor: 6.556

4. elastix: a toolbox for intensity-based medical image registration.

Authors: Stefan Klein; Marius Staring; Keelin Murphy; Max A Viergever; Josien P W Pluim
Journal: IEEE Trans Med Imaging Date: 2009-11-17 Impact factor: 10.048

Review 5. Alzheimer's disease.

Authors: Clive Ballard; Serge Gauthier; Anne Corbett; Carol Brayne; Dag Aarsland; Emma Jones
Journal: Lancet Date: 2011-03-01 Impact factor: 79.321

Review 6. White matter changes on CT and MRI: an overview of visual rating scales. European Task Force on Age-Related White Matter Changes.

Authors: P Scheltens; T Erkinjunti; D Leys; L O Wahlund; D Inzitari; T del Ser; F Pasquier; F Barkhof; R Mäntylä; J Bowler; A Wallin; J Ghika; F Fazekas; L Pantoni
Journal: Eur Neurol Date: 1998 Impact factor: 1.710

7. Man Versus Machine Part 2: Comparison of Radiologists' Interpretations and NeuroQuant Measures of Brain Asymmetry and Progressive Atrophy in Patients With Traumatic Brain Injury.

Authors: David E Ross; Alfred L Ochs; Megan E DeSmit; Jan M Seabaugh; Michael D Havranek
Journal: J Neuropsychiatry Clin Neurosci Date: 2015 Impact factor: 2.198

8. White matter lesion extension to automatic brain tissue segmentation on MRI.

Authors: Renske de Boer; Henri A Vrooman; Fedde van der Lijn; Meike W Vernooij; M Arfan Ikram; Aad van der Lugt; Monique M B Breteler; Wiro J Niessen
Journal: Neuroimage Date: 2009-05-01 Impact factor: 6.556

9. Smoothing reference centile curves: the LMS method and penalized likelihood.

Authors: T J Cole; P J Green
Journal: Stat Med Date: 1992-07 Impact factor: 2.373

10. Longitudinal assessment of global and regional atrophy rates in Alzheimer's disease and dementia with Lewy bodies.

Authors: Elijah Mak; Li Su; Guy B Williams; Rosie Watson; Michael Firbank; Andrew M Blamire; John T O'Brien
Journal: Neuroimage Clin Date: 2015-02-07 Impact factor: 4.881

9 in total

Review 1. The quantitative neuroradiology initiative framework: application to dementia.

Authors: Olivia Goodkin; Hugh Pemberton; Sjoerd B Vos; Ferran Prados; Carole H Sudre; James Moggridge; M Jorge Cardoso; Sebastien Ourselin; Sotirios Bisdas; Mark White; Tarek Yousry; John Thornton; Frederik Barkhof
Journal: Br J Radiol Date: 2019-08-01 Impact factor: 3.039

2. Automated quantitative MRI volumetry reports support diagnostic interpretation in dementia: a multi-rater, clinical accuracy study.

Authors: Hugh G Pemberton; Olivia Goodkin; Ferran Prados; Ravi K Das; Sjoerd B Vos; James Moggridge; William Coath; Elizabeth Gordon; Ryan Barrett; Anne Schmitt; Hefina Whiteley-Jones; Christian Burd; Mike P Wattjes; Sven Haller; Meike W Vernooij; Lorna Harper; Nick C Fox; Ross W Paterson; Jonathan M Schott; Sotirios Bisdas; Mark White; Sebastien Ourselin; John S Thornton; Tarek A Yousry; M Jorge Cardoso; Frederik Barkhof
Journal: Eur Radiol Date: 2021-01-15 Impact factor: 5.315

3. Dementia imaging in clinical practice: a European-wide survey of 193 centres and conclusions by the ESNR working group.

Authors: M W Vernooij; F B Pizzini; R Schmidt; M Smits; T A Yousry; N Bargallo; G B Frisoni; S Haller; F Barkhof
Journal: Neuroradiology Date: 2019-03-09 Impact factor: 2.804

4. An MRI-based strategy for differentiation of frontotemporal dementia and Alzheimer's disease.

Authors: Qun Yu; Yingren Mai; Yuting Ruan; Yishan Luo; Lei Zhao; Wenli Fang; Zhiyu Cao; Yi Li; Wang Liao; Songhua Xiao; Vincent C T Mok; Lin Shi; Jun Liu
Journal: Alzheimers Res Ther Date: 2021-01-12 Impact factor: 6.982

5. White matter microstructure alterations in frontotemporal dementia: Phenotype-associated signatures and single-subject interpretation.

Authors: Mary Clare McKenna; Marlene Tahedl; Aizuri Murad; Jasmin Lope; Orla Hardiman; Siobhan Hutchinson; Peter Bede
Journal: Brain Behav Date: 2022-01-24 Impact factor: 2.708

6. Comparing two artificial intelligence software packages for normative brain volumetry in memory clinic imaging.

Authors: Jacob J Visser; Rebecca M E Steketee; Lara A M Zaki; Meike W Vernooij; Marion Smits; Christine Tolman; Janne M Papma
Journal: Neuroradiology Date: 2022-01-15 Impact factor: 2.995

Review 7. Quantification of amyloid PET for future clinical use: a state-of-the-art review.

Authors: Hugh G Pemberton; Lyduine E Collij; Fiona Heeman; Ariane Bollack; Mahnaz Shekari; Gemma Salvadó; Isadora Lopes Alves; David Vallez Garcia; Mark Battle; Christopher Buckley; Andrew W Stephens; Santiago Bullich; Valentina Garibotto; Frederik Barkhof; Juan Domingo Gispert; Gill Farrar
Journal: Eur J Nucl Med Mol Imaging Date: 2022-04-07 Impact factor: 10.057

8. Radiological assessment of dementia: the Italian inter-society consensus for a practical and clinically oriented guide to image acquisition, evaluation, and reporting.

Authors: Francesca B Pizzini; Enrico Conti; Angelo Bianchetti; Alessandra Splendiani; Domenico Fusco; Ferdinando Caranci; Alessandro Bozzao; Francesco Landi; Nicoletta Gandolfo; Lisa Farina; Vittorio Miele; Marco Trabucchi; Giovanni B Frisoni; Stefano Bastianello
Journal: Radiol Med Date: 2022-09-07 Impact factor: 6.313

Review 9. Technical and clinical validation of commercial automated volumetric MRI tools for dementia diagnosis-a systematic review.

Authors: Hugh G Pemberton; Lara A M Zaki; Olivia Goodkin; Ravi K Das; Rebecca M E Steketee; Frederik Barkhof; Meike W Vernooij
Journal: Neuroradiology Date: 2021-09-03 Impact factor: 2.804

9 in total