Literature DB >> 34571161

A deep learning toolbox for automatic segmentation of subcortical limbic structures from MRI images.

Douglas N Greve¹, Benjamin Billot², Devani Cordero³, Andrew Hoopes³, Malte Hoffmann⁴, Adrian V Dalca⁵, Bruce Fischl⁵, Juan Eugenio Iglesias⁶, Jean C Augustinack⁴.

Abstract

A tool was developed to automatically segment several subcortical limbic structures (nucleus accumbens, basal forebrain, septal nuclei, hypothalamus without mammillary bodies, the mammillary bodies, and fornix) using only a T1-weighted MRI as input. This tool fills an unmet need as there are few, if any, publicly available tools to segment these clinically relevant structures. A U-Net with spatial, intensity, contrast, and noise augmentation was trained using 39 manually labeled MRI data sets. In general, the Dice scores, true positive rates, false discovery rates, and manual-automatic volume correlation were very good relative to comparable tools for other structures. A diverse data set of 698 subjects were segmented using the tool; evaluation of the resulting labelings showed that the tool failed in less than 1% of cases. Test-retest reliability of the tool was excellent. The automatically segmented volume of all structures except mammillary bodies showed effectiveness at detecting either clinical AD effects, age effects, or both. This tool will be publicly released with FreeSurfer (surfer.nmr.mgh.harvard.edu/fswiki/ScLimbic). Together with the other cortical and subcortical limbic segmentations, this tool will allow FreeSurfer to provide a comprehensive view of the limbic system in an automated way.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34571161 PMCID： PMC8643077 DOI： 10.1016/j.neuroimage.2021.118610

Source DB: PubMed Journal: Neuroimage ISSN： 1053-8119 Impact factor: 6.556

Introduction

The limbic system is a set of brain structures that govern the interplay between subcortical regions and association cortices. The limbic system was originally defined by Maclean (MacLean, 1949), but its composition has evolved and been debated (Kotter and Stephan, 1997; LeDoux, 2012). In cortex, the limbic lobe includes the olfactory cortex (paleocortex), hippocampus (allocortex), caudal orbitofrontal, medial frontal, temporopolar, anteroventral insular, cingulate, retrosplenial, and parahippocampal gyri. Subcortically, limbic structures include, but are not limited to, hypothalamus (including the mammillary bodies), amygdala, the extended amygdala, nucleus accumbens, ventral pallidum, association thalamic nuclei, basal forebrain, septal nuclei, cerebellum, fornix, and the reticular formation of the brainstem. The limbic system supports a wide variety of functions and behaviors, including autonomic regulation (heart rate, blood pressure, hunger thirst, sexual arousal circadian rhythm), cognitive/attentional/emotional processing, spatial memory, long term memory, fear, emotional memory, anxiety, aggression, reward, and addiction (Heimer and Van Hoesen, 2006; L. Heimer et al., 2008; Mesulam, 1985). Thus, understanding the role of the limbic system in health and disease is clinically relevant and significant. Brain imaging (e.g., MRI and PET) can be used to enhance this understanding. However, scientists and clinicians who use neuroimaging often do not have the anatomical expertise to properly locate these anatomical structures. Further, it is a tedious, error prone, and time-consuming task to manually label structures in a whole brain image, especially in a large data set with many subjects. Accordingly, imaging scientists have developed methods that will automatically label structures of interest (Despotovic et al., 2015). Typically, this starts with an expert manually labeling a set of images; a tool is then trained using the expert labels as input; this tool is then applied to a novel image to automatically predict how an expert would have labeled the image. Performance of such methods vary depending upon the structure, method, and quality of the training and test images. Many tools to automatically label the brain have already been developed using parametric methods, machine learning techniques, or by simply deforming label atlases to an individual via a nonlinear registration. Cortically, Fischl et al., 2004, developed a method to automatically segment cortical regions, including two limbic areas (cingulate and parahippocampal gyri), using a surface-based Bayesian method. With respect to the subcortical limbic system, several groups have created tools that include hippocampus, amygdala, thalamus, and nucleus accumbens (Billot et al., 2020b; Fischl et al., 2002; Henschel et al., 2020; Iglesias et al., 2015; Jog et al., 2019; Patenaude et al., 2011; Puonti et al., 2016; Saygin et al., 2017; Wenzel et al., 2018). For hypothalamus, Rodrigues et al., 2020 used a U-Net (Ronnenberger et al., 2015) to segment whole hypothalamus, and Billot et al., 2020a implemented a U-Net to segment the hypothalamic subunits. Even less has been done for fornix, septal nucleus, and basal forebrain. Butler et al., 2014 and Butler et al., 2012, created a (fixed, i.e., non-probabilistic) septal nuclei label in MNI space which is then simply mapped to an individual’s brain image after non-linear registration. Teipel et al., 2014 and Cavedo et al., 2017 used a similar method to segment basal forebrain. Jin et al., 2015 developed a segmentation tool for fornix, but it requires a diffusion MRI volume. In this manuscript, we develop, test, and validate an easy-to-use tool to automatically segment several subcortical limbic structures from T1-weighted anatomical MRIs. These structures include hypothalamus[1] (HTh), mammillary bodies (MB), basal forebrain (BF), septal nuclei (SepN), fornix (Fx), and nucleus accumbens (NA) (see Fig. 1). Despite the clinical significance of the limbic system, several of these structures (MB, BF, SepN, and Fx) have few, if any, publicly available automatic segmentation tools. Unique aspects of this tool include: MB segmentation (to our knowledge, there are no other tools to segment MB), Fx segmentation from T1-weighted images, probabilistic segmentation of BF and SepN, the combination of limbic regions, ease-to-use, self-contained (but easy integration with FreeSurfer), and extensive testing. We show the clinical utility of the tool by applying it to independent aging and Alzheimer’s disease (AD) data sets. The robustness of the tool was tested on a diverse set of 698 independent images from various scanners. This tool is publicly available with FreeSurfer (surfer.nmr.mgh.harvard.edu/fswiki/ScLimbic); these segmentations can be combined with other segmentations in FreeSurfer to provide a more complete representation of the limbic system using automated methods.

Fig. 1.

Example manual segmentations of the labels used in this study. The hypothalamus label excludes mamillary bodies, which were included as a separate label. The anterior commissure (AC) was labeled only to provide a reference for manually labeling the other structures. The upper images are sagittal slices; the bottom images are coronal slices.

Methods

Data sets

Several data sets were used for manual labeling and network training and validation as well as for testing robustness and clinical validation. In all cases, images were resampled into a 2563 1mm3 vol, and the intensities rescaled into an 8-bit range. These operations, known as “conforming”, are the first step in the FreeSurfer pipeline. Inputs to the tool must be 1mm3 but do not need to be 2563 or 8-bit (the tool can reslice to 1mm3 if needed).

FreeSurfer maintenance (FSM)

MRI images were acquired on 29 subjects (M = 15, F = 14), mean age 44.8 years (+/− 18.5, min = 19 max = 76). This study was approved by the Massachusetts General Hospital Internal Review Board for the protection of human subjects; all subjects gave written informed consent. Scanning was performed on a Siemens 3T Prisma with a 32-channel head coil. Two acquisitions were used for this study: a multiecho MPRAGE sequence (van der Kouwe et al., 2008) and a single echo MP2RAGE sequence (Marques et al., 2010). MPRAGE parameters were 1 mm isotropic voxel size, 256 × 256 × 176, inversion time 1250 ms, TR 2530 ms, readout flip angle 7°, time between readout pulses 9.8 ms, GRAPPA acceleration factor 2, bandwidth 650 Hz/Px, four echoes (1.69, 3.55, 5.41, and 7.27 ms). The four echoes were combined by computing the root-mean-square (RMS) of the four images yielding a single T1-weighted (T1w) volume. MP2RAGE parameters were 1 mm isotropic, 256 × 256 × 176, 1st inversion time 700 ms, 2nd inversion time 2500 ms, TR 5000 ms, readout flip angle for 1st inversion 4°, readout flip angle for 2nd inversion 5°, time between readout pulses 7.1 ms, TE 2.98 ms, GRAPPA acceleration factor 3, bandwidth 240 Hz/Px. The MP2RAGE sequence automatically produces a quantitative T1 (qT1) map.

Alzheimer’s disease neuroimaging initiative (ADNI, Weiner et al., 2010)

T1w images from 110 ADNI subjects were used. Ten subjects (5 M/5F, mean age 77y) were manually labeled; all these subjects had an AD diagnosis. The remaining 100 subjects were used to evaluate the effect of AD on the volume of the limbic structures (50 healthy controls (HC), 22 M/28F, age mean/std/min/max 75.0/4.8/62/90y; 50 diagnosed with AD 28 M/22F, age mean/std/min/max 74.3/7.2/56/88y); we refer to this set as the ADNI100.

Harvard aging brain study (HABS, Mormino et al., 2014)

Ninety-nine subjects were drawn from HABS, which had approval from the Massachusetts General Hospital Internal Review Board; all subjects gave written informed consent. Subjects were healthy and aged from 66 to 87 years (mean 73.9y, s.d. 5.8y), 44 males and 55 females. MPRAGEs were acquired on a Siemens 3T Trio. Greve et al., 2016 describes additional scanning parameter details of this cohort.

Minimal interval resonance imaging in Alzheimer’s disease (MIRIAD, Malone et al., 2013)

In this data set, we analyzed 40 subjects (20 AD, 20 HC; 18 F, 22 M; mean age 68y; GE 1.5T Signa scanner; partially defaced), each with two time points 14 days apart, to evaluate test-retest reliability. This data is publicly available from miriad.drc.ion.ucl.ac.uk.

Thousand Functional connectomes (FC1k)

We analyzed 499 cases from the FC1k data base (fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html), a public collection of anonymized MRI data. While best known for fMRI, the 1000 Functional Connectomes also has T1-weighted anatomical MRI data from which we analyzed 499 subjects from three sites: Beijing (198 subjects), Cambridge (198 subjects), and Oulu (103 subjects), all 3T scanners. The subjects ranged in age from 18 to 30y; all images were defaced. The voxel size was Beijing: 1.3 × 1 × 1 mm, Cambridge: 1.2 × 1.2 × 1.2 mm, Oulu: .94x.94 × 1 mm. We point out here that the Oulu data set was very noisy based on visual inspection.

Manual labeling

The left and right sides of six structures (HTh, MB, BF, Fx, SepN, and NA; see Fig. 1) were manually labeled for a total of 12 distinct labels on the 29 FSM subjects and the 10 ADNI subjects. The manual labeling was overseen by an experienced neuroanatomist (JCA). Parts of the anterior commissure (AC) were labeled, but only to provide a reference for manually labeling the other structures. For the FSM, qT1 images were used for manual labeling as they provided the best contrast for the boundaries of interest; T1w images were used for the ADNI subjects. A description of the anatomical labeling protocol is given in Appendix A.

U-Net architecture and training

We used the network described in Billot et al., 2020a. This is network is a simple 3D variant of the popular U-Net architecture (Ronnenberger et al., 2015). The training software (Neuron (Dalca et al., 2018) and Lab2Im (Billot et al., 2020a) Python packages) is publicly available at https://github.com/BBillot/hypothalamus_seg. Billot et al., 2020a extensively tuned the hyperparameters and showed that this network out-performed state-of-the-art multi-atlas segmentation (Artaechevarria et al., 2009). For this tool, the network architecture, augmentation, and training were identical to that of Billot et al., 2020a with the exception that we include intensity noise augmentation (i.e., the adding of white Gaussian noise to the image during training). Briefly, the network has three resolution layers. Convolutions are performed with a 3 × 3 × 3 kernel. The first convolution has 24 output feature maps followed by a batch normalization and a max pooling step; the number of features is doubled after each max pooling and halved after each up-convolution. All layers, except the last, use an Exponential Linear Unit (ELU) activation function. The last layer has a softmax activation function. The input is always a T1w MRI. Augmentation consisted of spatial transformations (left-right flipping, affine and nonlinear transforms) and intensity transforms (multiplication by a bias field, noise augmentation, rescaling with min-max normalization, and contrast augmentation with nonlinear gamma (power law) distortion). The network was trained by optimizing the “soft” Dice score between the manual labels and the predicted labels. The first 50 epochs were trained without noise augmentation followed by 50 epochs with noise augmentation. Each epoch consisted of 1000 batches with a batch size of 1. The network easily reached convergence in this time. A batch size of 1 was used due to GPU memory limitations; as pointed out by Billot et al., 2020a, this low batch size is compensated for by using a large number of voxels (1603) to compute the loss function and gradient. The network was trained with the ADAM optimizer (Kingma, 2015). In Experiment 1, the network was trained on a subset of the 39 subjects for cross validation purposes. In the rest of the experiments, the network was trained on all 39 subjects.

Experiment 1: Cross-validation

The 39 manually labeled subjects were divided into a training group (N = 21, 10 female, 6 AD, 57y mean age) and a testing/validation group (N = 18, 11 female, 4 CE, 52y mean age). The network was trained on the training group and then applied to the (independent) test group. The manual and automatic labels were then compared in terms of Dice, correlation coefficient, true positive rate (TPR), and false discovery rate (FDR); a paired t-test was used to determine whether the manual and automatic volumes systematically differed.

Experiment 2: Robustness

The performance of machine learning tools is highly dependent on the training set and augmentation; if the tool sees an input that is somewhat different from the augmented training set, it may underlabel or fail to label at all. To evaluate the robustness of this tool, we applied it to 698 data samples from five data sets (ADNI100, HABS, Beijing, Cambridge, and Oulu) that were neither in the training or test sets and represent a variety of scanners and populations. To avoid visually inspecting each case, we used reverse classification accuracy (RCA, Robinson et al., 2019; Valindria et al., 2017) to flag individual data sets for visual scrutiny. In RCA, the test image was nonlinearly registered using ANTs (Avants et al., 2011) to each of the 39 manual labeled subjects, the manual labels were then mapped into the test image space where Dice scores were computed for each subject and label. For a given label, the maximum Dice score across the 39 was used as the quality metric, where 0 was bad and 1 was perfect. The idea here is that if one of the manually labeled subjects is anatomically close to the test subject, and this procedure will produce a reasonably good overlap between the segmentations. For the purposes of non-linear registration, the images were skull stripped and bias field corrected using FreeSurfer (Fischl, 2012); the segmentation was always applied to the raw data. Images that had a quality score of less than 0.5 on any label were flagged for manual inspection by two of the authors (DNG and DC); all labels were evaluated for a case regardless of which label was flagged. The criteria for passing were whether a given label was in about the right place with about the right shape and did not appear to be under- or over-labeled by more than 25%. The quality control reviewers were able to do this in about 2 min per case. These criteria were intentionally vague to avoid labor equivalent to the manual labeling of the flagged cases, which would have taken months or years. While a low RCA score could be an indication of a poor segmentation, it could also be the result of a poor nonlinear registration (and so not a problem with the tool per se).

Experiment 3: Test-retest reliability

The two time points from the 40 MIRIAD subjects were used to evaluate test-retest reliability. Each subject/time point was segmented using the current method. The correlation coefficient and intraclass correlation (ICC, using the ICC(3,1) from Shrout and Fleiss, 1979) of segmentation volumes across time and subject were then computed as the test-retest measure. We also computed a paired-t to test whether the two time points were systematically different.

Experiment 4: Alzheimer’s disease and aging effects

To evaluate the effect of AD on the volume of the limbic structures, the network was applied to the ADNI100 data set. The volumes were then compared across diagnosis (AD-vs-HC) using a two-sample t-test. The volumes were corrected for estimated total intracranial volume (eTIV, Buckner et al., 2004) to account for differences in head size. To further assess clinical sensitivity, we performed an analysis quantifying the changes in the volume of these structures with age. While age itself is not a clinical condition, age does impose substantial changes on the brain similar to diseases. The T1w images of the HABS data set were segmented using the present method. The eTIV-corrected volumes of the structures were then regressed against age; the null hypothesis that there was no age effect was evaluated with a t-test.

Results

Fig. 2 illustrates an example of the automatic segmentation for each of the structures in an individual subject withheld from the training; this subject was in the middle of the range of Dice scores. Green indicates that a voxel was in both the manual and automatic labels; from the standpoint of the automatic segmentation, these are true positives (TPs). Yellow indicates that the voxel was present in the manual label but not in the automatic segmentation (i.e., a false negative, FN); the full manual label consists of the green and yellow voxels. Red indicates that a voxel was in the automatic segmentation but not in the manual label (i.e., a false positive, FP). The full automatic label consists of green and red voxels.

Fig. 2.

Performance of automatic segmentation on a single test subject as compared to the manual segmentation for each of the structures. Green indicates that the voxel was in both the manual and automatic segmentations (a true positive, TP). Yellow means that the voxel was only in the manual (a false negative, FN). Red means the voxel was only in the automatic (a false positive, FP). The mean Dice score for this subject was 0.78, the middle of the range for the test subjects. (A) NA, (B) BF, (C) SepN, (D) HTh, (E) MB, (F), Left Fx.

Experiment 1: Cross-validation results

Table 1 shows the cross-validation performance of the automatic segmentation. None of the automatic segmentation volumes were significantly different than that of the manual segmentations (paired t-test) indicating no systematic bias in the volume measurement. Cross-subject volume variation was also comparable. The Dice scores range from 0.69 to 0.82. Aside from MB, the correlation coefficient (CC) between the manual and automatic volumes is in the moderate to high range of 0.62 to 0.88; the MB has a relatively low CC.

Table 1

Cross-validation performance of the automatic segmentation. Manual Vol is the mean volume of the manual segmentation in mm3; Auto Vol is the mean volume of the automatic segmentation in mm3. CC is the Pearson correlation coefficient between Manual Vol and Auto Vol; TPR: mean true positive rate; FDR: mean false discovery rate. Numbers in parentheses indicate standard deviations. NA: nucleus accumbens, BF: basal forebrain, SepN: septal nuclei, HTh: hypothalamus without mammillary bodies, MB: mammillary bodies, Fx: fornix, L: left, R: right. The table reflects only data from the 18 independent test subjects.

Structure	Manual Vol	Auto Vol	Dice	CC	TPR	FDR
NA-L	374.9 (110.9)	404.1 (129.5)	0.82 (0.045)	0.88	0.85 (0.058)	0.20 (0.090)
NA-R	380.1 (119.8)	422.6 (130.5)	0.78 (0.084)	0.72	0.83 (0.072)	0.24 (0.140)
BF-L	328.7 (68.8)	304.2 (48.9)	0.78 (0.051)	0.63	0.76 (0.095)	0.19 (0.066)
BF-R	322.6 (70.3)	318.6 (54.2)	0.75 (0.087)	0.70	0.76 (0.114)	0.24 (0.095)
SepN-L	117.5 (30.4)	108.9 (17.5)	0.69 (0.079)	0.62	0.68 (0.093)	0.28 (0.110)
SepN-R	114.9 (31.9)	101.1 (18.7)	0.72 (0.074)	0.69	0.69 (0.077)	0.23 (0.130)
HTh-L	439.3 (88.3)	473.4 (68.6)	0.81 (0.035)	0.74	0.85 (0.051)	0.21 (0.076)
HTh-R	438.6 (91.6)	471.6 (65.4)	0.82 (0.034)	0.78	0.86 (0.057)	0.21 (0.063)
MB-L	51.6 (9.2)	50.4 (10.3)	0.78 (0.070)	0.50	0.77 (0.078)	0.19 (0.118)
MB-R	54.1 (9.9)	51.4 (7.6)	0.80 (0.061)	0.37	0.79 (0.098)	0.18 (0.094)
Fx-L	551.9 (109.6)	525.4 (88.3)	0.80 (0.043)	0.75	0.78 (0.076)	0.18 (0.054)
Fx-R	544.2 (127.8)	505.6 (88.8)	0.79 (0.040)	0.87	0.77 (0.057)	0.18 (0.064)

The True Positive Rate (TPR, number of true positives detected by the automatic segmentation divided by the number of voxels in the manual segmentation) ranged from 0.68 to 0.86. As illustrated in Fig. 2, this value represents the number of green voxels (true positives) divided by the sum of the green and yellow voxels (number of voxels in the manual label). The False Discovery Rate (FDR, number of false positives divided by the total number of voxels in the automatic segmentation) ranged from 0.18 to 0.28. As displayed in Fig. 2, this value represents the number of red voxels (false positives) divided by the sum of the green and red voxels (total number of voxels in the automatic label).

Experiment 2: Robustness results

In the robustness test, 124/698 cases were flagged by RCA for inspection. The two raters had very similar results, agreeing 94% of the time. Of the 124, the raters agreed that 91 have no issues at all, suggesting that the RCA threshold of 0.5 was quite liberal. Issues with the 33 remaining cases all had to do with underlabeling to some degree. There were 2 cases (0.3%) where a label was simply not present (NA-L in one case and MB-R in the other), both from Oulu. There were 31 other cases (4.3%) where at least one region was underlabeled; 15 of those were fornix. Of the 18 (2.6%) remaining from the 33, the SepN were suspect in 6 subjects because of an anatomical variant (an unclosed cavum septum pellucidum, width about 10 mm); to be clear, it was not evident that SepN segmentation failed because of this, we just do not have enough experience with this variant to know that it succeeded. For the remaining 12 cases (1.7%), various regions were underlabeled. Hypothalamus failed in 2 AD cases because the portion of fornix that goes through hypothalamus was not labeled; in these cases, there was simply no contrast between HTh and Fx. Table 2 shows the underlabeling rate for each label individually averaged across the two raters. Except for fornix, the rates are all less than 1%. Of the regions, fornix incurred the most failures, some on subjects with much atrophy but a portion on young subjects with very small ventricles. One subject failed because of an extreme angle in the head position; after manual rotation, the segmentation passed. Of the 33 problematic cases, 17 came from Oulu.

Table 2

Robustness and test-retest reliability. Underlabeling rate (UR) is the percent of the 698 subjects that had some mislabeling based on visual inspection. CC is Pearson correlation coefficient and ICC is intraclass correlation.

Structure	UR	CC	ICC
NA-L	0.72%	0.94	0.94
NA-R	0.29%	0.97	0.97
BF-L	0.29%	0.96	0.96
BF-R	0.29%	0.94	0.94
SepN-L	0.86%	0.91	0.91
SepN-R	0.86%	0.93	0.92
HTh-L	0.57%	0.94	0.94
HTh-R	0.43%	0.95	0.94
MB-L	0.57%	0.90	0.90
MB-R	0.72%	0.94	0.94
Fx-L	3.01%	0.94	0.94
Fx-R	3.01%	0.94	0.94

Experiment 3: Test-retest reliability results

The test-retest reliability across scans of the 40 MIRIAD subjects is shown in Table 2 using both correlation coefficient (CC) and intraclass correlation (ICC). The values are distributed closely around 0.95. While MB-L is the lowest, it is still high at 0.90; at 0.94, MB-R is similar to that of other labels. The time points were not significantly different when tested with a paired t-test, also suggesting good reliability.

Experiment 4: Effects of Alzheimer’s disease and aging results

The results for the effects of AD and aging are shown in Table 3. The change in volume with AD and age was always negative, indicating a loss of tissue (i.e., atrophy). All structures, except MB, show significance in either AD or age or both.

Table 3

Effect of AD and age on the volume of the given structure. Change and Slope show the change in volume in thousandths of percent of intracranial volume. Slope is per decade. A negative Change value indicates loss of volume in AD relative to HC. A negative Slope indicates a loss in volume with age. The p-values have been corrected for 12 comparisons; those with p < 0.05 are marked with an asterisk. See Table 1 for structure abbreviations.

Structure	AD Change	p	Age Slope	p
NA-L	−2.82	0.051142	−2.93	0.000951 *
NA-R	−2.36	0.117383	−3.15	0.000084 *
BF-L	−2.39	0.000641 *	−1.54	0.003073 *
BF-R	−2.38	0.000138 *	−1.03	0.032481 *
SepN-L	−0.47	0.289839	−0.09	0.999968
SepN-R	−0.58	0.007340 *	−0.05	1.000000
HTh-L	−1.95	0.023953 *	−2.36	0.000181 *
HTh-R	−2.09	0.001621 *	−2.04	0.000771 *
MB-L	−0.20	0.874555	−0.10	0.974076
MB-R	−0.30	0.238399	−0.09	0.998295
Fx-L	−3.13	0.007580 *	−2.87	0.000475 *
Fx-R	−2.84	0.025472 *	−2.81	0.010283 *

Tool usage

The tool and instructions are available from the FreeSurfer wiki at surfer.nmr.mgh.harvard.edu/fswiki/ScLimbic (“ScLimbic” is meant to abbreviate “subcortical limbic”); a diagram of the software workflow is also shown in Fig. 3. In the basic usage, one creates a folder with the T1w volumes in (NIFTI or mgz format) one wants to segment, then runs the Python script

Fig. 3.

Diagram of the tool workflow showing various options and outputs. Green arrows indicate output from other subjects.

The tool will find all the input images, segment them, and write out the segmentation images into the output folder; the segmentations will resemble Fig. 1. It will also create a CSV file where each row is a case, each column is a label, and each entry is the volume of that structure in mm3. On a single threaded CPU, the program takes about 40 s to run on a single case; with 3 threads (—threads 3), the time drops to about 15 s; using a GPU (—cuda) does not reduce this significantly as much of the time is spent loading and writing. The tool uses about 20GB of memory. If the input volume is not 1mm3, then there is an option to reslice to this resolution (—conform); the reslicing is only internal – the output segmentation is resliced back to the original resolution. Note that changing the resolution may affect the quality. If one is planning to perform a volumetric group study, then one will need to normalize by ICV. If one does not have an estimate of the ICV, then the tool can compute it using the FreeSurfer method (—etiv, Buckner et al., 2004); the ICV will be included as a column in the CSV file. Computing the ICV will increase the processing time to about 5 min for each case. The CSV file can be imported into a statistical program like SPSS or R for further processing or it can be processed using FreeSurfer’s mri_glmfit, which includes automatic application of ICV correction if ICV is in the CSV. The user should visually inspect the segmentation output. To assist in quality control, the tool can output two additional CSV files (—write_qa_stats). One contains a z-score[2] for the volume each structure based on the means and standard deviations of the manual labels. In the other, the “confidence” (mean posterior probability within the label) is reported. If the z-score is very high or the confidence is very low, then the case should be visually examined. The tool does not require knowledge of FreeSurfer; as long as FreeSurfer is installed, then the user need only understand and execute mri_sclimbic_seg.

Discussion

The goal of this study was to develop a deep learning segmentation tool for the following limbic structures: hypothalamus, mammillary bodies (part of hypothalamus), basal forebrain, septal nuclei, nucleus accumbens, and fornix. The tool was trained on manually labeled data and evaluated over 700 independent data sets; clinical efficacy was shown on AD and aging data sets.

Segmentation performance

The central goal of automatic segmentation is to replicate how an expert would have labeled a novel image. This capability was judged by comparing the automatic segmentation to the manual segmentation of images not included in the training. Average Dice scores ranged from 0.69 (SepN) to 0.82 (NA). This is well within the range of other studies. For example, Fischl et al., 2002 and Puonti et al., 2016 had Dice scores between 0.70 and 0.90 for much larger structures, which will generally perform better on Dice than smaller structures. For whole hypothalamus, Billot et al., 2020a had a Dice score of 0.84 and Rodrigues et al., 2020 had 0.77; our tool is comparable at 0.81. Billot et al., 2020a also had a Dice score of 0.81 for the posterior hypothalamus, the subunit closest to our definition of MB, which had a comparable Dice score of 0.78. While the Dice score provides a good summary measure of overall accuracy, other metrics provide more meaningful evaluations in terms of how the segmentation will perform when applied for a particular purpose. The Pearson correlation coefficient (CC, Table 1) shows how the volume of the automatic segmentation scales with that of the manual label. Ideally, the volume would accurately reflect the true value; however, this is technically not necessary in studies that compare groups or correlate diagnostic parameters as long as the volume scales with the true value. This ability to scale is measured by the CC. In our study, the CCs were generally in the range of 0.62 to 0.88, except for MB, which was 0.37 and 0.50. The CC for SepN was 0.62–0.69, which exceeds the 0.34–0.66 obtained by Butler et al., 2014. The low CC score for MB indicates that the MB volume might not be a sensitive marker of cross-subject differences. Indeed, while other labels were significant in both the AD and aging studies, MB was not significant in either. The Dice for MB was a relatively high 0.78; this shows a shortcoming of the Dice score as a performance metric, which is why we tested additional metrics in this report.

Appropriateness for multimodal integration

We report the True Positive Rate (TPR) and False Discovery Rate (FDR) for each structure (Table 1 and Fig. 2). These measures are pertinent to multimodal integration studies (e.g., fMRI, dMRI, PET). For example, in a task fMRI study, the amplitude of the hemodynamic response might be averaged over the label; in a diffusion study, the label may be used as a seed region for tractography. Errors in the cross-modal analysis may result if the label does not significantly overlap the true structure or if a significant number of voxels from a neighboring structure were included. TPR measures the overlap with the true structure (sensitivity), and FDR measures the contamination for neighboring structures (specificity). In this study, the TPRs were in the range of 0.68–0.86, meaning that a large fraction of the true structure will fall within the automatically segmented label. The FDRs were in the range of 0.18–0.28, meaning that a relatively small fraction of automatically segmented voxels fall outside of the true structure. Our findings indicate that all these structures, including MB, are appropriate for cross-modal applications. We have not been able to find other automatic segmentation studies that report TPR or FDR, so there is no reference for comparison.

Robustness

we evaluated the robustness of the segmentation on 698 cases. Reverse classification accuracy (RCA, Valindria et al., 2017) liberally flagged 124 cases for visual inspection. “Failures,” as indicated by noticeable underlabeling, were found in only 33 cases. For individual labels, the failure rate was quite low (Table 2), less than 1% for all structures, except for fornix, which was 3%. While this performance is quite good, it is important to evaluate where and how the underlabeling occurs. We observed several failure modes. Six cases had large (> 10 mm) unclosed cavum septum pellucidum (CSP). A CSP is a space between the left and right septa in the lateral ventricles, very close to the septal nuclei and fornix (Born et al., 2004). The septa usually fuse shortly after birth, but closure does not occur in roughly 1–5% of the population (Chen et al., 2014). This structural irregularity can cause errors in the SepN and Fx segmentations since these structures are closely bound to the septa. With 14 of the 32 failures, Fx was the most error prone of all the limbic labels; Fx had several failure modes. The first was the CSP cases mentioned above. The second was advanced atrophy in some cases (i.e., AD). The Fx is a white matter strand that connects HTh and hippocampus. In healthy subjects, it is clearly visible but still only a few millimeters in diameter. With aging and disease, it becomes thinner and darker, and the crus of the fornix becomes barely visible as it passes through the atrium of the lateral ventricle. This can cause the automatic segmentation to be hit-or-miss in this region. We emphasize that this was observed in only a handful of cases; the vast majority of atrophic cases had good Fx segmentations. The third Fx failure mode was in young subjects with very small ventricles in MRI with poor contrast. In such cases, the Fx tail was in near or direct contact with the corpus callosum and became indistinguishable; all of these cases were in the Oulu data set. Finally, in a two ADNI cases, the body of the Fx, which is completely surrounded by the HTh, was not segmented because there was no visible gray/white contrast; presumably, this is just part of the disease process, but we counted it as an error for both Fx and HTh. We emphasize here that these circumstances of underlabeling mentioned above occurred in a very small fraction of cases. This robustness test probably represents a worst-case scenario as the data sets (deliberately) included low-quality data (Oulu) or data collected many years ago (ADNI). On high-quality data such as FSM, HABS, Beijing, and Cambridge datasets, virtually no errors occurred.

Test-Retest reliability

The test-retest performance of the tool was evaluated using 40 (20 AD and 20 HC) subjects scanned two weeks apart. This duration is probably too short for much true anatomical change to have occurred, so any differences are attributed to either scanning or inaccuracies of the tool. The CCs and ICCs were around 0.94, which is excellent. While the MB manual-automatic volume correlations were poor, the CC and ICC for MBs were very high (0.90 and 0.94) in test-retest. This indicates that all the structures, including MB, can be sensitive to longitudinal changes.

Clinical significance

The limbic system is especially vulnerable to Alzheimer’s disease (Mesulam, 1996; Braak and Braak, 1997; Braak and Del Tredici, 2012; Hyman et al., 1984; Terry and Katzman, 1983; Hopper and Vogel, 1976). The SepN and HTh are strongly connected to the hippocampus via the Fx, so hippocampal atrophy has substantial downstream effects on Fx, SepN and HTh. The hippocampus is a seminal structure in the staging of Alzheimer’s disease pathology (Braak and Braak, 1991, 1997) and a neuroimaging biomarker benchmark (Braskie and Thompson, 2014; Weiner et al., 2017). Our SepN label includes medial septal nuclei, while our basal forebrain label includes vertical limb (Ch2) and the horizontal limb (Ch3) of the diagonal band of Broca, and the nucleus basalis of Meynert (Ch4). The latter making up the main portion of acetylcholine input for the cerebral cortex. Acetylcholine has a large neurochemical impact on Alzheimer’s disease pathology (Ballinger et al., 2016; Geula and Mesulam, 1996; Hampel et al., 2019). In line with this thesis, we found that BF, SepN, HTh, and Fx showed atrophy when comparing ADs to age-matched controls (Table 3). These results corroborate other studies such as Teipel et al., 2014 (BF), Butler et al., 2018 (SepN), Billot et al., 2020a (HTh), and Copenhaver et al., 2006 (Fx). NA and MB did not show an effect despite being found in other studies (Nie et al., 2017 and Copenhaver et al., 2006 respectively); NA would have been significant in this study without corrections for multiple comparisons. While advanced age is not a clinical condition in and of itself, the aging brain undergoes many changes that are similar to clinical conditions (Salat et al., 2004). We demonstrated significant changes with age in NA, BF, HTh, and Fx, thus providing additional evidence of clinical utility.

An easy-to-use tool

the tool that performed the automatic labeling in this study is freely available via FreeSurfer along with extensive documentation for how to use it on individual and group data (see surfer.nmr.mgh.harvard.edu/fswiki/ScLimbic; for source code see github.com/freesurfer). It is self-contained and easy to use, including computing and applying corrections for ICV, if needed. The segmentation of a single case can be done on a CPU or GPU and finishes in a few minutes. The images, labels, and training code are available, so researchers can retrain the network with their own manually labeled data if desired. While we have emphasized a specific tool built on the U-Net architecture of Billot et al., 2020a, this work shows that the manual labels that we have developed are sufficient for accurately labeling these structures in the human brain, and new algorithms and architectures (e.g., Billot et al., 2020b; Isensee et al., 2021) could be employed to create new tools with even better, more robust performance.

Conclusion

A tool was developed to automatically segment several subcortical limbic structures (nucleus accumbens, basal forebrain, septal nuclei, hypothalamus without mammillary bodies, the mammillary bodies, and fornix) from a T1-weighted MRI. This tool fills an unmet need as there are few, if any, tools to segment these clinically relevant structures. A U-Net with spatial, intensity, contrast, and noise augmentation was trained using 39 manually labeled MRI data sets. In general, the Dice scores, true positive rates, false discovery rates, and manual-automatic volume correlation were very good relative to comparable tools for other structures. A diverse data set of 698 subjects were segmented using the tool; evaluation of the resulting labelings showed that the tool failed in less than 1% of cases. Test-retest reliability of the tool was excellent. The automatically segmented volume of all structures except mammillary bodies showed effectiveness at detecting either clinical AD effects, age effects, or both. This tool will be publicly released with FreeSurfer (surfer.nmr.mgh.harvard.edu). Together with the other cortical and subcortical limbic segmentations, this tool will allow FreeSurfer to provide a comprehensive view of the limbic system in an automated way.

54 in total

1. Psychosomatic disease and the visceral brain; recent developments bearing on the Papez theory of emotion.

Authors: P D MacLEAN
Journal: Psychosom Med Date: 1949 Nov-Dec Impact factor: 4.312

Review 2. Intraclass correlations: uses in assessing rater reliability.

Authors: P E Shrout; J L Fleiss
Journal: Psychol Bull Date: 1979-03 Impact factor: 17.737

3. Prevalence of cavum septum pellucidum and/or cavum Vergae in brain computed tomographies of Taiwanese.

Authors: Jiann-Jy Chen; Chi-Jen Chen; Hsin-Feng Chang; Dem-Lion Chen; Yung-Chu Hsu; Tzu-Pu Chang
Journal: Acta Neurol Taiwan Date: 2014-06

4. Thinning of the cerebral cortex in aging.

Authors: David H Salat; Randy L Buckner; Abraham Z Snyder; Douglas N Greve; Rahul S R Desikan; Evelina Busa; John C Morris; Anders M Dale; Bruce Fischl
Journal: Cereb Cortex Date: 2004-03-28 Impact factor: 5.357

Review 5. Neuropathological stageing of Alzheimer-related changes.

Authors: H Braak; E Braak
Journal: Acta Neuropathol Date: 1991 Impact factor: 17.088

6. The septum pellucidum and its variants. An MRI study.

Authors: Christine M Born; Eva M Meisenzahl; Thomas Frodl; Thomas Pfluger; Maximilian Reiser; H J Möller; Gerda L Leinsinger
Journal: Eur Arch Psychiatry Clin Neurosci Date: 2004-10 Impact factor: 5.270

7. The limbic system in Alzheimer's disease. A neuropathologic investigation.

Authors: M W Hopper; F S Vogel
Journal: Am J Pathol Date: 1976-10 Impact factor: 4.307

8. Comprehensive cellular-resolution atlas of the adult human brain.

Authors: Song-Lin Ding; Joshua J Royall; Susan M Sunkin; Lydia Ng; Benjamin A C Facer; Phil Lesnar; Angie Guillozet-Bongaarts; Bergen McMurray; Aaron Szafer; Tim A Dolbeare; Allison Stevens; Lee Tirrell; Thomas Benner; Shiella Caldejon; Rachel A Dalley; Nick Dee; Christopher Lau; Julie Nyhus; Melissa Reding; Zackery L Riley; David Sandman; Elaine Shen; Andre van der Kouwe; Ani Varjabedian; Michelle Wright; Lilla Zöllei; Chinh Dang; James A Knowles; Christof Koch; John W Phillips; Nenad Sestan; Paul Wohnoutka; H Ronald Zielke; John G Hohmann; Allan R Jones; Amy Bernard; Michael J Hawrylycz; Patrick R Hof; Bruce Fischl; Ed S Lein
Journal: J Comp Neurol Date: 2016-11-01 Impact factor: 3.215

9. FastSurfer - A fast and accurate deep learning based neuroimaging pipeline.

Authors: Leonie Henschel; Sailesh Conjeti; Santiago Estrada; Kersten Diers; Bruce Fischl; Martin Reuter
Journal: Neuroimage Date: 2020-06-08 Impact factor: 6.556

10. Automated segmentation of the hypothalamus and associated subunits in brain MRI.

Authors: Benjamin Billot; Martina Bocchetta; Emily Todd; Adrian V Dalca; Jonathan D Rohrer; Juan Eugenio Iglesias
Journal: Neuroimage Date: 2020-08-25 Impact factor: 6.556

4 in total

1. Limbic covariance network alterations in patients with transient global amnesia.

Authors: Jaeho Kang; Dong Ah Lee; Ho-Joon Lee; Kang Min Park
Journal: J Neurol Date: 2022-07-09 Impact factor: 6.682

Review 2. Deep Learning-Based Diagnosis of Alzheimer's Disease.

Authors: Tausifa Jan Saleem; Syed Rameem Zahra; Fan Wu; Ahmed Alwakeel; Mohammed Alwakeel; Fathe Jeribi; Mohammad Hijji
Journal: J Pers Med Date: 2022-05-18

3. Analysis of the extent of limbic system changes in multiple sclerosis using FreeSurfer and voxel-based morphometry approaches.

Authors: Amanda Frisosky Abuaf; Samuel R Bunting; Sara Klein; Timothy Carroll; Jake Carpenter-Thompson; Adil Javed; Veronica Cipriani
Journal: PLoS One Date: 2022-09-22 Impact factor: 3.752

4. SynthStrip: skull-stripping for any brain image.

Authors: Andrew Hoopes; Jocelyn S Mora; Adrian V Dalca; Bruce Fischl; Malte Hoffmann
Journal: Neuroimage Date: 2022-07-13 Impact factor: 7.400

4 in total