Literature DB >> 35297114

Multiple sclerosis cortical lesion detection with deep learning at ultra-high-field MRI.

Francesco La Rosa^1,2,3, Erin S Beck^3,4, Josefina Maranzano^5,6, Ramona-Alexandra Todea⁷, Peter van Gelderen⁸, Jacco A de Zwart⁸, Nicholas J Luciano³, Jeff H Duyn⁸, Jean-Philippe Thiran^1,2,9, Cristina Granziera^10,11, Daniel S Reich³, Pascal Sati^3,12, Meritxell Bach Cuadra^2,9.

Abstract

Manually segmenting multiple sclerosis (MS) cortical lesions (CLs) is extremely time consuming, and past studies have shown only moderate inter-rater reliability. To accelerate this task, we developed a deep-learning-based framework (CLAIMS: Cortical Lesion AI-Based Assessment in Multiple Sclerosis) for the automated detection and classification of MS CLs with 7 T MRI. Two 7 T datasets, acquired at different sites, were considered. The first consisted of 60 scans that include 0.5 mm isotropic MP2RAGE acquired four times (MP2RAGE×4), 0.7 mm MP2RAGE, 0.5 mm T2 *-weighted GRE, and 0.5 mm T2 *-weighted EPI. The second dataset consisted of 20 scans including only 0.75 × 0.75 × 0.9 mm3 MP2RAGE. CLAIMS was first evaluated using sixfold cross-validation with single and multi-contrast 0.5 mm MRI input. Second, the performance of the model was tested on 0.7 mm MP2RAGE images after training with either 0.5 mm MP2RAGE×4, 0.7 mm MP2RAGE, or alternating the two. Third, its generalizability was evaluated on the second external dataset and compared with a state-of-the-art technique based on partial volume estimation and topological constraints (MSLAST). CLAIMS trained only with MP2RAGE×4 achieved results comparable to those of the multi-contrast model, reaching a CL true positive rate of 74% with a false positive rate of 30%. Detection rate was excellent for leukocortical and subpial lesions (83%, and 70%, respectively), whereas it reached 53% for intracortical lesions. The correlation between disability measures and CL count was similar for manual and CLAIMS lesion counts. Applying a domain-scanner adaptation approach and testing CLAIMS on the second dataset, the performance was superior to MSLAST when considering a minimum lesion volume of 6 μL (lesion-wise detection rate of 71% versus 48%). The proposed framework outperforms previous state-of-the-art methods for automated CL detection across scanners and protocols. In the future, CLAIMS may be useful to support clinical decisions at 7 T MRI, especially in the field of diagnosis and differential diagnosis of MS patients.

Entities: Chemical

Keywords: 7 T; cortical lesions; deep learning; detection; multiple sclerosis; ultra-high-field MRI

Mesh：

Year: 2022 PMID： 35297114 PMCID： PMC9539569 DOI： 10.1002/nbm.4730

Source DB: PubMed Journal: NMR Biomed ISSN： 0952-3480 Impact factor: 4.478

three dimensional 9‐Hole Peg Test 25‐foot timed walk Lin's concordance correlation coefficient cortical lesion Cortical Lesion AI‐Based Assessment in Multiple Sclerosis convolutional neural network Dice coefficient Expanded Disability Status Scale gray matter Cohen's kappa coefficient lesion‐wise false positive rate lesion‐wise true positive rate magnetization‐prepared 2 rapid acquisition gradient echoes multiple sclerosis Symbol Digit Modalities Test signal‐to‐noise ratio T 2*‐weighted segmented echo‐planar imaging T 2*‐weighted multi‐echo GRE T 2* weighted volume difference white matter white matter lesion

INTRODUCTION

Multiple sclerosis (MS) is an inflammatory demyelinating disease affecting the central nervous system. It is characterized by focal areas of white matter (WM) demyelination. In recent decades, however, histopathological studies have shown that lesions in the cortex are also common. , , Moreover, increasing use of ultra‐high‐field MRI has led to the observation that cortical lesions (CLs) are extremely frequent in MS patients, persist over time, correlate with disability and progressive disease, , and may help to differentiate MS from its clinical mimics. CLs have been classified into three major types with potentially different etiologies : leukocortical (Type 1, located at the interface between WM and gray matter, GM), intracortical (Type 2, involving purely the cortex and not reaching the pial surface), and subpial (including Type 3 lesions, located entirely in the GM and touching the pial surface, and Type 4 lesions, extending from the pial surface, through the cortex, into the WM). In order to maximize MS diagnostic and prognostic accuracy, it is, therefore, crucial to analyze the clinical implication of CLs and their response to current and novel MS treatments. In contrast to white matter lesions (WMLs), CLs, and particularly intracortical and subpial lesions, are difficult to visualize with conventional sequences, although in recent years the development of advanced sequences has led to improved CL visualization. However, at 3 T, even advanced sequences such as double inversion recovery (DIR), magnetization‐prepared 2 rapid acquisition gradient echoes (MP2RAGE), and phase sensitive inversion recovery (PSIR) are relatively insensitive to CLs. , Compared with lower magnetic fields, ultra‐high field MRI allows higher signal‐to‐noise ratio (SNR) and enhanced magnetic susceptibility contrast, both of which can provide important insights into MS pathophysiology. At 7 T, T 2*‐weighted (T2*w) methods and MP2RAGE dramatically improve CL, and especially subpial lesion, visualization. , , , , , For example, compared with the combined use of 7 T MP2RAGE and T2*w images, 3 T double inversion recovery was 6% sensitive for subpial lesions and 3 T MP2RAGE was 5% sensitive. Thus, 7 T imaging is essential for expanding our understanding of cortical demyelination. With the recent FDA approval of 7 T scanners for clinical use, we can expect an increasing number of scans on these devices in the coming years. However, even with the most sensitive imaging methods, identification and segmentation of CLs is very time consuming and requires significant experience. Despite the promise of 7 T for the visualization of CLs, studies using 7 T MRI have shown only modest inter‐rater reliability, when considering both CL detection and CL count, assessed with Cohen's kappa coefficient (k) and Lin's concordance correlation coefficient (CCC), respectively. , In the work of Nielsen et al., for instance, two experts analyzing 7 T FLASH‐T2* images and a total of 103 CLs had a moderate inter‐rater agreement (k = 0.69). In a similar study, Harrison et al considered 7 T magnetization‐prepared rapid acquisition gradient echo (MPRAGE) and assessed the intra‐rater and inter‐rater reliability in terms of lesion count by two experts. Strong intra‐rater agreement was found (CCC = 0.96), whereas the inter‐rater correlation was weak (CCC = 0.54). An automatic segmentation method for CL will therefore be essential to support large‐scale studies and consistent evaluation of CLs in multi‐site clinical trials. Deep learning methods have lately shown outstanding performance in MRI segmentation, classification, and synthesis. Applied to MS, these techniques have achieved state‐of‐the‐art performance for WML segmentation , , as well as for the automated assessment of other novel imaging biomarkers, such as paramagnetic rim lesions and the central vein sign. , However, while some different approaches have been presented to automatically segment CLs based on 3 T MRI, , , to the best of our knowledge only two methods have been proposed for detection at 7 T. , First, Fartaria et al proposed MSLAST (Multiple Sclerosis Lesion Analysis at Seven Tesla), an automated method based on partial volume estimation and topological constraints that segments both WMLs and CLs with a single MP2RAGE scan as input. MSLAST was evaluated on a cohort of 25 individuals with MS imaged in two research centers using 0.7 mm isotropic MP2RAGE images and achieved a detection rate of 74% and 58% for WMLs and CLs, respectively, with a false positive rate of 40% when considering a minimum lesion size of 6 μL. Second, we previously proposed a deep‐learning‐based method to detect and classify CLs in 7 T MRI considering three different contrasts: MP2RAGE, T 2*‐weighted multi‐echo GRE (T 2* GRE), and T 2*‐weighted segmented echo‐planar imaging (T 2* EPI) (voxel size of 0.5 × 0.5 × 0.5 mm3). On a larger cohort of 60 MS patients, the performance of our method showed promise, achieving a CL detection rate of 67% and a false positive rate of 42%. Moreover, almost 400 CLs (24% of the total false positives) detected by our proposed network and initially classified as false positives were retrospectively judged as lesions by an expert. The contribution of the current work is threefold. First, building upon our previous method, we propose “CLAIMS” (Cortical Lesion AI‐Based Assessment in Multiple Sclerosis), an improved pipeline for the automated detection and classification of CL with either single or multi‐contrast MRI at 7 T. Second, we assess the relative value of MP2RAGE and T2*w images, as well as the impact of MP2RAGE image quality, for this automated task. Third, an additional dataset from another institution is considered in order to test CLAIMS's generalizability and robustness. For this purpose, a domain adaptation approach was performed and evaluated as well. Importantly, we evaluated CLAIMS's performance with respect to our experts' MRI‐based annotations, which throughout the manuscript are referred to as ground truth.

MATERIAL AND METHODS

Datasets

Two datasets, acquired at Institutions A and B, were analyzed in this study. At Institution A, MRI acquisitions were performed on 60 individuals with MS (43 relapsing remitting, 17 progressive, 63% female, 49 ± 11 (mean ± standard deviation) years old, age range 29–77 years) with Expanded Disability Status Scale (EDSS) scores ranging from 0 to 7.5 (median 2.0) and disease duration of 14 ± 11 (range 0–42) years. In addition to EDSS, the clinical assessment included the following disability measures: 9‐Hole Peg Test (9‐HPT), 25‐foot timed walk (25TW), and Symbol Digit Modalities Test (SDMT). Imaging was done on a 7 T whole‐body research system (Siemens Healthcare, Erlangen, Germany) using a 32‐channel head coil. The MRI protocol included (i) three dimensional (3D) MP2RAGE (T R/T I1/T I2/T E = 6000/800/2700/5 ms, voxel size = 0.5 × 0.5 × 0.5 mm3) repeated four times (total acquisition time ∼40 min), (ii) 3D‐segmented T 2* EPI20,21 (T R/T E = 52/23 ms, voxel size = 0.5 × 0.5 × 0.5 mm3) acquired in two partially overlapping volumes for whole brain coverage (total acquisition time ~7 min), and (iii) 2D T2*w multi‐echo GRE (T R/T E1/T E2/T E3/T E4/T E5 = 4095/11/23/34/45/56 ms, voxel size = 0.215 × 0.215 × 1.0 mm3) acquired in three volumes for nearly full supratentorial coverage and averaged across the echo times (total acquisition time ~35 min). Moreover, for 55 out of the 60 patients, an additional single 3D MP2RAGE acquisition was performed with the following parameters: T R/T I1/T I2/T E = 5000/700/2500/2.9 ms, voxel size = 0.7 × 0.7 × 0.7 mm3 (acquisition time ~10 min). At Institution B, 20 patients with early relapsing–remitting MS (RRMS) (75% female, 35 ± 7 (mean ± standard deviation) years old, age range 21–46 years) with EDSS scores ranging from 0 to 4 (median 1.5) and disease duration less than 5 years were imaged. A 7 T research scanner (Siemens Healthcare) was used, and the protocol included 3D MP2RAGE (T R/T I1/T I2/T E = 6000/750/2350/2.92 ms, voxel size = 0.75 × 0.75 × 0.9 mm3) and 3D T2*w multi‐echo GRE (T R/T E1/T E2/T E3/T E4/T E5/T E6/T E7/T E8/T E9 = 45/4.59/9.18/13.77/18.35/22.94/27.53/32.12/36.71/41.3 ms, voxel size = 0.75 × 0.75 × 0.9 mm3). For both datasets, MP2RAGE images were processed on the respective scanners with the Siemens research sequence package to obtain uniform denoised images and T 1 maps. Throughout the manuscript, by MP2RAGE we always refer to its uniform denoised image. The study was approved by the institutional review board of both institutions, and all patients gave written informed consent prior to participation.

Manual segmentation

In the 60 cases from Institution A, the four uniform denoised (T 1w) and T 1 map MP2RAGE repetitions were co‐registered and median T1w and T1map images were generated, as described previously. Median images are referred to as MP2RAGE×4 and MP2RAGE T1map×4 hereafter. CLs were visually identified using MP2RAGE×4, MP2RAGE T1map, T 2* GRE, and T 2* EPI images and delineated on MP2RAGE×4 independently by one neurologist (E.B.) and one neuroradiologist (J.M.) (both with several years of experience identifying CLs), who subsequently reached consensus in a joint session. The experts classified the CLs as leukocortical, intracortical, or subpial according to previously described criteria. CLs were hypointense on MP2RAGE images and/or hyperintense on T2*w images and were seen on at least two consecutive axial slices. All lesions were manually segmented after consensus agreement using the image analysis software Display (http://www.bic.mni.mcgill.ca/software/Display/Display.html). In total, 2247 CLs (21.0 median lesions/case, IQR = 54) were segmented, of which 36% were leukocortical, 7% were intracortical, and 57% were subpial. This also includes 192 CLs (37/8/147 leukocortical/intracortical/subpial, respectively) that were added after a retrospective analysis by an expert of the “false positives” generated by the convolutional neural network (CNN) in our previous study. The intraclass correlation coefficient between the two raters was 0.91 (95% CI 0.85–0.94) for total CLs, 0.91 (95% CI 0.85–0.94) for subpial lesions, and 0.91 (95% CI 0.85–0.94) for leukocortical lesions. Figure 1 shows an example of each lesion type. A WML segmentation was obtained with a semi‐automated method.

FIGURE 1

Examples of the three CL types identified in Dataset A shown in the three MRI images considered

Examples of the three CL types identified in Dataset A shown in the three MRI images considered In the cases from Institution B, the manual segmentations were performed by consensus between one radiologist (A.T., 6 years of experience) and one neurologist with expertise in MS and neuroimaging (C.G., 13 years of experience). The ITK‐SNAP (http://www.itksnap.org/) tool was used for the annotations. 188 CLs (1.0 median lesions/case, IQR = 10) were identified and subsequently classified into three types, leukocortical (69%), intracortical (26%), and subpial (5%). For the analysis of this dataset, intracortical and subpial lesions were grouped together

Pre‐processing

The images of each subject were linearly registered to the same space (MP2RAGE×4) using ANTs and subsequently skull‐stripped using FSL‐Brain Extraction Tool. All images were then resampled to 0.5 × 0.5 × 0.5 mm3 using a bilinear interpolation for the MRI contrasts and a nearest neighbor interpolation for the lesion masks. Finally, all non‐zero voxels were normalized with mean 0 and standard deviation 1. Examining the lesion masks, approximately 80 CLs appeared to be outside of the T 2* contrasts' FOV and were therefore excluded from the analysis that included these contrasts.

Convolutional neural network

CLAIMS relies on our previously proposed CNN architecture with several targeted modifications that boosted its performance. In particular, it is inspired by the 3D U‐Net, but we used four resolution levels instead of three, each one with an increasing number of features: 16, 32, 64, and 128, respectively. Compared with our previous work, the tissue segmentation output branch was removed, and larger patches of size 96 × 96 × 96 voxels were provided as input to the network (scheme in Figure 2). As output, two separate labels are provided, one representing leukocortical lesions and another one representing intracortical and subpial lesions.

FIGURE 2

Scheme of the proposed multi‐contrast CNN architecture inspired by the 3D U‐Net. The CNN takes as input 3D patches of size 96 × 96 × 96 of the different MRI contrasts (red frame) and provides as output a CL detection and classification into two classes. Input channel dropout is applied when the two T 2* contrasts are considered (green frame) Instead of cross‐entropy loss, used in our previous work, we trained the network with focal loss (γ = 2), as it has recently been shown to outperform cross‐entropy for a large variety of tasks. All WML voxels from the semi‐automatically obtained WML masks were considered with a weight of 0 during training. The Adam optimizer was used, and each model was trained for 400 epochs with an initial learning rate of 0.5 × 10−3.

Data augmentation

Extensive data augmentation was applied to reduce the risk of overfitting. Input sequence dropout was applied as in our previous study, meaning that at each training iteration with multiple contrasts one of the two T 2* images is randomly dropped (e.g. multiplied by zero). Moreover, random rotations of up to 180° in the three planes and random flipping of the axis were applied. Additionally, random intensity shift (with an offset of 0.1) and affine transformations (random rotations of up to 15° and random scaling of up to 10% of the image size) were performed. The training was done on an NVIDIA RTX3090 for 400 epochs and took approximately 44 hours. Testing takes approximately 2 min for each single case using the same machine. The code has been implemented in MONAI running on top of PyTorch and is publicly available on our research website (https://github.com/Medical-Image-Analysis-Laboratory).

EXPERIMENTS

Several experiments and ablation studies were performed in order to compare the different MRI contrasts and performance of CLAIMS on separate datasets. First, over the 60 cases from Institution A, a sixfold cross‐validation was done to compare the single and multi‐image inputs (50 cases for training and 10 for testing in each fold). An internal validation was performed in each fold with five randomly drawn subjects. The following image combinations were considered as input: MP2RAGE×4 + T 2* GRE + T 2* EPI, MP2RAGE×4 + T 2* GRE, MP2RAGE×4 + T 2* EPI, MP2RAGE×4 alone, T 2* GRE alone, and T 2* EPI alone. The lesion count of the best performing model (MP2RAGE×4) was compared with the ground truth's lesion count. Moreover, we assessed the Spearman correlation between four disability measures and both automated and manual lesion counts. Second, an experiment was conducted to evaluate the performance of CLAIMS on MP2RAGE images obtained with a single acquisition and slightly larger voxel size (acquisition time of approximately 10 min), which would be closer to a clinical scenario compared with the 0.5 mm MP2RAGE×4 (with an acquisition time of about 40 min and therefore intended for research purposes). For this purpose, a model was trained with the 0.5 mm MP2RAGE×4, alternatively with the MP2RAGE 0.7 mm isotropic images (see Figure 3 for a visual example of the differences between these two images), and in a third approach by randomly providing at each iteration either a 0.5 mm MP2RAGE×4 or a 0.7 mm single acquisition MP2RAGE to the network. The resulting three models were then all tested on the 0.7 mm single acquisition MP2RAGE images. Importantly, averaging four acquisitions increases the SNR by a factor of 2, whereas going from a voxel size of 0.5 to 0.7 increases the SNR by a factor of 2.7. Therefore, 0.7 mm single acquisition MP2RAGE has a slightly higher SNR than the 0.5 mm MP2RAGE×4 (see Figure 3).

FIGURE 3

Examples of CLs identified by the experts in the MP2RAGE×4 0.5 mm and T 2* GRE, and retrospectively seen in the MP2RAGE 0.7 mm. A subpial lesion is marked by a blue mask, and an intracortical lesion is marked by a yellow mask

Third, training was performed with all cases from Institution A (using both MP2RAGE 0.7 mm single acquisition and 0.5 mm MP2RAGE×4, but no T 2* images) and then tested 14 cases from Institution B. Further, a domain adaptation of this model with six different cases from Institution B was also done, consisting of re‐training all CNN layers starting with the previous weights. In this case, the fine‐tuning lasted 50 epochs and had an initial learning rate of 0.5 × 10−4. The results were compared with the previous state‐of‐the‐art method using 14 subjects from the same dataset. Examples of CLs identified by the experts in the MP2RAGE×4 0.5 mm and T 2* GRE, and retrospectively seen in the MP2RAGE 0.7 mm. A subpial lesion is marked by a blue mask, and an intracortical lesion is marked by a yellow mask

Evaluation and statistical analysis

Common detection metrics such as the lesion‐wise true positive rate (LTPR) and lesion‐wise false positive rate (LFPR) were considered for the evaluation on a lesion‐wise level. Due to the extremely small size of CLs (starting from 1 μL in our datasets), each lesion labeled by the expert is considered detected if it overlaps by at least one voxel with the automatically generated mask. The median detection rate and false positive rate are considered on a patient‐wise level as well. The Dice coefficient (DSC) and volume difference (VD) were computed to quantify the accuracy of lesion delineation and volumetric segmentation for the subjects with at least one CL. The classification accuracy between the two types of CL considered (leukocortical versus intracortical/subpial) was assessed as the percentage of correctly classified CLs. For all cases from Institution A a minimum lesion size of 1 μL was considered for both the automated and manual masks when computing the metrics, whereas for the subjects of Institution B we evaluated different minimum lesion volumes (1, 3, 6, 9, and 15 μL). For both datasets, the segmented connected components that overlap with WMLs are not considered as false positives, as the difference between leukocortical and juxtacortical lesions is subtle. The Wilcoxon signed‐rank test was used to compare the metrics on a patient‐wise level. The Bonferroni correction was used to adjust significance for multiple comparisons. Differences are considered significant at p < 0.05. In order to verify the correlation between CL number and disability measures, the Spearman correlation coefficient (ρ) was computed. Cohen's kappa coefficient was evaluated to verify the agreement between the CLs delineated by the experts and the CLs detected by CLAIMS. Lin's concordance coefficient (CCC) was computed to assess the level of correlation between the ground truth and CLAIMS's lesion count.

RESULTS

Single versus multi‐image models

To assess the contribution of each MRI contrast, we performed the following ablation study. Three networks were trained, each with a single MRI contrast as input (MP2RAGE×4, T 2* GRE, and T 2* EPI), and then compared with two models receiving a bimodal input (MP2RAGE×4 and either T 2* GRE or T 2* EPI) and a model that received all three images (Figure 4, Table 1). In terms of lesion detection on a lesion‐wise level, the three image model and the MP2RAGE×4 model achieved similar results, with LTPRs of 82% and 83% for leukocortical lesions, 49% and 53% for intracortical lesions, and 70% and 69% for subpial lesions. Both of them had a LFPR of 30%. They considerably outperformed the bimodal models and the single input models trained with T 2* GRE and T 2* EPI (LTPRs of 34% and 30%, and LFPRs of 45% and 49%, respectively). Similar performance was observed in terms of lesion delineation on a patient‐wise level, with the 0.5 mm MP2RAGE×4 achieving the best metrics (0.49/0.80 DSC and VD versus 0.18/4.51 for the T 2* GRE and 0.16/4.57 for the T 2* EPI). Finally, a moderately high CL subtype classification accuracy of over 80% was observed for all four models.

FIGURE 4

TABLE 1

Median metrics and Cohen's kappa coefficient for the different input contrasts obtained with a sixfold cross‐validation over the 60 cases from Institution A. LTPR, LFPR, and classification accuracy are computed on a lesion level, whereas LTPR, LFPR, DSC, and VD are considered on a patient‐wise level. Bold, the best result for each metric

Input	Lesion‐wise			Patient‐wise				k
Input	LTPR	LFPR	Classification accuracy	LTPR	LFPR	DSC	VD	k
0.5 mm MP2RAGE×4, T ₂* GRE, T ₂* EPI	0.74	0.30	0.88	0.73	0.33	0.47	0.87	0.48
0.5 mm MP2RAGE×4, T ₂* GRE	0.73	0.36	0.85	0.72	0.38	0.36	3.01	0.46
0.5 mm MP2RAGE×4, T ₂* EPI	0.71	0.38	0.82	0.70	0.41	0.35	3.22	0.45
0.5 mm MP2RAGE×4	0.74	0.30	0.88	0.73	0.32	0.49	0.80	0.49
T ₂* GRE	0.36	0.45	0.82	0.31	0.50	0.18	4.51	0.22
T ₂* EPI	0.32	0.55	0.81	0.30	0.57	0.16	4.57	0.18

Visual results showing CL detection with the single input 0.5 mm MP2RAGE×4 model. Left: an intracortical lesion manually segmented (green) that was correctly detected by the automated method (red). Right: a similar example for a leukocortical lesion. GT, ground truth Median metrics and Cohen's kappa coefficient for the different input contrasts obtained with a sixfold cross‐validation over the 60 cases from Institution A. LTPR, LFPR, and classification accuracy are computed on a lesion level, whereas LTPR, LFPR, DSC, and VD are considered on a patient‐wise level. Bold, the best result for each metric The patient‐wise analysis in Figure 5 shows the violin plots for the LTPR and LFPR for the single and three input models. The single input 0.5 mm MP2RAGE×4 model and the three image model achieved results with no significant statistical differences (p > 0.05), and both significantly outperformed the single input models trained with T 2* GRE and T 2* EPI (p < 0.001). The single input 0.5 mm MP2RAGE×4 mode reached a Cohen kappa coefficient of 0.49 when compared with the manually annotated CLS. Analyzing as well the correlation between the manual and automated lesion counts, Lin's concordance coefficient reached 0.91 (see Figure 6), indicating substantial agreement. Moreover, we computed the Spearman coefficient to compare the correlation between four disability measures (EDSS, 9‐HPT, 25TW, SMDT) and both the ground truth and CLAIMS lesion counts. The manual and CLAIMS lesion counts correlated similarly with each of the disability measures, presented in Table 3. Comparable results were observed considering individual types of CL (Suppl. Table 2). The Bland–Altman plot was computed considering the volumetric differences between manual and automated CL segmentation (see Figure 7). No particular bias was observed for CLAIMS segmentations.

FIGURE 5

Violin plots of the LTPR and LFPR for different input models evaluated with a sixfold cross‐validation over the 60 subjects of Institution A. Each dot represents a subject

FIGURE 6

Correlation between the manual CL count and the one provided automatically by CLAIMS (best model trained with MP2RAGE×4). The solid lines show the linear regression model between the two measures along with a confidence interval at 95%. The dashed lines indicate the expected lesion count estimates. The CCC between manual and automatic lesion counts is reported in the legend and shows a substantial agreement

TABLE 3

Spearman correlation coefficient ρ (and its relative p‐value) computed between four disability measures and the manual and automated CL count. Both counts show a moderate correlation for all four measures

	CLAIMS CL number		Ground truth CL number
	𝝆	P	𝝆	P
Ground truth CL number	0.86	<0.0001	—	—
EDSS	0.45	0.0003	0.43	0.0008
25TW	0.42	0.0008	0.43	0.0008
9HPT	0.50	<0.0001	0.40	0.0014
SDMT	−0.58	<0.0001	−0.53	<0.0001

FIGURE 7

Bland–Altman plot (reference − prediction) of the manually and automatically segmented CL volumes. The solid green line shows the mean difference, whereas the dotted red lines the ±1.96 SD limits of the mean difference

Violin plots of the LTPR and LFPR for different input models evaluated with a sixfold cross‐validation over the 60 subjects of Institution A. Each dot represents a subject Correlation between the manual CL count and the one provided automatically by CLAIMS (best model trained with MP2RAGE×4). The solid lines show the linear regression model between the two measures along with a confidence interval at 95%. The dashed lines indicate the expected lesion count estimates. The CCC between manual and automatic lesion counts is reported in the legend and shows a substantial agreement Bland–Altman plot (reference − prediction) of the manually and automatically segmented CL volumes. The solid green line shows the mean difference, whereas the dotted red lines the ±1.96 SD limits of the mean difference Table 2 and Figure 8 show the detection rates for the three different CL types on a lesion‐wise level. As observed in our previous work, intracortical lesions remain the most challenging, with a detection rate of 53% in the best scenario (three image model). On the other hand, leukocortical and subpial lesions have high detection rates of over 80 and 70%, respectively, for both the MP2RAGE×4 and the three image model. Similar behavior is observed between the three CL types for all models, with a significant drop however in the detection of intracortical lesions for the T 2* GRE and T 2* EPI models (13% and 7%, respectively) (Table 3).

TABLE 2

Comparison of lesion detection rates on a patient‐wise level for the different models

	MP2RAGE×4		T ₂* GRE		T ₂* EPI		3 contrasts
	Detected (%)	Per patient	Detected (%)	Per patient	Detected (%)	Per patient	Detected (%)	Per patient
	Detected (%)	Median (range, IQR)	Detected (%)	Median (range, IQR)	Detected (%)	Median (range, IQR)	Detected (%)	Median (range, IQR)
All	1649 (74%)	14 (0–163, 36)	748 (36%)	6 (0–141, 10)	692 (32%)	5 (0–128, 11)	1648 (74%)	14 (0–151, 31)
Leukocortical	672 (83%)	5 (0–54, 13)	268 (63%)	4 (0–39, 12)	254 (59%)	4 (0–36, 11)	656 (82%)	6 (0–48, 13)
Intracortical	83 (53%)	1 (0–17, 2)	25 (15%)	0 (0–5, 1)	19 (12%)	0 (0–2, 0)	76 (49%)	1 (0–17, 2)
Subpial	894 (69%)	6 (0–96, 8)	455 (48%)	4 (0–92, 6)	421 (41%)	5 (0–81, 4)	916 (70%)	7 (0–98, 9)

NS: non‐significant.

p < 0.05.

p < 0.01.

p < 0.001.

FIGURE 8

Lesion‐wise CL detection rate for the three different CL lesion types considered over the 60 subjects of Institution A

Comparison of lesion detection rates on a patient‐wise level for the different models NS: non‐significant. p < 0.05. p < 0.01. p < 0.001. Lesion‐wise CL detection rate for the three different CL lesion types considered over the 60 subjects of Institution A Spearman correlation coefficient ρ (and its relative p‐value) computed between four disability measures and the manual and automated CL count. Both counts show a moderate correlation for all four measures

Evaluating CLAIMS on standard MP2RAGE images

In order to evaluate CLAIMS on more commonly acquired single acquisition 0.7 mm MP2RAGE images, three models were trained with the following inputs: (1) MP2RAGE 0.7 mm, (2) MP2RAGE×4 0.5 mm, (3) alternating MP2RAGE×4 0.5 mm and 0.7 mm at each training iteration. All models were then tested on the single acquisition 0.7 mm MP2RAGE. For any given subject, care was taken to ensure that all images were in either the training or testing set and not split across them. Table 4 reports the metrics obtained when evaluating these models on the MP2RAGE 0.7 mm images with a sixfold cross‐validation. All model results were evaluated with respect to the ground truth determined using MP2RAGE×4, T 2* GRE, and T 2* EPI images. First, compared with the previous results for the 0.5 mm MP2RAGE×4 single input model (see Table 1), we observed that training and testing on single acquisition 0.7 mm MP2RAGE causes a drop in the detection rate from 74 to 53%. Second, between the three models, the one using both 0.5 and 0.7 mm images during training achieves the best metrics, with LTPR of 53% and LFPR of 33%. Finally, the worst‐performing model is the one trained only with MP2RAGE×4, having LTPR of 35% and LFPR of 41%.

TABLE 4

Metrics obtained for models trained with different inputs (listed in the first column) and tested on the MP2RAGE 0.7 mm images. Bold, the best result for each metric

	Lesion‐wise			Patient‐wise
Training images	LTPR	LFPR	Classification accuracy	Dice	VD
MP2RAGE 0.7 mm	0.52	0.39	0.85	0.25	1.10
MP2RAGE×4 0.5 mm	0.35	0.41	0.81	0.16	1.29
MP2RAGE×4 0.5 and MP2RAGE 0.7 mm	0.53	0.33	0.85	0.29	1.06

Metrics obtained for models trained with different inputs (listed in the first column) and tested on the MP2RAGE 0.7 mm images. Bold, the best result for each metric

Independent test set

In our last study, an MP2RAGE‐only model was trained with all cases from Institution A (mixing 0.5 mm MP2RAGE×4 and 0.7 mm MP2RAGE, as this was the top performing input in the previous experiment) and tested on the 14 subjects of Institution B, which include only 0.75 × 0.75 × 0.9 mm3 MP2RAGE images and were used also to report results with MSLAST in a previous work. Furthermore, we also propose a domain adapted version of CLAIMS (CLAIMS_DA), fine‐tuning the same model (training all layers) on six additional cases from Institution B. Figure 9 presents the lesion‐wise true and false positive rates for CLAIMS, CLAIMS_DA, and MSLAST, considering different minimum lesion volumes, much smaller than even the ones proposed in the MSLAST paper (1, 3, and 9 μL versus 6 and 15 μL). CLAIMS outperforms MSLAST for all volume thresholds and CLAIMS_DA achieves even better results on a lesion‐wise level. In particular, considering a minimum lesion volume of 6 μL, it reaches a detection rate of 71% with a false positive rate of 29% (see Supplementary Material Table 3). The CL classification accuracies for CLAIMS and CLAIMS_DA were 81% and 84%, respectively. On a patient‐wise level, no statistically significant differences were observed between the three models.

FIGURE 9

False positive and detection rate in a pure testing scenario on the Institution B dataset for CLAIMS, CLAIMS domain‐adapted (CLAIMS_DA), and MSLAST. Different minimum lesion volumes are considered. N refers to the number of CLS in the ground truth for each minimum lesion volume

DISCUSSION AND CONCLUSION

In this work, we have explored the value of different 7 T MRI contrasts for the automated detection of CL. For this purpose, we trained and tested a novel U‐net‐based deep learning method we call CLAIMS. We analyzed three MRI contrasts as input (MP2RAGE, T 2* GRE, and T 2* EPI), different image resolutions, and CLAIMS's generalizability on an external testing dataset. CLAIMS's automated CL count was then compared with the experts' CL number, and its correlation with four disability measures was examined. Furthermore, we compared our results with the only previous automated CL segmentation method at 7 T in the literature. Finally, we reported the results obtained considering different minimum lesion volumes and analyzing the three CL subtypes. The results of our ablation study with different input contrasts fed to the model show that the MP2RAGE contrast alone is sufficient to achieve the best performance on both a lesion and patient‐wise level. In contrast to a past study regarding the manual segmentation of CLs, the addition of two T2*w images did not contribute significantly to improving the automated CL detection. Moreover, the single input models trained with the T2*w contrasts performed poorly compared with the MP2RAGE one. This is in line with a previous study, where it was shown that the MP2RAGE increases the visual detection of all CL types compared with T2*w imaging at 7 T. However, it is important to note that our ground truth was created considering all three MRI images, and some lesions might not be visible on the T 2* contrasts alone. Moreover, even a lower resolution MP2RAGE outperformed the T2*w models, arguably establishing the MP2RAGE as the preferred contrast by CLAIMS. Focusing on the different CL types, we observed that their detection rate varies considerably. As in our previous study, intracortical lesions were once again the most challenging type, with a detection rate of only 53% for the best model. On the other hand, both leukocortical and subpial lesions were detected with high sensitivities of over 80 and 70%, respectively. These values approach the inter‐rater agreement of about 85% in a previous study considering two raters. In the prior study, Cohen's kappa coefficient was 0.69, showing substantial agreement, whereas ours, computed between CLAIMS output and the manual masks, reached 0.49. However, we considered a dataset with 20 times more CLs (2247 versus 103), and this could explain the higher variability. Similarly, the inter‐rater reliability can also be estimated by computing CCC, which takes into account the correlation between lesion counts. Analyzing lesion counts from 10 sample scans and by two raters, Harrison et al. showed Lin's coefficient to be 0.54 for all CLs, meaning that the agreement was weak. In contrast, our best model achieved a value of 0.91, showing a substantial correlation between the manual and automated CL counts. Moreover, CLAIMS computes a single subject's lesion map in approximately 2 min, whereas manual raters needed anywhere between 30 min and several hours depending on lesion burden to perform a similar task manually. This suggests that CLAIMS could be useful to support and speed up experts' CL assessments. To further support this claim, we analyzed the correlation between four disability measures and the manual and automated CL numbers. Both CL counts correlated with the four measures (Spearman's coefficient between 0.40 and 0.57) in a very similar way. To the best of our knowledge, this is the first time that an automated CL count has been proven to correlate with disability scores. Relying only on the MP2RAGE, we then evaluated CLAIMS on MP2RAGE images, which are more feasible to acquire, obtained with a single acquisition of about 10 min (versus 40 min for MP2RAGE×4) and with a slightly larger voxel size. Specifically, we compared different models trained with either 0.5 mm, 0.7 mm, or both 0.5 mm and 0.7 mm. The 0.5 mm isotropic images were obtained as an average of four acquisitions and had, therefore, an SNR twice as high as that of a single acquisition at the same resolution, but a lower SNR compared with the 0.7 mm MP2RAGE. Importantly, the model trained with 0.5 mm images and tested on 0.7 mm images (simply interpolated to 0.5 mm) performed very poorly, showing the value of the voxel size for the detection of very small structures. It is important to keep in mind that the ground truth was labeled on the 0.5 mm images, and this might also cause a partial drop in performance. However, when mixing both 0.5 mm and 0.7 mm images during training, the metrics improve considerably, even outperforming the model trained with 0.7 mm images only. This indicates that CLAIMS can extract useful information during training from the smaller voxel size images and successfully use these at inference time. Overall, however, we notice that going from 0.5 mm MP2RAGE×4 to single acquisition 0.7 mm MP2RAGE causes a detection rate drop of about 20%, proving the importance of image quality and resolution even for automated methods. Finally, we tested the performance of CLAIMS trained by mixing both 0.5 mm MP2RAGE×4 and 0.7 mm MP2RAGE on a different dataset where MP2RAGE scans were acquired with a voxel size of 0.75 × 0.75 × 0.9 mm3. This is the same dataset on which the previous state‐of‐the‐art method (MSLAST) was evaluated, and therefore we could carry out a precise comparison including the same subjects. It is important to note that the dataset from Institution B includes subjects in the very early stages of the disease (disease duration < 5 years) who have a much lower lesion burden compared with the subjects of Institute A (median lesion count per case of 1 versus 21). Moreover, the CL subtype distribution is different as well, with a majority of leukocortical lesions in Dataset A, whereas intracortical lesions are prevalent in Dataset B (see Supplementary Material, Supplementary Figure 1). This could be partially explained by the lower voxel size and the presence of T 2* images in Dataset A, which allow higher visual detection of subpial lesions. Nevertheless, CLAIMS proves to be robust and performs well in this multi‐center scenario, with a lesion‐wise detection rate of about 61% when different minimum lesion volumes are considered (1, 3, 6, 9, and 15 μL). It outperforms MSLAST (for minimum lesion volumes of both 6 μL and 15 μL), and it also classifies CL into two types (leukocortical and intracortical) with an accuracy of about 80%. When decreasing the minimum lesion volume considered, the detection rate remains stable, with only a marginal increase of false positives. For instance, if a minimum lesion volume of 3 μL is considered, CLAIMS achieves a detection rate of 61% with a false positive rate of 33%. Moreover, when fine‐tuning CLAIMS with six additional subjects belonging to the same dataset, the lesion‐wise performance increases considerably, reaching a detection rate of 71% with a false positive rate of 32%. This confirms the efficacy of domain adaptation for deep learning‐based models applied to a different dataset and indicates our proposed method as the state‐of‐art technique for CL detection with 7 T MRI. In this study, results were evaluated on both lesion and patient‐wise levels. During training, our proposed CNN takes as input random 3D patches extracted from different patients. As the CNN does not see the field of view of the entire brain, a lesion‐wise evaluation is crucial to determine its performance at testing time. At the same time, however, from a clinical research perspective, when evaluating the effects of CL on clinical outcomes, patient‐wise data is the most relevant. Moreover, the lesion‐wise detection rate might be extremely dependent on cases with a high lesion load while not reflecting the method's performance on subjects with few lesions. For these reasons, both analyses were considered to provide a comprehensive evaluation. A recent study has investigated the dependence between the training dataset size and the segmentation performance of a CNN. Considering a 2D U‐Net architecture, Narayana et al. concluded that 50 subjects are sufficient for an accurate MS lesion segmentation. In our study, we designed a 3D CNN architecture instead, while, however, having a reduced number of feature maps and resolution levels compared with the original 2D U‐Net architecture. Each training fold included 50 subjects. Thus, we believe that our sample size is adequate for the task, and this is supported also by the promising results obtained on the external testing dataset (Center B). Moreover, considering subjects from both Centers A and B, we analyzed the largest cohort of MS patients for automated CL detection in the literature. Our study also presents some limitations. First, the two datasets considered were acquired with scanners from the same manufacturer and with similar acquisition protocols. Additional differences in the images could arise under more general conditions, potentially causing a drop in performance. In this case, a fine‐tuning of the pre‐trained model relying on a few annotated subjects could help overcome this issue and help regain performance. Second, the detection rate of intracortical lesions remains quite low, even though it improved by more than 10% compared with our previous study. These are infrequently visible and extremely small CLs, and perhaps a larger training dataset with an increased number of intracortical lesions could mitigate this issue. Of note, there was a difference in relative prevalence of intracortical versus subpial lesions in the two datasets, likely related to differences in lesion appearance between the two imaging protocols (superficial cortical involvement is often more apparent on T2*w images). By grouping intracortical and subpial lesions for the purposes of the model, we were able to achieve good performance and lesion classification on both datasets; however, differentiating between intracortical and subpial lesions remains a difficult task, for both manual rating and automated methods. Third, in this work, we analyze a single CNN architecture, which proved effective for CL detection in our previous study. It was out of the scope of this study to compare several different models and tweak their parameters to optimize the performance. We rather selected the state‐of‐the‐art architecture and analyzed in detail its potential to tackle a clinically relevant problem such as the detection of CLs. Fourth, although the MRI sequences used here are highly sensitive for CLs compared with 3 T techniques, their true sensitivity to CLs is unknown and it is likely that some CLs are not well seen using these techniques. Thus, some of the false positive lesions detected by the method presented here may be true lesions, as shown in our previous study. Future work could include exploring more advanced deep learning architectures in order to better include the information from each single MRI contrast, particularly the T2*w images. Given time constraints in the clinical setting, use of fewer MP2RAGE acquisitions could be explored as well to determine how much each additional repetition contributes to improved lesion sensitivity. In addition, novel compressed sensing techniques could be exploited to shorten the MP2RAGE acquisition time to a clinically acceptable duration even with multiple repetitions. Moreover, the use of larger and additional datasets could further prove the generalizability of the proposed method. Finally, the use of the T 1 map image type generated from MP2RAGE acquisitions could also be explored for the automated detection of CLs, although in our experience most CLs are similarly seen on both MP2RAGE uniform denoised and T 1 map images. In conclusion, we present CLAIMS, a DL framework for the detection and classification of CLs with 7 T MRI. When CLAIMS is trained only with MP2RAGE, it achieves state‐of‐the‐art performance for all three CL types, and its CL count correlates with disability measures similarly to experts' visual assessment. If fine‐tuned, it adapts extremely well to a different dataset acquired in a different site. As 7 T scanners from several manufacturers are now being approved for clinical use, CLAIMS could eventually be useful to support clinical decisions, particularly in the field of diagnosis and differential diagnosis of MS patients.

FUNDING INFORMATION

This project is supported by the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska‐Curie project TRABIT (Agreement 765148), the Novartis Foundation for Medical‐Biological Resaerch (#21A032), the Centre d'Imagerie BioMédicale of the University of Lausanne, the Swiss Federal Institute of Technology Lausanne, the University of Geneva, the Centre Hospitalier Universitaire Vaudois, and the Hôpitaux Universitaires de Genève. Erin S. Beck is supported by a Career Transition Fellowship from the National Multiple Sclerosis Society. Pascal Sati, Erin S. Beck, Nicholas J. Luciano, Daniel S. Reich, Peter van Gelderen, Jacco A. de Zwart, and Jeff Duyn are supported by the Intramural Research Program of the National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.

CONFLICTS OF INTEREST

D.S.R.: research support from Abata, Sanofi‐Genzyme, and Vertex. C.G.: the University Hospital of Basel (USB), as the employer of C.G., has received the following fees which were used exclusively for research support: (i) advisory board and consultancy fees from Actelion, Genzyme‐Sanofi, Novartis, GeNeuro, and Roche; (ii) speaker fees from Genzyme‐Sanofi, Novartis, GeNeuro, and Roche; (iii) research support from Siemens, GeNeuro, and Roche. For the remaining authors, no conflicts of interest were declared. Table S1. Metadata information of the two datasets considered. Supp. Table 2. Spearman correlation coefficient 𝝆 (and its relative p‐value) computed between four disability measures and the manual and automated CL count, subdivided per CL type. Both counts show a moderate correlation for all four measures. Supp. Table 3. Lesion‐wise metrics obtained with a minimum lesion volume of 6 μL for CLAIMS, CLAIMS_DA, and MSLAST. In bold are the best results for each metric. No significant differences (p > 0.05) were observed between the three models on a patient‐wise level for both LTPR and LFPR. MSLAST did not classify CL in multiple types. Supp. Figure 1. Fraction of leukocortical and intracortical/subpial CL in dataset A and B. N refers to the total number of CL in each dataset. Click here for additional data file.

30 in total

Review 1. Gray matter involvement in multiple sclerosis.

Authors: Istvan Pirko; Claudia F Lucchinetti; Subramaniam Sriram; Rohit Bakshi
Journal: Neurology Date: 2007-02-27 Impact factor: 9.910

2. Association of Cortical Lesion Burden on 7-T Magnetic Resonance Imaging With Cognition and Disability in Multiple Sclerosis.

Authors: Daniel M Harrison; Snehashis Roy; Jiwon Oh; Izlem Izbudak; Dzung Pham; Susan Courtney; Brian Caffo; Craig K Jones; Peter van Zijl; Peter A Calabresi
Journal: JAMA Neurol Date: 2015-09 Impact factor: 18.302

3. Cortical lesions, central vein sign, and paramagnetic rim lesions in multiple sclerosis: Emerging machine learning techniques and future avenues.

Authors: Francesco La Rosa; Maxence Wynen; Omar Al-Louzi; Erin S Beck; Till Huelnhagen; Pietro Maggi; Jean-Philippe Thiran; Tobias Kober; Russell T Shinohara; Pascal Sati; Daniel S Reich; Cristina Granziera; Martina Absinta; Meritxell Bach Cuadra
Journal: Neuroimage Clin Date: 2022-09-24 Impact factor: 4.891

4. Cortical and phase rim lesions on 7 T MRI as markers of multiple sclerosis disease progression.

Authors: Constantina A Treaba; Allegra Conti; Eric C Klawiter; Valeria T Barletta; Elena Herranz; Ambica Mehndiratta; Andrew W Russo; Jacob A Sloane; Revere P Kinkel; Nicola Toschi; Caterina Mainero
Journal: Brain Commun Date: 2021-06-24

5. MP2RAGE, a self bias-field corrected sequence for improved segmentation and T1-mapping at high field.

Authors: José P Marques; Tobias Kober; Gunnar Krueger; Wietske van der Zwaag; Pierre-François Van de Moortele; Rolf Gruetter
Journal: Neuroimage Date: 2009-10-09 Impact factor: 6.556

6. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain.

Authors: B B Avants; C L Epstein; M Grossman; J C Gee
Journal: Med Image Anal Date: 2007-06-23 Impact factor: 8.545

7. Longitudinal Characterization of Cortical Lesion Development and Evolution in Multiple Sclerosis with 7.0-T MRI.

Authors: Constantina A Treaba; Tobias E Granberg; Maria Pia Sormani; Elena Herranz; Russell A Ouellette; Céline Louapre; Jacob A Sloane; Revere P Kinkel; Caterina Mainero
Journal: Radiology Date: 2019-04-09 Impact factor: 29.146

8. Inversion Recovery Susceptibility Weighted Imaging With Enhanced T2 Weighting at 3 T Improves Visualization of Subpial Cortical Multiple Sclerosis Lesions.

Authors: Erin S Beck; Neville Gai; Stefano Filippini; Josefina Maranzano; Govind Nair; Daniel S Reich
Journal: Invest Radiol Date: 2020-11 Impact factor: 10.065

9. Interrater reliability: the kappa statistic.

Authors: Mary L McHugh
Journal: Biochem Med (Zagreb) Date: 2012 Impact factor: 2.313

10. Automated detection of white matter and cortical lesions in early stages of multiple sclerosis.

Authors: Mário João Fartaria; Guillaume Bonnier; Alexis Roche; Tobias Kober; Reto Meuli; David Rotzinger; Richard Frackowiak; Myriam Schluep; Renaud Du Pasquier; Jean-Philippe Thiran; Gunnar Krueger; Meritxell Bach Cuadra; Cristina Granziera
Journal: J Magn Reson Imaging Date: 2015-11-25 Impact factor: 4.813

1 in total

1. Multiple sclerosis cortical lesion detection with deep learning at ultra-high-field MRI.

Authors: Francesco La Rosa; Erin S Beck; Josefina Maranzano; Ramona-Alexandra Todea; Peter van Gelderen; Jacco A de Zwart; Nicholas J Luciano; Jeff H Duyn; Jean-Philippe Thiran; Cristina Granziera; Daniel S Reich; Pascal Sati; Meritxell Bach Cuadra
Journal: NMR Biomed Date: 2022-03-31 Impact factor: 4.478

1 in total