Literature DB >> 31634822

Uncovering convolutional neural network decisions for diagnosing multiple sclerosis on conventional MRI using layer-wise relevance propagation.

Fabian Eitel¹, Emily Soehler¹, Judith Bellmann-Strobl², Alexander U Brandt³, Klemens Ruprecht⁴, René M Giess⁵, Joseph Kuchling⁶, Susanna Asseyer⁶, Martin Weygandt⁵, John-Dylan Haynes⁷, Michael Scheel⁸, Friedemann Paul⁹, Kerstin Ritter¹⁰.

Abstract

Machine learning-based imaging diagnostics has recently reached or even surpassed the level of clinical experts in several clinical domains. However, classification decisions of a trained machine learning system are typically non-transparent, a major hindrance for clinical integration, error tracking or knowledge discovery. In this study, we present a transparent deep learning framework relying on 3D convolutional neural networks (CNNs) and layer-wise relevance propagation (LRP) for diagnosing multiple sclerosis (MS), the most widespread autoimmune neuroinflammatory disease. MS is commonly diagnosed utilizing a combination of clinical presentation and conventional magnetic resonance imaging (MRI), specifically the occurrence and presentation of white matter lesions in T2-weighted images. We hypothesized that using LRP in a naive predictive model would enable us to uncover relevant image features that a trained CNN uses for decision-making. Since imaging markers in MS are well-established this would enable us to validate the respective CNN model. First, we pre-trained a CNN on MRI data from the Alzheimer's Disease Neuroimaging Initiative (n = 921), afterwards specializing the CNN to discriminate between MS patients (n = 76) and healthy controls (n = 71). Using LRP, we then produced a heatmap for each subject in the holdout set depicting the voxel-wise relevance for a particular classification decision. The resulting CNN model resulted in a balanced accuracy of 87.04% and an area under the curve of 96.08% in a receiver operating characteristic curve. The subsequent LRP visualization revealed that the CNN model focuses indeed on individual lesions, but also incorporates additional information such as lesion location, non-lesional white matter or gray matter areas such as the thalamus, which are established conventional and advanced MRI markers in MS. We conclude that LRP and the proposed framework have the capability to make diagnostic decisions of CNN models transparent, which could serve to justify classification decisions for clinical review, verify diagnosis-relevant features and potentially gather new disease knowledge.

Entities: Chemical Disease Gene Species

Keywords: Convolutional neural networks deep learning multiple sclerosis MRI; Layer-wise relevance propagation; Visualization transfer learning

Mesh：

Year: 2019 PMID： 31634822 PMCID： PMC6807560 DOI： 10.1016/j.nicl.2019.102003

Source DB: PubMed Journal: Neuroimage Clin ISSN： 2213-1582 Impact factor: 4.881

Introduction

Multiple Sclerosis (MS) is the most widespread autoimmune neuroinflammatory disease in young adults with 2.2 million cases reported worldwide (Mitchell et al., 2019). The disease is mainly characterized by inflammation, demyelination and neurodegeneration in the central nervous system and often leads to substantial disability in patients (Reich et al., 2018). The current quasi-standard for diagnosing MS, the McDonald criteria, relies on clinical presentation and the presence of lesions visible in conventional T2-weighted brain magnetic resonance imaging (MRI) data (Thompson et al., 2018). Most common in clinical practice are fluid-suppressed T2-weighted image sequences (e.g. fluid-attenuated inversion recovery sequence [FLAIR]), which are sensitive towards MS-relevant white matter lesions, but also relatively unspecific with respect to underlying disease processes (Geraldes et al., 2018). Several other imaging markers have been described including global brain atrophy, thalamic atrophy, cortical lesions, altered structural and functional connectivity or central vein signs (Lowe et al., 2002; Azevedo et al., 2018; Absinta et al., 2016; Filippi et al., 2016; Sinnecker et al., 2019; Backner et al., 2018; Pawlitzki et al., 2017; Solomon et al., 2017), of which some are captured in conventional MRI and others require advanced MRI techniques such as diffusion weighted imaging or functional MRI. In the last decade, a lot of research effort has been put on the automatic (i.e. data-driven) detection of neurological diseases based on neuroimaging data including MRI (Orrù et al., 2012; Woo et al., 2017). Early approaches combined parameter-based machine learning algorithms, such as support vector machines, with carefully extracted features known or hypothesized to be relevant in the respective disease. In MS research, features ranging from T2 lesion characteristics to atrophy to local intensity patterns or multi-scale information extracted from MRI data have been used in combination with standard machine learning analyses to either diagnose MS or predict disease progression (Eshaghi et al., 2018; Nichols et al., 2012; Weygandt et al., 2011; Hackmack et al., 2012a; Hackmack et al., 2012b; Weygandt et al., 2015; Wottschel et al., 2015). While choosing features based on expert criteria reflects the current state of knowledge, it does not allow for finding new and potentially unexpected hidden data properties, which might also help in characterizing a certain disease. Deep learning techniques fill a gap here and allow for utilizing hierarchical information directly from raw or minimally processed data (Lecun et al., 2015). By being specifically tailored to image data, in particular convolutional neural networks (CNNs) have led to major breakthroughs in medical imaging (Litjens et al., 2017; Rajpurkar et al., 2017a; Rajpurkar et al., 2017b; De Fauw et al., 2018). In neuroimaging, most CNN analyses so far focused on Alzheimer's disease (Vieira et al., 2017), but there are also some recent studies in MS. Given the importance of lesions in diagnosing MS and monitoring disease progression, most efforts have been put on the task of lesion segmentation (Valverde et al., 2017; Li et al., 2016; Khastavaneh and Ebrahimpour-Komleh, 2017). Others used CNNs to diagnose MS based on 2-dimensional MRI slices (Wang et al., 2018) or to predict short-term disease activity based on binary lesion masks (Yoo et al., 2016). Despite their potential, deep learning methods are criticized for being non-transparent (such as a ‘black box’) due to the difficulty to retrace the classification decision in light of huge parameter spaces and highly non-linear interactions (Castelvecchi, 2016). This is especially problematic in medical applications since understanding and explaining neural network decisions is required for clinical integration, error tracking or knowledge discovery. Explaining neural network decisions is an open research area in computer science and a number of suggestions have been made in recent years. Different directions for explanations include visualizing features (Zeiler and Fergus, 2014), generating images that maximally activate a certain neuron (Olah et al., 2017) and creating heatmaps based on the input images indicating the relevance of each voxel for the final classification decision (Simonyan and Zisserman, 2014; Bach et al., 2015; Springenberg et al., 2015). Heatmaps are in particular valuable in the medical context, since they allow for an easy and intuitive investigation of what the respective classifier found to be important directly in the input data. Besides understanding diagnostic decisions for individual patients, heatmaps might be useful in validating CNN models. Recently, we have shown the potential of transparent CNN applications for knowledge discovery in Alzheimer's disease (Rieke et al., 2018; Böhle et al., 2019). The objective of the current study was to investigate whether a transparency approach can uncover decision processes in MRI-based diagnosis of MS, a disease with well-defined imaging markers, thereby supporting future clinical implementation and verification of machine learning-based diagnosis systems. We present a transparent CNN framework (see Fig. 1) for the MRI-based diagnosis of MS relying on layer-wise relevance propagation (LRP, (Bach et al., 2015; Samek et al., 2017a)) – a heatmap method that has been shown to outperform previous approaches in terms of explainability and disease-specific evidence (Böhle et al., 2019; Samek et al., 2017a). Since the data set was rather small (n = 147), we investigated the effect of pre-training the CNN on data from the Alzheimer's Disease Neuroimaging Initiative (ADNI, n = 921). Using LRP, individual heatmaps were generated for each subject and analyzed with respect to well-established imaging features in MS (e.g. white matter lesions or thalamic atrophy). By showing that LRP in combination with a naive CNN model (i.e. a model independent of MS-specific knowledge) indeed helps in uncovering relevant imaging features, we conclude that this framework is not only useful in justifying individual diagnostic decisions but also to validate CNN models (especially in light of small sample sizes).

Fig. 1

Illustration of the transparent CNN framework. In the training phase, the CNN model learns a non-linear relationship between the MRI data and the binary diagnostic labels (MS yes/no). Optionally, the CNN models are pre-trained on a substitute data set or lesions are filled in the MRI data. The learned CNN model is then tested on new subjects to predict the diagnostic label. By supplementing this label with a LRP heatmap, which indicates the relevance of each voxel for the respective label, this framework allows us to understand (at least to some extent) the classification decision in individual subjects. Additionally, the validity of the CNN models can be assessed by matching highlighted brain areas with domain knowledge.

Materials and methods

Subjects

In the present study, we retrospectively analyzed data collected by FP from Charité – Universitätsmedizin Berlin as part of the VIMS study: Follow-up examination of visual parameters for the creation of a database (neuro-ophthalmologic register) in patients with MS versus healthy subjects.2 We enrolled 76 patients with relapsing-remitting MS according to the McDonald criteria 2010 (Polman et al., 2011) and 71 healthy controls. Patients were excluded if they were outside the age range of 18–69 or did not have an MRI scan. All patients were examined under supervision of a board-certified neurologist at the NeuroCure Clinical Research Center (Charité – Universitätsmedizin Berlin) between January 2011 and July 2015. All participants provided written informed consent prior to their inclusion in the study. The study was approved by the local ethics committee and was performed in accordance with the 1964 Declaration of Helsinki in its currently applicable version. Part of this data has been used in previous studies (e.g. (Kuchling et al., 2018)). Demographical details of subjects can be found in Table 1. There is a significant group difference in age (p < 0.05, obtained via a t-test), but not in sex (chi-squared test).

Table 1

Demographics of MS patients and healthy controls. Disease duration is measured in months and lesion volume in ml. EDSS, expanded disability status scale; std., standard deviation.

	MS patients	Healthy controls
Subjects [n]	76	71
Female/Male, in %	55% / 45%	65% / 35%
Age (in years), mean ± std	43.32 (± 11.99)	38.23 (± 13.10)
Disease duration, median, range	139.14 (0–522.59)	n.a.
EDSS, median, range	2.50 (0.00–6.50)	n.a.
Lesion volume, median, range	5.10 (0.12–232.47)	0.09 (0–14.98)

Demographics of MS patients and healthy controls. Disease duration is measured in months and lesion volume in ml. EDSS, expanded disability status scale; std., standard deviation.

MRI acquisition and preprocessing

All MRI data were acquired on the same 3 T scanner (Tim Trio Siemens, Erlangen, Germany) using a volumetric high-resolution T1 weighted magnetization prepared rapid acquisition gradient echo (MPRAGE) sequence (TR = 1900 ms, TE = 2.55 ms, TI = 900 ms, FOV = 240 × 240 mm2, matrix 240 × 240, 176 slices, voxel size: 1 mm isotropic) as well as a volumetric high-resolution fluid-attenuated inversion recovery sequence (FLAIR, TR = 6000 ms, TE = 388 ms, TI = 2100 ms; FOV = 256 × 256 mm2, voxel size: 1 mm isotropic). All MR images were bias field corrected using non-parametric non-uniform intensity normalization (Tustison et al., 2010), changed to a robust field of view and linearly oriented to MNI space using FMRIB software tools (Jenkinson and Smith, 2001). The FLAIR images were then co-registered to the MPRAGE images using a spline interpolation with FSL FLIRT (Jenkinson et al., 2002). Lesion segmentation was done semi-automatically on FLAIR using the lesion prediction algorithm (Schmidt, 2017) as implemented in the Lesion Segmentation Toolbox3 version 2.0.15. Lesion masks are subsequently manually corrected by two raters using ITK-SNAP (Yushkevich et al., 2006).4 Both raters have more than 5 years of experience in T2 lesion segmentation and were supervised by a board-certified neuroradiologist (MS). Raters were not blinded to the diagnosis. Generation of a brain mask and tissue segmentation into gray matter, white matter, and cerebrospinal fluid was achieved using the Computational Anatomy Toolbox version 11.09 (Gaser and Dahnke, 2016) implemented in SPM12 version 7219. The data were preprocessed in that way to ensure that images are in relative realignment while preserving individual structural variations. Only FLAIR data entered the subsequent analyses because this is the most sensitive sequence for lesions and used in clinical routine for diagnosing MS and monitoring disease progression. For computational efficiency initial scan volumes (182 × 218 × 182) were down-sampled to 96 × 114 × 96 voxels (voxel size: 2 mm isotropic) and standardized for each subject using min-max scaling. To analyze what the classifier picks up when there are no lesions, we generated an additional MRI data set, in which the lesions in FLAIR images were filled. For this, we implemented a version of (Valverde et al., 2014), in which lesion areas (according to the manually segmented lesion masks) have been replaced by local average intensities in normal-appearing white matter. White matter maps were obtained from the SPM 12 tissue segmentation algorithm (Ashburner and Friston, 2003).

ADNI data for pre-training

Data used for pre-training were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database5 We have used subjects from ADNI phase 1 who were included in one of two standard MRI collections (Wyman et al., 2013). We only selected MRI data of Alzheimer's disease (AD) patients and cognitive normal subjects, in total 921 MRI scans from 389 subjects (covering one to three time points). Follow-up acquisitions can be interpreted as a form of data augmentation used to increase the variance within the training data base. Demographical information can be found in Table 2. The MRI scans were acquired with 1.5 Tesla scanners at multiple sites and had already undergone gradient non-linearity, intensity inhomogeneity and phantom-based distortion correction. T1-weighted MPRAGE scans were downloaded and warped to MNI space with ANTs (Avants et al., 2011). As for the MS data, the initial scan volumes were down-sampled to 96 × 114 × 96 voxels and standardized.

Table 2

Demographics of ADNI data set.

	AD patients	Healthy controls
Subjects [n]	231	158
Female/Male, in %	42% / 58%	48% / 52%
Age (in years), mean ± std	74.98 (± 7.40)	75.93 (± 5.01)

Demographics of ADNI data set.

Classification and visualization analyses

Based on the preprocessed FLAIR data, we first trained several CNN models (with and without pre-training, with and without lesion-filling) to discriminate MS patients and healthy controls and then explained the model's decisions for individual subjects in the test data using LRP. For the CNN models, we evaluated the effect of transfer learning by (1) training the model solely on MS data and (2) pre-training the model on ADNI data and fine-tuning it on MS data. To examine whether our pre-trained network can also learn from only normal-appearing brain matter (NABM), i.e. regions without hyperintense lesions, we retrained the network on lesion-filled FLAIR data. As baseline analyses, we included a support vector machine to classify based on (1) lesion volume and (2) preprocessed FLAIR data. Prior to training, the MS data set was randomly split into two sets: (1) a set for training and hyperparameter optimization (85%) and (2) a holdout set used only for final model evaluation (15%). The code for all models and also the lesion filling algorithm is available at https://github.com/derEitel/explainableMS. In the following subsections, we specify our parameter settings for CNNs, transfer learning and visualization techniques (in particular LRP).

Convolutional neural networks

In this study, we used a 3D CNN architecture consisting of four convolutional layers followed by exponential linear units (ELUs) activation functions and four max-pooling layers applied after the first, second and fourth ELU activation. For each convolutional layer, we learned 64 filters with a kernel size of 3 × 3× 3. Finally, a linear layer with an output shape of 1 and a sigmoid activation returns the classification score. To improve generalization, the model has been regularized using a dropout on the outputs of each max-pooling layer (p = 0.3), L2-regularization (λ = 0.01) using the weights of the third and fourth convolutional layer, and finally early-stopping the training after the validation loss has not improved for 10/15 epochs during pre-training/fine-tuning. We trained all models using the Adam optimizer (Kingma and Ba, 2014). Hyperparameters (including learning rate, L2 regularization and dropout probability) were optimized on 85% of the training data, leaving 15% for validation. After finding suitable hyperparameters, the model performance was tested out-of-sample on the holdout set. To increase robustness, all CNN experiments were repeated 10 times on the same data split, and thus reported metrics are an average over all 10 trials. We report balanced accuracy as a mean between sensitivity and specificity as well as area under the receiver operating characteristic curve (AUC). All code was implemented using Keras (Chollet, 2015) with the TensorFlow (Abadi et al., 2015) backend.6

Transfer learning

Due to the small sample size of the MS data set, we employed the principle of transfer learning (Crammer et al., 2008; Duan et al., 2009; Ben-David et al., 2010), which has been shown to improve performance in medical imaging including MRI data (Gupta et al., 2013; Tajbakhsh et al., 2016; Ghafoorian et al., 2017; Hosseini-Asl et al., 2018; Basaia et al., 2019). We pre-trained our CNN model on ADNI MRI data to separate AD patients and healthy controls, and fine-tuned it on the MS data set to separate MS patients and healthy controls. Since the ADNI data set contains multiple scans for several subjects we ensured that validation and testing was done on disjoint subject sets. The average balanced accuracy over all trials was 78.47%. For further analysis, we selected a model from the 10 trials based on its performance, and then picked its training checkpoint with the best validation accuracy of 82.50%. Fine-tuning on the MS data set uses the same model architecture, which is initialized with the weights and biases of the selected pre-trained model instead of randomly distributed values. We allow all layers to re-learn because we transferred a CNN model between rather different tasks and data sets, in particular (1) across diseases (AD to MS) and (2) across MRI sequences (MPRAGE to FLAIR) exhibiting different magnetic field strengths (1.5 and 3 Tesla). Additionally, the data was augmented during fine-tuning, such that during the creation of each mini-batch each image was flipped along the sagittal axis with a probability of 50% and randomly translated between −2 and 2 pixels within the axial plane. We found optimal initial learning rates to be 0.001 in the pre-training and 0.0005 with a 0.002 decay in the fine-tuning phase.

Visualization

Deep learning methods are often criticized for their lack of interpretability and over the last years much research has focused on improving the interpretability of neural networks (Castelvecchi, 2016; Montavon et al., 2018; Lapuschkin et al., 2019). While some work has focused on understanding class representations and functions of individual neurons, others have developed methods to generate heatmaps based on the input data that indicate the importance or relevance of each pixel or voxel for the final classification decision (Bach et al., 2015; Springenberg et al., 2015; Simonyan et al., 2013). The latter approach is in particular promising in the medical field since it allows for explaining in a fast and intuitive way individual classification decisions without the need for delving deeply into the network structure (Böhle et al., 2019). Generally, it is distinguished between local and global attribution methods (Ancona et al., 2017). Whereas local attribution methods represent how a change in a specific voxel would impact the network's output and solely rely on the network's gradient (e.g. sensitivity analysis resulting in image-specific saliency maps), global attribution methods adjust the relevance of the presence of a feature globally by weighting it with the entire input and thus are more suitable for explanation. In the present study, we used LRP, which has been shown to be a powerful global attribution method (Bach et al., 2015; Samek et al., 2017a; Lapuschkin et al., 2019). It uses the classification score f(x) directly (and not the gradient as in most other visualization methods) and propagates it through the network using the following rule Here, the relevance from layer R is propagated to its previous layer R. The term ε is set to a small value (in this study: 0.001) to avoid division by 0. By using both the activation x as well as the weights w connecting layers i and j, LRP assigns a larger share to neurons that are more strongly activated and to connections which have been reinforced during training (Samek et al., 2017b). By decomposing the classification score f(x) rather than the gradient and conserving the classification score during backpropagation, LRP overcomes the flaws of sensitivity analysis (Samek et al., 2017b) and has been shown to provide evidence for AD in individual subjects (Böhle et al., 2019). Recently, it has been shown that LRP can be formulated in the same mathematical framework as other global attribution methods including gradient*input (Shrikumar et al., 2017), integrated gradients (Sundararajan et al., 2017) and DeepLIFT (Shrikumar et al., 2017) and are equivalent under certain assumptions (Ancona et al., 2017). In this study, we produced individual LRP heatmaps for every subject in the holdout set. We have used the iNNvestigate implementation of LRP (Alber et al., 2018).7 For comparison, we produced heatmaps using gradient*input as an alternative global attribution method.

Evaluation of heatmaps

Besides qualitatively comparing individual heatmaps, we compared average heatmaps of MS patients and healthy controls. We evaluated the importance of different brain regions by computing the average relevance for each brain area in the (1) Neuromorphometrics atlas8 (Bakker et al., 2015) mostly containing gray matter regions and the (2) JHU DTI-based white-matter atlas9 (Mori and Crain, 2005) containing white matter regions. Areas were aggregated between left and right hemisphere and certain substructures are combined into one region. For visualization of (1) we selected the 30 areas with the highest sum of absolute relevance means across MS patients and healthy controls in the test set, yielding areas with both the highest and lowest relevance. Please reconsider here that the MRI data have only been linearly registered and thus slight deviations from the anatomical locations stated in the atlases are conceivable. To evaluate the effect of transfer learning on the heatmaps, we compare average heatmaps for MS patients before and after pre-training. To assess the relevance of normal-appearing brain areas in contrast to lesion areas, we computed relevance scores separately for the original MRI data set and the lesion filled MRI data set. To assess the amount of relevance attributed to the lesions in the original MRI data set, we computedwhere lm is the individual lesion mask and hm+ the individual positive relevance.

Baseline analyses

As a baseline we have trained a support vector machine (SVM) to classify between MS patients and healthy controls based on (1) FLAIR lesion load and (2) preprocessed FLAIR volumes. Hyperparameters were tuned on the training data set using grid search, nested within a 5-fold cross-validation (SVM kernel: linear and radial basis function [RBF], C, γ = [0.001,0.1,1,10]); for the preprocessed FLAIR volumes an optional prior dimensionality reduction step via principal component analysis was performed.

Results

Classification performance

In Table 3, we depict the performance for the different classification models. As expected FLAIR lesion load – as one of the core biomarkers in MS – in combination with a SVM led to a high balanced accuracy (88.46%) and a high AUC (94.62%). When instead of the FLAIR lesion load the entire FLAIR volume is used as input to the SVM, the AUC dropped down to 66.92%. The CNN model solely trained on the MS data set resulted in a balanced accuracy of 71.23% and an AUC of 85.46%. When the network has been pre-trained on the ADNI data set and fine-tuned to the MS data set, the balanced accuracy increased by 16 percentage points to 87.04% and is therefore comparable to the performance of the baseline FLAIR lesion load model. Moreover, the pre-trained CNN model outperformed all other classifiers in terms of AUC (96.08%) and importantly also in terms of sensitivity (93.08%). The ROC curve for all 10 trials is shown in supplementary Fig. 1. For further processing we have selected the model with the best validation balanced accuracy from the 10 training repetitions of 91.67%, which achieved a holdout balanced accuracy of 91.15%. Its training curve can be found in supplementary Fig. 2. To assess the impact of normal-appearing brain matter, we trained the same CNN model on lesion-filled FLAIR data. Still, a reasonable balanced accuracy of 70.15% and a relatively high AUC of 90.92% has been achieved.

Table 3

Performance (in %) for the different models on the holdout data set. Values are averages over 10 trials. Highest values per column are highlighted in bold. Pre-train., pre-training; Class., classifier; Bal. acc., balanced accuracy; Sens., sensitivity; Spec., specificity; AUC, area under the curve of the receiver operating characteristic; les. fill., lesions filled.

Data	Pre-train.	Class.	Bal. acc.	Sens.	Spec.	AUC
FLAIR lesion load	–	SVM	88.46%	76.92%	100.00%	94.62%
FLAIR	–	SVM	66.92%	53.85%	80.00%	66.92%
FLAIR	no	CNN	71.23%	68.46%	74.00%	85.46%
FLAIR	yes	CNN	87.04%	93.08%	81.00%	96.08%
FLAIR - les. fill.	yes	CNN	70.15%	92.31%	48.00%	90.92%

Visualization

After the CNN models have been trained, we used LRP to generate an individual heatmap for each subject in the holdout data set indicating the relevance of each voxel for the respective classification decision. In Fig. 2, we show the individual heatmaps overlayed on the FLAIR data for four correctly classified MS patients, who achieved the highest classification scores in terms of the sigmoid output. High classification scores generally indicate a higher confidence of the model for the respective classification decision and thus the corresponding explanations are usually more pronounced and less diffuse as for cases with lower classification scores. All four patients have in common that high positive relevance is attributed around the occipital horn of both lateral ventricles and covers periventricular lesion areas as well as the body and splenium of the corpus callosum. Even though the images were clearly classified as MS, certain regions are assigned negative relevance, meaning that these areas speak against the MS diagnosis. Negative relevance can be found around the frontal horn of both ventricles, notably even in periventricular lesion areas (see for example subject 1). Interestingly, lesions not bordering the ventricles seem often to be ignored or are assigned negative relevance. For comparison, we show and discuss individual heatmaps of two misclassified subjects in supplementary Fig. 3.

Fig. 2

Individual LRP heatmaps (overlayed on the input FLAIR data) for the four MS patients with the highest classification score in terms of the sigmoid output. Heatmap values are normalized in the range [−0.03, 0.03]. Colors indicate regions supporting (red) or rejecting (blue) the classification as a MS patient with respect to the underlying CNN model. In Fig. 3, we show average heatmaps for all correctly classified MS patients (top) and all correctly classified healthy controls (bottom) in the holdout set. In accordance with the heatmaps of the individual subjects in Fig. 2, posterior periventricular white matter regions have a strong positive relevance for the MS diagnosis. This is true for both MS patients and healthy controls, but the effect is less pronounced for healthy controls. The reversed effect can be seen for clusters exhibiting negative relevance in white matter areas in the corpus callosum and close to occipital and parietal lobe. Over all voxels healthy controls typically obtain a negative relevance sum (mean ± std.: −1.05e-6 ± 0.0013) as opposed to a positive relevance sum in MS patients (3.07e-06 ± 0.0014). Notably, the total relevance attributed to lesion areas was on average 5.15% (on MS patients 9.71%) compared to a lesion coverage of only 0.41% in the training data set. In Fig. 4, we show that the sum of voxels containing lesions (referred to as lesion sum) and LRP relevance sum are significantly correlated for training and hold-out data.

Fig. 3

Fig. 4

Correlation between lesion sum and LRP relevance sum. The Pearson correlation coefficient is shown for both training and holdout set separately, of which both are significant (p < 0.001, p < 0.001, permutation test). The size of each data point shows the lesion-relevance similarity according to Eq. (2).

Average LRP heatmaps for all correctly classified MS patients (top) and all correctly classified healthy controls (bottom) in the holdout set. Values are normalized in the range [−0.02, 0.02]. Please note that the underlying brain map has been computed as the average of all training subjects and does not reflect the MRI data of individual subjects. Correlation between lesion sum and LRP relevance sum. The Pearson correlation coefficient is shown for both training and holdout set separately, of which both are significant (p < 0.001, p < 0.001, permutation test). The size of each data point shows the lesion-relevance similarity according to Eq. (2). In Fig. 5, we depict the region-wise LRP relevance for MS diagnosis, separately for MS patients and healthy controls. In the Neuromorphometrics atlas (see Fig. 5a), most relevance is attributed to cerebral white matter, followed by thalamus, lateral ventricles and diencephalon. Negative relevance is strongest in the precuneus, followed by lingual gyrus, cuneus and insula. In the JHU white matter atlas (see Fig. 5b), most positive relevance is attributed to posterior corona radiata and corpus callosum, followed by posterior thalamic radiation, tapetum, internal capsule and fornix. Notably, these areas are generally characterized by a high lesion density, which is also present in this MS data set (see supplementary Figs. 4 and 5). Negative relevance has been found in the superior and anterior corona radiata. Generally, the relevance for MS patients is higher in white matter than in gray matter areas. Moreover, the differences between MS patients and healthy controls are more pronounced in white matter areas.

Fig. 5

LRP relevance distribution over (a) 30 (mainly) gray matter areas from the Neuromorphometrics atlas and (b) 22 white matter areas from the JHU ICBM-DTI atlas, separately for MS patients and healthy controls in the holdout set. The absolute values per region are rather small as LRP aims to conserve the sigmoid output and distributes it over all voxels. The qualitative and quantitative analysis using another global attribution method, namely gradient*input, produced highly similar results as shown in supplementary Figs. 6 and 7. In Fig. 6, we show the effects of transfer learning on the average relevance heatmaps for the MS patients in the holdout set. For the untrained model with random parameters (first row), only scarcely distributed individual voxels attain tiny relevance values. For the CNN model trained on ADNI and directly applied to MS patients (without fine-tuning; second row), more voxels are attributed relevance and are diffusely clustered. For the CNN model trained only on MS data (without pre-training; third row), strong relevance is projected to the ventricles and periventricular white matter. And finally, for the pre-trained model (transfer learning from ADNI to MS; last row), distinct clusters for both positive and negative relevance can be detected, which are more delineated than for the CNN model without pre-training.

Fig. 6

Average heatmaps for different CNN models applied to the MS (VIMS) cohort – starting from an untrained CNN model with random parameters over a CNN trained only on either ADNI or MS data to a CNN pre-trained on ADNI and fine-tuned on MS. As it can be seen, the fine-tuned model led to the most concise regions of positive and negative relevance. Please note that we averaged here the heatmaps over all (not only the correctly classified) MS patients in the holdout set and that the heatmap values here are not normalized to a fixed range but shown with respect to the minimum value of the untrained model. To assess the contribution of normal-appearing brain matter, we compared the relevance maps between the CNN models trained on the original FLAIR data and the lesion-filled FLAIR data (for the performance see Table 3). In Fig. 7, we depict the relevance for the 10 top-scored white matter regions, separately for both models. In general one can see that the relevance shifts from a distribution more evenly spread among multiple areas to a distribution with a prominent peak and otherwise low shares of relevance. Notably, relevance is shifted away from areas with large amounts of lesions such as posterior corona radiata, posterior thalamic radiata as well as tapetum towards mainly the corpus callosum and regions with very few lesions like fornix and external capsule (see supplementary Fig. 4 for distribution of white matter lesions).

Fig. 7

Comparison of average relevance distribution over white matter areas for a CNN model trained on original FLAIR data (left) and lesion-filled FLAIR data (right; NABM, normal-appearing brain matter). We calculated the relevance sum of both models (averaged over subjects) and show the 10 areas with the highest score.

Discussion

Summary

In the present study, we introduced a transparent framework for analyzing neuroimaging data with CNNs that is able to explain individual classification decisions. By utilizing transfer learning we could further achieve good classification results from only a small data set of task-specific data. In combination with LRP, we could demonstrate the capacity of our framework to learn significant MS-relevant information from conventional MRI data. Notably, a pre-trained CNN was able to identify MS patients with an accuracy similar to a classical machine learning analysis, in which the FLAIR lesion load was used as input. This is quite remarkable, because the CNN model was considered to be naive by not being provided with any prior information on MS-relevant features such as hyperintense lesions. The subsequent visualization analysis, using heatmaps generated by LRP, revealed that the CNN model indeed uses (posterior) white matter lesions as primary information source. In addition, other information, e.g. in normal-appearing white and gray matter (e.g. the thalamus) have been found useful by the CNN model.

Related work

Compared to other neurological diseases, in particular AD, only a few MS studies exist that employ machine learning methods outside the scope of lesion segmentation. We think that the main reasons are (1) the lack of easy accessible large open data bases such as the Alzheimer's Neuroimaging Initiative (ADNI) data base and (2) the focus on white matter lesion volume as primary MRI-derived outcome measure in MS. Classical machine learning methods in combination with more or less sophisticated feature extraction methods, from both conventional and advanced MRI data, have been used to (1) diagnose MS (Weygandt et al., 2011; Hackmack et al., 2012b; Zurita et al., 2018; Eshaghi et al., 2016) (2) decode symptom severity (Hackmack et al., 2012a) (3) identify clinical subtypes (Eshaghi et al., 2018; Nichols et al., 2012; Eshaghi et al., 2015) and (4) predict conversion from clinically isolated syndrome to MS (Wottschel et al., 2015; Bendfeldt et al., 2019). Deep learning architectures have so far been implemented for lesion segmentation (Valverde et al., 2017; Li et al., 2016; Khastavaneh and Ebrahimpour-Komleh, 2017), predicting MS based on binary lesion masks (Yoo et al., 2016), modelling brain and lesion variability (Brosch, 2016) and finding differences in normal-appearing brain matter based on T1-weighted and myelin images (Yoo et al., 2018). To the best of our best knowledge, the present study is the first study employing CNNs and advanced visualization techniques for diagnosing MS based on the clinically most relevant MRI sequence (i.e. FLAIR). It is generally recognized that, especially in the medical field, it is very important that classification decisions are reasonably explained even in light of high accuracies (which are no guarantee for a – from a human perspective – sensible discrimination strategy (Lapuschkin et al., 2019; Lapuschkin et al., 2016)). Although a number of methods exist that generate individual heatmaps (Zeiler and Fergus, 2014; Springenberg et al., 2015; Simonyan et al., 2013; Zintgraf et al., 2017), we focused here on the LRP method (Bach et al., 2015; Montavon et al., 2018; Lapuschkin et al., 2019) which has a solid theoretical framework and has been extensively validated (see e.g. (Samek et al., 2017a; Lapuschkin et al., 2019; Samek et al., 2017b)). Very recently, LRP has shown to be very helpful for explaining cognitive states or AD diagnosis in deep neural networks trained on either functional or structural MRI data (Böhle et al., 2019; Thomas et al., 2018). To the best of our knowledge, these are the only applications of LRP in the neuroimaging field. In the present study, we demonstrated that LRP is capable of identifying reasonable areas supporting a MS diagnosis in addition to features needing further clinical validation. Those areas have been shown to be robust using gradient*input as a different visualization method. By this, we have shown that those heatmaps can be very valuable in explaining decisions of neural networks trained on small sample sizes and to verify whether an algorithm has learned something meaningful (i.e. matching domain knowledge) or just spotted biases or artifacts in the data (see also (Springenberg et al., 2015; Lapuschkin et al., 2019)).

Key findings

CNNs learn to identify lesions as an important biomarker for MS

Although our pre-trained CNN model did not get any prior information about the relevance of hyperintense lesions for MS, it learned to successfully identify lesions as a primary information source. Notably, the total relevance attributed to lesion areas was on average 5.15% (on MS patients 9.71%) compared to a lesion coverage of only 0.41% in the training data set. In addition, LRP relevance sum was significantly correlated to lesion sum. We show that LRP heatmaps not only detect single lesions in individual patients but generally attributed most positive relevance to white matter areas around the posterior occipital horns. Importantly, the CNN model did not simply assign high relevance to hyperintense areas in the brain, but learned to distinguish between different lesion locations: while anterior periventricular lesions as well as lesions not bordering the lateral ventricles were assigned no or negative relevance, only posterior periventricular lesion areas were assigned positive relevance for MS. Interestingly, hyperintensities in posterior ventricular regions seem to be the main reason why the healthy control in supplementary Fig. 3 has been misclassified as MS patient. In general, strongest positive relevance was found in posterior corona radiata, corpus callosum and thalamic radiation, which are characterized by a high lesion density in MS patients (see (Gass et al., 2012) and supplementary Figs. 4 and 5).

CNNs learn to identify relevant areas beyond lesions

The CNN model primarily focuses on lesions, but relevance has also been attributed to gray matter areas such as the thalamus, which is known to be affected in MS from earliest disease stages (Azevedo et al., 2018; Azevedo et al., 2015). To further investigate what the CNN model learns beyond lesions, we repeated the analysis on lesion filled FLAIR data. As expected, the balanced accuracy as well as AUC decreased (by almost 17 and 6 percentage points respectively) and relevance has shifted away from regions which typically contain hyperintense lesions. The region that was assigned most relevance after lesion removal was the corpus callosum. While the corpus callosum is generally susceptible to demyelinating lesions (Barnard and Triggs, 1974; Garg et al., 2015; Renard et al., 2014) the literature also suggests further biomarkers such as axonal loss and diffuse atrophy (Renard et al., 2014; Evangelou et al., 2000) or narrow T2 hyperintense bands along the callosal-septal interface (Garg et al., 2015). The fornix, even though it contains a very small amount of lesions (see supplementary Fig. 4 and (Thomas et al., 2011)), is assigned positive relevance with lesions and an increased relevance without lesions. It has been shown that lower fractional anisotropy in the fornix is exhibited in MS subjects in comparison to healthy controls (Roosendaal et al., 2009; Kern et al., 2012). Additionally, external capsule and superior cerebellar peduncle receive only positive relevance after lesion removal, which were found to be affected in MS patients (Anderson et al., 2011; Zhang et al., 2017). These results are generally in line with other machine learning studies finding differences in normal-appearing brain matter in MS patients (Weygandt et al., 2011; Hackmack et al., 2012a; Yoo et al., 2018). It would be very interesting to further investigate whether our findings correlate with underlying pathological mechanisms only demonstrable by advanced MRI sequences such as diffusion weighted imaging or magnetization transfer imaging.

Transfer learning improves learning across diseases and MRI sequences

In recent years, transfer learning has been successfully employed in brain lesion segmentation (Ghafoorian et al., 2017) and AD classification (Gupta et al., 2013; Hosseini-Asl et al., 2018; Payan and Montana, 2015). The latter studies used either autoencoders trained on MRI data or natural images (Gupta et al., 2013; Payan and Montana, 2015) or used one AD data set for pre-training and another AD data set for fine-tuning (Hosseini-Asl et al., 2018). In the present study, we have shown that transfer learning can also help in learning (1) across diseases (AD to MS) and (2) across MRI sequences (MPRAGE to FLAIR) exhibiting different magnetic field strengths (1.5 and 3 Tesla). We demonstrated that not only the balanced accuracy increases drastically (about 16 percentage points), but also that LRP leads to much more focused heatmaps concentrating on (posterior) periventricular lesion areas. Given that our pre-trained model performed similar to a classical machine learning analysis using FLAIR lesion load as a classical biomarker in MS, we believe that larger data sets might allow for outperforming models based on lesion masks in the future. Additionally, we are convinced that our approach – given a reasonable data basis – might also be very useful in answering more complex questions such as predicting disease progression.

Limitations

The main limitation of this study is the limited sample size. Although a sample size of n = 147 is comparable with other deep learning studies in the neuroimaging field (Vieira et al., 2017), it is generally considered to be too low to learn robust representations from the data and to generalize to other data sets. To partly alleviate this problem, we pre-trained our network on ADNI data (n = 921) and fine-tuned it on the MS data. By visualizing the average heatmaps for MS patients, we show in addition to a balanced accuracy of 87.04 % that the CNN captures MS-relevant information by focusing on posterior ventricular regions usually characterized by a high rate of MS lesion incidences. Nevertheless, future studies should verify our results in larger data sets, preferably coming from different sites. Another limitation, related to the first one, is that we were limited in the choice of architecture used for the CNN analysis. Very deep networks with a high capacity easily overfit on data sets with less than hundreds or thousands of samples per class. Furthermore, since we use volumetric data the additional dimension as compared to 2D images causes each layer to consume substantially more GPU memory, which makes it a strongly limiting factor in architecture design. However, we found a relatively simple CNN architecture to be successful together with several regularization methods (drop out, L2-regularization and early stopping). Moreover, by registering the MRI data only linearly to MNI space, the regions contained in both atlases only roughly correspond to individual anatomical locations. On the other hand, non-linear registration can lead to strong deformations, in particular in patients, and we show here that our CNN model can also operate on a more native level (in accordance with (Suk et al., 2014)). To be able to make more specific anatomical claims in individual subjects, future studies might use individual atlases. And finally, heatmaps do neither allow to determine the underlying pathological mechanism (e.g. atrophy, demyelination or axonal loss) resulting in assigning a voxel to be relevant or to assess interactions between voxels. For this, one would have to take a deeper look into the specific filters that have been learned throughout the training process in combination with MR sequences more sensitive for certain tissue damage (e.g. diffusion weighted or myelin imaging). Nevertheless, we still believe that heatmaps can be very helpful in supplementing individual disease diagnoses by providing a simple and intuitive explanation.

Conclusion

In conclusion, we have shown that our framework helps in uncovering CNN decisions for diagnosing MS based on FLAIR data using LRP. In particular, we demonstrated that (1) CNN models pre-trained on AD data are capable of successfully separating MS patients and controls on a typically sized neuroimaging cohort and (2) LRP is not only very valuable in explaining individual network's decisions, but also in generally helping to assess whether CNN models have learned significant features. Notably, our CNN models focus on hyperintense lesions as primary information source, but also incorporates information from lesion location and normal-appearing brain areas. We see a high potential in the combination of CNNs, transfer learning and LRP heatmaps and are convinced that our framework might not only be helpful in other disease decoding studies, but also for answering more complex questions such as predicting disease progression or treatment response in individual subjects.

Funding

We acknowledge support from the German Research Foundation (DFG, 389563835), the Manfred and Ursula-Müller Stiftung and Charité – Universitätsmedizin Berlin (Rahel-Hirsch scholarship and Open Access Publication Fund).

62 in total

1. Can we overcome the 'clinico-radiological paradox' in multiple sclerosis?

Authors: Kerstin Hackmack; Martin Weygandt; Jens Wuerfel; Caspar F Pfueller; Judith Bellmann-Strobl; Friedemann Paul; John-Dylan Haynes
Journal: J Neurol Date: 2012-03-24 Impact factor: 4.849

Review 2. The corpus callosum in the diagnosis of multiple sclerosis and other CNS demyelinating and inflammatory diseases.

Authors: Nidhi Garg; Stephen W Reddel; David H Miller; Jeremy Chataway; D Sean Riminton; Yael Barnett; Lynette Masters; Michael H Barnett; Todd A Hardy
Journal: J Neurol Neurosurg Psychiatry Date: 2015-04-09 Impact factor: 10.154

Review 3. The fornix in health and disease: an imaging review.

Authors: Adam G Thomas; Panos Koumellis; Robert A Dineen
Journal: Radiographics Date: 2011 Jul-Aug Impact factor: 5.333

4. A reproducible evaluation of ANTs similarity metric performance in brain image registration.

Authors: Brian B Avants; Nicholas J Tustison; Gang Song; Philip A Cook; Arno Klein; James C Gee
Journal: Neuroimage Date: 2010-09-17 Impact factor: 6.556

Review 5. Building better biomarkers: brain models in translational neuroimaging.

Authors: Choong-Wan Woo; Luke J Chang; Martin A Lindquist; Tor D Wager
Journal: Nat Neurosci Date: 2017-02-23 Impact factor: 24.884

6. Multiple sclerosis: low-frequency temporal blood oxygen level-dependent fluctuations indicate reduced functional connectivity initial results.

Authors: Mark J Lowe; Micheal D Phillips; Joseph T Lurito; David Mattson; Mario Dzemidzic; Vincent P Mathews
Journal: Radiology Date: 2002-07 Impact factor: 11.105

7. Regional DTI differences in multiple sclerosis patients.

Authors: S D Roosendaal; J J G Geurts; H Vrenken; H E Hulst; K S Cover; J A Castelijns; P J W Pouwels; F Barkhof
Journal: Neuroimage Date: 2008-11-05 Impact factor: 6.556

8. Diagnostic criteria for multiple sclerosis: 2010 revisions to the McDonald criteria.

Authors: Chris H Polman; Stephen C Reingold; Brenda Banwell; Michel Clanet; Jeffrey A Cohen; Massimo Filippi; Kazuo Fujihara; Eva Havrdova; Michael Hutchinson; Ludwig Kappos; Fred D Lublin; Xavier Montalban; Paul O'Connor; Magnhild Sandberg-Wollheim; Alan J Thompson; Emmanuelle Waubant; Brian Weinshenker; Jerry S Wolinsky
Journal: Ann Neurol Date: 2011-02 Impact factor: 10.422

9. MRI pattern recognition in multiple sclerosis normal-appearing brain areas.

Authors: Martin Weygandt; Kerstin Hackmack; Caspar Pfüller; Judith Bellmann-Strobl; Friedemann Paul; Frauke Zipp; John-Dylan Haynes
Journal: PLoS One Date: 2011-06-17 Impact factor: 3.240

10. Deep learning of joint myelin and T1w MRI features in normal-appearing brain tissue to distinguish between multiple sclerosis patients and healthy controls.

Authors: Youngjin Yoo; Lisa Y W Tang; Tom Brosch; David K B Li; Shannon Kolind; Irene Vavasour; Alexander Rauscher; Alex L MacKay; Anthony Traboulsee; Roger C Tam
Journal: Neuroimage Clin Date: 2017-10-14 Impact factor: 4.881

16 in total

1. Prediction of high and low disease activity in early MS patients using multiple kernel learning identifies importance of lateral ventricle intensity.

Authors: Claudia Chien; Moritz Seiler; Fabian Eitel; Tanja Schmitz-Hübsch; Friedemann Paul; Kerstin Ritter
Journal: Mult Scler J Exp Transl Clin Date: 2022-07-03

Review 2. Machine Learning Approaches in Study of Multiple Sclerosis Disease Through Magnetic Resonance Images.

Authors: Faezeh Moazami; Alain Lefevre-Utile; Costas Papaloukas; Vassili Soumelis
Journal: Front Immunol Date: 2021-08-11 Impact factor: 7.561

Review 3. Artificial intelligence for clinical decision support in neurology.

Authors: Mangor Pedersen; Karin Verspoor; Mark Jenkinson; Meng Law; David F Abbott; Graeme D Jackson
Journal: Brain Commun Date: 2020-07-09

4. Investigation of Deep-Learning-Driven Identification of Multiple Sclerosis Patients Based on Susceptibility-Weighted Images Using Relevance Analysis.

Authors: Alina Lopatina; Stefan Ropele; Renat Sibgatulin; Jürgen R Reichenbach; Daniel Güllmar
Journal: Front Neurosci Date: 2020-12-18 Impact factor: 4.677

5. Deep Learning-Based Method to Differentiate Neuromyelitis Optica Spectrum Disorder From Multiple Sclerosis.

Authors: Hyunjin Kim; Youngin Lee; Yong-Hwan Kim; Young-Min Lim; Ji Sung Lee; Jincheol Woo; Su-Kyeong Jang; Yeo Jin Oh; Hye Weon Kim; Eun-Jae Lee; Dong-Wha Kang; Kwang-Kuk Kim
Journal: Front Neurol Date: 2020-11-30 Impact factor: 4.003

6. Patch individual filter layers in CNNs to harness the spatial homogeneity of neuroimaging data.

Authors: Fabian Eitel; Jan Philipp Albrecht; Martin Weygandt; Friedemann Paul; Kerstin Ritter
Journal: Sci Rep Date: 2021-12-27 Impact factor: 4.996

7. Application of machine learning analysis based on diffusion tensor imaging to identify REM sleep behavior disorder.

Authors: Dong Ah Lee; Ho-Joon Lee; Hyung Chan Kim; Kang Min Park
Journal: Sleep Breath Date: 2021-07-08 Impact factor: 2.816

8. Investigating efficient CNN architecture for multiple sclerosis lesion segmentation.

Authors: Alexandre Fenneteau; Pascal Bourdon; David Helbert; Christine Fernandez-Maloigne; Christophe Habas; Rémy Guillevin
Journal: J Med Imaging (Bellingham) Date: 2021-02-06

Review 9. Opportunities for Understanding MS Mechanisms and Progression With MRI Using Large-Scale Data Sharing and Artificial Intelligence.

Authors: Hugo Vrenken; Mark Jenkinson; Dzung L Pham; Charles R G Guttmann; Deborah Pareto; Michel Paardekooper; Alexandra de Sitter; Maria A Rocca; Viktor Wottschel; M Jorge Cardoso; Frederik Barkhof
Journal: Neurology Date: 2021-10-04 Impact factor: 9.910

Review 10. Interpretation and visualization techniques for deep learning models in medical imaging.

Authors: Daniel T Huff; Amy J Weisman; Robert Jeraj
Journal: Phys Med Biol Date: 2021-02-02 Impact factor: 3.609