| Literature DB >> 31634822 |
Fabian Eitel1, Emily Soehler1, Judith Bellmann-Strobl2, Alexander U Brandt3, Klemens Ruprecht4, René M Giess5, Joseph Kuchling6, Susanna Asseyer6, Martin Weygandt5, John-Dylan Haynes7, Michael Scheel8, Friedemann Paul9, Kerstin Ritter10.
Abstract
Machine learning-based imaging diagnostics has recently reached or even surpassed the level of clinical experts in several clinical domains. However, classification decisions of a trained machine learning system are typically non-transparent, a major hindrance for clinical integration, error tracking or knowledge discovery. In this study, we present a transparent deep learning framework relying on 3D convolutional neural networks (CNNs) and layer-wise relevance propagation (LRP) for diagnosing multiple sclerosis (MS), the most widespread autoimmune neuroinflammatory disease. MS is commonly diagnosed utilizing a combination of clinical presentation and conventional magnetic resonance imaging (MRI), specifically the occurrence and presentation of white matter lesions in T2-weighted images. We hypothesized that using LRP in a naive predictive model would enable us to uncover relevant image features that a trained CNN uses for decision-making. Since imaging markers in MS are well-established this would enable us to validate the respective CNN model. First, we pre-trained a CNN on MRI data from the Alzheimer's Disease Neuroimaging Initiative (n = 921), afterwards specializing the CNN to discriminate between MS patients (n = 76) and healthy controls (n = 71). Using LRP, we then produced a heatmap for each subject in the holdout set depicting the voxel-wise relevance for a particular classification decision. The resulting CNN model resulted in a balanced accuracy of 87.04% and an area under the curve of 96.08% in a receiver operating characteristic curve. The subsequent LRP visualization revealed that the CNN model focuses indeed on individual lesions, but also incorporates additional information such as lesion location, non-lesional white matter or gray matter areas such as the thalamus, which are established conventional and advanced MRI markers in MS. We conclude that LRP and the proposed framework have the capability to make diagnostic decisions of CNN models transparent, which could serve to justify classification decisions for clinical review, verify diagnosis-relevant features and potentially gather new disease knowledge.Entities:
Keywords: Convolutional neural networks deep learning multiple sclerosis MRI; Layer-wise relevance propagation; Visualization transfer learning
Mesh:
Year: 2019 PMID: 31634822 PMCID: PMC6807560 DOI: 10.1016/j.nicl.2019.102003
Source DB: PubMed Journal: Neuroimage Clin ISSN: 2213-1582 Impact factor: 4.881
Fig. 1Illustration of the transparent CNN framework. In the training phase, the CNN model learns a non-linear relationship between the MRI data and the binary diagnostic labels (MS yes/no). Optionally, the CNN models are pre-trained on a substitute data set or lesions are filled in the MRI data. The learned CNN model is then tested on new subjects to predict the diagnostic label. By supplementing this label with a LRP heatmap, which indicates the relevance of each voxel for the respective label, this framework allows us to understand (at least to some extent) the classification decision in individual subjects. Additionally, the validity of the CNN models can be assessed by matching highlighted brain areas with domain knowledge.
Demographics of MS patients and healthy controls. Disease duration is measured in months and lesion volume in ml. EDSS, expanded disability status scale; std., standard deviation.
| MS patients | Healthy controls | |
|---|---|---|
| Subjects [n] | 76 | 71 |
| Female/Male, in % | 55% / 45% | 65% / 35% |
| Age (in years), mean ± std | 43.32 (± 11.99) | 38.23 (± 13.10) |
| Disease duration, median, range | 139.14 (0–522.59) | n.a. |
| EDSS, median, range | 2.50 (0.00–6.50) | n.a. |
| Lesion volume, median, range | 5.10 (0.12–232.47) | 0.09 (0–14.98) |
Demographics of ADNI data set.
| AD patients | Healthy controls | |
|---|---|---|
| Subjects [n] | 231 | 158 |
| Female/Male, in % | 42% / 58% | 48% / 52% |
| Age (in years), mean ± std | 74.98 (± 7.40) | 75.93 (± 5.01) |
Performance (in %) for the different models on the holdout data set. Values are averages over 10 trials. Highest values per column are highlighted in bold. Pre-train., pre-training; Class., classifier; Bal. acc., balanced accuracy; Sens., sensitivity; Spec., specificity; AUC, area under the curve of the receiver operating characteristic; les. fill., lesions filled.
| Data | Pre-train. | Class. | Bal. acc. | Sens. | Spec. | AUC |
|---|---|---|---|---|---|---|
| FLAIR lesion load | – | SVM | 76.92% | 94.62% | ||
| FLAIR | – | SVM | 66.92% | 53.85% | 80.00% | 66.92% |
| FLAIR | no | CNN | 71.23% | 68.46% | 74.00% | 85.46% |
| FLAIR | yes | CNN | 87.04% | 81.00% | ||
| FLAIR - les. fill. | yes | CNN | 70.15% | 92.31% | 48.00% | 90.92% |
Fig. 2Individual LRP heatmaps (overlayed on the input FLAIR data) for the four MS patients with the highest classification score in terms of the sigmoid output. Heatmap values are normalized in the range [−0.03, 0.03]. Colors indicate regions supporting (red) or rejecting (blue) the classification as a MS patient with respect to the underlying CNN model.
Fig. 3Average LRP heatmaps for all correctly classified MS patients (top) and all correctly classified healthy controls (bottom) in the holdout set. Values are normalized in the range [−0.02, 0.02]. Please note that the underlying brain map has been computed as the average of all training subjects and does not reflect the MRI data of individual subjects.
Fig. 4Correlation between lesion sum and LRP relevance sum. The Pearson correlation coefficient is shown for both training and holdout set separately, of which both are significant (p < 0.001, p < 0.001, permutation test). The size of each data point shows the lesion-relevance similarity according to Eq. (2).
Fig. 5LRP relevance distribution over (a) 30 (mainly) gray matter areas from the Neuromorphometrics atlas and (b) 22 white matter areas from the JHU ICBM-DTI atlas, separately for MS patients and healthy controls in the holdout set. The absolute values per region are rather small as LRP aims to conserve the sigmoid output and distributes it over all voxels.
Fig. 6Average heatmaps for different CNN models applied to the MS (VIMS) cohort – starting from an untrained CNN model with random parameters over a CNN trained only on either ADNI or MS data to a CNN pre-trained on ADNI and fine-tuned on MS. As it can be seen, the fine-tuned model led to the most concise regions of positive and negative relevance. Please note that we averaged here the heatmaps over all (not only the correctly classified) MS patients in the holdout set and that the heatmap values here are not normalized to a fixed range but shown with respect to the minimum value of the untrained model.
Fig. 7Comparison of average relevance distribution over white matter areas for a CNN model trained on original FLAIR data (left) and lesion-filled FLAIR data (right; NABM, normal-appearing brain matter). We calculated the relevance sum of both models (averaged over subjects) and show the 10 areas with the highest score.