| Literature DB >> 34799630 |
Ekin Yagis1, Selamawet Workalemahu Atnafu2, Alba García Seco de Herrera1, Chiara Marzi2, Riccardo Scheda2, Marco Giannelli3, Carlo Tessa4, Luca Citi1, Stefano Diciotti5.
Abstract
In recent years, 2D convolutional neural networks (CNNs) have been extensively used to diagnose neurological diseases from magnetic resonance imaging (MRI) data due to their potential to discern subtle and intricate patterns. Despite the high performances reported in numerous studies, developing CNN models with good generalization abilities is still a challenging task due to possible data leakage introduced during cross-validation (CV). In this study, we quantitatively assessed the effect of a data leakage caused by 3D MRI data splitting based on a 2D slice-level using three 2D CNN models to classify patients with Alzheimer's disease (AD) and Parkinson's disease (PD). Our experiments showed that slice-level CV erroneously boosted the average slice level accuracy on the test set by 30% on Open Access Series of Imaging Studies (OASIS), 29% on Alzheimer's Disease Neuroimaging Initiative (ADNI), 48% on Parkinson's Progression Markers Initiative (PPMI) and 55% on a local de-novo PD Versilia dataset. Further tests on a randomly labeled OASIS-derived dataset produced about 96% of (erroneous) accuracy (slice-level split) and 50% accuracy (subject-level split), as expected from a randomized experiment. Overall, the extent of the effect of an erroneous slice-based CV is severe, especially for small datasets.Entities:
Mesh:
Year: 2021 PMID: 34799630 PMCID: PMC8604922 DOI: 10.1038/s41598-021-01681-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the previous studies performing classification of neurological disorders using MRI and with clear data leakage (see also Supplementary Table S1 for a detailed description).
| Disorder | References | Groups (number of subjects) | Machine learning model | Data split method | Type of data leakage | Accuracy (%) |
|---|---|---|---|---|---|---|
| AD/MCI | Gunawardena et al.[ | AD-MCI-HC (36) | 2D CNN | 4:1 train/test slice-level split | Wrong split | 96.00 |
| Hon and Khan[ | AD-HC (200) | 2D CNN (VGG16) | 4:1 train/test slice-level split | Wrong split | 96.25 | |
| Jain et al.[ | AD-MCI-HC (150) | 2D CNN (VGG16) | 4:1 train/test slice-level split | Late and wrong split | 95.00 | |
| Khagi et al.[ | AD-HC (56) | 2D CNN (AlexNet, GoogLeNet,ResNet50, new CNN) | 6:2:2 train/validation/test slice-level split | Wrong split | 98.00 | |
| Sarraf et al.[ | AD-HC (43) | 2D CNN (LeNet-5) | 3:1:1 train/validation/test slice-level split | Wrong split | 96.85 | |
| Wang et al.[ | MCI-HC (629) | 2D CNN | Data augmentation + 10:3:3 train/validation/test split by MRI slices | Wrong split and augmentation before split | 90.60 | |
| Puranik et al.[ | AD/EMCI-HC (75) | 2D CNN | 17:3 train/test split by MRI slices | Wrong split | 98.40 | |
| Basheera et al.[ | AD-MCI-HC (1820) | 2D CNN | 4:1 train/test split by MRI slices | Wrong split | 90.47 | |
| Nawaz et al.[ | AD-MCI-HC (1726) | 2D CNN | 6:2:2 slice level split | Wrong split | 99.89 |
AD Alzheimer’s disease, HC healthy controls, MCI mild cognitive impairment.
Summary of the previous studies performing classification of neurological disorders using MRI and suspected to have potential data leakage (see also Supplementary Table S2 for a detailed description).
| Disorder | References | Groups (number of subjects) | Machine learning model | Data split method | Type of data leakage | Accuracy (%) |
|---|---|---|---|---|---|---|
| AD/MCI | Farooq et al.[ | AD-MCI-LMCI-HC (355) | 2D CNN (GoogLeNet and modified ResNet) | 3:1 train/test (potential) slice-level split | Wrong split | 98.80 |
| Ramzan et al.[ | HC-SMC- EMCI-MCI-LMCI-AD (138) | 2D CNN (ResNet-18) | 7:2:1 train/validation/test (potential) slice-level split | Wrong split | 100 | |
| Raza et al.[ | AD-HC (432) | 2D CNN (AlexNet) | 4:1 train/test (potential) slice-level split | Wrong split | 98.74 | |
| Pathak et al.[ | AD-HC (266) | 2D CNN | 3:1 (potential) slice level split | Wrong split | 91.75 | |
| ASD | Libero et al.[ | ASD-TD (37) | Decision tree | unclear | Entire data set used for feature selection | 91.90 |
| Zhou et al.[ | ASD-TD/HC (280) | Random tree classifier | 4:1 train/test split | Entire data set used for feature selection | 100 | |
| PD | Sivaranjini, et al.[ | PD-HC (182) | 2D CNN | 4:1 train/test split by MRI slices | Wrong split | 88.90 |
| TBI | Lui et al.[ | TBI-HC (47) | Multilayer perceptron | tenfold CV | Entire data set used for feature selection | 86.00 |
| Brain tumor | Hasan et al.[ | Tumor-HC (600) | MGLCM + 2D CNN + SVM | tenfold CV | Wrong split and entire data set used for feature selection | 99.30 |
AD Alzheimer’s disease, ASD Autism spectrum disorder, EMCI early mild cognitive impairment, HC healthy controls, LMCI late mild cognitive impairment, MCI Mild cognitive impairment, MGLCM modified gray level co-occurrence matrix, PD Parkinson’s disease, SMC subjective memory concerns, TBI traumatic brain injury, TD typically developing.
Summary of the previous studies performing classification of neurological disorders using MRI and that provide insufficient information to assess data leakage (see also Supplementary Table S3 for a detailed description).
| Disorder | References | Groups (number of subjects) | Machine learning model | Data split method | Accuracy (%) |
|---|---|---|---|---|---|
| AD/MCI | Al-Khuzaie et al.[ | AD-HC (240) | 2D CNN | (Potential) slice-level split | 99.30 |
| Wu et al.[ | AD-HC (457) | 2D CNN | Data augmentation + 2:1 train/test split by MRI slices | 97.58 |
AD Alzheimer’s disease, HC healthy controls, MCI mild cognitive impairment.
Mean slice-level accuracy on the training and test set of the outer CV over fivefold nested CV has been reported for three 2D CNN models (see “Materials and methods” section), all datasets, and two data split methods (slice-level and subject-level).
| Dataset | Network architecture | Training set accuracy (%) | Test set accuracy (%) | |||
|---|---|---|---|---|---|---|
| Subject-level split | Slice-level split | Subject-level split | Slice-level split | Difference | ||
| OASIS-200 | VGG16-v1 | 95.93 | 99.85 | 66.0 | 94.18 | 28.18 |
| VGG16-v2 | 95.13 | 100 | 66.13 | 96.99 | 30.86 | |
| ResNet-18 | 100 | 100 | 68.87 | 98.96 | 30.1 | |
| OASIS-34 | VGG16-v1 | 88.94 | 100 | 54.35 | 99.19 | 44.84 |
| VGG16-v2 | 96.94 | 100 | 54.34 | 99.33 | 44.99 | |
| ResNet-18 | 100 | 100 | 57.49 | 98.96 | 41.47 | |
| OASIS-random | VGG16-v1 | 63.38 | 100 | 53.37 | 95.93 | 42.56 |
| VGG16-v2 | 69.17 | 100 | 49.25 | 94.81 | 45.56 | |
| ResNet-18 | 84.49 | 99.09 | 50.8 | 93.74 | 42.94 | |
| ADNI | VGG16-v1 | 91.09 | 100 | 70.12 | 95.31 | 25.19 |
| VGG16-v2 | 80.49 | 100 | 66.49 | 95.24 | 28.75 | |
| ResNet-18 | 100 | 100 | 68.68 | 96.87 | 30.19 | |
| PPMI | VGG16-v1 | 76.8 | 100 | 48.24 | 93.99 | 45.75 |
| VGG16-v2 | 73.19 | 100 | 46.93 | 94.37 | 47.44 | |
| ResNet-18 | 100 | 100 | 48.06 | 96.12 | 44.06 | |
| Versilia | VGG16-v1 | 99.72 | 100 | 53.86 | 95.97 | 42.11 |
| VGG16-v2 | 76.89 | 100 | 42.97 | 97.8 | 54.83 | |
| ResNet-18 | 99.90 | 95.13 | 51.36 | 92.63 | 41.27 | |
The difference between accuracy using slice-level and subject-level split in the test set has also been reported.
Demographic features of subjects belonging to OASIS-200, ADNI, PPMI, and Versilia datasets.
| Dataset | Patients | Healthy controls |
|---|---|---|
| Number of subjects | 100 | 100 |
| Age (range, years) | 62–96 | 59–94 |
| Age (mean ± SD, years) | 76.70 ± 7.10 | 75.50 ± 9.10 |
| Gender (women/men) | 59/41 | 73/27 |
| Number of subjects | 100 | 100 |
| Age (range, years) | 56–89 | 58–95 |
| Age (mean ± SD, years) | 74.28 ± 7.96 | 75.04 ± 7.11 |
| Gender (women/men) | 44/56 | 52/48 |
| Number of subjects | 100 | 100 |
| Age (range, years) | 34–82 | 31–83 |
| Age (mean ± SD, years) | 61.71 ± 9.99 | 61.91 ± 11.52 |
| Gender (women/men) | 40/60 | 36/64 |
| Number of subjects | 17 | 17 |
| Age (range, years) | 48–78 | 54–77 |
| Age (mean ± SD, years) | 64 ± 7.21 | 64.00 ± 7.00 |
| Gender (women/men) | 4/13 | 5/12 |
The same information for the OASIS-34 datasets has been reported in Supplementary Table S5.
AD Alzheimer’s disease, ADNI Alzheimer’s Disease Neuroimaging Initiative, OASIS open access series of imaging studies, PD Parkinson’s disease, PPMI Parkinson’s Progression Markers Initiative, SD standard deviation.
Figure 1Schematic diagram of the overall T1-weighted MRI data processing and validation scheme. First, a preprocessing stage included co-registration to a standard space, skull-stripping and slices selection based on entropy calculation. Then, CNNs model’s training and validation have been performed on each dataset in a nested CV loop using two different data split strategies: (a) subject-level split, in which all the slices of a subject have been placed either in the training or in the test set, avoiding any form of data leakage; (b) slice-level split, in which all the slices have been pooled together before CV, then split randomly into training and test set.
Figure 2The two different networks based on the VGG16 architecture are shown. Each colored block of layers illustrates a series of convolutions. (a) The first model, named as VGG16-v1 consists of five convolutional blocks followed by three fully connected layers. Only the last three fully connected layers are fine-tuned. (b) On the other hand, the second model, VGG16-v2, has five convolutional blocks followed by a global average pooling layer, and all the layers are fine-tuned.
Figure 3A modified ResNet-18 architecture with an average pooling layer at the end is shown. The upper box represents a residual learning block with an identity shortcut. Each layer is denoted as (filter size, # channels); layers labeled as “freezed” indicates that the weights are not updated during backpropagation, whereas when they are labeled as “fine-tuned” they are updated. The identity shortcuts can be directly used when the input and output are of the same dimensions (solid line shortcuts) and when the dimensions increase (dotted line shortcuts). ReLU rectified linear unit.