| Literature DB >> 28945803 |
Oscar Esteban1, Daniel Birman1, Marie Schaer2, Oluwasanmi O Koyejo3, Russell A Poldrack1, Krzysztof J Gorgolewski1.
Abstract
Quality control of MRI is essential for excluding problematic acquisitions and avoiding bias in subsequent image processing and analysis. Visual inspection is subjective and impractical for large scale datasets. Although automated quality assessments have been demonstrated on single-site datasets, it is unclear that solutions can generalize to unseen data acquired at new sites. Here, we introduce the MRI Quality Control tool (MRIQC), a tool for extracting quality measures and fitting a binary (accept/exclude) classifier. Our tool can be run both locally and as a free online service via the OpenNeuro.org portal. The classifier is trained on a publicly available, multi-site dataset (17 sites, N = 1102). We perform model selection evaluating different normalization and feature exclusion approaches aimed at maximizing across-site generalization and estimate an accuracy of 76%±13% on new sites, using leave-one-site-out cross-validation. We confirm that result on a held-out dataset (2 sites, N = 265) also obtaining a 76% accuracy. Even though the performance of the trained classifier is statistically above chance, we show that it is susceptible to site effects and unable to account for artifacts specific to new sites. MRIQC performs with high accuracy in intra-site prediction, but performance on unseen sites leaves space for improvement which might require more labeled data and new approaches to the between-site variability. Overcoming these limitations is crucial for a more objective quality assessment of neuroimaging data, and to enable the analysis of extremely large and multi-site samples.Entities:
Mesh:
Year: 2017 PMID: 28945803 PMCID: PMC5612458 DOI: 10.1371/journal.pone.0184661
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Visual assessment of MR scans.
Two images with prominent artifacts from the Autism Brain Imaging Data Exchange (ABIDE) dataset are presented on the left. An example scan (top) is shown with severe motion artifacts. The arrows point to signal spillover through the phase-encoding axis (right-to-left –RL–) due to eye movements (green) and vessel pulsations (red). A second example scan (bottom) shows severe coil artifacts. On the right, the panel displays one representative image frame extracted from the animations corresponding to the subjects presented on the left, as they are inspected by the raters during the animation. This figure caption is extended in Block 1 of S1 File.
Fig 2Inter-rater variability.
The heatmap shows the overlap of the quality labels assigned by two different domain experts on 100 data points of the ABIDE dataset, using the protocol described in section Labeling protocol. We also compute the Cohen’s Kappa index of both ratings, and obtain a value of κ = 0.39. Using the table for interpretation of κ by Viera et al. [16], the agreement of both raters is “fair” to “moderate”. When the labels are binarized by mapping “doubtful” and “accept” to a single “good” label, the agreement increases to κ = 0.51 (“moderate”). The “fair” to “moderate” agreement of observers demonstrates a substantial inter-rater variability. The inter- and intra- rater variabilities translate into the problem as class-noise since a fair amount of data points are assigned noisy labels that are not consistent with the labels assigned on the rest of the dataset. An extended investigation of the inter- and intra- rater variabilities is presented in Block 5 of S1 File.
Fig 3Inter-site variability renders as a batch effect on the calculated IQMs.
These plots display features extracted by MRIQC (columns) of all participants (rows), clustered by site (17 centers from the ABIDE datasets, plus the two centers where DS030 was acquired –“BMC” and “CCN”–). The plot of original features (left panel) shows how they can easily be clustered by the site they belong to. After site-wise normalization including centering and scaling within site (right), the measures are more homogeneous across sites. Features are represented in arbitrary units. For better interpretation, the features-axis (x) has been mirrored between plots.
Summary table of the train and test datasets.
The ABIDE dataset is publicly available, and contains images acquired at 17 sites, with a diverse set of acquisition settings and parameters. This heterogeneity makes it a good candidate to train machine learning models that can generalize well to novel samples from new sites. We selected DS030 [20] from OpenfMRI as held-out dataset to evaluate the performance on data unrelated to the training set. A comprehensive table showing the heterogeneity of parameters within the ABIDE dataset and also DS030 is provided in Block 2 of S1 File.
| Dataset | Site ID | Scanner vendor & model | Size | Resolution |
|---|---|---|---|---|
| CALTECH | Siemens Magnetom TrioTim, 1.59/2.73⋅10-3/0.8, 10, AP | 176±80×256±32×256±32 | 1.00×1.00±0.03×1.00±0.03 | |
| CMU | Siemens Magnetom Verio, 1.87/2.48⋅10-3/1.1, 8, AP | 176±15×256±62×256±62 | 1.00×1.00×1.00 | |
| KKI | Philips Achieva 3T, 8⋅10-3/3.70⋅10-3/0.8, 8, | 256×200±30×256±30 | 1.00×1.00×1.00 | |
| LEUVEN | Philips Intera 3T, 9.60⋅10-3/4.60⋅10-3/0.9, 8, RL | 256×182×256 | 0.98×1.20×0.98 | |
| MAX_MUN | Siemens Magnetom Verio, 1.8/3.06⋅10-3/0.9, 9, AP | 160±16×240±16×256±16 | 1.00×1.00±0.02×1.00±0.02 | |
| NYU | Siemens Magnetom Allegra, 2.53/3.25⋅10-3/1.1, 7, AP | 128×256×256 | 1.33×1.00×1.00 | |
| OHSU | Siemens Magnetom TrioTim, 2.3/3.58⋅10-3/0.9, 10, AP | 160×239±1×200±1 | 1.10×1.00×1.00 | |
| OLIN | Siemens Magnetom Allegra, 2.5/2.74⋅10-3/0.9, 8, RL | 208±32×256×176 | 1.00×1.00×1.00 | |
| PITT | Siemens Magnetom Allegra, 2.1/3.93⋅10-3/1.0, 7, AP | 176×256×256 | 1.05×1.05×1.05 | |
| SBL | Philips Intera 3T, 9⋅10-3/3.5⋅10-3/ | 256×256×170 | 1.00×1.00×1.00 | |
| SDSU | General Electric Discovery MR750 3T, 11.1⋅10-3/4.30⋅10-3/0.6, 8, | 172×256×256 | 1.00×1.00×1.00 | |
| STANFORD | General Electric Signa 3T, 8.4⋅10-3/1.80⋅10-3/ | 256×132×256 | 0.86×1.50×0.86 | |
| TRINITY | Philips Achieva 3T, 8.5⋅10-3/3.90⋅10-3/1.0, 8, AP | 160×256±32×256±32 | 1.00×1.00±0.07×1.00±0.07 | |
| UCLA | Siemens Magnetom TrioTim, 2.3/2.84⋅10-3/0.85, 9, AP | 160±16×240±26×256±26 | 1.20±0.20×1.00±0.04×1.00±0.04 | |
| UM | General Electric Signa 3T, | 256±154×256×124 | 1.02±0.38×1.02±0.16×1.20±0.16 | |
| USM | Siemens Magnetom Allegra, 2.1/3.93⋅10-3/1.0, 7, AP | 160±96×480±224×512±224 | 1.20±0.20×0.50±0.50×0.50±0.50 | |
| YALE | Siemens Magnetom TrioTim, 1.23/1.73⋅10-3/0.6, 9, AP | 160±96×256×256 | 1.00×1.00×1.00 | |
| BMC | Siemens Magnetom TrioTim, 2.53/3.31⋅10-3/1.1, 7, RL | 176×256×256 | 1.00×1.00×1.00 |
a http://fcon_1000.projects.nitrc.org/indi/abide/.
b https://openfmri.org/dataset/ds000030/.
c Please note that each vendor reported a different definition for TR and TE, thus their values are not directly comparable.
d Sizes and resolutions are reported as follows: median value along each dimension ± the most extreme value from the median (either above or below).
Summary table of IQMs.
The 14 IQMs spawn a vector of 64 features per anatomical image on which the classifier is learned and tested.
| CJV | The coefficient of joint variation of GM and WM was proposed as objective function by Ganzetti et al. [ |
| CNR | The contrast-to-noise ratio [ |
| SNR | MRIQC includes the the signal-to-nose ratio calculation proposed by Dietrich et al. [ |
| QI2 | The second quality index of [ |
| EFC | The entropy-focus criterion [ |
| FBER | The foreground-background energy ratio [ |
| INU | MRIQC measures the location and spread of the bias field extracted estimated by the inu correction. The smaller spreads located around 1.0 are better. |
| QI1 | The first quality index of [ |
| WM2MAX | The white-matter to maximum intensity ratio is the median intensity within the WM mask over the 95% percentile of the full intensity distribution, that captures the existence of long tails due to hyper-intensity of the carotid vessels and fat. Values should be around the interval [0.6, 0.8]. |
| FWHM | The full-width half-maximum [ |
| ICVs | Estimation of the icv of each tissue calculated on the FSL |
| rPVE | The residual partial volume effect feature is a tissue-wise sum of partial volumes that fall in the range [5%-95%] of the total volume of a pixel, computed on the partial volume maps generated by FSL |
| SSTATs | Several summary statistics (mean, standard deviation, percentiles 5% and 95%, and kurtosis) are computed within the following regions of interest: background, CSF, WM, and GM. |
| TPMs | Overlap of tissue probability maps estimated from the image and the corresponding maps from the ICBM nonlinear-asymmetric 2009c template [ |
Fig 4MRIQC’s processing data flow.
Images undergo a minimal processing pipeline to obtain the necessary corrected images and masks required for the computation of the IQMs.
Fig 5Visual reports.
MRIQC generates one individual report per subject in the input folder and one group report including all subjects. To visually assess MRI samples, the first step (1) is opening the group report. This report shows boxplots and strip-plots for each of the IQMs. Looking at the distribution, it is possible to find images that potentially show low-quality as they are generally reflected as outliers in one or more strip-plots. For instance, in (2) hovering a suspicious sample within the coefficient of joint variation (CJV) plot, the subject identifier is presented (“sub-51296”). Clicking on that sample will open the individual report for that specific subject (3). This particular example of individual report is available online at https://web.stanford.edu/group/poldracklab/mriqc/reports/sub-51296_T1w.html.
Selecting the appropriate split strategy for cross-validation.
The cross-validated area under the curve (AUC) and accuracy (ACC) scores calculated on the ABIDE dataset (train set) are less biased when LoSo is used to create the outer folds, as compares to the evaluation scores obtained in DS030 (held-out set).
| ABIDE (train) | DS030 (held-out) | Bias (Δ) | |||||
|---|---|---|---|---|---|---|---|
| Outer split | Inner split | AUC | ACC (%) | AUC | ACC (%) | AUC | ACC |
| 10-Fold | 5-Fold | .87±.04 | 83.75±3.6 | .68 | 75.7 | .19 | 7.0 |
| LoSo | .86±.04 | 81.93±3.5 | .71 | 77.0 | .15 | 5.0 | |
| LoSo | 5-Fold | .71±.15 | 75.96±16.8 | .68 | 76.2 | ||
| LoSo | .71±.15 | 75.21±17.8 | .68 | 76.6 | |||
Fig 6Nested cross-validation for model selection.
The plots on the left represent the scores (AUC on top, ACC below) obtained in the outer loop of nested cross-validation, using the LoSo split. The plots show how certain sites are harder to predict than others. On the right, the corresponding violin plots that summarize the overall performance. In both plots, the dashed lines represent the averaged cross-validated performance for the three models: RFC (blue line, AUC = 0.73±0.15, ACC = 76.15%±13.38%), SVC-lin (light orange, AUC = 0.68±0.18, ACC = 67.54%±20.82%), and SVC-rbf (dark orange, AUC = 0.64±0.17, ACC = 69.05%±18.90%).
Evaluation on the held-out dataset.
The model cross-validated on the ABIDE dataset performs with AUC = 0.707 and ACC = 76% on DS030. The recall column shows the insensitivity of the classifier to the true “exclude” cases. The predicted group summarizes the confusion matrix corresponding to the prediction experiment.
| Performance scores | Predicted | Support | |||||
|---|---|---|---|---|---|---|---|
| precision | recall | F1 score | accept | exclude | |||
| True | accept | 0.77 | 0.95 | 0.85 | 180 | 10 | 190 |
| exclude | 0.68 | 0.28 | 0.40 | 54 | 21 | 75 | |
| avg / total | 0.74 | 0.76 | 0.72 | 234 | 31 | 265 | |
Fig 7Evaluation on the held-out dataset.
A. A total of 50 features are selected by the preprocessing steps. The features are ordered from highest median importance (the QI2 [12]) to lowest (percentile 5% of the intensities within the GM mask). The boxplots represent the distribution of importances of a given feature within all trees in the ensemble. B. (Left) Four different examples of false negatives of the DS030 dataset. The red boxes indicate a ghosting artifact, present in more than 20% of the images. Only extreme cases where the ghost overlaps the cortical GM layer of the occipital lobes are presented. (Right) Two examples of false positives. The two examples are borderline cases that were rated as “doubtful”. Due to the intra- and inter- rater variabilities, some data points with poorer overall quality are rated just “doubtful”. These images demonstrate the effects of the noise in the quality labels.