Literature DB >> 28286318

Spinal cord grey matter segmentation challenge.

Ferran Prados¹, John Ashburner², Claudia Blaiotta², Tom Brosch³, Julio Carballido-Gamio⁴, Manuel Jorge Cardoso⁵, Benjamin N Conrad⁶, Esha Datta⁴, Gergely Dávid⁷, Benjamin De Leener⁸, Sara M Dupont⁸, Patrick Freund⁷, Claudia A M Gandini Wheeler-Kingshott⁹, Francesco Grussu¹⁰, Roland Henry⁴, Bennett A Landman⁶, Emil Ljungberg¹¹, Bailey Lyttle¹², Sebastien Ourselin⁵, Nico Papinutto⁴, Salvatore Saporito¹³, Regina Schlaeger⁴, Seth A Smith¹⁴, Paul Summers¹⁵, Roger Tam¹⁶, Marios C Yiannakas¹⁰, Alyssa Zhu⁴, Julien Cohen-Adad¹⁷.

Abstract

An important image processing step in spinal cord magnetic resonance imaging is the ability to reliably and accurately segment grey and white matter for tissue specific analysis. There are several semi- or fully-automated segmentation methods for cervical cord cross-sectional area measurement with an excellent performance close or equal to the manual segmentation. However, grey matter segmentation is still challenging due to small cross-sectional size and shape, and active research is being conducted by several groups around the world in this field. Therefore a grey matter spinal cord segmentation challenge was organised to test different capabilities of various methods using the same multi-centre and multi-vendor dataset acquired with distinct 3D gradient-echo sequences. This challenge aimed to characterize the state-of-the-art in the field as well as identifying new opportunities for future improvements. Six different spinal cord grey matter segmentation methods developed independently by various research groups across the world and their performance were compared to manual segmentation outcomes, the present gold-standard. All algorithms provided good overall results for detecting the grey matter butterfly, albeit with variable performance in certain quality-of-segmentation metrics. The data have been made publicly available and the challenge web site remains open to new submissions. No modifications were introduced to any of the presented methods as a result of this challenge for the purposes of this publication.

Entities: Chemical

Keywords: Challenge; Evaluation metrics; Grey matter; MRI; Segmentation; Spinal cord

Mesh：

Year: 2017 PMID： 28286318 PMCID： PMC5440179 DOI： 10.1016/j.neuroimage.2017.03.010

Source DB: PubMed Journal: Neuroimage ISSN： 1053-8119 Impact factor: 6.556

Introduction

A large spectrum of (non)-traumatic neurological disorders have been linked with spinal cord grey matter (GM) and white matter (WM) tissue changes (Amukotuwa and Cook, 2015). The spinal cord is a challenging area for magnetic resonance imaging (MRI) (Wheeler-Kingshott et al., 2014, Stroman et al., 2014) due to the small cross-sectional area dimension of the spinal cord, the presence of motion, susceptibility artifacts and, in particular, the complex shape and small area fraction of GM tissue. Recently, Yiannakas et al. (2012) demonstrated the feasibility to distinguish between the WM and GM by performing manual segmentation of the cervical cord using a T1-weighted fast field echo (FFE) data acquired in a 3 T scanner with reasonable acquisition times and an in-plane resolution of 0.5×0.5 mm2. More recently, Schlaeger et al., 2014, Schlaeger et al., 2015 also demonstrated that spinal cord GM area was the strongest correlate of disability in multiple sclerosis using multivariate models that included brain GM and WM volumes, fluid-attenuated inversion recovery lesion load, T1 lesion load, spinal cord cross-sectional area (CSA), T2 lesion load, age, sex, and disease duration. Several semi- or fully-automated segmentation methods have been proposed in the last decade for cervical CSA estimation (Losseff et al., 1996, Hickman et al., 2004, Tench et al., 2005, Zivadinov et al., 2008, Horsfield et al., 2010, McIntosh et al., 2011, Bergo et al., 2012, Chen et al., 2013, de Leener et al., 2014, Asman and Bryan, 2014, Taso et al., 2015, El Mendili et al., 2015, de Leener et al., 2016). While most methods present good performance, interpretation and comparison of results between different methods is seldom possible due to the use of different imaging datasets (usually in-house data), different MRI sequences, different ways to obtain gold standard segmentations (number of raters and consensus mask) and the use of various performance scores (2D/slice-wise or 3D/volumetric). Recent cervical cord CSA segmentation methods have reached a performance close to manual segmentation (de Leener et al., 2014; Asman and Bryan, 2014, El Mendili et al., 2015, de Leener et al., 2016), but accurate GM segmentation remains a challenge. Moreover, there is a lack of publicly available datasets with GM/WM contrast and corresponding ground truth that facilitate a fair and reliable comparison across methods. A GM spinal cord segmentation challenge was organised in conjunction by four internationally recognised spinal cord imaging research groups (University College London, Polytechnique Montreal, University of Zurich and Vanderbilt University) to test the different performances of various methods, with the aim of characterizing the state-of-the-art in the field according to a pre-defined set of assessment criteria as well as identifying opportunities for future improvement. Several GM spinal cord segmentation methods developed independently by various research groups across the world were compared. These methods were used to segment the same multi-centre and multi-vendor dataset acquired with distinct 3D gradient-echo sequences, which are available to the community at http://cmictig.cs.ucl.ac.uk/niftyweb/challenge, and the obtained results were compared to the manual segmentation performed by 4 raters.

Material

Participating teams applied their automatic or semi-automatic segmentation algorithms to anatomical MR images of 40 healthy spinal cords. Challenge data was composed by 80 datasets, split in 40 training and 40 test datasets, 20 each acquired at 4 different sites (University College London, Polytechnique Montreal, University of Zurich and Vanderbilt University). See Table 1 for demographic data. Algorithms were evaluated against manual segmentations from four trained raters (one from each site who each analysed all data from all sites) in terms of segmentation accuracy and precision using several validation metrics.

Table 1

Demographic data per site, first row: number of healthy controls per site, second row: gender - female (F):male (M); third row: mean age in years. Std: standard deviation.

	Site 1 – UCL	Site 2 – Montreal	Site 3 – Zurich	Site 4 – Vanderbilt
Subjects	20	20	20	20
Gender	14F:6M	11F:9M	6F:14M	7F:13M
Mean Age (Std)	44.3 (10.4)	33.7 (17.4)	40.6 (10.4)	28.3 (8.2)

Demographic data per site, first row: number of healthy controls per site, second row: gender - female (F):male (M); third row: mean age in years. Std: standard deviation.

Data

A multi-centre, multi-vendor dataset of spinal cord anatomical images of healthy subjects was provided. Each site provided images from 20 healthy subjects along with WM/GM manual segmentation masks. The acquisition parameters for each site were the following: Site 1, University College London. Acquisition was performed using a 3 T Philips Achieva MRI system with dual-transmit technology enabled for all scans (Philips Healthcare, Best, Netherlands) and the manufacturer's product 16-channel neurovascular coil. All participants were immobilised using a MRI-compatible cervical collar (TalarMade Ltd, Chesterfield, UK). The cervical cord was imaged in the axial-oblique plane (i.e. slices perpendicular to the longitudinal axis of the cord) with the center of the imaging volume positioned at the level of C2-3 intervertebral disc. The MRI acquisition parameters were: fat-suppressed 3D slab-selective fast field echo (3D-FFE) with time of repetition (TR)=23 ms; time of echo (TE)=5 ms, flip angle α=7°, field-of-view (FOV)=240×180 mm2, voxel size=0.5×0.5×5 mm3, NEX=8, 10 axial contiguous slices, scanning time 13:34 min. A 15 mm section of the high-resolution 3D-FFE volumetric scan (i.e. 3 slices) was extracted, with the middle slice passing through the C2/C3 intervertebral disc. Site 2, Polytechnique Montreal. Acquisition was performed using a 3 T Siemens TIM Trio, with the body coil used for RF transmission and the 12 channels head coil+4 channels neck coil for RF reception. All participants were immobilised with padding. Axial 2D spoiled gradient echo, TR=539 ms, TE=5.41, 12.56 and 19.16 ms (averaged off-line to create a single image with increased SNR), flip angle α=35°, readout bandwidth (BW)=200 Hz per pixel, voxel size=0.5×0.5×5 mm3, 10 slices, matrix size of 320×320, R=2 acceleration along RL direction with GRAPPA reconstruction, phase stabilization. Scanning time 4:38 min. Site 3, University of Zurich. Scanning was performed on a 3 T Siemens Skyra MRI scanner (Siemens Healthcare, Erlangen, Germany) using a 16-channel radio-frequency receive head and neck coil and radio-frequency body transmit coil. All participants wore an MRI-compatible neck collar (Laerdal Medicals, Stavanger, Norway). A 3D high-resolution optimized T2*-weighted multi-echo sequence (multiple echo data image combination; MEDIC) was applied to acquire five high-resolution 3D volumes of the cervical cord at C2/C3 level. Each volume consisted of twenty contiguous slices acquired in the axial-oblique plane and was obtained with a resolution of 0.5×0.5×2.5 mm3 within 2:08 min for each of the five volumes. Following parameters were applied: TE=19 ms, TR=44 ms, FOV=192×162 mm2, matrix size=384×324, flip angle α=11°, and readout bandwidth=260 Hz per pixel. After data acquisition, zero-interpolation filling was used to double in-plane resolution (0.25×0.25 mm2) and the five 3D volumes were averaged in the spatial domain to create a single image with increased SNR. Site 4, Vanderbilt University. Imaging was performed on a 3 T whole body Philips scanner (Philips Achieva, Best, Netherlands). A two-channel body coil was used in multi-transmit mode for excitation and a 16-channel SENSE neurovascular coil was used for reception. All participants were immobilised using foam pads around the head between the coil and a foam neck pillow. The sequence consisted of a multi-slice, multi-echo fast field echo (mFFE) acquired in the axial plane with the following parameters: TR=700 ms, TE/δTE=7.2/8.9 ms, FOV=160×160 mm2, flip angle α=28°, voxel size=0.65×0.65×5 mm3 interpolated to 0.29×0.29×5 mm3, number of echoes=3, NSA=2, 14 axial contiguous slices, SENSE: RL=2. The resulting scan time was 5:46 min. The centre of the imaging volume was positioned at C3/C4 intervertebral disc. At all sites, the imaging volume was carefully positioned to ensure comparable results across all scans. Written informed consent was obtained from all participants and the work was approved by the respective institution's local research committee. Table 2 summarises the acquisition parameters of the 4 sequences used in this study.

Table 2

A summary of acquisition parameters from each site.

	Site 1 – UCL	Site 2 – Montreal	Site 3 – Zurich	Site 4 – Vanderbilt
Scanner	3 T Philips Achieva	3 T Siemens TIM Trio	3 T Siemens Skyra	3 T Philips Achieva
Sequence	3D Gradient echo	2D spoiled gradient multi-echo	3D multi-echo gradient-echo	3D multi-echo gradient-echo
TE (ms)	5	5.41,12.56,19.16	19	7.2,16.1,25
TR (ms)	23	539	44	700
Flip Angle (deg)	7	35	11	28
FOV (mm)	240×180	320×320	162×192	160×160
Resolution (mm)	0.5×0.5×5	0.5×0.5×5	0.25×0.25×2.5	0.3×0.3×5
NEX	8	1	5	2
Slices	10 (3 extracted)	10	20	14
Time (m:s)	13:34	4:38	10:40	5:46
Coil (channels)	16	12+4	16	16
Coil type	Neurovascular	Head+Neck	Neurovascular	Neurovascular
Acceleration	–	GRAPPA factor 2	–	SENSE RL=2

A summary of acquisition parameters from each site.

Image quality assessment

Image quality assessments were performed for each site at subject level. Signal-to-noise ratio within WM and contrast-to-noise ratio between GM and WM (Grussu et al., 2015, Yiannakas et al., 2016) were computed as: and .

GM mask delineation

Four expert raters, one per site, working independently, manually segmented the GM and WM masks using different software packages following (Yiannakas et al., 2012) guidelines. Rater 1 (MY) and rater 3 (GD), first outlined GM manually and subsequently outlined the cord CSA in all subjects using the semi-automated cord finder option available with JIM (v. 6.0, Xinapse Systems, Northants, UK; http://www.xinapse.com/). GM and cord CSA JIM masks were converted to NIFTI using the JIM masker tool. Pixels that were at least 50% within ROIs were defined to be inside the mask. Rater 2 (SMD) and 4 (BL) manually outlined both GM and WM masks. Rater 2 used FSLView (Jenkinson et al., 2012) and Rater 4 used MIPAV http://mipav.cit.nih.gov/. Additionally, in order to assess rater performance, a consensus segmentation of the four raters was calculated using majority voting. In this paper, consensus is defined as voxels receiving three or more rater votes.

Evaluation framework

An online automatic evaluation tool was made publicly available as part of NiftyWeb (Prados et al., 2016a) at http://cmictig.cs.ucl.ac.uk/niftyweb/. From this website, training and testing data were publicly available for download. The training dataset contained a total of 40 volumes (18F:22M, mean age years), 10 per site, with the WM and GM spinal cord segmentation from 4 expert raters and a text file with the vertebral level of each slice. The testing dataset contained a total of 40 volumes (20F:20M, mean age years), 10 per site, and a text file with the vertebral level. Participants were required to accept a data usage license agreement prior to downloading the data. Teams submitted their binary tissue segmentation masks and obtained the performance results automatically for both training or testing datasets. The submitted segmentations were assessed using the validation metrics described in the following section. The evaluation website will remain open to new submissions. Gold standard segmentations of the testing dataset will remain hidden.

Validation metrics

A number of quantitative scores were used to validate the quality of the submitted binary segmentations. All evaluations were performed in 3D and, in order to cover the same area/volume, only the slices that were processed by all the raters were taken into account. Manual binary segmentation masks were considered as the ground truth (GT). For each provided mask (PM) by the teams, each voxel was classified as: True positive (TP), if it was a GM voxel in GT mask and it was segmented as GM; true negative (TN), if it was a non-GM voxel in GT mask and it was segmented as non-GM; false positive (FP), if it was a non-GM voxel in GT mask and it was segmented as GM; and finally, false negative (FN), if it was a GM voxel in GT mask and it was segmented as non-GM. The evaluation scores included three overlapping metrics:Four distance based metrics:And three statistical based metrics: Dice Similarity Coefficient (DSC): a measure of the spatial overlap between two masks (Dice, 1945). Jaccard Index (JI): similarity index between two masks (Jaccard, 1912), which is related to the DSC. Conformity Coefficient (CC): measures the ratio between mis-segmented voxels and correctly segmented voxels (Chang et al., 2009).if the number of TP voxels is 0, the CC is undefined. Symmetric Mean Absolute Surface Distance (MSD): the mean of the sum of the Euclidean distance (for each voxel) between mask contours.where N and N are the total number of voxels in the contour for GT and PM respectively. The distance values are obtained through the use of a 3D Euclidean distance transform (Gerig et al., 2001). Hausdorff Surface Distance (HSD): measures the maximal contour distance between the two segmentations.where d is the Euclidean distance between voxel x and y. Skeletonized Hausdorff Distance (SHD): measures the maximum distance between the two skeletonized (Zhang and Suen, 1984) GM segmentations as an indicator of maximal local error (Dupont, 2016). Skeletonized Median distance (SMD): measures the median distance between the two skeletonized GM segmentations as an indicator of global errors (Dupont, 2016). Sensitivity or True Positive Rate (TPR): represents a methods ability to segment GM as a proportion of all correctly labelled voxels.TPR values ranges between 0 and 100, values close to 100 mean a good quality segmentation, whilst low TPR values mean that the method tends to under-segment. Specificity or True Negative Rate (TNR): measures the proportion of correctly segmented background (non-GM) voxels, i.e. the ratio between the number of correctly labeled background voxels in the automated segmentation and the total number of background voxels in the manual segmentation.TNR values range between 0 and 100. Methods with a lower number of FP voxels will have a higher TNR. Due to the small size of the GM when compared to the total image size, TNR values are naturally very high in this scenario. Precision or Positive Predictive Value, (PPV): measures the degree of compromise between true and false positive.PPV values range is between 0 and 100. A high PPV (close to 100) represents optimal segmentations with a low amount or absence of FP, while low PPV values represent over-segmented results. Skeletonized measures were calculated using the Spinal Cord Toolbox (de Leener, 2016) and the others using NiftySeg (niftyseg.sf.net). MSD, HSD, SHD and SMD are presented in millimetres, with lower scores reflecting better results. Finally, for DSC, JI, CC, TNR, TPR and PPV, higher scores reflect better results. Table 3 summarises the metrics used.

Table 3

A summary of the validation metrics.

Name	Abbr.	Range	Qualitative Interpretation	Quantitative Interpretation	Category
Dice Similarity Coefficient	DSC	0−1	Similarity between masks	Higher values are better	Overlap
Jaccard Index	PPV	0−100	Similarity between masks	Higher values are better	Overlap
Conformity Coefficient	CC	<100	Ratio between mis-segmented and correctly segmented	Higher values are better	Overlap
Symmetric Mean Absolute Surface Distance	MSD	>0	Mean euclidean distance between mask contours (mean error)	Smaller values are better	Distance
Hausdorff Surface Distance	HSD	>0	Longest euclidean distance between mask contours (absolute error)	Smaller values are better	Distance
Skeletonized Hausdorff Distance	SHD	>0	Indicator of maximal local error	Smaller values are better	Distance
Skeletonized Median Distance	SMD	>0	Indicator of global errors	Smaller values are better	Distance
True Positive Rate or Sensitivity	TPR	0−100	Low values mean that method tends to under-segment	Higher values are better	Statistical
True Negative Rate or Specificity	TNR	0−100	Quality of segmented background	Higher values are better	Statistical
Positive Predictive Value or Precision	PPV	0−100	Low values mean that method tends to over-segment	Higher values are better	Statistical

A summary of the validation metrics.

Statistical analysis

Each PM was compared to the equivalent GT mask of each rater. Then, the mean and standard deviation for all the evaluation scores were computed. Both the per rater and overall metric results were included in a report e-mail that was sent automatically to the teams immediately after the submission of the results. For each metric, a two-tailed unequal variance paired t-test was used to assess if the there were any significant differences in performance between the best result and the others. Tests were also performed for significant differences between each method and the consensus of the manual segmentations in order to assess the performance of the proposed techniques against human raters. Results are presented using box plots where the bottom and the top of the box plot are the 25% and the 75% percentiles, or Q1 and Q3 quartiles, respectively; the upper and lower whiskers represent: upper whisker=min(max(y), Q3+1.5×IQR) and lower whisker=max(min(y), Q1-1.5×IQR), where IQR stands for interquartile range that is the difference between Q3 and Q1. Additionally, each of the obtained results is represented as a black dot and the mean using a rhombus. Using STATA 14, we computed a generalized linear model to assess whether the results of any presented method and metric were biased by sequence or age. All sequence interaction coefficients (categorical variables) were jointly compared with an F-test to estimate between-sequence differences. Interactions with age (continuous variable) were obtained using an independent linear regression model without sequence interaction.

Submission guidelines

In this challenge, the participating teams were allowed unlimited submissions. Teams were also allowed to use other publicly available datasets within their algorithms. Numerical input parameters were permitted, but under the requirement that they would be kept constant for all data sets. Output GM segmentations were provided in the same space and resolution as the input data. There were no restrictions on how the algorithms were implemented with regards to platform, programming language, or software library dependencies. Algorithms were executed solely by the competing team with the segmentation results provided to the organizers. Output segmentations were saved in NIFTI format with a label of 1 assigned to spinal cord GM and 0 otherwise. Methods and results were presented during the 3rd Annual Spinal Cord MRI Workshop, held immediately following the ISMRM annual meeting in Singapore, May 2016. If human interaction was required to run an algorithm, teams were asked to provide a description of the required steps (e.g., cropping, normalization, centering, pre-segmentation, etc.).

Methods

Eleven different users requested the data and seven institutions initially entered the challenge. Finally, six teams submitted final results to the challenge and presented their method during the workshop. Team 1 – University College London, led by FP, MJC, CWK and SO. Method name: Joint collaboration for spinal cord grey matter segmentation (Prados et al., 2016b), referred to as: JCSCS. Team 2 – University of British Columbia, led by EL, TB and RT. Method name: Deepseg, referred to as: DEEPSEG. Team 3 – University of California San Francisco, led by ED, NP, RS, AZ, JCG and RH. Method name: Morphological geodesic active contours algorithm (Datta et al., 2016) -, referred to as: MGAC. Team 4– Eindhoven University of Technology and University College London, led by SS, FG, FP and CWK. Method name: Grey matter segmentation based on maximum entropy, referred to as: GSBME. Team 5 – Polytechnique Montreal, led by SMD, BDL and JCA. Method name: Multi-atlas based segmentation method for the spinal cord white and grey matter (Dupont, 2016) implemented in the Spinal Cord Toolbox (de Leener, 2016), referred to as: SCT. Team 6 – University of Zurich and University College London, led by CB, PF and JA. Method name: Semisupervised VBEM (Blaiotta et al., 2016), referred to as: VBEM. All methods are described in the Appendix A and Table 4 presents a summary of each method. No modifications were introduced to any of the presented spinal cord GM segmentation methods as a result of this challenge for the purposes of this publication.

Table 4

Setup parameters and characteristics for each presented method. Note atlas size is in number of slices and that computational time per slice is an approximation, has been obtained in different workstations and might vary depending on the resolution.

Name	Init.	Training	Atlas size	Time per slice	Available
JCSCS	Automatic	No	820	4–5 min	niftyseg.sf.net
DEEPSEG	Automatic	Yes (4 h)	160	<1 s	Soon
MGAC	Automatic	No	1	1 s	Soon
GSBME	Manual	Yes (<1 min)	No	5–80 s	Upon request
SCT	Automatic	No	447	8–10 s	spinalcordtoolbox.sf.net
VBEM	Automatic	No	No	5 s	Soon

Results

In order to assess inter-rater variability, using leave-one-out cross-validation, the quantitative analysis results of the performance of each rater segmentation using the testing dataset are presented in Table 5 and Fig. 1.

Table 5

Comparison of each rater segmentation versus the majority voting mask of all raters for the test dataset with the mean (std) Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC). In bold face, the best obtained result for each particular metric. The script * represents significant differences (paired t-test with p0.05) between the obtained result by a rater and the best result. MSD, HSD, SHD and SMD are in millimetres and lower values mean better, for all the other scores higher values mean better score.

	Rater 1	Rater 2	Rater 3	Rater 4
DSC	0.91 (0.02)*	0.89 (0.03)*	0.90 (0.03)*	0.93 (0.03)
MSD	0.20 (0.21)	0.30 (0.31)*	0.21 (0.22)	0.14 (0.15)
HSD	1.80 (0.68)*	1.75 (0.57)*	1.53 (0.44)	1.44 (0.55)
SHD	0.71 (0.28)	1.10 (0.39)*	0.70 (0.31)	0.66 (0.30)
SMD	0.37 (0.18)	0.43 (0.21)	0.36 (0.18)	0.35 (0.17)
TPR	89.27 (3.7)	81.99 (5.39)*	84.64 (3.76)*	90.19 (4.38)
TNR	99.990 (0.02)	99.995 (0.01)	99.995 (0.01)	99.994 (0.01)
PPV	92.01 (3.48)*	96.52 (1.87)	96.04 (1.92)	95.08 (2.06)*
JI	0.83 (0.04)*	0.80 (0.05)*	0.82 (0.04)*	0.86 (0.04)
CC	78.95 (5.94)*	73.80 (8.89)*	77.45 (6.40)*	83.62 (6.21)

Fig. 1

Results of the raters for the testing dataset. Boxplot, the mean value is represented by a rhombus and dots show original obtained values per mask. Each rater's results are compared to the majority voting mask. From left to right, first row: Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized median distance (SMD) and skeletonized Hausdorff distance (SHD). Second row: true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC). Comparison of each rater segmentation versus the majority voting mask of all raters for the test dataset with the mean (std) Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC). In bold face, the best obtained result for each particular metric. The script * represents significant differences (paired t-test with p0.05) between the obtained result by a rater and the best result. MSD, HSD, SHD and SMD are in millimetres and lower values mean better, for all the other scores higher values mean better score. In addition, Table 6, Figs. 2 and 3 present the results of each method also using the testing dataset against each rater independently. In Figs. 2 and 3, the results are split by site, with a boxplot drawn for each metric and site. In Fig. 2, MSD, HSD, SMD and SHD are in millimetres but are represented using a logarithmic scale in order to highlight the various results. Table 6 presents the obtained results per method, with the mean (std) and p-value for each metric estimated with respect to the best result of the same metric (marked in bold face). Using a two-tailed unequal variance paired t-test, significant differences () between a method and the best performing method have been marked with “*”. Methods that were found to not be statistically significantly different from the consensus of manual segmentations are marked with script “+” (p0.05).

Table 6

Comparison of each method segmentation versus each one of the four raters masks for the test dataset with the mean (std) Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC). In bold face, the best obtained result for each particular metric. The script * represents significant differences (paired t-test with p0.05) between the obtained result and the best result. The script + represents non-significant differences (paired t-test with p0.05) between the obtained result and the consensus of the raters. MSD, HSD, SHD and SMD are in millimetres and lower values mean better, for all the other scores higher values mean better score.

	JCSCS	DEEPSEG	MGAC	GSBME	SCT	VBEM
DSC	0.79 (0.04)	0.80 (0.06)	0.75 (0.07)*	0.76 (0.06)*	0.69 (0.07)*	0.61 (0.13)*
MSD	0.39 (0.44)	0.46 (0.48)	0.70 (0.79)*	0.62 (0.64)	0.69 (0.76)*	1.04 (1.14)*
HSD	2.65 (3.40)+	4.07 (3.27)*	3.56 (1.34)	4.92 (3.30)*	3.26 (1.35)	5.34 (15.35)+
SHD	1.00 (0.35)	1.26 (0.65)*	1.07 (0.37)	1.86 (0.85)*	1.12 (0.41)	2.77 (8.10)+
SMD	0.37 (0.18)+	0.45 (0.20)*+	0.39 (0.17)*+	0.61 (0.35)*	0.39 (0.16)+	0.54 (0.25)*
TPR	77.98 (4.88)*	78.89 (10.33)*	87.51 (6.65)+	75.69 (8.08)*	70.29 (6.76)*	65.66 (14.39)*
TNR	99.98 (0.03)	99.97 (0.04)	99.94 (0.08)*	99.97 (0.05)	99.95 (0.06)	99.93 (0.09)*
PPV	81.06 (5.97)	82.78 (5.19)	65.60 (9.01)*	76.26 (7.41)*	67.87 (8.62)*	59.07 (13.69)*
JI	0.66 (0.05)	0.68 (0.08)	0.60 (0.08)*	0.61 (0.08)*	0.53 (0.08)*	0.45 (0.13)*
CC	47.17 (11.87)	49.52 (20.29)	29.36 (29.53)*	33.69 (24.23)*	6.46 (30.59)*	−44.25 (90.61)*

Fig. 2

Fig. 3

True positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC) results of the presented methods per site using the testing dataset. Boxplot, the mean value is represented by a rhombus and dots show original obtained values per mask.

Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD) results of the presented methods per site using the testing dataset. Boxplot, the mean value is represented by a rhombus and dots show original obtained values per mask. MSD, HSD, SMD and SHD are in mm and represented using a logarithmic scale. True positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC) results of the presented methods per site using the testing dataset. Boxplot, the mean value is represented by a rhombus and dots show original obtained values per mask. Comparison of each method segmentation versus each one of the four raters masks for the test dataset with the mean (std) Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC). In bold face, the best obtained result for each particular metric. The script * represents significant differences (paired t-test with p0.05) between the obtained result and the best result. The script + represents non-significant differences (paired t-test with p0.05) between the obtained result and the consensus of the raters. MSD, HSD, SHD and SMD are in millimetres and lower values mean better, for all the other scores higher values mean better score. For a qualitative analysis, a randomly selected slice is shown from subject 11 of each site. Original image, consensus segmentation mask from the four raters, the corresponding binary segmentation result for each method and DSC value are shown (see Fig. 4).

Fig. 4

Binary grey matter segmentation results for the same single slice for subject 11 of each site. From top to bottom row: input image, majority voting segmentation from the 4 raters and the segmentation methods: JCSCS, DEEPSEG, MGAC, GSBME, SCT and VBEM. Obtained 3D DSC is overlayed. Using the WM and GM consensus masks, we computed the mean and standard deviation of and . Site 1 had a and a . Site 2 had a and a . Site 3 had a and a . Finally, Site 4 had a and a . The generalized linear model for assessing bias related to the type of sequence (see Table 7) showed that DEEPSEG results are significantly affected (p0.05) by image quality (i.e. site) for all the metrics. We also found that most of the distance metrics results obtained by the methods (MSD, HSD, SHD and SMD) are influenced by the sequences (p0.05) due to the different resolutions. Furthermore, Table 8 shows that age (atrophy) significantly influences the JCSCS and GBSME algorithms when overlap metrics are considered (DSC, JI and CC).

Table 7

	JCSCS	DEEPSEG	MGAC	GSBME	SCT	VBEM
DSC	0.233	<0.001	0.210	0.174	0.286	0.010
MSD	<0.001	<0.001	<0.001	<0.001	<0.001	<0.001
HSD	0.270	<0.001	<0.001	<0.001	<0.001	0.295
SHD	<0.001	<0.001	<0.001	<0.001	<0.001	0.345
SMD	<0.001	<0.001	<0.001	<0.001	<0.001	<0.001
TPR	0.120	<0.001	0.145	0.263	0.869	0.005
TNR	<0.001	<0.001	<0.001	<0.001	<0.001	<0.001
PPV	0.155	<0.001	0.032	0.009	0.030	0.140
JI	0.217	<0.001	0.172	0.212	0.261	0.003
CC	0.256	<0.001	0.289	0.161	0.346	0.041

Table 8

Generalized linear model results for the method's performance per each metric depending on the age of each subject expressed as regression coefficient, 95% confidence interval (CI) and p-value. DSC, TPR, TNR, PPV, JI, CC are in and SHD, SMD, MSD, HSD are in mm years−1. Values with mean that the age (atrophy) has a statistically significant influence over the performance of this metric and method. Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC).

	JCSCS	DEEPSEG	MGAC	GSBME	SCT	VBEM
DSC	9×10−4	0.001	−0.001	0.001	−3×10−4	9×10−4
	CI=[1×10−4 to 17×10−4]	CI=[−2×10−4 to 26×10−4]	CI=[−27×10−4 to 6×10−4]	CI=[−1×10−4 to 30×10−4]	CI=[−19×10−4 to 13×10−4]	CI=[−41×10−4 to 21×10−4]
	p=0.02	p=0.11	p=0.22	p=0.03	p=0.70	p=0.54

MSD	−0.002	−0.003	2×10−4	−0.005	−0.002	7×10−4
	CI=[−0.013 to 0.008]	CI=[−0.014 to 0.009]	CI=[−0.020 to 0.020]	CI=[−0.021 to 0.011]	CI=[−0.021 to 0.018]	CI=[−0.028 to 0.030]
	p=0.66	p=0.65	p=0.99	p=0.53	p=0.87	p=0.96

HSD	−0.036	−0.022	−0.013	−0.049	−0.009	0.070
	CI=[−0.121 to 0.049]	CI=[−0.104 to 0.059]	CI=[−0.045 to 0.018]	CI=[−0.130 to 0.034]	CI=[−0.040 to 0.023]	CI=[−0.321 to 0.460]
	p=0.40	p=0.58	p=0.40	p=0.24	p=0.58	p=0.72

SHD	−0.002	−0.012	5×10−4	−0.021	5×10−4	0.043
	CI=[−0.009 to 0.005]	CI=[−0.028 to 0.003]	CI=[−0.008 to 0.009]	CI=[−0.040 to −0.002]	CI=[−0.009 to 0.010]	CI=[−0.163 to 0.249]
	p=0.62	p=0.12	p=0.90	p=0.03	p=0.91	p=0.67

SMD	0.001	−8×10−4	−7×10−4	−0.004	8×10−4	0.003
	CI=[−0.003 to 0.005]	CI=[−0.006 to 0.004]	CI=[−0.003 to 0.005]	CI=[−0.012 to 0.004]	CI=[−0.003 to 0.005]	CI=[−0.003 to 0.009]
	p=0.63	p=0.75	p=0.72	p=0.31	p=0.65	p=0.32

TPR	0.074	0.085	−0.097	0.102	−0.052	−0.199
	CI=[−0.019 to 0.167]	CI=[−0.165 to 0.335]	CI=[−0.254 to 0.059]	CI=[−0.084 to 0.288]	CI=[−0.213 to 0.110]	CI=[−0.554 to 0.155]
	p=0.11	p=0.49	p=0.21	p=0.27	p=0.52	p=0.26

TNR	6×10−4	7×10−4	0.5×10−4	8×10−4	−4×10−4	12×10−4
	CI=[−2×10−4 to 14×10−4]	CI=[−3×10−4 to 17×10−4]	CI=[−19×10−4 to 20×10−4]	CI=[−2×10−4 to 20×10−4]	CI=[−10×10−4 to 19×10−4]	CI=[−9×10−4 to 33×10−4]
	p=0.13	p=0.19	p=0.96	p=0.13	p=0.57	p=0.24

PPV	0.109	0.125	−0.099	0.221	−0.013	−0.006
	CI=[−0.010 to 0.229]	CI=[0.039–0.211]	CI=[−0.307 to 0.108]	CI=[0.068–0.373]	CI=[−0.210 to 0.183]	CI=[−0.344 to 0.331]
	p=0.07	p=0.005	p=0.34	p=0.006	p=0.89	p=0.97

JI	0.001	0.002	−0.001	0.221	3×10−4	−0.001
	CI=[2×10−4 to 24×10−4]	CI=[−4×10−4 to 35×10−4]	CI=[−32×10−4 to 8×10−4]	CI=[2×10−4 to 37×10−4]	CI=[−22×10−4 to 14×10−4]	CI=[−0.004 to 0.002]
	p=0.02	p=0.12	p=0.24	p=0.03	p=0.68	p=0.46

CC	0.312	0.400	−0.461	0.623	0.104	−0.447
	CI=[0.049–0.574]	CI=[−0.081 to 0.883]	CI=[−1.170 to 0.249]	CI=[0.068–1.179]	CI=[−0.845 to 0.638]	CI=[−2.704 to 1.810]
	p=0.02	p=0.10	p=0.20	p=0.03	p=0.78	p=0.69

Generalized linear model results for the method's performance per each metric depending on the scanner sequence expressed as p-value (F-test between all site coefficients in a regression model). Values with (in bold face) mean that the image quality has an statistically significant influence over the performance of this metric and method. Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC). Generalized linear model results for the method's performance per each metric depending on the age of each subject expressed as regression coefficient, 95% confidence interval (CI) and p-value. DSC, TPR, TNR, PPV, JI, CC are in and SHD, SMD, MSD, HSD are in mm years−1. Values with mean that the age (atrophy) has a statistically significant influence over the performance of this metric and method. Dice similarity coefficient (DSC), mean surface distance (MSD), Hausdorff surface distance (HSD), skeletonized Hausdorff distance (SHD), skeletonized median distance (SMD), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Jaccard index (JI) and conformity coefficient (CC).

Discussion

Presented algorithms were found to be able to identify and segment GM on all datasets with an acceptable precision and shape (see Fig. 4). It is important to highlight the fact that the small size of the spinal cord GM makes the process of segmenting the GM algorithmically challenging, as the inclusion/exclusion of one voxel can have a substantial impact on the performance scores (see Table 7, Table 8). Raters delineated very similar masks, however in comparison to the majority voting-based consensus segmentation, rater 4 performs significantly better than the remaining raters (see Table 5). Note that significant differences in performance between raters does not necessarily mean clinically significant differences. The statistically significant difference between raters is mostly due to small differences in segmentation protocol when drawing the GM. This is further corroborated by the small standard deviations of most performance scores (see Table 5 and Fig. 1). JCSCS was found to be a method that provides similar mask contour and shape when compared to the ground truth, obtaining amongst some of the best scores for MSD, HSD, SHD and SMD (see Table 6 and Fig. 2). In terms of HSD and SMD, JCSCS was not found to be significantly different from a consensus manual rater (; see Table 6). Regarding the overlap scores between JCSCS and rater masks, the obtained results were found not to differ significantly () from the best results (see DSC, PPV, JI and CC in Table 6). The low TPR values obtained by JCSCS means that it tends to marginally undersegment the GM producing more conservative masks and consequently getting high TNR values. The lowest standard deviation obtained by JCSCS in seven out of ten validation metrics demonstrates its robustness and reliability across different vendors, independent sites, various acquisition parameters and image resolution. With the highest DSC among all presented techniques (DSC=0.8) DEEPSEG has shown the potential of deep learning for spinal cord segmentation. Furthermore, the algorithm, which was originally intended for brain lesion segmentation, was only slightly adjusted for the spinal cord, lending further support to the strengths of deep learning. The DEEPSEG algorithm performs significantly worse than the best technique on four scores: HSD, SHD, SMD, and TPR. However, the SMD, which quantifies global errors, was found not to be significantly different from human raters, suggesting that DEEPSEG captures the gold standard skeletonised structure of the GM. In some occurrences, the DEEPSEG algorithm will fail to connect the two horns of the GM, as seen in Fig. 4, potentially linked to the relatively low number of training samples commonly necessary for deep learning applications. The MGAC method scored high amongst the methods in TPR, indicating the highest level of specificity. In addition, the MGAC method scored amongst the highest in both SHD and SMD, demonstrating the methods ability to determine the underlying shape of the GM. However, MGAC did not score as highly in TNR, representing a lower level of sensitivity. This lower sensitivity is also seen in the lower MSD and PPV scores. These results suggest that the MGAC method is excellent at determining the underlying shape of the GM, but may overestimate the GM volumes compared to human raters. This overestimation in volume can be seen in Fig. 4. One strong advantage of the MGAC algorithm is its ability to work on images with different contrasts. This algorithm was developed for use on PSIR images, but has also been shown to work well on T2weighted images. The GSBME method consists of three steps: preprocessing, maximum-entropy thresholding and outlier detection. As far as its performance is concerned, GSBME provides consistent values of quality-of-segmentation scores in all sites. It ranks intermediately when scores that measure the degree of overlap between masks (i.e. DSC and JI) or the ability of rejecting false/accepting true segmentations are considered (i.e. TNR and TPR). However, its performance worsens when using scores that measure the physical distance between segmented voxels, especially in their skeletonized version (i.e. SMD, SHD). While the performance of the algorithm could be potentially improved, the current implementation appears to suffice for the characterisation of grey/white matter differences in future studies involving healthy controls. From an algorithmic point of view, the current implementation includes a number of operations to standardise data from different sites (i.e. normalisation, denoising). These could potentially be unnecessary for single-site studies, leading to a further simplification of the algorithm. The bottle-neck of the method is the initial detection of the spinal cord, a semi-automatic procedure requiring manual input. Further improvements to the technique should focus on the final step of outlier detection, which effectively reduces false positives but that can also lead to false negatives. The GM segmentation as implemented in the Spinal Cord Toolbox (SCT) is an atlas-based method. Therefore, the output segmentation is a fusion of manual segmentations that constitute the model, implying that segmentations always have a shape that resembles the GM. Moreover, unlike contour deformation or intensity based methods, SCT is very robust to artefacts or pathologies as demonstrated in Dupont (2016). The scores computed for SCT were satisfying. However, the relatively low PPV suggests that SCT has a tendency to over segment the GM, which could be addressed by adjusting the threshold of the output probabilistic segmentation. Results obtained with SCT for shape sensitive indicators (HSD, SHD and SMD) were amongst the best ones, suggesting that SCT captures properly the GM shape in the input image. Moreover, the SMD score of SCT was similar to the gold standard, suggesting that this method performs as well as human raters in capturing the overall GM shape. Finally, SCT is available as an open-source software package (http://spinalcordtoolbox.sf.net) (de Leener, 2016). Compared to some of the other competing algorithms, the semisupervised VBEM method exhibited a relatively poor performance in terms of overlap scores with the manual segmentations. On the other hand, the results were submitted for evaluation only once. Therefore no parameter tuning was performed in order to maximize the performance with respect to the selected accuracy measures. The method tries to capture the most parsimonious partitioning of the data based on the observed image intensities, therefore structures that have partially overlapping intensity distributions, such as GM and WM in the spinal cord might be particularly hard to resolve. Additionally, if the training data set is not sufficiently large, volumetric approaches also suffer from having a relatively small amount of training labels available at each anatomical cross section (especially for images with different fields of view), compared to slice based methods. Nevertheless, such a probabilistic modelling framework represents an ideal environment for performing statistical morphometric group studies, which can potentially help to unravel the mechanisms underlying neurological disorders. It should also be noted that, for this purpose, conformity with manual labelling protocols does not constitute a primary concern, as long as there is internal consistency of the results across subjects. Finally, as no single method has consistently outperformed all other methods for every site and assessment metric, no hard conclusions can be drawn with regards to the true best performing method; the choice of an optimal method would change depending on the target sequence, computational time and choice of performance metrics.

Conclusions

This paper demonstrates the feasibility of six emerging segmentation methods to fully automatically and robustly segment the butterfly shape of the GM in the spinal cord. Thus, next to established voxel-wise segmentation algorithms optimized for the brain, the spinal cord tissue is entering the field of voxel-wise analysis opening new avenues to make statistical inferences of volume and shape across the entire neuroaxis (Freund et al., 2016). We have presented the results of the first spinal cord GM segmentation challenge. Six institutions across the world have collaborated in order to compare their cutting edge methods using the same dataset from multiple vendors and sites. The challenge was successful and the presented methods provided highly promising results using different underlying principles. This variety showed that spinal cord GM segmentation remains challenging within a vibrant research field. Finally, training data and masks, and testing data without masks, will remain publicly available at http://cmictig.cs.ucl.ac.uk/niftyweb for the community to continue to evaluate their methods. Future spinal cord GM challenges will aim to include other image modalities, more vendors, neurological conditions, other spinal cord levels and attempt to harmonise the manual segmentation software and protocol.

42 in total

1. Automatic magnetic resonance spinal cord segmentation with topology constraints for variable fields of view.

Authors: Min Chen; Aaron Carass; Jiwon Oh; Govind Nair; Dzung L Pham; Daniel S Reich; Jerry L Prince
Journal: Neuroimage Date: 2013-08-06 Impact factor: 6.556

2. A morphological approach to curvature-based evolution of curves and surfaces.

Authors: Pablo Márquez-Neila; Luis Baumela; Luis Alvarez
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2014-01 Impact factor: 6.226

3. Neurite orientation dispersion and density imaging of the healthy cervical spinal cord in vivo.

Authors: Francesco Grussu; Torben Schneider; Hui Zhang; Daniel C Alexander; Claudia A M Wheeler-Kingshott
Journal: Neuroimage Date: 2015-01-31 Impact factor: 6.556

4. Robust, accurate and fast automatic segmentation of the spinal cord.

Authors: Benjamin De Leener; Samuel Kadoury; Julien Cohen-Adad
Journal: Neuroimage Date: 2014-04-27 Impact factor: 6.556

5. Fully-integrated framework for the segmentation and registration of the spinal cord white and gray matter.

Authors: Sara M Dupont; Benjamin De Leener; Manuel Taso; Arnaud Le Troter; Sylvie Nadeau; Nikola Stikov; Virginie Callot; Julien Cohen-Adad
Journal: Neuroimage Date: 2016-09-20 Impact factor: 6.556

6. Unbiased average age-appropriate atlases for pediatric studies.

Authors: Vladimir Fonov; Alan C Evans; Kelly Botteron; C Robert Almli; Robert C McKinstry; D Louis Collins
Journal: Neuroimage Date: 2010-07-23 Impact factor: 6.556

Review 7. The current state-of-the-art of spinal cord imaging: applications.

Authors: C A Wheeler-Kingshott; P W Stroman; J M Schwab; M Bacon; R Bosma; J Brooks; D W Cadotte; T Carlstedt; O Ciccarelli; J Cohen-Adad; A Curt; N Evangelou; M G Fehlings; M Filippi; B J Kelley; S Kollias; A Mackay; C A Porro; S Smith; S M Strittmatter; P Summers; A J Thompson; I Tracey
Journal: Neuroimage Date: 2013-07-14 Impact factor: 6.556

8. Spinal cord gray matter atrophy correlates with multiple sclerosis disability.

Authors: Regina Schlaeger; Nico Papinutto; Valentina Panara; Carolyn Bevan; Iryna V Lobach; Monica Bucci; Eduardo Caverzasi; Jeffrey M Gelfand; Ari J Green; Kesshi M Jordan; William A Stern; H-Christian von Büdingen; Emmanuelle Waubant; Alyssa H Zhu; Douglas S Goodin; Bruce A C Cree; Stephen L Hauser; Roland G Henry
Journal: Ann Neurol Date: 2014-08-21 Impact factor: 10.422

Review 9. FSL.

Authors: Mark Jenkinson; Christian F Beckmann; Timothy E J Behrens; Mark W Woolrich; Stephen M Smith
Journal: Neuroimage Date: 2011-09-16 Impact factor: 6.556

10. Fully automated grey and white matter spinal cord segmentation.

Authors: Ferran Prados; M Jorge Cardoso; Marios C Yiannakas; Luke R Hoy; Elisa Tebaldi; Hugh Kearney; Martina D Liechti; David H Miller; Olga Ciccarelli; Claudia A M Gandini Wheeler-Kingshott; Sebastien Ourselin
Journal: Sci Rep Date: 2016-10-27 Impact factor: 4.379

26 in total

1. Automatic Spinal Cord Gray Matter Quantification: A Novel Approach.

Authors: C Tsagkas; A Horvath; A Altermatt; S Pezold; M Weigel; T Haas; M Amann; L Kappos; T Sprenger; O Bieri; P Cattin; K Parmar
Journal: AJNR Am J Neuroradiol Date: 2019-08-22 Impact factor: 3.825

2. Correlations between cervical spinal cord magnetic resonance diffusion tensor and diffusion kurtosis imaging metrics and motor performance in patients with chronic ischemic brain lesions of the corticospinal tract.

Authors: Valentina Panara; R Navarra; P A Mattei; E Piccirilli; V Bartoletti; A Uncini; M Caulo
Journal: Neuroradiology Date: 2018-12-05 Impact factor: 2.804

Review 3. Future Brain and Spinal Cord Volumetric Imaging in the Clinic for Monitoring Treatment Response in MS.

Authors: Tim Sinnecker; Cristina Granziera; Jens Wuerfel; Regina Schlaeger
Journal: Curr Treat Options Neurol Date: 2018-04-20 Impact factor: 3.598

4. Quantification of DTI in the Pediatric Spinal Cord: Application to Clinical Evaluation in a Healthy Patient Population.

Authors: B B Reynolds; S By; Q R Weinberg; A A Witt; A T Newton; H R Feiler; B Ramkorun; D B Clayton; P Couture; J E Martus; M Adams; J C Wellons; S A Smith; A Bhatia
Journal: AJNR Am J Neuroradiol Date: 2019-06-13 Impact factor: 3.825