| Literature DB >> 30271370 |
Stefan Winzeck1, Arsany Hakim2, Richard McKinley2, José A A D S R Pinto3, Victor Alves3, Carlos Silva3, Maxim Pisov4,5, Egor Krivov5, Mikhail Belyaev5, Miguel Monteiro6, Arlindo Oliveira6, Youngwon Choi7, Myunghee Cho Paik7, Yongchan Kwon7, Hanbyul Lee7, Beom Joon Kim8, Joong-Ho Won7, Mobarakol Islam9, Hongliang Ren9, David Robben10, Paul Suetens10, Enhao Gong11, Yilin Niu12, Junshen Xu11, John M Pauly11, Christian Lucas13, Mattias P Heinrich13, Luis C Rivera14, Laura S Castillo14, Laura A Daza14, Andrew L Beers15, Pablo Arbelaezs14, Oskar Maier13, Ken Chang15, James M Brown15, Jayashree Kalpathy-Cramer15, Greg Zaharchuk16, Roland Wiest2, Mauricio Reyes17.
Abstract
Performance of models highly depend not only on the used algorithm but also the data set it was applied to. This makes the comparison of newly developed tools to previously published approaches difficult. Either researchers need to implement others' algorithms first, to establish an adequate benchmark on their data, or a direct comparison of new and old techniques is infeasible. The Ischemic Stroke Lesion Segmentation (ISLES) challenge, which has ran now consecutively for 3 years, aims to address this problem of comparability. ISLES 2016 and 2017 focused on lesion outcome prediction after ischemic stroke: By providing a uniformly pre-processed data set, researchers from all over the world could apply their algorithm directly. A total of nine teams participated in ISLES 2015, and 15 teams participated in ISLES 2016. Their performance was evaluated in a fair and transparent way to identify the state-of-the-art among all submissions. Top ranked teams almost always employed deep learning tools, which were predominately convolutional neural networks (CNNs). Despite the great efforts, lesion outcome prediction persists challenging. The annotated data set remains publicly available and new approaches can be compared directly via the online evaluation system, serving as a continuing benchmark (www.isles-challenge.org).Entities:
Keywords: MRI; benchmarking; datasets; deep learning; machine learning; prediction models; stroke; stroke outcome
Year: 2018 PMID: 30271370 PMCID: PMC6146088 DOI: 10.3389/fneur.2018.00679
Source DB: PubMed Journal: Front Neurol ISSN: 1664-2295 Impact factor: 4.003
Participants of ISLES 2016 (more details and main features of each method see Appendix ISLES16-A1 to ISLES16-A7).
| CH-UBE | University of Bern, Switzerland |
| Incorporating time to reperfusion into the FASTER ( | |
| DE-UZL | Institute of Medical Informatics, Universität zu Lübeck, Germany |
| Random forests for stroke lesion and clinical outcome prediction | |
| HK-CUH | Deptartment of Computer Science and Engineering, The Chinese University of Hong Kong |
| Residual Volumetric Network for Ischemic Stroke Lesion Segmentation | |
| KR-SUC | Department of Statistics, Seoul National University, Korea |
| KR-SUK | Deep Convolutional Neural Network Approach for Brain Lesion Segmentation |
| KR-SUL | |
| PK-PNS | Pakistan Institute of Nuclear Science and Technology, Islamabad, Pakistan |
| Segmentation of Ischemic Stroke Lesion using Random Forests in Multi-modal MRI Images | |
| UK-CVI | CVIP, Comp. at School of Science and Eng., University of Dundee, UK |
| Combination of CNN and Hand-crafted feature for Ischemic Stroke Lesion Segmentation | |
| US-SFT | University of Southern California, Fractal Analytics, TopicIQ |
| A Deep-Learning Based Approach for Ischemic Stroke Lesion Outcome Prediction |
These methods are variants of a single method.
Details of the ISLES 2016 & 2017 Data.
| 2016 | 2017 | |
| Number of cases | 35 training and 19 testing | 43 training and 32 testing |
| Number of expert segmentations for training and testing sets | 1 (training), 2 (testing) | 1 (training), 1 (testing) |
| MRI sequences | ADC, rBF, rBV, MTT, TMAX, TTP, Raw PWI | ADC, rBF, rBV, MTT, TMAX, TTP, Raw PWI |
Participants of ISLES 2017 (more details and main characteristic of each method see Appendix ISLES17-A1 to ISLES17-A14).
| AAMC | Athinoula A. Martinos Center, USA |
| Ensembling 3D U-Nets For Ischemic Stroke Lesion Segmentation | |
| HKU-1 | Hong Kong University of Science and Technology, China |
| Deep Adversarial Networks for Stroke Lesion Segmentation | |
| HKU-2 | Hong Kong University of Science and Technology, China |
| Stochastic Dense Network for Brain Lesion Segmentation | |
| INESC | INESC-ID, Portugal |
| Fully Convolutional Neural Network for 3D Stroke Lesion Segmentation | |
| KU | Korea University, Korea |
| Gated Two-Stage Convolutional Neural Networks for Ischemic Stroke Lesion Segmentation | |
| KUL | KU Leuven, Belgium |
| Dual-scale Fully Convolutional Neural Network for Final Infarct Prediction | |
| MIPT | Moscow Institute of Physics and Technology, Russia |
| Neural Networks Stroke Lesion Segmentation | |
| NEU | NEUROPHET Inc. Seoul, South Korea |
| Combination of U-Net and Densely Connected Convolutional Networks | |
| NUS | National University of Singapore, Singapore |
| Fully Convolutional Network with Hypercolumn Features for Brain Lesion Segmentation | |
| SNU-1 | Seoul National University, Korea |
| SNU-2 | Schemic Stroke Lesion Segmentation with Convolutional Neural Networks for Small Data |
| SU | Stanford University, USA |
| Multi-scale Patch-wise 3D CNN for Ischemic Stroke Lesion Segmentation | |
| UA | Universidad de los Andes, Colombia |
| Volumetric Multimodality Neural Network For Ischemic Stroke Segmentation | |
| UL | University of Luebeck, Germany |
| 2D Multi-Scale Res-Net for Stroke Segmentation | |
| UM | Universito of Minho, Portugal |
| Combining Clinical Information for Stroke Lesion Outcome Prediction using Deep Learning |
These methods are variants of a single method.
Summary of lesion characteristics for ISLES 2017 Data.
| Lesion count | mean [min, max] = 2.46[1, 14] |
| Lesion volume | mean [min, max] = 37.83 |
| Lesion localisation in Lobes | for all 32 cases lesions were located in more than one lobe |
| Lesion localisation | |
| Involved territory | |
| Midline shift | not present for any of the 32 cases |
| Laterality | |
| White matter lesions |
n, number of cases with given feature; MCA, middle cerebral artery, PCA, posterior cerebral artery.
Fazekas Classification: 0, absent; 1, punctuate; 2, beginning confluent areas; 3, large confluence.
Figure 1Ranking Scheme. The teams were sorted by their different performance metrics e.g., Dice score (DC) and assigned a rank value per case. Ranks for each team were then separately averaged on a case-wise basis. The final team's rank was then calculated as the mean of all its case-ranks.
Leaderboard ISLES 2016: The rank specifies the final value to order methods relative to each other by performance.
| 1 | KR-SUL | 3.03 | 18/19 | |||
| 2 | KR-SUC | 3.57 | 3.58 | 3.71 | 3.42 | 18/19 |
| 3 | KR-SUK | 3.82 | 3.74 | 4.13 | 3.61 | |
| 4 | CH-UBE | 3.95 | 4.26 | 3.76 | 3.82 | |
| 5 | DE-UZL | 4.21 | 4.21 | 3.82 | 4.61 | |
| 6 | UK-CVI | 4.08 | 5.11 | 4.68 | 5.45 | 16/19 |
| 7 | HK-CHU | 5.59 | 5.08 | 5.53 | 6.16 | |
| 8 | PK-PNS | 6.48 | 6.34 | 7.58 | 5.55 | 12/19 |
| 9 | US-SFT | 8.07 | 8.03 | 8.03 | 8.16 | 11/19 |
Dice, HD, and ASSD rank are the average achieved ranks for each participating team per case. The last column gives the number of successfully (Dice > 0) predicted lesions. Best mean values printed in bold.
Figure 2Significant differences between the 9 submitted methods for ISLES 2016. Each node stands for one participating team. A connection between the nodes represents a significant difference between both lesion prediction models. Methods at the tail side of the arrow indicate superiority to the corresponding connected one. The stronger or weaker a model is the more outgoing or incoming connections (#outgoing/#incoming, respectively), are associated with a team's node. Additionally, the node's color saturation indicates the strength of a method (differences in Friedman test rank sum), with better methods appearing more saturated (i.e., darker blue). All methods, except for PK-PNS, are significantly better than US-SFT (post-hoc Dunn test p < 0.05).
Figure 3Distribution of Dice scores computed between the automatic lesion predictions and both groundtruths (GT1 and GT2) individually for ISLES 2016. For all teams the Dice scores computed with respect to rater 1 were significantly lower than those calculated with respect to the 2nd groundtruth (GT2).
Leaderboard ISLES 2017: While the rank denotes the final value used to sort the teams performances relative to each other.
| 1 | SNU-2 | 5.25 | 5.97 | 30/32 | |
| 2 | UL | 5.42 | 6.16 | 29/32 | |
| 3 | HKU-1 | 5.55 | 5.09 | 6.00 | 29/32 |
| 4 | INESC | 5.92 | 5.00 | 6.84 | 31/32 |
| 5 | KUL | 6.03 | 6.19 | 5.88 | 30/32 |
| 6 | SNU-1 | 6.47 | 6.25 | 6.69 | 29/32 |
| 7 | UM | 6.58 | 6.31 | 6.84 | 31/32 |
| 8 | MIPT | 6.72 | 6.34 | 7.09 | 30/32 |
| 9 | SU | 7.20 | 7.09 | 7.31 | |
| 10 | KU | 8.75 | 10.09 | 7.41 | 28/32 |
| 11 | AAMC | 9.05 | 8.63 | 9.47 | 27/32 |
| 12 | UA | 9.78 | 9.31 | 10.25 | 29/32 |
| 13 | NUS | 9.95 | 9.50 | 10.41 | 29/32 |
| 14 | NEU | 10.44 | 11.88 | 9.00 | 16/32 |
| 15 | HKU-2 | 11.80 | 12.50 | 11.09 | 14/32 |
Dice and HD rank are the average achieved ranks for each participating team. The cases column denotes the number of successfully (DC > 0) predicted lesions. Best mean values printed in bold.
Average Dice score, precision and sensitivity for individual teams across all 32 cases for ISLES 2017.
| 1 | SNU-2 | 0.31 ± 0.23 | 0.36 ± 0.27 | 0.45 ± 0.31 |
| 2 | UL | 0.29 ± 0.21 | 0.34 ± 0.26 | 0.51 ± 0.33 |
| 3 | HKU-1 | 0.34 ± 0.27 | 0.39 ± 0.28 | |
| 4 | INESC | 0.30 ± 0.22 | 0.34 ± 0.27 | 0.51 ± 0.31 |
| 5 | KUL | 0.27 ± 0.22 | 0.39 ± 0.31 | |
| 6 | SNU-1 | 0.28 ± 0.23 | 0.36 ± 0.31 | 0.41 ± 0.31 |
| 7 | UM | 0.29 ± 0.22 | 0.26 ± 0.24 | 0.61 ± 0.28 |
| 8 | MIPT | 0.27 ± 0.20 | 0.31 ± 0.28 | 0.39 ± 0.29 |
| 9 | SU | 0.26 ± 0.21 | 0.28 ± 0.25 | 0.56 ± 0.26 |
| 10 | KU | 0.17 ± 0.16 | 0.23 ± 0.28 | 0.36 ± 0.33 |
| 11 | AAMC | 0.23 ± 0.22 | 0.19 ± 0.20 | |
| 12 | UA | 0.19 ± 0.16 | 0.27 ± 0.25 | 0.21 ± 0.18 |
| 13 | NUS | 0.19 ± 0.16 | 0.29 ± 0.26 | 0.23 ± 0.22 |
| 14 | NEU | 0.11 ± 0.16 | 0.17 ± 0.25 | 0.12 ± 0.17 |
| 15 | HKU-2 | 0.05 ± 0.10 | 0.17 ± 0.28 | 0.05 ± 0.11 |
All evaluation measures are given in mean ± standard deviation. Best mean values printed in bold.
Figure 5Achieved Dice scores for each case across all 15 participating teams sorted by mean value. The dashed line shows the overall mean Dice score of 0.23 (red) and the 0.5 mark (black). Note that the case numbers were assigned according to ascending mean Dice score.
Figure 6Significant differences between the 15 submitted methods at ISLES 2017. Each node stands for one participating team. A connection between two nodes represents a significant difference between both lesion prediction models, whereas the methods at the tail side was superior. The stronger or weaker a models the more outgoing or incoming connections (#outgoing/#incoming), are associated with a team's node. Additionally, the node color saturation indicates the strength of a method, with better methods appearing more saturated. Differences between methods were assessed via non-parametric ANOVA with repeated measurements (Friedman test) and subsequent, pair-wise comparison with Dunn test (p < 0.05).
Figure 7Statistical comparison of lesion prediction performance of single models vs. ensembles. Left: An ensemble of five models (E5) could improve the Dice score in comparison to the two weaker models (SNU-1 p < 0.01, UL p < 0.05). This effect was, however, not observed when building an ensemble with three models (E3). Middle: The ensemble E5 significantly gained precision in contrast to most of the single models (SNU-1 p < 0.01, SNU-2 p < 0.05, UL p < 0.001, INESC p < 0.01). KUL's precision was higher or similar to that of the ensembles, showing no significant difference. Right: The ensemble E3 was found to be more sensitive to predict lesion than SNU-1's model. Overall the models show a fair ability to detect lesions. *p < 0.05, **p < 0.01, and ***p < 0.001.
Figure 8Example of different softmax maps of one patient. Top row: Diffusion (ADC) and perfusion (TTP) scan and the corresponding manual lesion annotation (LABEL) and the softmax maps of the ensembles of the top five (E5) and top three (E3) ranked teams. Bottom row: Softmax maps of the top five ranking teams. Both shape and certainty (see color bar) of the predicted lesion vary between the different participants.
Dice score dependency of threshold for softmax maps.
| INESC | 0.28 | ||||||||
| KUL | 0.22 | 0.26 | 0.27 | 0.26 | 0.23 | 0.20 | 0.15 | 0.02 | |
| SNU-1 | 0.20 | 0.23 | 0.25 | 0.26 | 0.26 | 0.23 | 0.20 | 0.16 | |
| SNU-2 | 0.29 | 0.29 | |||||||
| UL | 0.19 | 0.24 | 0.26 | 0.27 | 0.27 | 0.25 | 0.21 | ||
Figure 4Performance metrics for all teams of ISLES 2017. Higher ranking teams (e.g., 1st place SNU-2) achieved Dice scores > 0.7 for some cases, however, overall Dice scores clustered around 0.2–0.3. The two teams ranked last (NEU and HKU-2) showed much lower Dice scores than all other teams, which was a consequence of the low number of successful submissions. The model of UM seemed to be most sensitive to detect lesions, but lacks in precision.
Overview of methods of participants of ISLES 2016.
| CH-UBE | Random Forest classifier integrating time to reperfusion |
| DE-UZL | Random Forests classifier |
| HK-CUH | U-Net architecture; summation instead of concatenation of different pathways |
| KR-SUC | Ensemble of U-Net architecture and fully convolutional neural network |
| KR-SUK | |
| KR-SUL | |
| PK-PNS | Random Forest classifier |
| UK-CVI | Combination of CNN and hand-crafted features |
| US-SFT | U-Net architecture |
Overview of methods of participants of ISLES 2017.
| AAMC | 3D CNN U-Net architecture; increased number of layers and convolutional filter, multiple down-sampling path ways; anisotropic patch size of 16 × 16 × 4; prediction of 16 overlapping patches per voxels, that are averaged. Morphological operations to reduce small clusters of erroneous predictions |
| HKU-1 | U-Net architecture, including data augmentation and batch normalization, adversarial training of two deep neural networks to avoid over-fitting |
| HKU-2 | 3D CNN U-Net architecture; long short-term memory (LSTM) to capture information in 3rd dimension of MRI scans; data augmentation |
| INESC | V-Net architecture; new loss-function: sum of standard cross-entropy loss and dice-loss |
| KU | Hierarchy of 2 CNNs. 1st CNN discriminates lesion and healthy tissue, 2nd CNN only acts up on voxels where the 1st CNN was uncertain; auto-context (use of probability maps from 1st CNN) |
| KUL | U-net architecture; data augmentation via x-axis flip, Gaussian noise and small linear intensity transformations; ensemble of 4 networks; suppression of prediction in non-dominant hemisphere |
| MIPT | Ensemble of E-Net, DeepMedic, and two U-Nets; 2D and 3D architectures; weighted sum of models' predictions; data augmentation: rotation, flips, registration, and elastic co-registration to template |
| NEU | Combination 3D U-Net and densely connected CNN; refinement with CRF |
| NUS | PixelNet applied to lesion outcome prediction |
| SNU-1/SNU-2 | Ensemble of three CNNs: U-Net, DeepMedic, pyramid scene parsing network; negative Dice score loss |
| SU | 3D CNN with 2 scale pathways; data augmentation through rigid transformations, weighted ratios on positive and negative labels |
| UA | CNN with 4 scale pathways |
| UL | 2D U-Net with skip connections; Dice loss is added up to total loss; inversely weighted loss to tackle class imbalance |
| UM | 2D U-Net in combination with clinical information |