| Literature DB >> 35851016 |
Emilia Gryska1,2, Isabella Björkman-Burtscher3,4, Asgeir Store Jakola5,6, Tora Dunås5,7, Justin Schneiderman8,5, Rolf A Heckemann8,2.
Abstract
OBJECTIVES: To determine the reproducibility and replicability of studies that develop and validate segmentation methods for brain tumours on MRI and that follow established reproducibility criteria; and to evaluate whether the reporting guidelines are sufficient.Entities:
Keywords: diagnostic radiology; magnetic resonance imaging; neuroradiology; statistics & research methods
Mesh:
Year: 2022 PMID: 35851016 PMCID: PMC9297223 DOI: 10.1136/bmjopen-2021-059000
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 3.006
Description of the two algorithms implemented in the reproducibility analysis, 3D dual-path CNN9 and 2D single-path CNN,10 according to the reproducibility categories proposed by Renard et al3
| Main category | Subcategory | 3D dual-path CNN | 2D single-path CNN |
| Algorithm/model | Description of the DL architecture | Dual-path 3D CNN with a fully connected 3D CRF. | Single-path 2D CNN; two network architectures for HGG and LGG. |
| Dataset description | Image acquisition parameters | BraTS 2015 dataset | |
| Image size | |||
| Data set size | |||
| Link to the data set | |||
| Preprocessing description | Data excluded +reason | None | None |
| Augmentation transformation | Sagittal reflection of images | Rotation with multiples of 90° angles | |
| Final sample size | Not specified | ~1 800 000 for HGG | |
| Training/validation/ testing split | Explanation if validation set not created | Training and testing sets provided by the BraTS challenge | |
| CV strategy +no of folds | Not specified | 5-fold CV on training set (n=274) | 1 subject in both HGG (n=220) and LGG (n=54) |
| Optimisation strategy | Optimisation algorithm +reference | RMSProp optimiser | Stochastic Gradient Descent and Nesterov’s momentum |
| Hyperparameters (learning rate | |||
| Hyperparameter selection strategy | CRF: 5-fold CV on a training subset HGG (n=44) and LGG (n=18) | Validation using 1 subject in both HGG (n=220) and LGG (n=54) | |
| Computing infrastructure | Name, class of the architecture, and memory size | NVIDIA GTX Titan X GPU using cuDNN V.5.0, 12 GB | GPU NVIDIA GeForce GTX 980 |
| Middleware | Toolbox used/in-house code +build version | Theano | Theano V.0.7.0 |
| Source code link +dependencies |
|
| |
| Evaluation | Metrics average +variations | Mean of DSC, Precision, and Sensitivity (calculated by the online evaluation system) | Boxplot and mean of DSC (calculated by the online evaluation system) |
| Our implementation middleware | |||
| Python version | 3.8.2 | 3.7.4 | |
| DL library | Tensorflow 2.2.1 | Theano (git version eb6a412), Lasagne (git version 5d3c63c) | |
| Numpy | 1.18.5 | 1.17.3 | |
| Nibabel | 3.0.2 | 3.2.1 | |
All the parameters and versions found in the first part of the table were specified in the original articles. The selection strategy of images to respective cross-validation folds was not specified. In the part ‘our implementation middleware’, we specify the Python version and libraries used for our implementations.
BraTS, Brain Tumour Segmentation Challenge; CNN, convolutional neural networks; CRF, conditional random field; CV, cross-validation; 2D, two dimensions; 3D, three dimensions; DL, deep learning; DSC, Dice similarity coefficient; FC, fully connected; HGG, high-grade glioma; LGG, low-grade glioma.
Reproducibility results on BraTS 2015 presented in the original paper for the 3D dual-path CNN9 and for the 2D single-path CNN10 (original) and for our independent reproducibility analysis (this work)
| Dice similarity coefficient | Positive predictive value | Sensitivity | |||||||
| Whole tumour | Tumour core | CE tumour | Whole tumour | Tumour core | CE tumour | Whole tumour | Tumour core | CE tumour | |
| 3D dual-path CNN | |||||||||
| Original | 0.85 | 0.67 | 0.63 | 0.85 |
|
| 0.88 | 0.61 | 0.66 |
| This work | 0.85 |
|
| 0.85 | 0.83 | 0.62 | 0.88 |
|
|
| 2D single-path CNN | |||||||||
| Original |
|
|
| – | – | – | – | – | – |
| This work (HGG) | 0.36 | 0.25 | 0.17 | 0.36 | 0.21 | 0.29 | 0.54 | 0.58 | 0.17 |
| This work (LGG) | 0.25 | 0.14 | 0.13 | 0.40 | 0.51 | 0.37 | 0.25 | 0.10 | 0.10 |
Our analysis was carried out for HGG and LGG model parameters of the 2D single-path CNN. The results were congruent with the original analysis for the 3D dual-path CNN but they show an unsuccessful attempt to reproduce the 2D single-path CNN validation. The higher score in each column is emphasised in bold. Measures of dispersion or significance of differences were not available for the original method evaluation.
BraTS, Brain Tumour Segmentation Challenge; CE, contrast-enhanced; CNN, convolutional neural network; 2D, two dimensions; 3D, three dimensions; HGG, high-grade glioma; LGG, low-grade glioma.
Figure 1Comparison of the field inhomogeneity correction with ANTs/Nipype (left) and SimpleITK (right). Distinct differences in the FLAIR signal intensity of tumour tissue are visible (red squares). FLAIR, fluid-attenuated inversion recovery. ANTs, Advanced Normalization Tools.
3D dual-path CNN9 replication analysis results on in-house data for high-grade glioma (HGG) cases and meningioma (MNG) cases evaluated on the tumour core and for low-grade glioma (LGG) cases evaluated on the whole tumour label
| ID | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | Mean | SD |
| HGG cases tumour core | ||||||||||||||
| DSC | 0.88 | 0.85 | 0.80 | 0.85 | 0.89 | 0.85 | 0.57 | 0.89 | 0.86 | 0.81 | 0.87 | 0.14 |
|
|
| PPV | 0.84 | 0.86 | 0.72 | 0.84 | 0.85 | 0.79 | 0.41 | 0.85 | 0.80 | 0.73 | 0.80 | 0.08 |
|
|
| Sen | 0.93 | 0.85 | 0.89 | 0.87 | 0.92 | 0.91 | 0.89 | 0.93 | 0.93 | 0.91 | 0.96 | 0.61 |
|
|
| MNG cases tumour core | ||||||||||||||
| DSC | 0.84 | 0.80 | 0.56 | 0.09 | 0.77 | n.a. |
|
| ||||||
| PPV | 0.89 | 0.72 | 0.41 | 0.60 | 0.66 |
|
| |||||||
| Sen | 0.79 | 0.90 | 0.92 | 0.05 | 0.93 |
|
| |||||||
| LGG cases whole tumour | ||||||||||||||
| DSC | 0.35 | 0.70 | 0.89 | 0.58 | 0.93 | 0.85 | 0.83 | 0.85 | 0.54 | 0.77 | n.a |
|
| |
| PPV | 0.27 | 0.55 | 0.86 | 0.43 | 0.93 | 0.77 | 0.88 | 0.90 | 0.43 | 0.74 | n.a |
|
| |
| Sen | 0.52 | 0.93 | 0.92 | 0.89 | 0.93 | 0.95 | 0.78 | 0.80 | 0.75 | 0.80 | n.a |
|
| |
DSC, Dice similarity coefficient; n.a, not available; PPV, positive predictive value; Sen, sensitivity.
Comparison of the mean results of the reproducibility (BraTS 2015 test set) and replicability (in-house image set) analysis of the 3D dual-path CNN9
| Data set: | In-house image set | BraTS 2015 test image set | ||
| Cases: | HGG | MNG | LGG+HGG | |
| Tumour core | DSC | 0.77 | 0.61 | 0.68 |
| PPV | 0.72 | 0.66 | 0.83 | |
| Sen | 0.88 | 0.71 | 0.64 | |
| Cases: | LGG | LGG+HGG | ||
| Whole tumour | DSC | 0.73 | 0.85 | |
| PPV | 0.83 | 0.85 | ||
| Sen | 0.67 | 0.88 | ||
BraTS, Brain Tumour Segmentation Challenge; CNN, convolutional neural network; 3D, three dimensions; DSC, Dice similarity coefficient; HGG, high-grade glioma; LGG, low-grade glioma; MNG, meningioma; PPV, positive predictive value; Sen, sensitivity.
Figure 2Comparison of the expert segmentation (reference) and the three-dimensional (3D) dual-path CNN tumour core segmentation in the in-house data for high-grade glioma (HGG) and meningioma (MNG) cases overlaid on contrast enhanced T1 weighted. Voxels misclassified by the 3D dual-path CNN are visible in HGG cases #07 and #12 (top and middle row). The 3D dual-path CNN failed to correctly outline the tumour and included normal brain structures in the left medial temporal lobe for MNG case #04 (bottom row). CNN, convolutional neural network.
Figure 3Comparison of the expert segmentation (reference) and the three-dimensional (3D) dual-path CNN whole tumour segmentation in the in-house data for low-grade glioma cases displayed overlaid on FLAIR images. Voxels misclassified by the 3D dual-path CNN are visible bilaterally in the orbit in case #01 (top row), which should have been excluded by the skull stripping procedure. In case #09 (middle row), the 3D dual-path CNN misclassified contralateral, sequence-depended FLAIR hyperintensities. CNN, convolutional neural network; FLAIR, fluid-attenuated inversion recovery.
A suggested reproducibility and replicability checklist for automatic medical image segmentation studies
| Data set—description of the image data set used for model development and validation | |
|
Image acquisition parameters Data set size Data excluded +reason Link to the data set (if available) | □ |
| Data set preprocessing—description of the processing steps applied to the raw images before they can be fed to the segmentation model: | |
|
List of all processing steps and corresponding parameters developed for the implementation List of processing steps not included in the implementation (when segmentation model developed and validated on partially preprocessed data) Statement if proprietary software was used Link to the source code +dependencies | □ |
| Segmentation model—description of the model’s architecture used for the segmentation: | |
|
Description of the model (layers, nodes, functions, etc) Trained model Framework used to build the model +version Statement if proprietary software was used Link to the source code +dependencies | □ |
| Postprocessing—description of all processing steps and corresponding parameters applied to the output of the segmentation algorithm before evaluation: | |
|
List of all processing steps and corresponding parameters developed for the implementation Statement if proprietary software was used Link to the source code +dependencies | □ |
| Model development—description of the training/validation and optimisation strategies: | |
|
Augmentation transformations and corresponding parameters used for training Training/validation/testing split Final training sample size CV strategy +no of folds /no of training and evaluation runs Optimisation algorithm +reference Hyperparameter selection strategy Hyperparameters (learning rate Link to the training source code +dependencies | □ |
| Computing infrastructure—description of the hardware used: | |
|
Name Class of the architecture Memory size | □ |
| Model evaluation—description of the model evaluation: | |
|
Metrics average +variations Reference segmentation source Failed cases: number and reasons Training and testing runtime Link to the evaluation source code or platform | □ |
The update from the established checklists3 8 includes a new category data set preprocessing, and a new item in model evaluation category: Failed cases: number and reasons. We also regrouped the items into categories that provide a clearer structure for reporting in particular of reproducibility and replicability studies.