| Literature DB >> 32792540 |
Félix Renard1,2, Soulaimane Guedria3,4, Noel De Palma3, Nicolas Vuillerme4,5.
Abstract
Medical image segmentation is an important tool for current clinical applications. It is the backbone of numerous clinical diagnosis methods, oncological treatments and computer-integrated surgeries. A new class of machine learning algorithm, deep learning algorithms, outperforms the results of classical segmentation in terms of accuracy. However, these techniques are complex and can have a high range of variability, calling the reproducibility of the results into question. In this article, through a literature review, we propose an original overview of the sources of variability to better understand the challenges and issues of reproducibility related to deep learning for medical image segmentation. Finally, we propose 3 main recommendations to address these potential issues: (1) an adequate description of the framework of deep learning, (2) a suitable analysis of the different sources of variability in the framework of deep learning, and (3) an efficient system for evaluating the segmentation results.Entities:
Mesh:
Year: 2020 PMID: 32792540 PMCID: PMC7426407 DOI: 10.1038/s41598-020-69920-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The different steps of a DL framework are displayed in solid-line boxes: the steps related to the dataset (the data augmentation and cross-validation strategies, DL architecture design, training step (with the optimization procedure), and estimation of the hyperparameters of the optimization) and the evaluation system. The different sources of variability are highlighted in dashed-line boxes: the variability linked to (A) the dataset, (B) the DL architecture, (C) the optimization procedure, (D) the hyper parameter estimation for the optimization and (E) the implementation and infrastructure.
Segmentation metrics
| Metric | Equation | Range | Meaning |
|---|---|---|---|
| Dice coefficient (DC) | 0–1 | Spatial overlap between masks | |
| True positive rate (TPR) | 0–1 | Sensitivity | |
| True negative rate (TNR) | 0–1 | Specificity | |
| Average volume distance (AVD) | Precision |
Mask segmentation mask, Ground Truth ground-truth mask, TP true positives, voxels that are correctly segmented as the region of interest, TN true negatives, voxels that are correctly segmented as the background, FP false positives, voxels that are incorrectly segmented as the region of interest, FN false negatives, voxels that are incorrectly segmented as the background. corresponds to the directed average Hausdorff metric, defined as , where N is the number of pixels or voxels considered.
The training size, the kind of data augmentation (DA), the presence of the DA term and the validation set (VS) term, the training size proportion and the cross-validation (CV) strategy in each article.
| Article | Training size | DA | DA term | VS term | Training proportion | CV strategy |
|---|---|---|---|---|---|---|
| Guo (2014)[ | Patches | No | No | Not clearly detailed | LOO | |
| de Brebisson (2015)[ | Patches | No | Yes | 43% | No | |
| Choi (2016)[ | Patches | No | Yes | 75% | No | |
| Stollenga (2015)[ | Patches | Yes | No | 50%, 25% | No | |
| Zhang (2015)[ | Patches | No | Yes | 87.50% | LOO | |
| Andermatt (2016)[ | Yes | Yes | No | 25% | No | |
| Bao (2016)[ | Patches | No | No | 50%, 50% | No | |
| Birenbaum (2016)[ | Patches | Yes | Yes | 80% | LOO | |
| Brosch (2016)[ | Not described | No | Yes | 46%, 95%, 80% | No/LOO/No | |
| Chen (2016a)[ | Not described | No | No | 25% | LOO | |
| Ghafoorian (2016b) | Patches | No | Yes | 90% | No | |
| Ghafoorian (2016a) | Patches | No | Yes | 89% | No | |
| Havaei (2016b)[ | Not described | No | Yes | 70% | No | |
| Havaei (2016a)[ | Patches | Yes | Yes | 46%, 84% | No/7 FO | |
| Kamnitsas (2017)[ | Patches | Yes | Yes | 80%, 72%, 44% | 5 FO | |
| Kleesiek (2016)[ | Patches | Yes | No | 50%, 50% | 2 FO/3 FO | |
| Mansoor (2016)[ | Patches | No | No | Not clearly detailed | Not clearly detailed | |
| Milletari (2016a)[ | Patches | No | Yes | 82%, 33% | No | |
| Moeskops (2016a)[ | Patches | No | No | 20%; 25% ; 33% | LOO/No/No | |
| Nie (2016b)[ | Patches | No | No | Not clearly detailed | LOO | |
| Pereira (2016)[ | Patches | Yes | Yes | 46%, 84% | No | |
| Shakeri (2016)[ | Patches | Yes | Yes | 66% , 50% | 3 FO/2 FO | |
| Zhao (2016)[ | Patches | No | No | Not clearly detailed | Not clearly detailed |
The CV method can be leave one out (LOO) or k fold out (k FO). For example, the article by Kamnitsas et al.,[47], presents 3 datasets with training sizes , and . The data augmentation is based on a patch strategy (the authors referred to it in the article). They also explicitly described whether they used a validation set. The training proportions of the 3 datasets are 80%, 72% and 44%. Finally, the authors used 5 fold out for the CV strategy.
The different DL models, kinds of datasets (number of datasets, denoted as Nb DS, the kind of dataset (public or private) and the kind of evaluation (type, number and variability of the measures).
| Article | DL architecture | Nb DS | Dataset type | Type of Meas. | Nb of Meas. | Var. of Meas. |
|---|---|---|---|---|---|---|
| Guo (2014)[ | SAE | 1 | Private | DC | 1 | Values |
| de Brebisson (2015)[ | CNN | 1 | Public | DC | 1 | No |
| Choi (2016)[ | CNN | 2 | Public | DC, P, R | 3 | Values, graph |
| Stollenga (2015)[ | RNN | 2 | Public | DC, MHD, AVD | 3 | No |
| Zhang (2015)[ | CNN | 1 | Private | DC, MHD | 2 | Values, graph * |
| Andermatt (2016)[ | RNN | 1 | Public | DC, MHD, AVD | 3 | No |
| Bao (2016)[ | CNN | 2 | Public | DC, VD, SD, TPR, FPR | 1 | No |
| Birenbaum (2016)[ | CNN | 1 | Public | DC,Score | 2 | No |
| Brosch (2016)[ | CNN | 3 | 2 Public & private | DC, AVD, LTPR, LFPR | 4 | Graph |
| Chen (2016a)[ | CNN | 1 | Public | DC, MHD, AVD | 3 | No |
| Ghafoorian (2016b) | CNN | 1 | Private | DC, AUC | 2 | Graph |
| Ghafoorian (2016a) | CNN | 1 | Private | DC, AUC | 2 | Graph |
| Havaei (2016b)[ | CNN | 3 | Public | DC,VD,SD,TPR,FPR | 5 | No |
| Havaei (2016a)[ | CNN | 2 | Public | DC,Sens.,Spe | 3 | Graph |
| Kamnitsas (2017)[ | CNN | 3 | Private & 2 public | DC, P, Sens, ASSD, HD | 5 | Values, graph |
| Kleesiek (2016)[ | CNN | 4 | 3 Public & 1 private | DC,Sens.,Spe | 3 | Values, graph |
| Mansoor (2016)[ | SAE | 1 | Private | DC, ALSD | 2 | Values, graph |
| Milletari (2016a)[ | CNN | 2 | Private | DC, MDEC, FR | 3 | Graph |
| Moeskops (2016a)[ | CNN | 3 | Public | DC, MSD | 2 | Values, graph |
| Nie (2016b)[ | CNN | 1 | Private | DC | 1 | Values * |
| Pereira (2016)[ | CNN | 2 | Public | DC, PPV, Sens | 3 | Graph |
| Shakeri (2016)[ | CNN | 2 | Public & private | DC, HD, CMD | 3 | Graph |
| Zhao (2016)[ | CNN | 1 | Public | DC | 1 | Graph |
For the types of measures, DC Dice Coefficient, P Prediction, R Recall, MHD modified Hausdorff distance, AVD average volume distance, TPR true positive rate, FPR false positive rate, AUC area under the curve, Sens. Sensitivity, and Spe. Specificity. The variability of a measure corresponds to the presence of the standard deviation value or a display in a graph. The (*) means that the values for all subjects are reported. For example, the article by Kamnitsas et al.,[47], is based on a CNN. Their models are evaluated on 3 datasets, where one is private and two public. To evaluate their segmentations, they used the DC, P., Sens., ASSD and HD metrics (5 different metrics). The variability of the measures is displayed in a graph, and the corresponding values are reported in the text.
The kind of optimization, whether the hyperparameters (HPs) are handcrafted, the learning rate (the value (V.) and the presence (P.) of the term), the batch size (the value (V.) and the presence (P.) of the term), the dropout regularization (the value (V.) and the presence (P.) of the term) and whether the code is open source.
| Article | Optimization | HP handcrafted | Learning rate (V./P.) | Batch size (V./P.) | Dropout (V./P.) |
|---|---|---|---|---|---|
| Guo (2014)[ | GBM | Yes | No/no | No | No |
| de Brebisson (2015)[ | SGD (M) | Yes | Yes (0.05)/yes | Yes/yes | No |
| Choi (2016)[ | SGD (M) | Yes | Yes (0.001)/yes | No/yes | Yes/yes |
| Stollenga (2015)[ | RMS-prop | Yes | Yes (0.01)/yes | No | Yes/yes |
| Zhang (2015)[ | SGD (M) | Yes | Yes (0.0001)/yes | No | Yes/yes |
| Andermatt (2016)[ | AdaDelta | Yes | omit | No | Yes/yes |
| Bao (2016)[ | Not described | Yes | No | No | No |
| Birenbaum (2016)[ | AdaDelta | Yes ** | omit | No | Yes/yes |
| Brosch (2016)[ | AdaDelta | Yes | Sensitivity ratio Yes/yes | No | No |
| Chen (2016a)[ | Not described | Yes | No | No | No |
| Ghafoorian (2016b) | RMS-prop | Yes | No/yes | Yes/yes | Yes/yes |
| Ghafoorian (2016a) | RMS-prop | Yes | No/yes | Yes/yes | Yes/yes |
| Havaei (2016b)[ | SGD (M) | Yes | Yes (0.001)/yes | No | No/yes |
| Havaei (2016a)[ | SGD (M) | No (Grid Search) | Yes(0.005)/yes | No/yes | Yes/yes |
| Kamnitsas (2017)[ | RMS-prop | Yes | Yes(0.0001)/yes | Yes/yes | Yes/yes |
| Kleesiek (2016)[ | SGD | Yes | Yes(0.00001)/yes | Yes/yes | No |
| Mansoor (2016)[ | SGD (M) | Yes ** | No | Yes/yes | No |
| Milletari (2016a)[ | SGD (M) | Yes | Yes (range values)/yes | Yes/yes | Yes/yes |
| Moeskops (2016a)[ | RMS-prop | No (not explained) | No/yes | No/yes | No/yes |
| Nie (2016b)[ | Not described | Yes | No/yes | No | No |
| Pereira (2016)[ | SGD (M) | Yes | Yes (range values)/yes | Yes/yes | Yes/yes |
| Shakeri (2016)[ | SGD (M) | Yes | Yes(0.01)/yes | No | Yes/yes |
| Zhao (2016)[ | Not described | Yes | No | No | No |
The (M) in the optimization column signifies that the momentum algorithm is performed. The ** in the HP handcrafted column means that several DL architectures are tested. For example, the article by Kamnitsas et al.,[47], used an RMS-prop strategy for optimization. The different hyperparameters are handcrafted. The learning rate, the batch size and the dropout are mentioned in the text, and their corresponding values are given.
In the second column, the different implementations are described (Theano , Mat-ConvNet , Caffe , Keras , Pylearn2 and Lasagne ).
| Articles | Implementation | Infrastructure | Open source |
|---|---|---|---|
| Guo (2014)[ | Not described | Not described | No |
| de Brebisson (2015)[ | Theano | NVIDIA Tesla K40 GPU-12GB | No |
| Choi (2016)[ | Mat-ConvNet | GPU (GTX TITAN) | No |
| Stollenga (2015)[ | Not described | NVIDIA GTX TITAN X GPU-12GB | No |
| Zhang (2015)[ | Not described | Tesla K20c GPU | No |
| Andermatt (2016)[ | Caffe | NVIDIA GTX Titan X GPU-12GB | No |
| Bao (2016)[ | Not described | Not described | No |
| Birenbaum (2016)[ | Keras + Theano | NVIDIA GeForce GTX 980 Ti GPU | No |
| Brosch (2016)[ | Own implementation | GeForce GTX 780 | No |
| Chen (2016a)[ | Caffe | NVIDIA TITAN X GPU | Yes (*) |
| Ghafoorian (2016b)[ | Theano | Not described | No |
| Ghafoorian (2016a)[ | Not described | Titan X card | No |
| Havaei (2016b)[ | Keras | Nvidia TitanX GPU | No |
| Havaei (2016a)[ | Pylearn2 | NVIDIA Titan black card. | No |
| Kamnitsas (2017)[ | Theano | NVIDIA GTX Titan X GPU-12GB | Yes |
| Kleesiek (2016)[ | Theano | NVIDIA Titan-3GB | No |
| Mansoor (2016)[ | Not described | Not described | No |
| Milletari (2016a)[ | Caffe | NVIDIA “‘Tesla k40” or “Titan X”-12GB | No |
| Tested on Nvidia GTX 980-4GB | No | ||
| Moeskops (2016a)[ | Not described | NVIDIA Tesla K40 GPU (**) | No |
| Nie (2016b)[ | Caffe | Not described | No |
| Pereira (2016)[ | Theano + Lasagne | GPU NVIDIA GeForce GTX 980 | Yes |
| Shakeri (2016)[ | Mat-ConvNet | Described on github | Yes |
| Zhao (2016)[ | Not described | Not described | No |
http://deeplearning.net/software/theano/.
http://www.vlfeat.org/matconvnet/.
https://caffe.berkeleyvision.org/.
https://keras.io/.
http://deeplearning.net/software/pylearn2/.
https://lasagne.readthedocs.io/en/latest/.
For the infrastructure details, the materials are described as they are referenced in the articles. If the global memory is reported in the article, it is noted. The last column, ’Open Source’, shows whether the source code is available. The (*) indicates that the code source is not available but a detailed prototype of the algorithm is provided. The (**) indicates that the infrastructure is detailed in the Acknowledgements section. For example, the article by Kamnitsas et al.,[47], used the Theano implementation on an infrastructure based on an NVIDIA GTX Titan X GPU-12GB. Their code is released as open source.
Figure 2The left side, resp. the right, of the figure is relevant to the description of the dataset, resp. the optimization. The description of the training proportion is present in 83% of the selected articles. The terms of data augmentation, resp. the validation set, are described in 35%, respectively 57%, of the selected articles. For the optimization procedure, the name of the optimization algorithm is missing in 17% of the selected articles. Regarding the hyperparameter learning rate, dropout and batch size, their values are available in only 57%, 52% and 35% of the articles, respectively. These coefficients are mentioned in the text without any values in 19%, 9% and 13% of the articles, respectively. In the end, only 9% of the evaluated articles have enough information to be reproducible.
Figure 3Four different sources of variability. (A) There is a large variability in the dataset size. 68.5% of the numbers of samples in the dataset are less than or equal to 50. (B) In general, no cross-validation strategy is considered (more than 50% of the articles). (C) There are 5 different optimization algorithms introduced in the different articles. The main one is SGD based on momentum (SGM(M)). The gradient-based method (GBM) and stochastic gradient descent (SGD) are only general terms. (D) There are 5 different implementations of DL frameworks. Even the Theano implementation is used in 42.9% of the considered articles, and there is no consensus among the implementations.
Figure 4The number of evaluation measures used in each article. Note that the number of measures required to correctly evaluate a segmentation result is 3.
Figure 5The proposals are separated into three main parts: (A) an adequate and complete description of the DL framework for reproducibility purposes, (B) an analysis of the different sources of variability, and (C) an efficient evaluation system for image segmentation.