Literature DB >> 32792540

Variability and reproducibility in deep learning for medical image segmentation.

Félix Renard^1,2, Soulaimane Guedria^3,4, Noel De Palma³, Nicolas Vuillerme^4,5.

Abstract

Medical image segmentation is an important tool for current clinical applications. It is the backbone of numerous clinical diagnosis methods, oncological treatments and computer-integrated surgeries. A new class of machine learning algorithm, deep learning algorithms, outperforms the results of classical segmentation in terms of accuracy. However, these techniques are complex and can have a high range of variability, calling the reproducibility of the results into question. In this article, through a literature review, we propose an original overview of the sources of variability to better understand the challenges and issues of reproducibility related to deep learning for medical image segmentation. Finally, we propose 3 main recommendations to address these potential issues: (1) an adequate description of the framework of deep learning, (2) a suitable analysis of the different sources of variability in the framework of deep learning, and (3) an efficient system for evaluating the segmentation results.

Entities: Chemical

Mesh：

Year: 2020 PMID： 32792540 PMCID： PMC7426407 DOI： 10.1038/s41598-020-69920-0

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

Medical imaging plays a central role in medicine today because it can reveal the anatomy of the patient. However, to leverage the full potential of medical images, it is necessary to analyze them via image processing. One of the main clinical tools is image segmentation[1, 2]. Medical image segmentation can be defined as an automatic (or semiautomatic) process to detect boundaries within a 2D or 3D image. It is based on information such as pixel intensity, texture and anatomical knowledge.The result of segmentation can then be used in further applications and in gaining insights[2]; examples include the quantification of tissue volumes[3, 4], diagnosis[5, 6], the localization of pathology[7, 8], the study of anatomical structure[9, 10], treatment planning[11], and computer-integrated surgery[12]. Manual medical image segmentation leads to two main issues: much time is needed for delineation, and reproducibility is called into question. First, the time needed to segment is incompressible, and it is correlated with the number and the size of images. Since the size of these two parameters is increasing due to the ease of facility access to medical imaging and the improvement of acquisition technologies, manual segmentation is becoming intractable. Second, reproducibility corresponds to the agreement between the results of multiple measurements of the data (here, the segmentation results) under the same methodology. In medical image segmentation, it is well known that there is inter- and intraoperator variability. The former relates to the observed differences in the segmentation results obtained by two different operators, while the latter relates to the observed differences between two results of segmentation tasks performed by the same operator at two different times. Due to the crucial role of segmentation in medical diagnostics and treatments, the reproducibility of the method is fundamentally important. These two issues lead one to consider automatic segmentation. Automatic segmentation consists in determining a prediction model and its inherent parameters relative to a given class of problems (for example, the kind of imaging performed or organs imaged). These parameters can be divided into two classes: the hyperparameters associated with the model and the parameters estimated from the dataset. The aim of automatic segmentation is to estimate the best parameters to obtain highly accurate results over the training dataset while maintaining good generalization for other datasets of the same class of problem, also called “test datasets”. In other words, the algorithm must avoid perfectly fitting the training set with poor accuracy results on the testing set. This problematic phenomenon is also called “overfitting” (see page 108 of the book[13]). The rapid development of new automatic segmentation algorithms since the 2000s is strongly connected to the rise of machine learning[2]. During the last decade, a specific field of machine learning and artificial neural networks, called “deep learning” (DL)[13, 14], has outperformed classical segmentation methods[15]. A neural network with several hidden layers is considered a ‘deep’ neural network, hence the term ‘deep learning’[14]. This is the case for several reasons, for example, nonsupervised feature extraction via convolutional layers and the possibility of dealing with a very large dataset via efficient optimization methods such as backpropagation of the gradient (see chapter 6.5 of the book by Goodfellow et al.[13]). Several DL architectures have been applied to medical image segmentation, including fully convolutional networks (FCNs)[16] and U-Net[17] (see Litjens et al.[15] for a recent review). FCNs[16] are built from locally connected layers, such as convolution, pooling and upsampling layers. An FCN is composed of two main parts: the downsampling and upsampling paths. The downsampling path captures contextual information, whereas the upsampling path recovers spatial information. Moreover, skip connections between layers are performed to recover fine-grained spatial information that is potentially lost in the pooling and downsampling layers. U-Net[17] is built upon FCNs. The main difference is that each downsampling scale is linked to the corresponding upsampling scale with a concatenation operator. In this way, each upsampling scale has the information of the corresponding downsampling scale and the lower upsampling scale, leading to better segmentation. However, although DL algorithms perform well, they are complex. A number of factors may explain the variability in the obtained results: the intrinsic variability of the dataset, the stochastic process during optimization, the choice of the hyperparameters relative to the optimization and regularization processes, and the choice of the DL architecture itself. This variability in the different parts of the framework leads to some difficulties in analyzing the reproducibility and making comparisons between frameworks. In addition, this variability leads to numerous parameters and hyperparameters being set. Furthermore, as highlighted in Joelle Pineau’s reproducibility checklist[18], provided during NeurIPS 2019, describing the DL methods becomes its own challenge for reproducibility. Moreover, the strategy for evaluating the segmentation results, and thereby the variability of the method, is complex. There are a plethora of metrics[19] to analyze segmentations, leading to various ways of comparing the methods. Along these lines, three main questions, at least, about variability and reproducibility can be formulated.These three questions are crucial for the application and potentially the evaluation of the segmentation algorithms. After focusing on the concept of reproducibility in medical image segmentation and on how to consider the different sources of variability in DL, we will review the literature to provide an overview of the practice of reproducibility in the fields of medical image segmentation in DL, based on three main topics: (1) the description of the methods, (2) the analysis of variability and (3) the evaluation system. On the basis of this synthesis, we will propose recommendations to appreciate the results of new DL strategies. Question 1: Is there enough information in published articles in the field of medical image segmentation with DL to correctly reproduce the results? Question 2: If the information is provided, has the variability in the several steps of the DL framework been considered? Question 3: Does the evaluation system for the segmentation results correctly reflect this variability?

Related work

In this section, we will broadly address the issues of the reproducibility and evaluation of segmentation in medical imaging. Then, we will outline several sources of variability in the DL framework that can lead to difficulties for reproducibility.

Reproducibility and evaluation of segmentation in medical imaging

Reproducibility is a popular topic in science[20]. Hence, numerous articles[21, 22] reveal a potential crisis of reproducibility in the different fields of science. Thus, most scientists have experienced a failure to reproduce results[21] (more than 50% in the case of their own works in medicine, physics and engineering and more than 75% in the case of works by another person in the same fields). In the rest of the article, we will follow the definition of the report of the National Academies of Science, Engineering, and Medicine[20]: reproducibility means obtaining consistent results using the same input data, computational steps, methods, and conditions of analysis; it is synonymous with computational reproducibility. Moreover, this report[20] (recommendation 5-1, page 7) recommends that researchers should provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. These uncertainties include stochastic uncertainties. Reproducibility can be assessed with different procedures. First, reproducibility can be analyzed by intraclass correlation (ICC), proposed by Shrout and Fleiss[23]. The score obtained, which is between 0 and 1, indicating poor and perfect reproducibility, respectively, enables a comparison between intra-individual and inter-individual variabilities. Another statistical tool generalizing the ICC is analysis of the variance[24] (ANOVA). It provides a collection of tools focusing on the variability of the means among groups. One interesting point is that ANOVA can deal with multiple factors. One of the main sources of variability in machine learning originates from the difference between the observed samples of the dataset and the real distribution of the dataset. The fact that the learning step of the algorithm is performed on only a part of the distribution can affect the reproducibility and particularly the replication of the results. A class of tools, called “cross-validation” (CV)[25], is available in studying this variability. A special focus on these methods is made in the next section, concerning variability in a dataset. Moreover, segmentation in the specific field of medical imaging is complex in terms of reproducibility for several reasons. First, the available datasets are generally limited: the number of samples is generally less than 100 items. Then, each segmentation task must be considered with regard to the image modality (for example, whether it was obtained by MRI, scanner, or echography) and the organ studied[26]. Furthermore, the masks in segmentation are usually generated manually. This leads to some intra- and inter-rater variability. Consequently, there is no certain truth but only a gold standard. Additionally, there are also several metrics to evaluate segmentation, such as the dice coefficient (DC) and the modified Hausdorff distance. Each metric focuses on a specific aspect of the segmentation[19]. For example, a metric can correctly reflect the good overlap between a segmentation mask and a gold standard, but it cannot highlight the smoothness of the contour. To correctly describe the quality of a segmentation, several metrics are necessary[19, 26]. An adequate system of evaluation will permit accurate consideration of the variability in DL frameworks.

Variability in DL frameworks

In the next sections, five different kinds of variability are presented. The DL framework and its related sources of variability are displayed in Fig. 1.

Figure 1

The different steps of a DL framework are displayed in solid-line boxes: the steps related to the dataset (the data augmentation and cross-validation strategies, DL architecture design, training step (with the optimization procedure), and estimation of the hyperparameters of the optimization) and the evaluation system. The different sources of variability are highlighted in dashed-line boxes: the variability linked to (A) the dataset, (B) the DL architecture, (C) the optimization procedure, (D) the hyper parameter estimation for the optimization and (E) the implementation and infrastructure.

Variability in the dataset

To infer a segmentation with a DL model (and more globally, a supervised machine learning model), the classic method consists in splitting the data sets into three parts. The first part corresponds to the “training set” for estimating the parameters of the model: it is composed of the raw data and corresponding labels. Based on the raw data, the DL algorithm infers some results that are compared to the labels. The DL parameters are then optimized to minimize the error between the results and labels. The second part is the “validation set”. It is more specific to the DL community. It estimates the unbiased error of the trained DL model. It permits the training of the DL to be stopped to avoid overfitting. It is not mandatory and is usually used in practice when the dataset has enough samples. Finally, the last part, called the “testing dataset”, provides an unbiased evaluation of the final model of the DL algorithm. The proportions of the different parts depend on the initial number of samples and can significantly affect the expected degree of generalization. Let us consider a trivial example, where only one sample is chosen for the testing set; the evaluation of the DL depends greatly on the selected sample. In the same way, selecting few samples for the training set leads the model to perfectly learn the training data. To avoid bias in the data selection, strategies called “cross-validation” are performed. These strategies consist in dividing the dataset into several folds, then assigning these folds to the training, validation and testing sets. At the end of the DL model estimation and evaluation processes, the folds are reassigned for novel estimations and so on. The cross-validation strategies permit one to address variability in the data. The number of parameters to estimate in a DL model is often larger than the number of images in the datasets. Moreover, in medical imaging segmentation, the heterogeneous appearance of the target organ (anatomical variability) or of the lesions (size, shape or position) poses a great challenge. One solution, called “data augmentation”[27], generates new samples by applying different transformations to the dataset (e.g., rotation or flipping). In this way, unseen target organs or lesions can potentially be approximated. However, this also adds sources of variability in the general framework, since there is no consensus on which transformation to perform and the parameters of the transformation are generally randomly chosen.

Variability in the optimization

This section focuses specifically on the variability of optimization with an already estimated and constant set of hyperparameters. One of the main factors of complexity is the very large number of parameters of the model to be estimated. Training these parameters in DL models is very challenging. Solving the optimization problem of estimating these weights is generally an extremely difficult task with a stochastic process. Each weight in the DL algorithm corresponds to another parameter (which can be seen as another dimension) of the cost function of the optimization. DL models often have millions of parameters, making the search space to be evaluated by the algorithm extremely high dimensional, in contrast to classic machine learning algorithms. Moreover, the addition of each new dimension dramatically increases the distance between points in this high-dimensional space. Consequently, the search space is drastically increased. More precisely, the number of possible distinct configurations of a set of parameters increases exponentially as the number of parameters increases. This is often referred to as the “curse of dimensionality” (see page 155 of Goodfellow et al.[13]). In addition, the cost function is generally nonconvex (see page 282 of Goodfellow et al.[13]). These facts lead to several issues: the presence of local minima and flat regions with the constraint of the high-dimensionality of the search space. The best general algorithm known for solving this problem is stochastic gradient descent (SGD) (see chapter 5.9 of the book[13]), where the model weights are updated at each iteration using the backpropagation-of-error algorithm. However, there is no guarantee that the DL estimation will converge to a good solution (or even a good local optimum), that the convergence will be fast, or that convergence will even occur at all[28]. Nevertheless, recent work may suggest that local minima and flat regions may be less challenging than previously believed[29-31]. From Choromanska et al.[29], it appears that almost all local minima have very similar function values to the global optimum, and hence, finding a local minimum is sufficient. These last results have been obtained for classification tasks. Furthermore, the important convolutional step of segmentation is not considered in Choromanska et al.[29] or Dauphin et al.[30]. To the best of our knowledge, only one conference article[32] addresses this issue of stochastic optimization uncertainties in medical imaging segmentation with DL. The authors show that DL models estimated several times with the same data show differences, but the results obtained on the evaluated metrics are not significantly different.

Variability in the hyperparameters

The hyperparameters correspond to the global settings of an algorithm. In machine learning, each parameter impacts the results differently[33]. Several hyperparameters must be fitted before the training of the DL model, for example, the learning rate for optimization and the dropout percentage for regularization[13]. There are different ways to set them. First, manual configuration is considered. This strategy limits the exploration space, but the computation time is relatively short compared to those of other methods since only a rough approximation of the best hyperparameters is expected. The second kind of strategy is based on automatic space exploration. The classic method, called “grid search”, tests every combination of hyperparameters. It will find the best set of hyperparameters, but the computational cost increases greatly with the number of hyperparameters. Another strategy, called “random search”, randomly samples the set of hyperparameters to be evaluated. This method generally cannot reach the optimum values, but approximates them in fewer iterations than grid search. A new strategy[34], called “Bayesian optimization”, automatically infers a new combination of hyperparameters based on previous evaluations. In this case, the space exploration is intermediate and is driven by experience. The cost of exploration is lower than that in a grid or random search.

Variability in the DL architecture

Here, only the number of nodes, the number of layers, the kinds of layers (for example, convolutional, pooling, or dense) and the connections among the layers are considered in the architecture. Even with these four parameters, the number of available architectures is infinite. In practice, only three strategies are chosen for the selection of the architecture. The first one consists in selecting a well-known DL model that has already proved its performance in previous work[15], such as U-Net[17] for image segmentation. This method is considered more often in clinical application fields. This method is not expected to provide the best architecture for a specific problem. Another strategy consists in manually handcrafting the DL architecture. This leads to a plethora of architectures[15]. However, it does not guarantee the best architecture, and modifications of the tested architecture are generally not considered. The final strategy, also called “network architecture search”, is to automatically create a DL architecture through optimization for a specific task[35]. The drawback of approximating the best architecture is a very high cost in time and resources. For instance, the network architecture search proposed in[36] tested 20,000 architectures in 4 days with 500 graphics processing units (GPUs). The estimation of the minimal network architecture needed to achieve a certain segmentation accuracy on a given dataset can enable variability in the DL architecture to be avoided. However, as discussed in the review[37], this topic remains a challenge.

Variability in the middleware and the infrastructure

The last section focuses on algorithms relevant to DL. In this section, the possible variability due to the middleware and the infrastructure is considered. There are many toolboxes to implement a DL framework. To the best of our knowledge, no publication has addressed the problem of reproducibility in DL with regard to the middleware. Different implementations are compared, for example, by programming language, in terms of their capacity to use a GPU. A review of different implementations and their characteristics can be found in[38]. The learning phase in DL can be a very long process, considering the complexity of the architecture of the DL and the dataset size. As previously explained, the search for hyperparameters can also be prohibitive. To improve the processing time, several solutions based on the infrastructure are considered. Different kinds of infrastructures[39] can be used, such as a central processing unit (CPU), GPU, or tensor processing unit (TPU). However, some technical characteristics such as memory precision for different memory sizes can affect the accuracy of the results[40]. Another example, the numerical operations performed on the GPU, can be nondeterministic, leading to nonreproducibility in the results[41]. Another possibility for accelerating the processing time is choosing a parallel or distributed DL model. These techniques come with their own different methods that potentially impact the reproducibility of the outcome. For an overview of the parallel and distributed models and their own challenges, the interested reader can refer to[42, 43].

Methods

In this section, we first introduce how the literature review was performed, and then, we briefly describe the different metrics.

Literature review

There is no standard for the reproducibility or evaluation of DL in medical image segmentation. The aim of this review is to reflect common practices for DL in medical image segmentation. To fulfill this expectation, this review focuses on three goals: (1) to inspect how the methods are described to enable work to be reproduced, (2) to present the variety of methodology and highlight the variability among DL frameworks and (3) to outline the kinds of evaluations used in DL. To observe the variability of the methodology and evaluations in the literature, we focus on the 23 articles presented in the review article[15] in the specific section “Tissue/anatomy/lesion/tumor segmentation”. This review article was chosen because it was the most relevant found on Google Scholar (with the mandatory keywords ’medical image segmentation neural network’ and at least one keyword in ’review survey’) among more than 2300 hits on Google Scholar (in December 2019). All the considered articles propose recent strategies: the oldest one was published in 2014[44] and the mean year of publication is 2016. Moreover, the mean number of citations on Google Scholar (in December 2019) is (median = 97, min = 20, max = 1074). To obtain a more recent overview, we select 3 reviews of medical image segmentation methods[37, 45, 46]. We focus specifically on how the problem of variability and reproducibility is addressed in the scientific literature. We focus on the possible variability introduced by the data itself, by the optimization strategy and associated hyperparameters, by the middleware and the infrastructure, and by the evaluation measure. For all the inspected parameters or evaluations, we determine the presence of the terms and their potential values. This consideration is important for being able to reproduce the different works. When a framework is described, we determine whether the correct terms are used appropriately. To highlight this phenomenon, we consider the kind of algorithm used in the optimization strategy. For the data variability, we consider whether the DL algorithm is tested on several datasets, whether they are public or private, the number of datapoints available, whether data augmentation has been performed, the proportion of training, validation and testing sets and the possible application of a cross-validation method. For the optimization, we examine whether different parameters are recorded (the optimization strategy, learning rate, batch size, and presence of dropout regularization). We also investigate whether the hyperparameters of the optimization are hand-crafted or automatically optimized (and whether this information is available). For the middleware and infrastructure considerations, we report whether these details are provided. Special attention is also paid to the implementation of the DL model and the processing unit considered. We also determine whether the calculations are performed on a distributed system, which can be a large source of variability itself. For the evaluation, we consider the number and kinds of measures, and whether the variability of the results is described (the presence of standard deviations).

Metric evaluation

The evaluation of the different estimations of DL models is assessed with the DC, the true positive rate (TPR), also called the sensitivity (Sens.), the true negative rate (TNR), also called the specificity (Spef.), and the average volume distance (AVD) (linked with the Hausdorff distance). We chose these metrics because they often appeared in the articles of the literature review. The different metrics are described in Table 1[19]. We consider various metrics, since each metric has its drawbacks, and evaluate only a part of the segmentation problem[19, 26]. Readers interested in additional metrics and the interactions among them can read the study of Taha et al.[19].

Table 1

Segmentation metrics

Metric	Equation	Range	Meaning
Dice coefficient (DC)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{2 \times \|Mask ~ \cap ~Ground ~ Truth\|}{\|Mask\| + \|~Ground ~ Truth\|}$$\end{document}2×\|Mask∩GroundTruth\|\|Mask\|+\|GroundTruth\|	0–1	Spatial overlap between masks
True positive rate (TPR)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{TP}{TP + FN}$$\end{document}TPTP+FN	0–1	Sensitivity
True negative rate (TNR)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{TN}{TN + FP}$$\end{document}TNTN+FP	0–1	Specificity
Average volume distance (AVD)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$max\Big (d_H(Mask, Ground ~ Truth), d_H(Ground ~ Truth, Mask)\Big )$$\end{document}max(dH(Mask,GroundTruth),dH(GroundTruth,Mask))	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 0$$\end{document}≥0	Precision

Mask segmentation mask, Ground Truth ground-truth mask, TP true positives, voxels that are correctly segmented as the region of interest, TN true negatives, voxels that are correctly segmented as the background, FP false positives, voxels that are incorrectly segmented as the region of interest, FN false negatives, voxels that are incorrectly segmented as the background. corresponds to the directed average Hausdorff metric, defined as , where N is the number of pixels or voxels considered.

Segmentation metrics Mask segmentation mask, Ground Truth ground-truth mask, TP true positives, voxels that are correctly segmented as the region of interest, TN true negatives, voxels that are correctly segmented as the background, FP false positives, voxels that are incorrectly segmented as the region of interest, FN false negatives, voxels that are incorrectly segmented as the background. corresponds to the directed average Hausdorff metric, defined as , where N is the number of pixels or voxels considered.

Synthesis of the literature review

The main results are displayed in Tables 2, 3, 4 and 5. Table 2 focuses on the data variability. Table 3 focuses on the evaluation procedure. Table 4 presents the optimization strategies. Table 5 considers the middleware and the infrastructure. We are interested in the following three main points: (1) whether the DL strategy is correctly described as enabling the work to be reproduced, (2) whether the variability of the different parts of the DL framework are considered, and (3) how the evaluation is performed and the results are reported.

Table 2

The training size, the kind of data augmentation (DA), the presence of the DA term and the validation set (VS) term, the training size proportion and the cross-validation (CV) strategy in each article.

Article	Training size	DA	DA term	VS term	Training proportion	CV strategy
Guo (2014)[44]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	No	Not clearly detailed	LOO
de Brebisson (2015)[67]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	Yes	43%	No
Choi (2016)[68]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	Yes	75%	No
Stollenga (2015)[69]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	Yes	No	50%, 25%	No
Zhang (2015)[65]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 10$$\end{document}≤10	Patches	No	Yes	87.50%	LOO
Andermatt (2016)[50]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 10$$\end{document}≤10	Yes	Yes	No	25%	No
Bao (2016)[70]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 10$$\end{document}≤10, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	No	50%, 50%	No
Birenbaum (2016)[51]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 10$$\end{document}≤10	Patches	Yes	Yes	80%	LOO
Brosch (2016)[52]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100	Not described	No	Yes	46%, 95%, 80%	No/LOO/No
Chen (2016a)[12]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 10$$\end{document}≤10	Not described	No	No	25%	LOO
Ghafoorian (2016b)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100	Patches	No	Yes	90%	No
Ghafoorian (2016a)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100	Patches	No	Yes	89%	No
Havaei (2016b)[71]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100	Not described	No	Yes	70%	No
Havaei (2016a)[55]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100	Patches	Yes	Yes	46%, 84%	No/7 FO
Kamnitsas (2017)[47]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 100$$\end{document}≤100, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	Yes	Yes	80%, 72%, 44%	5 FO
Kleesiek (2016)[72]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100,\le 100$$\end{document}≥100,≤100	Patches	Yes	No	50%, 50%	2 FO/3 FO
Mansoor (2016)[58]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100	Patches	No	No	Not clearly detailed	Not clearly detailed
Milletari (2016a)[57]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 100$$\end{document}≤100, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	Yes	82%, 33%	No
Moeskops (2016a)[56]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	No	20%; 25% ; 33%	LOO/No/No
Nie (2016b)[66]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 10$$\end{document}≤10	Patches	No	No	Not clearly detailed	LOO
Pereira (2016)[48]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	Yes	Yes	46%, 84%	No
Shakeri (2016)[49]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	Yes	Yes	66% , 50%	3 FO/2 FO
Zhao (2016)[73]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 50$$\end{document}≤50	Patches	No	No	Not clearly detailed	Not clearly detailed

The CV method can be leave one out (LOO) or k fold out (k FO). For example, the article by Kamnitsas et al.,[47], presents 3 datasets with training sizes , and . The data augmentation is based on a patch strategy (the authors referred to it in the article). They also explicitly described whether they used a validation set. The training proportions of the 3 datasets are 80%, 72% and 44%. Finally, the authors used 5 fold out for the CV strategy.

Table 3

The different DL models, kinds of datasets (number of datasets, denoted as Nb DS, the kind of dataset (public or private) and the kind of evaluation (type, number and variability of the measures).

Article	DL architecture	Nb DS	Dataset type	Type of Meas.	Nb of Meas.	Var. of Meas.
Guo (2014)[44]	SAE	1	Private	DC	1	Values
de Brebisson (2015)[67]	CNN	1	Public	DC	1	No
Choi (2016)[68]	CNN	2	Public	DC, P, R	3	Values, graph
Stollenga (2015)[69]	RNN	2	Public	DC, MHD, AVD	3	No
Zhang (2015)[65]	CNN	1	Private	DC, MHD	2	Values, graph *
Andermatt (2016)[50]	RNN	1	Public	DC, MHD, AVD	3	No
Bao (2016)[70]	CNN	2	Public	DC, VD, SD, TPR, FPR	1	No
Birenbaum (2016)[51]	CNN	1	Public	DC,Score	2	No
Brosch (2016)[52]	CNN	3	2 Public & private	DC, AVD, LTPR, LFPR	4	Graph
Chen (2016a)[12]	CNN	1	Public	DC, MHD, AVD	3	No
Ghafoorian (2016b)	CNN	1	Private	DC, AUC	2	Graph
Ghafoorian (2016a)	CNN	1	Private	DC, AUC	2	Graph
Havaei (2016b)[71]	CNN	3	Public	DC,VD,SD,TPR,FPR	5	No
Havaei (2016a)[55]	CNN	2	Public	DC,Sens.,Spe	3	Graph
Kamnitsas (2017)[47]	CNN	3	Private & 2 public	DC, P, Sens, ASSD, HD	5	Values, graph
Kleesiek (2016)[72]	CNN	4	3 Public & 1 private	DC,Sens.,Spe	3	Values, graph
Mansoor (2016)[58]	SAE	1	Private	DC, ALSD	2	Values, graph
Milletari (2016a)[57]	CNN	2	Private	DC, MDEC, FR	3	Graph
Moeskops (2016a)[56]	CNN	3	Public	DC, MSD	2	Values, graph
Nie (2016b)[66]	CNN	1	Private	DC	1	Values *
Pereira (2016)[48]	CNN	2	Public	DC, PPV, Sens	3	Graph
Shakeri (2016)[49]	CNN	2	Public & private	DC, HD, CMD	3	Graph
Zhao (2016)[73]	CNN	1	Public	DC	1	Graph

For the types of measures, DC Dice Coefficient, P Prediction, R Recall, MHD modified Hausdorff distance, AVD average volume distance, TPR true positive rate, FPR false positive rate, AUC area under the curve, Sens. Sensitivity, and Spe. Specificity. The variability of a measure corresponds to the presence of the standard deviation value or a display in a graph. The (*) means that the values for all subjects are reported. For example, the article by Kamnitsas et al.,[47], is based on a CNN. Their models are evaluated on 3 datasets, where one is private and two public. To evaluate their segmentations, they used the DC, P., Sens., ASSD and HD metrics (5 different metrics). The variability of the measures is displayed in a graph, and the corresponding values are reported in the text.

Table 4

The kind of optimization, whether the hyperparameters (HPs) are handcrafted, the learning rate (the value (V.) and the presence (P.) of the term), the batch size (the value (V.) and the presence (P.) of the term), the dropout regularization (the value (V.) and the presence (P.) of the term) and whether the code is open source.

Article	Optimization	HP handcrafted	Learning rate (V./P.)	Batch size (V./P.)	Dropout (V./P.)
Guo (2014)[44]	GBM	Yes	No/no	No	No
de Brebisson (2015)[67]	SGD (M)	Yes	Yes (0.05)/yes	Yes/yes	No
Choi (2016)[68]	SGD (M)	Yes	Yes (0.001)/yes	No/yes	Yes/yes
Stollenga (2015)[69]	RMS-prop	Yes	Yes (0.01)/yes	No	Yes/yes
Zhang (2015)[65]	SGD (M)	Yes	Yes (0.0001)/yes	No	Yes/yes
Andermatt (2016)[50]	AdaDelta	Yes	omit	No	Yes/yes
Bao (2016)[70]	Not described	Yes	No	No	No
Birenbaum (2016)[51]	AdaDelta	Yes **	omit	No	Yes/yes
Brosch (2016)[52]	AdaDelta	Yes	Sensitivity ratio Yes/yes	No	No
Chen (2016a)[12]	Not described	Yes	No	No	No
Ghafoorian (2016b)	RMS-prop	Yes	No/yes	Yes/yes	Yes/yes
Ghafoorian (2016a)	RMS-prop	Yes	No/yes	Yes/yes	Yes/yes
Havaei (2016b)[71]	SGD (M)	Yes	Yes (0.001)/yes	No	No/yes
Havaei (2016a)[55]	SGD (M)	No (Grid Search)	Yes(0.005)/yes	No/yes	Yes/yes
Kamnitsas (2017)[47]	RMS-prop	Yes	Yes(0.0001)/yes	Yes/yes	Yes/yes
Kleesiek (2016)[72]	SGD	Yes	Yes(0.00001)/yes	Yes/yes	No
Mansoor (2016)[58]	SGD (M)	Yes **	No	Yes/yes	No
Milletari (2016a)[57]	SGD (M)	Yes	Yes (range values)/yes	Yes/yes	Yes/yes
Moeskops (2016a)[56]	RMS-prop	No (not explained)	No/yes	No/yes	No/yes
Nie (2016b)[66]	Not described	Yes	No/yes	No	No
Pereira (2016)[48]	SGD (M)	Yes	Yes (range values)/yes	Yes/yes	Yes/yes
Shakeri (2016)[49]	SGD (M)	Yes	Yes(0.01)/yes	No	Yes/yes
Zhao (2016)[73]	Not described	Yes	No	No	No

The (M) in the optimization column signifies that the momentum algorithm is performed. The ** in the HP handcrafted column means that several DL architectures are tested. For example, the article by Kamnitsas et al.,[47], used an RMS-prop strategy for optimization. The different hyperparameters are handcrafted. The learning rate, the batch size and the dropout are mentioned in the text, and their corresponding values are given.

Table 5

In the second column, the different implementations are described (Theano , Mat-ConvNet , Caffe , Keras , Pylearn2 and Lasagne ).

Articles	Implementation	Infrastructure	Open source
Guo (2014)[44]	Not described	Not described	No
de Brebisson (2015)[67]	Theano	NVIDIA Tesla K40 GPU-12GB	No
Choi (2016)[68]	Mat-ConvNet	GPU (GTX TITAN)	No
Stollenga (2015)[69]	Not described	NVIDIA GTX TITAN X GPU-12GB	No
Zhang (2015)[65]	Not described	Tesla K20c GPU	No
Andermatt (2016)[50]	Caffe	NVIDIA GTX Titan X GPU-12GB	No
Bao (2016)[70]	Not described	Not described	No
Birenbaum (2016)[51]	Keras + Theano	NVIDIA GeForce GTX 980 Ti GPU	No
Brosch (2016)[52]	Own implementation	GeForce GTX 780	No
Chen (2016a)[12]	Caffe	NVIDIA TITAN X GPU	Yes (*)
Ghafoorian (2016b)[7]	Theano	Not described	No
Ghafoorian (2016a)[7]	Not described	Titan X card	No
Havaei (2016b)[71]	Keras	Nvidia TitanX GPU	No
Havaei (2016a)[55]	Pylearn2	NVIDIA Titan black card.	No
Kamnitsas (2017)[47]	Theano	NVIDIA GTX Titan X GPU-12GB	Yes
Kleesiek (2016)[72]	Theano	NVIDIA Titan-3GB	No
Mansoor (2016)[58]	Not described	Not described	No
Milletari (2016a)[57]	Caffe	NVIDIA “‘Tesla k40” or “Titan X”-12GB	No
Milletari (2016a)[57]	Caffe	Tested on Nvidia GTX 980-4GB	No
Moeskops (2016a)[56]	Not described	NVIDIA Tesla K40 GPU (**)	No
Nie (2016b)[66]	Caffe	Not described	No
Pereira (2016)[48]	Theano + Lasagne	GPU NVIDIA GeForce GTX 980	Yes
Shakeri (2016)[49]	Mat-ConvNet	Described on github	Yes
Zhao (2016)[73]	Not described	Not described	No

http://deeplearning.net/software/theano/.

http://www.vlfeat.org/matconvnet/.

https://caffe.berkeleyvision.org/.

https://keras.io/.

http://deeplearning.net/software/pylearn2/.

https://lasagne.readthedocs.io/en/latest/.

For the infrastructure details, the materials are described as they are referenced in the articles. If the global memory is reported in the article, it is noted. The last column, ’Open Source’, shows whether the source code is available. The (*) indicates that the code source is not available but a detailed prototype of the algorithm is provided. The (**) indicates that the infrastructure is detailed in the Acknowledgements section. For example, the article by Kamnitsas et al.,[47], used the Theano implementation on an infrastructure based on an NVIDIA GTX Titan X GPU-12GB. Their code is released as open source.

Description of the DL strategy

In this section, we focus not on the fact that some methods are performed and some are not, but on whether the methods are clearly described. It can be seen that a method may have been applied without any mention in the text. The main findings are as follows: only two articles[47, 48] (9% of the articles) sufficiently describe the hyperparameters and the dataset to enable the work to be reproduced. One study[49] has just one hyperparameter missing (the batch size) in the text, but the source code is available with this information included. Here, we focus on descriptions relative to the dataset and to the optimization stage. These results are detailed in Fig. 2. The left side of the figure is relevant to the description of the dataset (the training proportion, the data augmentation and the validation set) and the right side to the description of the optimization (the optimization procedure, the learning rate, the dropout procedure and the batch size). Some criteria are described well, such as the training proportion (83% of the considered articles) or the optimization procedure (83% of the selected articles). However, some characteristics are less available, such as the procedure of data augmentation (only 35% of the articles). To obtain a reproducible study, all these characteristics must be described. Only 9% of the selected articles provide sufficient information to be reproducible.

Figure 2

The left side, resp. the right, of the figure is relevant to the description of the dataset, resp. the optimization. The description of the training proportion is present in 83% of the selected articles. The terms of data augmentation, resp. the validation set, are described in 35%, respectively 57%, of the selected articles. For the optimization procedure, the name of the optimization algorithm is missing in 17% of the selected articles. Regarding the hyperparameter learning rate, dropout and batch size, their values are available in only 57%, 52% and 35% of the articles, respectively. These coefficients are mentioned in the text without any values in 19%, 9% and 13% of the articles, respectively. In the end, only 9% of the evaluated articles have enough information to be reproducible.

In Table 2, the dataset management method is described. All the selected articles correctly present the dataset and the number of samples. 17% of the articles do not explain the training proportion used to estimate the parameters. Only 57% of the selected articles clearly state whether they used a validation set, and 35% whether they performed data augmentation. Table 4 focuses on the hyperparameters of the optimization process. 17% do not explain the optimization procedure at all. One[44] cites a generic name (GDM, for gradient-based method) without any explanation. The learning rate parameter is generally present with its initial values (or range of values). Four articles do not mention the values of the parameters. For the specific AdaDelta optimization used in[50-52], there is no learning rate. However, some coefficients need to be specified, such as the sensitivity ratio. Only one article[52] of the three mentions this coefficient. More than half of the selected studies (52%) do not mention the batch size, and only 35% of all the articles specify its value. The dropout method, which is more relevant to regularization, is present in 61% of the selected articles (only 43% specify the dropout ratio). 43% of the selected articles state that they perform stochastic gradient descent (SGD). However, in a strict sense, SGD is a generic term, and 90% of the selected articles use SGD with momentum. Moreover, SGD is generally confused with mini-batch GDMs[53] which is the case for 70% of the selected articles, which use the term batch size simultaneously with the term SGD. In Table 5, it can be seen that 35% of the selected articles do not describe the toolbox for the implementation of the DL models. 26% of the selected articles do not provide the kind of infrastructure. Supposing that a correct description of a GPU needs at least the name of the constructor, the class and the memory size, only 30% have this information. It can also be observed in Table 5 that there is no convention for reporting the infrastructure. The left side, resp. the right, of the figure is relevant to the description of the dataset, resp. the optimization. The description of the training proportion is present in 83% of the selected articles. The terms of data augmentation, resp. the validation set, are described in 35%, respectively 57%, of the selected articles. For the optimization procedure, the name of the optimization algorithm is missing in 17% of the selected articles. Regarding the hyperparameter learning rate, dropout and batch size, their values are available in only 57%, 52% and 35% of the articles, respectively. These coefficients are mentioned in the text without any values in 19%, 9% and 13% of the articles, respectively. In the end, only 9% of the evaluated articles have enough information to be reproducible. The best way to reproduce an algorithm and to explore the hyperparameters or the architecture of a DL model is to have access to the source code. In Table 5, we observe that only 17% of the articles release the source code. These articles[47-49] are the same as those that provide an exhaustive description of the framework for reproducibility. In the selected published articles, we are interested in the variability in the dataset, the optimization, the hyperparameters, the architecture of the DL framework, the implementation and the infrastructure. The main results in the next section are illustrated in Fig. 3. The figure is separated into four parts describing the variability of the dataset size, cross-validation strategies, optimization algorithms and implementation. The main conclusion is that there is no consensus on these topics. The rest of the results are discussed in detail in the subsequent section.

Figure 3

Four different sources of variability. (A) There is a large variability in the dataset size. 68.5% of the numbers of samples in the dataset are less than or equal to 50. (B) In general, no cross-validation strategy is considered (more than 50% of the articles). (C) There are 5 different optimization algorithms introduced in the different articles. The main one is SGD based on momentum (SGM(M)). The gradient-based method (GBM) and stochastic gradient descent (SGD) are only general terms. (D) There are 5 different implementations of DL frameworks. Even the Theano implementation is used in 42.9% of the considered articles, and there is no consensus among the implementations.

In Table 3, the results are focused on data variability. More than half of the methods are evaluated on more than one dataset and with publicly available datasets (in general, provided by data challenges such as BRATS[54]. 30% of the articles only test their algorithms on private datasets. Only 6 datasets have more than 100 samples, and in these 6 datasets, 4 come from the same public source, BRATS. This highlights the difficulty of obtaining large datasets. Consequently, data augmentation is important for medical image segmentation. Since the segmentation of a voxel can be performed locally, data augmentation based on patches can be considered. However, 13% of the articles do not clearly describe whether there is data augmentation or whether the patch strategy is considered, or how many patches are selected. The training proportion and the CV strategies permit avoiding or limiting bias relative to the chosen dataset. 52% of the articles do not use any CV strategies. One article[47] presents an original strategy for managing the intrinsic variability in the optimization stage of the DL: the results of 3 DL models are merged, leading to better results than one alone. The other 22 articles do not discuss this notion. We can observe in Table 4 that only one article,[55], clearly explains the tuning of the hyperparameters with a grid search strategy. Another article,[56], claims to automatically tune the hyperparameters without any explanation. In the articles considered in Table 4, there are three main strategies: SGD with momentum, RMS-prop and AdaDelta. One of the main hyperparameters is the learning rate, which varies greatly, from to . Two articles[48, 57] consider a range of values. At shown in Table 2, the training proportion, which can be viewed as a hyperparameter, has a wide variability (from 20% to 95% of the dataset). It is generally selected according to the size of the dataset. These results highlight the variability in the choice of hyperparameters for data management and optimization.

Variability in the architecture of DL frameworks

In Table 2, we can see that the main strategy is to use a convolutional neural network (CNN) or recurrent neural network (RNN) architecture (91% of the methods) for segmentation (these architectures are types of DL models[14]). Two articles[51, 58] test several different DL architectures in their frameworks (5 for[51] and 2 for[58]). Only one article[55] performed a grid search algorithm to determine the structure of the architecture (based on the kernel and max-pooling size for each layer and on the number of layers).

Variability in middleware and infrastructure

In Table 5, we can see that several implementations are considered. More precisely, four different toolboxes (Theano[59], Mat-ConvNet[60], Caffe[61] and Pylearn2[62]) are referenced in the articles. Only one in-house implementation was used[52]. In 13% of all the articles, a high-level API (Keras[63] or Lasagne[64]) is deployed in addition to these toolboxes. All the articles describing the infrastructure performed their algorithms on a GPU. No articles referred to a distributed system for the implementation of the DL algorithm. Four different sources of variability. (A) There is a large variability in the dataset size. 68.5% of the numbers of samples in the dataset are less than or equal to 50. (B) In general, no cross-validation strategy is considered (more than 50% of the articles). (C) There are 5 different optimization algorithms introduced in the different articles. The main one is SGD based on momentum (SGM(M)). The gradient-based method (GBM) and stochastic gradient descent (SGD) are only general terms. (D) There are 5 different implementations of DL frameworks. Even the Theano implementation is used in 42.9% of the considered articles, and there is no consensus among the implementations.

Evaluation of the variability

Almost half of the articles consider fewer than 3 metrics, which is the number recommended by Udupa et al.[26] (see Table 3 and Fig. 4). In a quarter of the articles, no variability relative to the metrics (such as the standard deviation) is provided. In some of the cases, this can be explained by the context of the data challenge platform for evaluation. In most articles, the variability is displayed with a boxplot. Only two articles report the complete results for each participant[65, 66].

Figure 4

The number of evaluation measures used in each article. Note that the number of measures required to correctly evaluate a segmentation result is 3.

For the evaluation metrics, the DC is considered in all articles. There is a large variability in the other metrics, since 22 different names of metrics can be found. Some of them are the same even if the names are different, such as the true positive rate, recall and sensitivity.

Reproducibility in the literature reviews

To evaluate the impact of reproducibility in DL for image segmentation after 2017, we consider the 3 reviews[37, 45, 46]. All the reviews highlight the problem of correctly comparing different methods. To address this issue, the reviews suggest testing the DL frameworks on public datasets through challenges and providing the code publicly. Moreover, the study[45] suggests that the difficulty of comparing the frameworks comes from the numerous available metrics used to evaluate segmentation. Furthermore, the study[37] highlights the problem of reproducibility due to the lack of a correct description of the frameworks. Finally, all the reviews consider reproducibility as a challenge. However, none of them raises the question of the intrinsic variability of DL frameworks. They do not refer to multiple metrics to correctly evaluate segmentation or discuss the cross-validation aspect. For the reported results in these reviews[37, 45], no variability measure is provided, such as the standard deviation. The number of evaluation measures used in each article. Note that the number of measures required to correctly evaluate a segmentation result is 3. The training size, the kind of data augmentation (DA), the presence of the DA term and the validation set (VS) term, the training size proportion and the cross-validation (CV) strategy in each article. The CV method can be leave one out (LOO) or k fold out (k FO). For example, the article by Kamnitsas et al.,[47], presents 3 datasets with training sizes , and . The data augmentation is based on a patch strategy (the authors referred to it in the article). They also explicitly described whether they used a validation set. The training proportions of the 3 datasets are 80%, 72% and 44%. Finally, the authors used 5 fold out for the CV strategy. The different DL models, kinds of datasets (number of datasets, denoted as Nb DS, the kind of dataset (public or private) and the kind of evaluation (type, number and variability of the measures). For the types of measures, DC Dice Coefficient, P Prediction, R Recall, MHD modified Hausdorff distance, AVD average volume distance, TPR true positive rate, FPR false positive rate, AUC area under the curve, Sens. Sensitivity, and Spe. Specificity. The variability of a measure corresponds to the presence of the standard deviation value or a display in a graph. The (*) means that the values for all subjects are reported. For example, the article by Kamnitsas et al.,[47], is based on a CNN. Their models are evaluated on 3 datasets, where one is private and two public. To evaluate their segmentations, they used the DC, P., Sens., ASSD and HD metrics (5 different metrics). The variability of the measures is displayed in a graph, and the corresponding values are reported in the text. The kind of optimization, whether the hyperparameters (HPs) are handcrafted, the learning rate (the value (V.) and the presence (P.) of the term), the batch size (the value (V.) and the presence (P.) of the term), the dropout regularization (the value (V.) and the presence (P.) of the term) and whether the code is open source. The (M) in the optimization column signifies that the momentum algorithm is performed. The ** in the HP handcrafted column means that several DL architectures are tested. For example, the article by Kamnitsas et al.,[47], used an RMS-prop strategy for optimization. The different hyperparameters are handcrafted. The learning rate, the batch size and the dropout are mentioned in the text, and their corresponding values are given. In the second column, the different implementations are described (Theano , Mat-ConvNet , Caffe , Keras , Pylearn2 and Lasagne ). http://deeplearning.net/software/theano/. http://www.vlfeat.org/matconvnet/. https://caffe.berkeleyvision.org/. https://keras.io/. http://deeplearning.net/software/pylearn2/. https://lasagne.readthedocs.io/en/latest/. For the infrastructure details, the materials are described as they are referenced in the articles. If the global memory is reported in the article, it is noted. The last column, ’Open Source’, shows whether the source code is available. The (*) indicates that the code source is not available but a detailed prototype of the algorithm is provided. The (**) indicates that the infrastructure is detailed in the Acknowledgements section. For example, the article by Kamnitsas et al.,[47], used the Theano implementation on an infrastructure based on an NVIDIA GTX Titan X GPU-12GB. Their code is released as open source.

Proposals for practices conducive to reproducibility in medical image segmentation with DL

On the basis of the literature review, our recommendations focus on three main points: (1) an adequate description of the DL framework, (2) a suitable analysis of the different sources of variability in the DL framework, and (3) an efficient evaluation system for the segmentation results. The flowchart of the different proposals is displayed in Fig. 5. Even if each part is independent, there is a natural order that we follow in our recommendations.

Figure 5

The proposals are separated into three main parts: (A) an adequate and complete description of the DL framework for reproducibility purposes, (B) an analysis of the different sources of variability, and (C) an efficient evaluation system for image segmentation.

Recommendations for the description of the framework

First, to perform reproducible research, it is mandatory to correctly describe all the aspects of the framework, from the DL model and its related hyperparameters to the evaluation system. The initial step consists in clearly describing the algorithm and/or the model of the DL architecture. A schema of the DL architecture should be provided since the architecture is generally complex. For the data part, several steps are mandatory:For the optimization step, the chosen algorithm should be clearly referenced with its name and its corresponding publication, and the final hyperparameters, such as the learning rate or the batch size, should be provided. If several evaluations are performed, the number of trials should be given. A complete description of the dataset is required, with the kind of acquisition (i.e., MRI or scanner), the size of the images and the total sample size. If the dataset is publicly available, a download link should be provided. For the preprocessing stage, the authors should explain whether some data are excluded. In the case of data augmentation, the different kinds of transformation must be described and the final number of samples should be included. For the special case of images, if the data augmentation consists in the selection of multiple patches, the characteristics and the final number of patches should be described. The allocation of the dataset samples into training, validation, and testing sets should be clearly described. If no validation set is created, this must be clearly stated and the choice should be explained. The cross-validation strategy should be described along with the number of folds considered. For the selection of the hyperparameters of the optimization process or the design of the DL architecture, the method should be explained, even if it is handcrafted. More precisely, the method and the search space of the different hyperparameters should be provided. A description of the computing infrastructure should be given with technical specifications: at least the name of the constructor, the class of the architecture and the memory size should be provided. For the middleware, the kind of implementation should be described (an available toolbox or in-house code, and the build version). If the toolbox is public, the link to the toolbox should be available. In general, the best solution is to provide a link to the downloadable source code with all included dependencies. Finally, for the evaluation, a clear description of the results should be given with the average metrics and their variations. If a figure is displayed, such as a boxplot, the values of the error bars should be provided. All these recommendations have been proposed in the Machine Learning Reproducibility Checklist[18]. There are two main differences between their recommendations and ours. First we merge the different points by the source of variability, whereas they merge by the section of the article (Methods and Results). The second difference consists in the particular focus on the image segmentation proposed here.

Recommendations for the analysis of variability

As shown by the literature review, sources of variability occur in each part of a DL framework: the dataset, the optimization procedure, the selection of the hyperparameters, the DL architecture, and the computational infrastructure. Each kind of variability is different and should be considered with its own tools.

Variability of the dataset

In practice, the available data are always a subset (or sampling) of the true distribution. This sampling effect typically introduces a bias which, in turn, results in variability in the final results. This fact is important in medical image processing, where the number of samples is limited. A DL network learning from a particular sampling of the data can lead to overfitting. The recommended common tool is cross-validation. It must be noted that the purpose of cross-validation is different from the use of batches in optimization. The choice of the cross-validation method (the number of samples left out) should be considered with regard to the sample size (i.e., as in the leave one out or k fold out strategy). The choice of the samples to be analyzed (patches or 2D images from 3D images) can lead to strong correlations between samples. To obtain a better sampling of the true distribution of the training dataset, data augmentation must be considered. The different transformations for the augmentation must be chosen carefully with regard to the organs studied. Furthermore, this data augmentation can enhance the accuracy by correcting a very poor diversity of the training dataset with respect to the testing dataset.

Variability of the optimization

For the optimization, the analysis of variability is often studied for classification purposes[29-31]. Our recommendation to manage the variability of the optimization is to perform at least several trials with the same hyperparameters on the same datasets. The observed variability should be recorded in the average score and its corresponding standard deviation. For a deeper analysis taking account of the dataset variability, the optimization should be produced several times in a cross-validation strategy for the dataset. All the results for each fold of the dataset should be grouped. A one-way ANOVA statistical test should be performed on the different groups to test whether there is a difference or interaction over the dataset considering the optimization. If the assumptions of ANOVA are violated, a strategy to perform a nonparametric test over cross-validation has already been proposed[74] to better estimate the residual error and to analyze the interaction between the algorithm and the learning dataset.

Variability of the hyperparameters

Our recommendation for the selection of the hyperparameters is to avoid handcrafted selection. Even if this selection is fast, the set of hyperparameters obtained can have a high variability since the hyperparameters can lie in a range of large variability[33]. Automatic selection by a grid search, random search or Bayesian optimization algorithm enables optimum values to be obtained that are potentially more robust. It should be noted that a Network Architecture Search Best Practices Checklist[75] was written in September 2019 on this specific subject.

Variability of DL architectures

The main problem addressed by the evaluated articles is: how should different DL architectures be compared? The comparison should consider the variability in the dataset, the optimization and the hyperparameter selection. Our recommendation is to perform, for each evaluated DL architecture, several trials of optimization on each fold of the dataset provided by a cross-validation strategy. A two-way ANOVA can be considered to evaluate the variability of the metrics with regard to the different folds of the cross-validation and the different DL architectures. If the assumptions of the ANOVA are violated, a nonparametric test can be proposed[76].

Variability of the infrastructure

In general, it is difficult to test the variability of infrastructures since their costs can be high. Our recommendations are to correctly capture the specificity of the infrastructure to avoid side effects for the reproducibility. Two important factors are the number of processing units and their characteristics (the kind of calculus unit used and the available random access memory (RAM)). The number of processor units will deeply impact the framework (distributed or non-distributed system). The RAM can affect the size of the batch during optimization. The kind of calculus unit used can lead to quantization and problematic noise calculations in the optimization. Regarding the middleware, an automatic deployment of the operating system and the toolbox associated with the DL framework are recommended. This should be based on a complete description of the system. In addition, distributed systems can be considered to achieve simulations in a reasonable time. To mitigate the reproducibility problem, some recommendations for the network should be made, such as the use of Infiniband (to avoid latency) or the use of a compartmentalized network (to avoid interactions with other users).

Recommendations for the analysis of the evaluation system

In the context of image segmentation, at least three metrics should be considered[19]. Because some of them are correlated[19], it is important to carefully choose which metric suits the scenario at hand[26]. Even if several metrics are defined[19] and, to the best of our knowledge, no consensus exists, we propose to at least evaluate the segmentation methods with the next three most common metrics: the DC, the TPR and the FNR. These metrics are described in the Methods section. Readers interested in image segmentation metrics can see more complex evaluations based on the recommendations of some studies[19, 26].

Discussion

The complexity and heterogeneity of DL frameworks are responsible for multiple kinds of variability. Because of the reproducibility crisis[21, 22], researchers have highlighted multiple factors that induce variability in the results obtained, as well as important guidelines that must be respected in order to minimize—or at least quantify—these effects: (i) for other researchers to be able to replicate the obtained results, it is necessary to precisely describe the DL framework in use as well as its optimization procedure; (ii) potential sources of variability must be acknowledged and, when possible, evaluated in order to determine their importance. Last, it is crucial to consider the specifics of the field being researched: already-existing data processing methodologies and evaluation procedures must be properly incorporated within the DL framework—see for instance[19, 26] for medical image segmentation. In practice, however, assessing reproducibility and variability is a rather difficult task in the context of DL frameworks.Conversely, variability may also be seen as a blessing. For instance, merging different optimization solutions or different DL frameworks improves the segmentation[47] and, more generally, the robustness. Key factors are generally interdependent For instance, the variability due to the optimization procedure not only depends on the choice of hyperparameters but also on the input data provided, i.e., the datasets. Facing such an issue, there is a need for new mathematical tools: (i) to de-correlate the overall variability and capture the individual effects associated with given parameter subsets; and (ii) to better compare the results obtained with different DL solutions. Heterogeneous nature of the variability This effect often makes it difficult to relate different sources of variability. For instance, let us consider the variability in the input data distribution on the one hand, and the variability in the optimization stochastic process on the other hand: these cannot be addressed in the same way, which in turn leads to different mathematical tools being needed to evaluate this variability. Hardware/software perturbations Typically, variability is estimated from a large number of repeated simulations, which requires powerful and/or distributed systems. These systems also induce variability, as they may differ slightly (in terms of architecture, data quantization, rounding strategies, implementation constraints, etc.). Finally, there is no clear consensus on the meaning of reproducibility, robustness and generalizability[77]. The notion of reproducibility should be driven mainly by the kind of application.

27 in total

1. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain.

Authors: Bruce Fischl; David H Salat; Evelina Busa; Marilyn Albert; Megan Dieterich; Christian Haselgrove; Andre van der Kouwe; Ron Killiany; David Kennedy; Shuna Klaveness; Albert Montillo; Nikos Makris; Bruce Rosen; Anders M Dale
Journal: Neuron Date: 2002-01-31 Impact factor: 17.173

2. Brain anatomical structure segmentation by hybrid discriminative/generative models.

Authors: Z Tu; K L Narr; P Dollar; I Dinov; P M Thompson; A W Toga
Journal: IEEE Trans Med Imaging Date: 2008-04 Impact factor: 10.048

Review 3. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

4. Automated segmentation of the optic nerve head for diagnosis of glaucoma.

Authors: R Chrástek; M Wolf; K Donath; H Niemann; D Paulus; T Hothorn; B Lausen; R Lämmer; C Y Mardin; G Michelson
Journal: Med Image Anal Date: 2005-04-08 Impact factor: 8.545

5. Tissue segmentation of head and neck CT images for treatment planning: a multiatlas approach combined with intensity modeling.

Authors: Valerio Fortunati; René F Verhaart; Fedde van der Lijn; Wiro J Niessen; Jifke F Veenland; Margarethus M Paulides; Theo van Walsum
Journal: Med Phys Date: 2013-07 Impact factor: 4.071

6. Location Sensitive Deep Convolutional Neural Networks for Segmentation of White Matter Hyperintensities.

Authors: Mohsen Ghafoorian; Nico Karssemeijer; Tom Heskes; Inge W M van Uden; Clara I Sanchez; Geert Litjens; Frank-Erik de Leeuw; Bram van Ginneken; Elena Marchiori; Bram Platel
Journal: Sci Rep Date: 2017-07-11 Impact factor: 4.379

7. Deep Learning for Fully-Automated Localization and Segmentation of Rectal Cancer on Multiparametric MR.

Authors: Stefano Trebeschi; Joost J M van Griethuysen; Doenja M J Lambregts; Max J Lahaye; Chintan Parmar; Frans C H Bakers; Nicky H G M Peters; Regina G H Beets-Tan; Hugo J W L Aerts
Journal: Sci Rep Date: 2017-07-13 Impact factor: 4.379

8. Quantifying the local tissue volume and composition in individual brains with magnetic resonance imaging.

Authors: Aviv Mezer; Jason D Yeatman; Nikola Stikov; Kendrick N Kay; Nam-Joon Cho; Robert F Dougherty; Michael L Perry; Josef Parvizi; Le H Hua; Kim Butts-Pauly; Brian A Wandell
Journal: Nat Med Date: 2013-11-03 Impact factor: 53.440

9. A semi-automatic computer-aided method for surgical template design.

Authors: Xiaojun Chen; Lu Xu; Yue Yang; Jan Egger
Journal: Sci Rep Date: 2016-02-04 Impact factor: 4.379

10. Automatic Segmentation of Kidneys using Deep Learning for Total Kidney Volume Quantification in Autosomal Dominant Polycystic Kidney Disease.

Authors: Kanishka Sharma; Christian Rupprecht; Anna Caroli; Maria Carolina Aparicio; Andrea Remuzzi; Maximilian Baust; Nassir Navab
Journal: Sci Rep Date: 2017-05-17 Impact factor: 4.379

19 in total

1. Deep learning for automatic brain tumour segmentation on MRI: evaluation of recommended reporting criteria via a reproduction and replication study.

Authors: Emilia Gryska; Isabella Björkman-Burtscher; Asgeir Store Jakola; Tora Dunås; Justin Schneiderman; Rolf A Heckemann
Journal: BMJ Open Date: 2022-07-18 Impact factor: 3.006

2. Algorithms used in medical image segmentation for 3D printing and how to understand and quantify their performance.

Authors: Magdalene Fogarasi; James C Coburn; Beth Ripley
Journal: 3D Print Med Date: 2022-06-24

Review 3. Towards a guideline for evaluation metrics in medical image segmentation.

Authors: Dominik Müller; Iñaki Soto-Rey; Frank Kramer
Journal: BMC Res Notes Date: 2022-06-20

Review 4. Overview of MR Image Segmentation Strategies in Neuromuscular Disorders.

Authors: Augustin C Ogier; Marc-Adrien Hostin; Marc-Emmanuel Bellemare; David Bendahan
Journal: Front Neurol Date: 2021-03-25 Impact factor: 4.003

5. A Dense RNN for Sequential Four-Chamber View Left Ventricle Wall Segmentation and Cardiac State Estimation.

Authors: Yu Wang; Wanjun Zhang
Journal: Front Bioeng Biotechnol Date: 2021-08-06

6. A 3D deep learning approach to epicardial fat segmentation in non-contrast and post-contrast cardiac CT images.

Authors: Thanongchai Siriapisith; Worapan Kusakunniran; Peter Haddawy
Journal: PeerJ Comput Sci Date: 2021-12-10

7. Detection of COVID-19 Based on Chest X-rays Using Deep Learning.

Authors: Walaa Gouda; Maram Almurafeh; Mamoona Humayun; Noor Zaman Jhanjhi
Journal: Healthcare (Basel) Date: 2022-02-10

8. The Relationship between Intelligent Image Simulation and Recognition Technology and the Health Literacy and Quality of Life of the Elderly.

Authors: Baojian Wei; Chunyu Li; Jiangye Xu
Journal: Contrast Media Mol Imaging Date: 2022-02-23 Impact factor: 3.161

9. Anatomic Point-Based Lung Region with Zone Identification for Radiologist Annotation and Machine Learning for Chest Radiographs.

Authors: Feng Li; Samuel G Armato; Roger Engelmann; Thomas Rhines; Jennie Crosby; Li Lan; Maryellen L Giger; Heber MacMahon
Journal: J Digit Imaging Date: 2021-07-29 Impact factor: 4.903

10. An evaluation of performance measures for arterial brain vessel segmentation.

Authors: Orhun Utku Aydin; Abdel Aziz Taha; Adam Hilbert; Ahmed A Khalil; Ivana Galinovic; Jochen B Fiebach; Dietmar Frey; Vince Istvan Madai
Journal: BMC Med Imaging Date: 2021-07-16 Impact factor: 1.930