Literature DB >> 28772163

Designing image segmentation studies: Statistical power, sample size and reference standard quality.

Eli Gibson¹, Yipeng Hu², Henkjan J Huisman³, Dean C Barratt².

Abstract

Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources. In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards. The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study.

Entities: Chemical Disease Species

Keywords: Image segmentation; Reference standard; Segmentation accuracy; Statistical power

Mesh：

Year: 2017 PMID： 28772163 PMCID： PMC5666910 DOI： 10.1016/j.media.2017.07.004

Source DB: PubMed Journal: Med Image Anal ISSN： 1361-8415 Impact factor: 8.545

Introduction

Demonstrating an improvement in segmentation algorithm accuracy typically involves comparison with an accepted reference standard, such as manual expert segmentations or other imaging modalities (e.g. histology). In many medical image segmentation problems, such segmentations are challenging due to the variable appearance of anatomical/pathological features, ambiguous anatomical definitions, clinical constraints, and interobserver variability. The resulting errors in the reference standards introduce errors in the performance measures used to compare segmentation algorithms, and can impact the probability of detecting a significant difference between algorithms, referred to as the statistical power (Beiden et al., 2000). The cost and quality of a reference standard is affected by the time and effort devoted to segmentation accuracy, the sample size, and the number, background, experience and proficiency of the observers. For example, the PROMISE12 prostate MRI segmentation challenge used two reference standards (illustrated in Fig. 1): a high-quality reference standard manually segmented by one experienced clinical reader and verified by another independent clinical reader, and a low-quality reference standard segmented by a less experienced non-clinical observer. An alternative approach is to estimate a high-quality reference standard by combining independent segmentations from multiple observers using algorithms such as STAPLE (Warfield et al., 2004) and SIMPLE (Langerak et al., 2010). A third approach is to mitigate the errors in a lower-quality reference standard by increasing the sample size (Konyushkova, Sznitman, Fua, 2015, Top, Hamarneh, Abugharbieh, 2011, Maier-Hein, Mersmann, Kondermann, Bodenstedt, Sanchez, Stock, Kenngott, Eisenmann, Speidel, 2014, Irshad, Montaser-Kouhsari, Waltz, Bucur, Nowak, Dong, Knoblauch, Beck, 2015). All three of these approaches, however, raise the cost of generating the reference standard, both logistically and economically.

Fig. 1

Left: Illustrative prostate MRI segmentations from the PROMISE12 prostate segmentation challenge (Litjens et al., 2014b) by two algorithms – A (blue) and B (yellow) – and the two manually contoured reference standards – L (red) which is of lower quality and H (green) that is of higher quality. Compared to H, L oversegmented anteriorly where image information was ambiguous, affecting accuracy measurements of A and B using L. Right: Harder apical segmentations showing regions containing voxels with different combinations of segmentation labels ABLH (overbar denotes negative classifications). The statistical model underlying the derived sample size formula for segmentation evaluation studies is derived from probability distributions of these voxel-wise segmentation labels. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) There are clear trade-offs between the sample size of the study, the cost of generating the reference standard, and the reference standard quality. The optimal balance of these trade-offs depends on the relationship between the study design parameters and statistical power. However, standard power calculation formulae do not, in general, account for the quality of reference standard segmentations. Thus, there is a need for new formulae to quantify these relationships. As a first step towards this goal, this paper presents a new sample size calculation relating statistical power to the quality of a reference standard (measured with respect to a higher-quality reference standard). Such a formula can answer key questions in study design: How many validation images are needed to evaluate a segmentation algorithm? How accurate does the reference standard need to be? In preliminary work (Gibson et al., 2015), we derived a relationship between statistical power and the quality of a reference standard for a simplified model that cannot account for correlation between voxels, and made a strong assumption that the reference and algorithm segmentation labels are conditionally independent given the high-quality reference standard. In the present paper, we build on our initial work to develop a generalized model that takes into account the correlation between voxels and the statistical dependence between algorithms and reference standards observed in segmentation studies. The remainder of this paper outlines the derivation (Section 2.3), application (Sections 3 and 6) and validation (Sections 4 and 5) of a statistical power formula for image segmentation. Insights and heuristics derived from the formula and its validation, as well as limitations of the work, are discussed in Section 7. Appendix A and Appendix B present mathematical details of the derivations.

Sample size calculations in segmentation evaluation studies

The probability of a study correctly detecting a true effect depends in part on the sample size. A study with a sample size that is too small has a higher risk of missing a meaningful underlying difference, while one with a sample size that is too large may be more expensive than necessary. Sample size calculations relate the probability of a study correctly detecting a true effect to specified and estimated parameters of the study design (Mace, 1964). The sample size depends on the probability distribution of the test statistic under the null and alternate hypotheses. This distribution, in turn, depends on the statistical analysis being performed and on an assumed statistical model of the studied population. We derive a sample size calculation for a specific analysis: comparing the mean segmentation accuracy — i.e. the proportion of voxels in an image that match the reference standard L — of two algorithms A and B that generate binary classifications of v voxels on n images using a paired Student’s t-test (Rosner, 2015) on the per-image accuracies. Specifically, this tests the null hypothesis that the mean segmentation accuracies of A and B (both measured by comparison to L) are equal against the alternative hypothesis that they are unequal. Paired t-test analyses such as this one are frequently performed in comparisons of segmentation accuracy (Caballero et al., 2014).

Notation

Throughout this paper, we use the notation given in Table 1. Symbols used in this paper are summarized in Table 2.

Table 1

Notation for mathematical symbols.

Type	notation
Segmentation algorithms	X (upper case non-italic)
Random variables and vectors	X (upper case)
Realizations of random variables and constants	x (lower case)
Vectors	x→ (arrow accent);  〈 x, y 〉  (angle brackets)
Estimates	x^ (circumflex accent)
Parameterized distributions	X ∼ X(θ) (bold capital with parameters in parentheses)
Expectation of X	E[X]
Conditional expectation of X given Z	E(X\|Z)
Conditional variance of X given Z	σX\|Z2
Conditional covariance of X and Y given Z	cov(X, Y\|Z)
Event X=1	x (bold lower case)
Event X=0	x¯ (bold lower case with bar)

Table 2

Glossary of mathematical symbols.

Symbol	Support	Description
Experimental parameters
n	N	Sample size
v	N	Number of voxels per image
α	R	Significance threshold (acceptable Type I error)
β	R	1−power (acceptable Type II error)
δ_MDD	[−1,1]	Minimum difference to detect with specified power
Population parameters
p→	[0, 1]³	Population average marginal probability for the per-voxel accuracy difference
δ	[−1,1]	Population accuracy difference
ψ	[0, 1]	Probability that A and B disagree on voxel label
δ_H	[−1,1]	Population accuracy difference measured against high-quality reference standard H
p(a), p(b), p(l), p(h)	[0, 1]	Probabilities of voxel labels being 1 for a randomly selected voxel
ρ_{i, j}	[−1,1]	Correlation between D_{k, i} and D_{k, j} given O→k
ρi,j¯	[0, 1]	Average ρ_{i, j} over all voxel pairs i and j
σO1−O−12	[0,ψ−δ2]	Variance of the accuracy difference in the marginal probability prior
ω	ω∈R+	Precision parameter of Dirichlet distribution controlling inter-image variability
Random variables
A_{k, i}, B_{k, i}, L_{k, i}, H_{k, i}	{0, 1}	Segmentation label for the ith voxel in the kth image
O→k	[0, 1]³	Per-image prior on average marginal probability
O→k,i	[0, 1]³	Per-voxel prior on marginal probability
D→k	{−1,0,1}v	Vector of per-voxel accuracies for the kth image
D_{k, i}	{−1,0,1}	Difference in accuracy for the ith voxel of the kth image
D	{−1,0,1}	Difference in accuracy for a random voxel
D¯k	[−1,1]	Per-image accuracy difference
Simulation variables
Dist_{i, j}	R+	Distance between voxels i and j
σ_ρ	R+	Scaling parameter to control spatial correlation in Monte Carlo simulations
d¯k	[−1,1]	Per-image accuracy difference of a simulated image
d_{k, i}	{−1,0,1}	Per-voxel accuracy difference of a simulated voxel
Other notation
p−1,p₀, p₁	[0, 1]	Elements of p→ for values −1, 0, and 1
Ok,−1,O_{k, 0}, O_{k, 1}	[0, 1]	Elements of O→k for values −1, 0, and 1
Ok,i,−1,O_{k, i, 0}, O_{k, i, 1}	[0, 1]	Elements of O→k,i for values −1, 0, and 1
A, B, L, H		Segmentation sources denoting two algorithms, a low-quality and a high-quality reference
f		Design factor
t_p{1}, t_p{2}	R	1- and 2-tailed p probability critical value from a T-distribution
σ02	[0,2]	Per-image accuracy difference variance under the null hypothesis
σalt2	[0,2]	Per-image accuracy difference variance under the alternative hypothesis

[x, y] denotes real numbers between x and y; {x, y, z} denotes a set of possible values; a superscript denotes a vector with x elements; denotes natural numbers; denotes real numbers. denotes positive real numbers.

Notation for mathematical symbols. Glossary of mathematical symbols. [x, y] denotes real numbers between x and y; {x, y, z} denotes a set of possible values; a superscript denotes a vector with x elements; denotes natural numbers; denotes real numbers. denotes positive real numbers.

Statistical model of segmentation

Our stochastic population model represents the joint distribution of possible segmentations by A, B, and L over a population of images. The data for one image from this population comprises binary segmentation labels (encoded as integers 0 or 1) assigned by A, B and L to each of the v voxels: where a, b, and l are the labels for the ith voxel in the kth image. The data for a study comprises n randomly sampled images, which we denoted with a set of random variables where A, B, and L are the random variables representing labels for the ith voxel in the kth randomly sampled image.

Accuracy difference measures

We focus on three types of segmentation accuracy differences. First, the per-voxel segmentation accuracy difference for the ith voxel in the kth image is . D can take on three values: 1 (when ), 0 (when ) and (when ). Random vector represents all D for the kth image. Second, the per-image accuracy difference is the proportion of correct voxel labels from algorithm A (with respect to reference standard L) minus the proportion of correct voxel labels from algorithm B (with respect to reference standard L): . Third, the population average accuracy difference δ is the expected value for a randomly selected image in the population, and equivalently, for a randomly selected per-voxel accuracy difference D.

Model distribution

For calculating power, the model (summarized in Table 3 and illustrated in Fig. 2) must encode the distribution of the metric analysed in the statistical analysis: the per-image accuracy difference . While depends on all three segmentations A, B and L, it can be expressed more simply as a unary function of . Therefore, we consider the distribution of directly, modeled as a v-dimensional correlated categorical distribution. To model this distribution, we follow the common convention of breaking down complex joint distributions into the mean, and multiple simpler sources of variation about the mean.

Table 3

Model summary. These expressions summarize the nested model used in our derivations. The motivation and detailed description is given in Section 2.2.2.

O→k∼P→(p→) where E[O→k]=p→
∀iO→k,i∼O→(O→k) where E(O→k,i\|O→k)=O→k
∀iDk,i∼Categorical(O→k,i)
∀i≠jcov(Dk,i,Dk,j\|O→k)=ρi,jσDk,i\|O→k2σDk,j\|O→k2

Fig. 2

The illustrated nested model shows, from left to right, (1) the prior distribution of per-image average marginal probabilities (shown on the triangular (standard 2-simplex) domain with axes O and shown and O implicitly defined as ; darkness represents the probability density), (2) three different samples (i.e. three images) of per-image average marginal probabilities (shown as arrows labelled and ), (3) three corresponding conditional prior distributions of per-voxel marginal probabilities for the three images (shown as in (1)), (4) nine different samples (i.e. nine voxels from the second image) of per-voxel marginal probabilities (shown as unlabelled arrows), and (5) the categorical distributions for the nine voxels from the second image (shown as pie charts of the relative probabilities of the per-voxel accuracy differences [orange], [blue], and [red]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Model summary. These expressions summarize the nested model used in our derivations. The motivation and detailed description is given in Section 2.2.2. The mean of is defined by the joint distribution of the segmentation labels. Considering the joint distribution is important, because the algorithm and reference standard labels for a randomly selected voxel (A, B and L) may not be independent from each other, as they depend on the same image information and overlapping prior knowledge. The mean of therefore, encodes the inter-segmentation correlation in the population average marginal probabilities of the per-voxel accuracy difference D (marginalized over combinations of segmentations A, B and L yielding each difference value): For example, when A and B are highly correlated, is higher and when A and L are highly correlated, increases while decreases. We consider the population average marginal probabilities as a model parameter . The variation of about the mean is affected by three sources of variation: intra-image inter-voxel correlation – two voxels in the same image may have correlated labels if, for example, they are adjacent or are commonly affected by the same image artifact. inter-image variability – the expected segmentation performance for different images may vary, as one image may have features that are more or less challenging for a particular algorithm or observer than another image. inter-voxel variability – two voxels in the same image may have different marginal probabilities depending on the image content; for example, voxels that are easy to segment for any algorithm would likely have the same labels for any algorithm, where more challenging voxels are more likely to show differences. Both the inter-image variability and the intra-image inter-voxel correlation affect the covariance matrix of . While the covariance matrix could be an explicit model parameter, interpreting the parameter is challenging because it conflates these different sources of correlation. Instead, we construct an over-parameterized nested model that allows us to separately represent inter-image variability and intra-image inter-voxel correlation. The key concept in this nested model is to introduce per-image priors (random variables ) on the average marginal probability for D within each image, in order to model inter-image variability. is a distribution of probability vectors (i.e. the open standard 2-simplex) with mean . Then, for each image, the conditional distribution of D given models the intra-image inter-voxel correlation. Specifically, we define the conditional covariance of given as where ρ is a pair-wise Pearson correlation coefficient and is the conditional variance of D given . To model the inter-voxel variability, each D has per-voxel priors (random variables ) defining its marginal probabilities. The conditional distribution of given is an arbitrary distribution of probability vectors with mean .

Derivation of the sample size formula for segmentation

The general form of the sample size formula (Connor, 1987), relates the sample size (n) to the variances ( and ) of per-image accuracy differences under the null hypothesis () and alternate hypothesis (δ ≠ 0), acceptable study error rates (α and β), and the minimum detectable difference (δ) in population accuracy between algorithms A and B to detect with power (). t and t are two- and one-tailed critical values taken from the inverse cumulative distribution function of the t-distribution with degrees of freedom. Of the parameters in Eq. (3), most are selected based on experimental design choices, but the variances of the per-image accuracy difference are derived from the statistical model. The variance of the per-image accuracy difference can be derived for any prior distribution of per-image average marginal probabilities () in terms of moments of the prior distribution by marginalizing out and (see Appendix A for a detailed derivation), yielding where is the population-wide probability that algorithms A and B disagree on the labeling of a voxel (see Fig. 3), (the variance of for the priors ) is a linear combination of moments of the prior distribution (), and is the average of the intra-image inter-voxel correlation coefficients.

Fig. 3

Illustration of the relationship between the proportion of disagreement (ψ) and the accuracy difference (δ). In these four examples, segmentation algorithms A (blue) and B (yellow) both over-contour the circular object taken as the reference standard segmentation L (red), adding different perturbations that lower accuracy. When sets of segmentations have higher ψ and lower δ (as in the lower right), it is harder to detect accuracy differences. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Substituting and (i.e. substituting and into ) yields the segmentation sample size formula for accuracy differences with respect to reference standard L, It is interesting to note that when there is no inter-voxel correlation (i.e. ) and no inter-image variability in marginal probabilities (i.e. ), Eq. (5) approaches the sample size formula for McNemar’s two-sample paired proportion test with nv samples (Connor, 1987).

Sample size with the Dirichlet prior distribution

To gain further insight into the sample size relationship, consider the special case where the prior distribution of per-image average marginal probabilities is a Dirichlet distribution (i.e. ), which represents inter-image variability with a single parameter: the precision ω (Minka, 2000). When ω is large, priors are likely to be near (i.e. there is little variation between images); when ω is small, priors are distributed more diffusely (i.e. there is more variation between images). The Dirichlet prior distribution has three properties that make interpretation of the sample size relationship easier: For the Dirichlet prior distribution, and ; therefore . Substituting into Eq. (4) and simplifying algebraically gives the variance of the per-image accuracy under a Dirichlet prior: Since is expressed in terms of δ, we can readily substitute and into Eq. (3) to get the sample size formula Several aspects of this formula link to previous work. The term is a type of design factor denoted hereafter as f (analogous to the design factor in cluster-randomized trials (Kish, 1965)), modelling the inter- vs intra-image variability in accuracy differences (i.e. each image being one correlated cluster of voxel samples). When there is no inter-voxel correlation (i.e. ), Eq. (7) simplifies to the formula found in our preliminary analysis (Gibson et al., 2015). The term ψ/δ2 is the squared coefficient of variation of D under the idealized assumption of completely independent voxels (i.e. ) — or equivalently, the statistical efficiency of estimating δ (Everitt and Skrondal, 2002). We thus refer to ψ/δ2 hereafter as the idealized efficiency. It is well-characterised as a model for variability in categorical probabilities, because it is the conjugate prior distribution of the categorical and multinomial distributions and thus commonly adopted in Bayesian analysis (Tu, Mosimann, 1962, Zhu, Zöllei, Wells, 2006) Representing inter-image variability with a single parameter simplifies interpretation and facilitates parameter fitting with small pilot data sets. for the Dirichlet prior distribution is proportional to which simplifies the sample size formula.

Incorporating reference standard quality

Conducting segmentation accuracy comparison studies using a lower-quality reference standard introduces an additional challenge: selecting the appropriate minimum detectable difference. On one hand, for the generic sample size formula (Eq. (3)) to be valid, δ must be measured with respect to the reference standard used in the study. On the other hand, the selection of δ depends on external clinical or technical requirements. Ideally, these requirements would be defined with respect to a high-quality reference standard H (with the MDD denoted δ), to most closely approximate the true requirement. If the high-quality reference standard can be used for the entire study, there is no conflict and δ can be used directly. If, however, a lower-quality reference standard is used, an appropriate δ needs to be selected. To resolve this dilemma, we have derived a formula to express δ for a low-quality reference standard as a function of δ, by characterizing the differences between the low- and high-quality reference standards (e.g. on a small pilot dataset). The derivation, detailed in Appendix B, expresses δ in terms of the joint probability of segmentation labels of A, B, L and H; isolates the terms of this expression that equate to δ; and simplifies the remaining terms. This yields an equation for δ as a function of δ and estimable parameters representing deviation of δ from δ: where for a randomly selected voxel and is the covariance between errors in L (with respect to H) and differences between A and B. The second term of this expression reflects error induced by over- or under-contouring by L (with respect to H). If L tends to over-contour compared to H, algorithms that assign more voxels as foreground will appear more accurate. The third term is the covariance reflecting errors in L that are biased in favour of A or B. This expression can be used to estimate the δ to use for a study using a low-quality reference standard.

Applying the sample size formula

The sample size formula derived above supports the design of segmentation accuracy comparison studies by estimating the sample size needed to detect a specified accuracy difference with high probability. As with all sample size calculations, three types of parameters have to be determined to apply the formula: the acceptable study error rates, the minimum detectable difference, and the variance parameters. Some of these parameters are chosen based on experimental, technical or clinical requirements outside the study design, while others are estimated from related literature or pilot data. We denote the estimate of parameter x as . The acceptable error rates are generally set using heuristics by study designers: (i.e. a 5% probability of falsely detecting a difference when there is none) and (i.e. an 80% probability of detecting a true difference). The minimum detectable difference (δ) is typically set by technical or clinical requirements outside the study design to be the smallest difference that is large enough to be important to detect with high probability. Specifically, if the true difference is δ or higher, the study should give a true positive with probability or higher. If the study will use a sufficiently high-quality reference standard, δ can be chosen directly. If the technical or clinical requirements are expressed with respect to a high-quality reference standard, but the study uses a lower-quality reference standard, then δ can be chosen and the equivalent can be estimated from the low-quality correction equation (Eq. (8)), using parameter estimation equations (Eqs. (9) and (10)) given in Section 3.1. The variance parameters depend on the distribution of the data; they are not chosen a priori, but can be estimated using values from related literature, or using pilot data. In the moment-based sample size equation (Eq. (5)), the variance parameters are ψ, and . In the Dirichlet-prior-based sample size equation (Eq. (7)), the variance parameters are ψ, and ω. In general, estimating these variance parameters individually can be challenging because the model is parameterized by multiple parameters that affect the intervoxel covariance of per-voxel accuracy differences, and because the moments of the prior for the per-image average marginal probabilities may depend on δ. Under some assumptions, however, we can estimate variance parameters. If we assume which may be appropriate when δ and δ are sufficiently small, we can estimate from the pilot data (using Eq. (13) in Section 3.1), and apply the generic sample size equation (Eq. (3)) directly. If we assume a parametric distribution for the per-image average marginal probabilities, it may be possible to express in terms of δ (as shown for the Dirichlet distribution in Eq. (6)) and estimate and from . For the Dirichlet distribution, the resulting variance could be characterized by a design factor modeling the combined effect of parameters and ω. An estimation equation for the design factor is given in Section 3.1 Eq. (14). If there is a need to estimate the effects of the variance parameters individually (e.g. to explore the effect of increased intra-image inter-voxel correlation on a planned study), and we assume that the intra-image inter-voxel correlation is spatially constrained (e.g. if voxels separated by a specified distance are effectively uncorrelated given ), then we can estimate using spatially sparse sampling and then estimate from and . This approach is outlined for a Dirichlet prior in Section 3.1. The optimal size for a pilot study data set has not been well-established in general, and depends on many factors (Hertzog, 2008), including the particular population being studied. In principle, the precision of the estimated sample size depends on the sensitivity of the formula to parameter estimation errors (see supplementary material) and the variances of the parameter estimators (which decrease as the pilot data set grows), both of which vary depending on the population being studied. In practice, formal sample size calculations for such pilot studies are rarely used (Hertzog, 2008); instead, heuristics, such as using 10 samples (Nieswiadomy, 2011), 12 samples (Julious, 2005) or using 10% of the anticipated size of the full study (Connelly, 2008, Lackey, Wingate, 1986) for larger studies, can be used. The risk of parameter estimation error can be mitigated using conservative parameter estimates, as described in Section 3.1 for .

Parameter estimation equations

To estimate parameters from pilot data, a small data set of images must be collected and segmented by algorithms A and B, by the reference standard L to be used for the study, and by the high-quality reference standard H. Given a segmented pilot data set, formula parameters can be estimated as follows. To estimate in terms of δ, we first estimate the proportion of positive voxels segmented by A across all images in the pilot data: where n′ is the number of images in the pilot data set. and can be estimated similarly. can be estimated as Then, from Eq. (8), . The probability of disagreement can be estimated using the sample mean as The population average accuracy difference can be estimated using the sample mean as The variance in per-image accuracy differences can be estimated using the unbiased sample variance as where . However, sample variance estimates from small pilot studies are imprecise and skewed (Browne, 1995), which inflates the probability of having an underpowered study. To mitigate this effect, Browne (1995) recommended using the upper bound of a γ% confidence interval on the variance to guarantee the specified power with γ% probability. This can be estimated using a double bootstrap method (e.g. Lee and Young, 1995 implemented for Matlab as ibootci (Penn, 2015)). When modeling the per-image marginal probability prior as a Dirichlet distribution, the design factor encoding the combined effect of parameters and ω can be estimated from Eq. (6) using sample estimates: and the idealized efficiency can be estimated as . To estimate the effects of the variance parameters individually, we can model the per-image marginal probability prior as a Dirichlet distribution and assume that the intra-image inter-voxel correlation is spatially constrained (i.e. voxels more than x pixels away are effectively uncorrelated given ). Sampling d from voxels spaced x voxels apart gives counts from a Dirichlet-multinomial distribution, and we can estimate the precision parameter using an iterative approach described by Minka (2000). The average correlation coefficient can then be estimated from Eq. (6) using sample estimates as

Simulations

Three sets of Monte Carlo simulations were used to evaluate the accuracy of the sample size formulae under three different conditions: with simulated images and segmentations from the assumed statistical model, to test the validity of the model; with real-world data (the PROMISE12 prostate MRI segmentation data set described in Section 4.2.1) using a high-quality reference standard, to test the applicability of the Dirichlet-based sample size formula (Eq. (7)) to real data; and with real-world data using a low-quality reference standard while expressing the minimum detectable difference in terms of a high-quality reference standard, to test the applicability of the low-quality correction equation (Eq. (8)) to real data.

Simulations with simulated data from the assumed statistical model

In order to characterize the validity of the model described in Section 2.2, we performed sets of simulations with controlled variation of a subset of model parameters (hereafter referred to as a simulation set). Recall that Eq. (7) defines the sample size needed to detect a significant accuracy difference with probability if the underlying population difference were δ. To test this, we set δ to the specified population accuracy difference, and compare the proportion of simulated studies yielding significant accuracy differences to . Note that this approach to select δ is appropriate for validating the sample size formula, but not for designing real segmentation comparison studies: in practice, δ should be chosen based on clinical or technical requirements. In each simulation, we repeatedly simulated a segmentation evaluation study by sampling per-voxel accuracy differences for ⌈n⌉ v-voxel segmentations and reference standards (where ⌈n⌉ denotes the smallest integer  ≥ n) using the assumed model and testing for an accuracy difference using a Student’s t-test. In each simulation, we compared the observed proportion of positive statistical tests with the predicted probability (i.e. the statistical power ) for sample size ⌈n⌉. To clarify the impact of this error in power, we also substituted the observed power into the Dirichlet-based sample size formula (Eq. (7)) to calculate the equivalent error in the predicted sample size n and detectable difference δ. In each simulation, we ran 25,000 repetitions in order to estimate the probability of a positive outcome with a 95% confidence interval with a width of 1%. Each per-image accuracy difference was computed by sampling the derived per-voxel accuracy differences d directly as follows: The scripts used to generate these samples are available at https://github.com/eligibson/MedIA2016. the marginal probability priors of per-voxel accuracy differences were drawn from a Dirichlet prior using the rdirichlet (Warnes et al., 2015) function in R version 3.1.1 (R Core Team, 2013), a correlation matrix was constructed where Dist is the intervoxel distance in a voxel image and is a scale parameter controlling the spatial extent of the correlation d were sampled using the ordsample (Barbiero and Ferrari, 2015) function in R. While this is equivalent to drawing samples from the algorithm and reference standard segmentations and computing d, it facilitates the direct control of the d correlation matrix needed in these experiments. The baseline parameter values in the simulation sets and the ranges of varied parameters are given in Table 4. Note that the simulations varying v, ω, σ and ψ were conducted at two baseline δ values. The parameter ranges for these simulations were chosen to balance the applicability of parameter values to medical image segmentation problems against practical constraints. The range of ω encompassed both highly consistent and highly variable prior distributions. Ranges of δ and ψ reflected plausible algorithm differences based on previous experience. Due to limitations on the ordsample algorithm the range of v and σ were constrained: v was limited to 100 because of the computational complexity of sampling high-dimensional correlated discrete random variables, and σ was constrained to 0.7 because of algorithmic constraints. The baseline parameter values were chosen to reflect typical sample sizes in segmentation studies (). Because the population parameters derived in Section 2.4 (δ(a), p(b), p(l), p(h) and ) are linked to statistical power through their influence on the parameter δ, simulations were run as a function of δ, instead of simulating many combinations of parameters that map to the same δ.

Table 4

Simulation parameters used to estimate the accuracy of the model. Note that the simulations varying v, ω, σ and ψ were conducted twice at two baseline δ values.

	# voxels	population accuracy difference	Dirichlet precision	spatial correlation width	population probability of disagreement
	v	δ	ω	σ_ρ	ψ
Baseline	36	3% / 6%	128	0.7	15%
Minimum	9	2%	64	0	15%
Maximum	100	10%	1024	0.7	45%
Increment	v by +1	+1%	× 2	+0.1	+5%

Simulation parameters used to estimate the accuracy of the model. Note that the simulations varying v, ω, σ and ψ were conducted twice at two baseline δ values.

Simulations with real-world data

To evaluate the applicability of sample size formula (Eq. (7)) and the low-quality correction equation (Eq. (8)) to a real-world data set, we simulated segmentation accuracy comparison studies using bootstrapped samples from the PROMISE12 data set. The PROMISE12 challenge is an ongoing resource for comparing many state-of-the-art prostate segmentation algorithms against a common reference standard. The challenge images comprise 100 T2W prostate MR images collected from 4 centres, split into 50 training images (with publicly available reference segmentations) and 30 testing images (with reference segmentations withheld). The reference segmentations were manually segmented by an experienced clinical reader, and verified by another independent clinical reader. In order to establish a standardised scoring system for multiple metrics, the challenge had a non-clinical graduate student manually segment the images and her metric scores were used to normalize the metric scores of the algorithms. Although the PROMISE12 challenge principally used the high-quality reference standard for evaluation, the second segmentation is analogous to a presumably lower-quality reference standard that could be considered as a lower cost option. Thus, the clinical manual segmentations will represent the high-quality reference standard H, the graduate student manual segmentations will represent the low-quality reference standard L, and two algorithms from the challenge will represent A and B. Using 10 algorithms from the PROMISE12 challenge, the simulations were repeated for all 45 possible pairs of algorithms. As in Section 4.1, we set δ to the population accuracy difference (treating the PROMISE12 test data set as the entire population) and compare the proportion of simulated studies yielding significant accuracy differences to .

Simulations with high-quality real-world data

To evaluate the applicability of the Dirichlet-based sample size formula (Eq. (7)) to a real-world data set, each simulated study in this experiment compared two algorithms to the high-quality reference standard. For every pair of algorithms, we estimated the population accuracy difference () and variance parameters using all 30 test cases from the PROMISE12 test data set. Using and the estimated variance parameters, we computed the predicted sample size n using Eq. (7). We then simulated 100,000 segmentation accuracy comparison studies using bootstrap sampling by sampling ⌈n⌉ images with replacement from the PROMISE12 images and testing the per-image accuracy differences using a paired Student’s t-test. We compared the proportion of positive tests to the power predicted by the model for ⌈n⌉ samples.

Simulations with low-quality real-world data

To evaluate the applicability of the low-quality correction equation (Eq. (8)) to a real-world data set, each simulated study in this experiment compared two algorithms to the low-quality reference standard, with calculated from Eq. (8) and the observed . Simulation using bootstrap sampling and evaluation proceeded as in Section 4.2.1 except that and the variance parameters were estimated with respect to low-quality reference standard L.

Results

Simulations under the statistical model

The variance of accuracy differences predicted by the model () was within 2% relative error of the Monte Carlo simulations across all simulation sets (RMS relative error 0.5%). The predicted power was within 4% error (simulated – predicted power) of the Monte Carlo simulations across all simulation sets with 95% confidence. Fig. 4 shows the absolute error in the predicted power (i.e., simulation - model power) under varying model parameters. The parameter with the largest impact on the accuracy of power prediction was δ. For simulations with baseline and the predicted power was within 2% and 3% absolute error, respectively, of the simulations with 95% confidence. A larger positive bias in the power prediction error across all values of v, ω and σ was observed for simulations with compared to simulations with suggesting that the positive bias can be primarily attributed to the baseline accuracy difference. The simulation with had the largest absolute error of 4%.

Fig. 4

Model accuracy (95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the simulated and model power) for each simulation set. For example, with the model predicted 82% power, 4% below the 86% power observed in the simulation. Each accuracy graph shows a blue line representing the expected error due to the observed skew alone (for the simulation varying δ and the baseline ) based on applying the regular t-test sample size formula to a skewed Pearson distribution. The similar shape of this curve to the observed errors suggests that the skew is a considerable contributor to the error. The histogram (lower right) shows the distribution of accuracy differences for the simulation with illustrating the slight but significant skew in the distribution, which contributes to the observed error. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) A proportion of the observed error can be attributed to skew in the distribution of per-image accuracy differences, deviating from the normality assumption of the t-test used in this work. The largest skew amongst our experiments (corresponding to the largest power prediction error) occurred when ; this is illustrated in a histogram of the accuracy differences, shown in Fig. 4. The effect of the deviation from normality is exacerbated in the simulations with large δ due to the lower sample size (), for which the t-test is more sensitive to violations of its assumptions. To illustrate the expected impact of skew alone on the error in predicted power, Fig. 4 shows the error of the standard paired t-test power calculation for a correspondingly skewed population (Pearson distribution with skew matching the simulation) overlaid in blue. The impact of these errors in predicted power on the sample size and minimum detectable difference is illustrated in Figs. 5 and 6.

Fig. 5

Fig. 6

The equivalent error in predicted minimum detectable difference (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the minimum difference detectable with simulated power and the minimum difference detectable with the modeled power. For example, with the model would predict that a minimum detectable difference of 10.5% would result in the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

The equivalent error in predicted sample size (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the sample size needed to achieve the simulated power and the sample size needed to achieve the modeled power. For example, with the model would overestimate by 1 the number of subjects needed to achieve the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The equivalent error in predicted minimum detectable difference (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the minimum difference detectable with simulated power and the minimum difference detectable with the modeled power. For example, with the model would predict that a minimum detectable difference of 10.5% would result in the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Simulations with high-quality real-world data

When the minimum detectable difference was defined and tested relative to the high-quality reference standard in the PROMISE12 data set, the simulated power was  < 4% higher than the power specified by the model (approximately 80%) for the majority of algorithm comparisons (range 0–20%). The error was strongly correlated with the skew of per-image accuracy differences in the population (Spearman’s ; ). The model did not over-estimate the power in any comparison, suggesting that it is conservative (i.e. avoiding predictions that result in underpowered studies) in the presence of skew. The errors for each pair of algorithms are reported in Table 5.

Table 5

Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the high-quality reference standard. The required sample sizes predicted by the model are given in parentheses.

	B	C	D	E	F	G	H	I	J

A	3 (108)	1 (41)	12 (28)	2 (31)	14 (11)	1 (50)	2 (101)	13 (8)	7 (22)
B		10 (15)	1 (163)	1 (26418)	1 (35)	1 (1.8E6)	10 (28)	0 (14)	0 (157)
C			12 (11)	4 (10)	14 (9)	11 (12)	0 (42)	17 (5)	9 (6)
D				4 (102)	2 (50)	3 (115)	13 (14)	1 (15)	2 (3357)
E					7 (19)	1 (14084)	2 (11)	12 (8)	3 (95)
F						5 (23)	12 (10)	1 (312)	5 (48)
G							7 (16)	8 (10)	0 (97)
H								20 (5)	15 (8)
I									2 (17)

Simulations with low-quality real-world data

When the minimum detectable difference was defined relative to the high-quality reference standard and tested relative to the low-quality reference standard in the PROMISE12 data set, the model predicted the simulated power with a median error of 5% (simulated – predicted power; range -29–16%) and a median absolute error of 6% (|simulated – predicted power|). The two algorithm pairs with the smallest δ (0.1% and 0.2% accuracy differences) and largest sample sizes (5714 and 3721) had the largest errors, overestimating power by 27% and 29%, respectively. The error was correlated with the skew of per-image accuracy differences (Spearman’s ; ), and excluding the 2 cases with the smallest δ, the correlation was stronger (Spearman’s ; ). The errors for each pair of algorithms are reported in Table 6.

Table 6

Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the low-quality reference standard. The required sample sizes predicted by the model are given in parentheses.

	B	C	D	E	F	G	H	I	J

A	6 (43)	−2 (167)	12 (22)	−5 (133)	2 (11)	−5 (25)	−27 (5714)	15 (7)	9 (21)
B		8 (14)	−6 (403)	8 (71)	−5 (67)	12 (3598)	12 (24)	−5 (17)	−29 (3721)
C			11 (12)	11 (34)	8 (11)	7 (13)	2 (50)	10 (6)	13 (8)
D				6 (31)	2 (87)	4 (165)	13 (16)	0 (17)	6 (508)
E					0 (15)	2 (41)	−1 (76)	11 (6)	6 (34)
F						0 (37)	5 (13)	4 (159)	4 (58)
G							4 (17)	6 (12)	−8 (466)
H								13 (6)	16 (11)
I									5 (16)

Case study

The direct application of the sample size formula to calculate the sample size is described in Section 3. The formula can also be used indirectly to guide other aspects in the design of segmentation comparison studies. In this case study, we illustrate one such application: evaluating the cost (in terms of sample size vs cost per subject) of using a lower-quality reference standard manually segmented by a non-clinical graduate student instead of one generated by clinical collaborators. For illustration, this case study simulates the availability of a pilot data set by using two algorithms and the 30 test data sets from the PROMISE12 challenge. To evaluate the cost of the two approaches, we can compare the sample sizes under the two reference standard strategies. The error rates and minimum detectable difference δ will be the same for both scenarios. We use commonly accepted Type I and II error rates: and . The appropriate δ depends on the clinical or technical requirements; for example, in the context of prostate segmentation, the MDD could represent the minimal improvement in prostate segmentation accuracy that would make an automated prostate MRI computer-aided detection (CAD) system (e.g. Litjens et al., 2014a) clinically suitable as a first reader. In this case study, we suppose that an analysis of an existing CAD system suggests an improvement in accuracy of 5% (with respect to a high-quality reference standard) would be sufficient to make the system clinically suitable. The variance parameters differ between the scenarios. To assess the scenario where the study uses a high-quality reference standard, we can estimate and using A, B and H. Using Eqs. (11)–(13) with h in place of l gives and . Since and δ are small relative to assuming will yield similar results to assuming a Dirichlet prior ( and ). The resulting sample size to detect a difference was 9 subjects. To assess the scenario where the study uses a low-quality reference standard instead, we first estimate using A, B, L and H. Parameter estimation equations (Eqs. (9) and (10)) gives and yielding . Using Eqs. (11)–(13) gives and . The resulting sample size to detect a difference was 12 subjects. Based on this analysis, we estimate that a study using this lower-quality reference standard would require 30% more subjects to detect a 5% improvement in accuracy than one using the high-quality reference standard. Since the cost per subject of generating the lower-quality reference standard is typically much lower, this could be a suitable approach for comparing these algorithms.

Discussion

In this work, we derived a sample size formula for studies comparing the segmentation accuracy of two algorithms, and also a relationship describing the effect of using lower-quality reference standards on the minimum detectable difference in segmentation accuracy. The formula accuracy was evaluated using Monte Carlo simulations, yielding errors in predicted power of less than 4% across a range of model parameters. The applicability of the formulae to real-world data was evaluated using bootstrap sampling from the PROMISE12 prostate MRI segmentation data set yielding median errors in predicted power less than 6%, but showed the error to be sensitive to skewed distributions and small sample sizes. A case study was also analyzed to illustrate the use of the formulae in a realistic context.

Validation in segmentation comparison studies

Improvements in the methodology for the validation and comparison of segmentation algorithms span a wide variety of approaches. One avenue to improve segmentation validation is to develop improved metrics. Simple segmentation metrics such as accuracy, Dice overlap, Cohen’s Kappa, mean absolute boundary distances and Hausdorff distances compare segmentations to a single reference standard and are commonly used (Taha and Hanbury, 2015). Newer metrics allow comparisons to multiple reference standards (e.g. the validation index (Juneja et al., 2013)) or comparisons that consider application specific utility (e.g. accuracy of quantitative measurements in segmented ROIs (Jha et al., 2012)). This latter concept can be taken further by validating segmentation through its impact on a larger system, such as the accuracy of a computer-assisted detection pipeline (Guo and Li, 2014). Model observers have also been developed to assess aspects of segmentation quality without a reference standard (Frounchi, Briand, Grady, Labiche, Subramanyan, 2011, Kohlberger, Singh, Alvino, Bahlmann, Grady, 2012); effectively creating a learned reference-standard-independent segmentation metric. Another avenue to improve segmentation validation is to improve the reference standard quality. Label fusion algorithms, such as STAPLE (Warfield et al., 2004) and SIMPLE (Langerak et al., 2010) enable the generation of higher-quality reference standards that combine information from multiple experts. Improvements in multimodal registration (Shah, Pohida, Turkbey, Mani, Merino, Pinto, Choyke, Bernardo, 2009, Gibson, Crukley, Gaed, Gómez, Moussa, Chin, Bauman, Fenster, Ward, 2012) enable reference standards based on information that is less dependent on the image being segmented. A third avenue is to increase the size of reference standards by reducing the cost per image, or via data augmentation. Active learning (Konyushkova, Sznitman, Fua, 2015, Top, Hamarneh, Abugharbieh, 2011) and other interactive annotation tools, reduce the cost of generating expert segmentations by partially automating the process. Crowdsourcing non-expert segmentations (Maier-Hein, Mersmann, Kondermann, Bodenstedt, Sanchez, Stock, Kenngott, Eisenmann, Speidel, 2014, Irshad, Montaser-Kouhsari, Waltz, Bucur, Nowak, Dong, Knoblauch, Beck, 2015) can cheaply generate many reference standards on many images, using the large numbers to offset the potential loss in quality. For some anatomy, artificial data with reference segmentations can be generated by simulating the imaging process (Cocosco et al., 1997) or perturbing the geometry and image signal of existing images (Hamarneh et al., 2008). This work, in contrast, aims to improve validation by enabling researchers to design efficient and appropriately powered studies. This work focuses on a particular analysis used in segmentation comparison studies: comparing the proportion of voxels where each of two segmentation algorithms agree with a single reference standard. The presented formulae can be directly applied by researchers developing new segmentation algorithms to facilitate the design of their studies. More broadly, this work has particular importance for work focused on improving reference standard quality and reference standard size by providing a framework for understanding the tradeoffs between quality and quantity in segmentation reference standards.

Accuracy and applicability of the sample size formulae

In typical study designs, the statistical power, i.e. the probability of detecting an accuracy difference of a specified size, is fixed heuristically at 80%, specifying that a 20% risk of missing a true effect is acceptable. Other study design parameters are optimized under this constraint, balancing costs and effect sizes. A study design with statistical power substantially above the acceptable risk is using resource inefficiently, while one with lower power gives an unacceptable risk of false negatives. In our model, the largest errors observed in the model were for large accuracy differences. The variance predicted by the model matches the simulations to within 2%, suggesting that model errors are not primarily due to an incorrect variance prediction. Rather, the distribution of the accuracy differences in these simulation sets suggests that the error can be attributed to a combination of two factors: low sample size and skewness. The accuracy difference distribution under our statistical model, when using a Dirichlet prior, generally has non-zero skew when there are accuracy differences (i.e. |δ| > 0) and inter-image variability (ω < ∞), and the simulations show a skew as high as 0.3 in these simulation sets. The t-test, however, assumes samples are drawn from a normal distribution with 0 skew. While the t-test is robust to such deviations from normality at large sample sizes, large accuracy differences are more easily detectable and thus require small sample sizes. This suggests that segmentation comparison studies should be careful in their application of the t-test for studies with small sample sizes; in such cases, a McNemar test adjusted for clustered sampling (Gönen, 2004, Durkalski, Palesch, Lipsitz, Rust, 2003) may be more appropriate. When applied to real-world data, the errors were generally larger than observed under the statistical model. The errors were strongly correlated with the skew of the distribution of per-image accuracy differences, which is consistent with our observations on simulated data. This effect was particularly evident when the predicted sample size was low: five of the six largest observed errors (where the model underestimated power by 13–20%) corresponded to simulated studies with n < 10, which is also consistent with our observations on simulated data. In general, the model underestimated the simulated power which could lead to inefficient resource usage, but would not lead to failed studies caused by insufficient power. When using a low-quality reference standard with δ defined with respect to a high-quality reference standard, the error was also correlated with skew. However, in this context, another source of error must be considered: error in the estimation of δ. When the estimated minimum detectable difference was very small (), small absolute estimation errors () led to large relative estimation errors, resulting in large errors in the predicted power. When using a low-quality reference standard, the model over-estimated the simulated power for 10/45 of the algorithm pairs, suggesting that additional subjects may be needed when using this model to avoid underpowered studies. The proposed approach for using low-quality reference standards presumes that a high-quality data set can be obtained, if only for a small pilot data set, and that clinical or technical requirements on accuracy differences specified with respect to that reference standard are useful. In some medical segmentation tasks (such as prostate cancer delineation on MRI (Gibson et al., 2016) or mitosis detection on histology images (Chowdhury et al., 2006)), even expert segmentations are highly variable. For some tasks, it may be appropriate to combine segmentations from multiple experts by consensus or using a label fusion algorithm such as STAPLE to generate a high-quality reference standard on a pilot study; however, care should be taken to consider whether requirements specified with respect to the resulting reference standard will be practically useful.

Model interpretation

Although the sample size relationship is a continuous function in multiple parameters, it can be useful to break the parameters into coarse categories to see emerging trends (see Table 7). In particular, we focus on the special case of modeling the prior as a Dirichlet random variable and examine the parameters that comprise the idealized efficiency ψ/δ2 and on the design factor f.

Table 7

		Design factor (f)
		0.01	0.05	0.1
Small differences (δMDD=2%)
ψ=2%	(ψ/δMDD2=50)	6*	21	41
ψ=11%	(ψ/δMDD2=275)	24	110	218
ψ=20%	(ψ/δMDD2=500)	41	198	394
Medium differences (δMDD=5%)
ψ=5%	(ψ/δMDD2=20)	3*	10	17
ψ=12.5%	(ψ/δMDD2=50)	6*	21	41
ψ=20%	(ψ/δMDD2=80)	8*	33	65
Large differences (δMDD=10%)
ψ=10%	(ψ/δMDD2=10)	3*	6*	10
ψ=15%	(ψ/δMDD2=15)	3*	8*	14
ψ=20%	(ψ/δMDD2=20)	3*	10	17

* Small samples sizes calculated from Eq. (7) are reported here; however, studies with such small sample sizes may be highly sensitive to violations of the assumptions of the t-test, and are not recommended.

Number of images required to detect a desired segmentation accuracy difference. When compensating for the use of a lower-quality reference standard, use Eq. (8) to estimate the minimum detectable difference (δ) first. * Small samples sizes calculated from Eq. (7) are reported here; however, studies with such small sample sizes may be highly sensitive to violations of the assumptions of the t-test, and are not recommended. δ can be coarsely categorized into small (δ ≤ 2%), medium (2% < δ < 10%), and large (δ ≥ 10%) differences. Detecting small differences can require large (often infeasible) sample sizes, whereas detecting large differences may be limited not by δ but by the assumptions of the statistical analysis. Within these effect size categories, the likelihood of disagreement between algorithms (ψ) plays an important role. ψ has the range . When ψ ≈ δ, it implies that most of the difference between the algorithm correspond to the more accurate algorithm correcting the errors of the less accurate one, while making few new errors. When ψ ≫ δ, the more accurate algorithm is making new errors on voxels where the less accurate algorithm was correct. Table 7 shows three levels of disagreement: minimal disagreement (), large disagreement () and a midpoint between them. When δ is small, the level of disagreement can introduce an order of magnitude difference in required sample sizes. The idealized efficiency is modulated by the design factor. The design factor ranges from 1/v (denoting that each voxel gives an independent estimate of accuracy differences) to 1 (denoting that each image gives an independent estimate of accuracy differences, but voxel segmentations are perfectly correlated). For realistic medical image segmentation algorithms, however, either of these extremes is unlikely. Table 7 shows three levels of the design factor: low correlation (), medium correlation () and high correlation (). Our derivations show that sample sizes for studies comparing the accuracy of segmentation algorithms principally depend on the idealized efficiency which relates the probability of voxel-wise disagreement (ψ) between algorithms to the minimum detectable difference δ, and the design factor f which reflects increased variability due to intervoxel correlation and inter-image variability. The sample size is approximately proportional to the idealized efficiency . ψ has the range which suggests that it is easier, in general, to detect a given accuracy difference when at least one of the algorithms is highly accurate (lowering the upper bound on ψ). Furthermore, it is easier to detect a given accuracy improvement when algorithm A principally corrects errors made by algorithm B (where ψ ≈ δ minimizing the idealized efficiency) than when algorithm A has errors that are independent from B. Although intuition would suggest that using lower-quality reference standards should consistently increase the required sample size, our derivations and simulations suggest a more complex relationship. The impact of errors in the reference standard is reduced by using a paired analysis which excludes variance due to factors that affect both algorithms in the same way, such as reference standard errors in voxels where the algorithms agree. Reference standard errors in regions of disagreement, however, do affect the variance of per-image accuracy differences ( from Eq. (6)). In the rightmost term of this equation, ψ (which does not depend on the reference standard) is generally much larger than δ2 (see Table 7), suggesting that the impact of reference standard errors on variance is predominantly via changing the design factor. Reference standard errors also affect the sample size (Eq. (8)) by altering the detectable accuracy difference when the reference standard has errors that are biased in favour of one algorithm or when it has systematic over- or under contouring and one algorithm contours more foreground than the other. Relatively speaking, systematic over- or under contouring will have only a small impact on the detectable accuracy differences, unless the algorithms’ foreground proportions are very different: for example, if A contours 5% more foreground than B, then 10% over-contouring by L (25 ×  that observed in the PROMISE12 data) will change the measured accuracy difference by only 0.5%, unless the contouring errors are biased towards one algorithm. Furthermore, errors in the reference standard that are biased towards one algorithm do not necessarily decrease power: reference standard errors biased towards the more accurate algorithm will exaggerate the true difference, increasing power at the expense of increased type I error.1 These observations were reflected in our analysis of the PROMISE12 challenge data (see Tables 5 and 6). Comparing the low-quality to the high-quality reference standard, the root-mean-squared relative error in was 4%, compared to 0.3% for . Because the low-quality reference standard had substantial agreement with the high-quality one (96% ± 1% mean ± SD accuracy), the effect of sample biases in reference standard errors were observable: for 17/45 pairs of algorithms, the studies designed to use the low-quality reference standard actually needed fewer subjects than studies using the high-quality reference standard; in all of these cases, there were slight sample biases in the low-quality reference standard towards the more accurate algorithm (primarily, as expected, in the covariance term in Eq. (8)). This increased |δ| relative to |δ| (i.e. the underlying differences between the algorithms were exaggerated and thus easier to detect). Because the experimental design for evaluating the model on real data required which was very small for some comparisons ( < 2% in 20/45 algorithm pairs and  < 0.5% in 4 algorithm pairs), this effect was magnified. Overall, our analysis of the PROMISE12 data aligns well with our theoretical model. Based on our analysis, using reference standards that are lower quality but unbiased may be a suitable approach for comparing segmentation algorithm accuracy.

Limitations

The contributions of this work should be considered in the context of its limitations. First, the sample size calculation presented in this work is specific to the statistical analysis (the paired Student’s t-test) and to the accuracy metric (proportion of voxels matching the reference standard). Further work is needed to develop these formulae for other analyses and metrics. Second, our correlation model is over-parameterized, representing inter-image variability and intra-image inter-voxel correlation separately, when their effect on the covariance of is coupled. This complicates the estimation of parameters, but yields formulae expressed in concepts familiar to the image analysis community. Third, due to constraints on sampling from specified high-dimensional correlated discrete distributions, we were unable to generate Monte Carlo simulations testing the extremes of some parameter ranges (e.g. high numbers of voxels and high intervoxel correlation). Because the metric analysed in the study is a mean over voxels (which becomes more precise with higher v) and because we did not observe an increase in error as v increased from 9–100, we do not anticipate notable differences in model performance with larger v. Fourth, our application of the formulae to real segmentation studies was limited by the public availability of data sets with high- and low- quality reference standards; the PROMISE12 data set used in our case study is a rare example of such data. Finally, the sensitivity of the formula to violations of its underlying assumptions was not estimated; future work in this area could clarify which of these assumptions are critical to the accuracy of the formula and which could be relaxed.

Conclusions

In this work, we derived formulae to address two interrelated questions in the design of studies comparing segmentation algorithms: How many validation images are needed to evaluate a segmentation algorithm? and How accurate does the reference standard need to be? The sample size formula predicted the power of simulated segmentation studies to within 4% across a range of model parameters, and when applied to the PROMISE12 prostate segmentation challenge data, predicted the power to within a median error of 6%. In addition to their direct application in calculating sample sizes, the formulae offer several insights for study design. First, it is generally easier to detect a given accuracy difference when at least one algorithm is highly accurate, as this reduces accuracy variability. Second, it is generally easier to detect a given accuracy difference when one algorithm principally corrects the errors of another, compared to when two algorithm make independent errors. Third, systematic over- or under-contouring by a low-quality reference standard does not impact accuracy measurements substantially unless one algorithm tends to contour more voxels as foreground than the other, but correlation between reference standard errors and algorithm differences can bias accuracy measurements. These formulae, and parameter estimation equations and guidelines that facilitate their use, hold the potential to enable researchers to make statistically motivated decisions about their study design and their choice of reference standard and to make the most efficient use of limited research resources.

25 in total

1. Analysis of clustered matched-pair data.

Authors: Valerie L Durkalski; Yuko Y Palesch; Stuart R Lipsitz; Philip F Rust
Journal: Stat Med Date: 2003-08-15 Impact factor: 2.373

Review 2. Considerations in determining sample size for pilot studies.

Authors: Melody A Hertzog
Journal: Res Nurs Health Date: 2008-04 Impact factor: 2.228

3. Can masses of non-experts train highly accurate image classifiers? A crowdsourcing approach to instrument segmentation in laparoscopic images.

Authors: Lena Maier-Hein; Sven Mersmann; Daniel Kondermann; Sebastian Bodenstedt; Alexandro Sanchez; Christian Stock; Hannes Gotz Kenngott; Mathias Eisenmann; Stefanie Speidel
Journal: Med Image Comput Comput Assist Interv Date: 2014

4. Evaluating segmentation error without ground truth.

Authors: Timo Kohlberger; Vivek Singh; Chris Alvino; Claus Bahlmann; Leo Grady
Journal: Med Image Comput Comput Assist Interv Date: 2012

5. Registration of prostate histology images to ex vivo MR images via strand-shaped fiducials.

Authors: Eli Gibson; Cathie Crukley; Mena Gaed; José A Gómez; Madeleine Moussa; Joseph L Chin; Glenn S Bauman; Aaron Fenster; Aaron D Ward
Journal: J Magn Reson Imaging Date: 2012-07-31 Impact factor: 4.813

6. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge.

Authors: Geert Litjens; Robert Toth; Wendy van de Ven; Caroline Hoeks; Sjoerd Kerkstra; Bram van Ginneken; Graham Vincent; Gwenael Guillard; Neil Birbeck; Jindang Zhang; Robin Strand; Filip Malmberg; Yangming Ou; Christos Davatzikos; Matthias Kirschner; Florian Jung; Jing Yuan; Wu Qiu; Qinquan Gao; Philip Eddie Edwards; Bianca Maan; Ferdinand van der Heijden; Soumya Ghose; Jhimli Mitra; Jason Dowling; Dean Barratt; Henkjan Huisman; Anant Madabhushi
Journal: Med Image Anal Date: 2013-12-25 Impact factor: 8.545

7. Computer-aided detection of prostate cancer in MRI.

Authors: Geert Litjens; Oscar Debats; Jelle Barentsz; Nico Karssemeijer; Henkjan Huisman
Journal: IEEE Trans Med Imaging Date: 2014-05 Impact factor: 10.048

8. Toward Prostate Cancer Contouring Guidelines on Magnetic Resonance Imaging: Dominant Lesion Gross and Clinical Target Volume Coverage Via Accurate Histology Fusion.

Authors: Eli Gibson; Glenn S Bauman; Cesare Romagnoli; Derek W Cool; Matthew Bastian-Jordan; Zahra Kassam; Mena Gaed; Madeleine Moussa; José A Gómez; Stephen E Pautler; Joseph L Chin; Cathie Crukley; Masoom A Haider; Aaron Fenster; Aaron D Ward
Journal: Int J Radiat Oncol Biol Phys Date: 2016-04-21 Impact factor: 7.038

9. The validation index: a new metric for validation of segmentation algorithms using two or more expert outlines with application to radiotherapy planning.

Authors: Prabhjot Juneja; Philp M Evans; Emma J Harris
Journal: IEEE Trans Med Imaging Date: 2013-04-12 Impact factor: 10.048

10. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool.

Authors: Abdel Aziz Taha; Allan Hanbury
Journal: BMC Med Imaging Date: 2015-08-12 Impact factor: 1.930

5 in total

1. PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images.

Authors: Samuel G Armato; Henkjan Huisman; Karen Drukker; Lubomir Hadjiiski; Justin S Kirby; Nicholas Petrick; George Redmond; Maryellen L Giger; Kenny Cha; Artem Mamonov; Jayashree Kalpathy-Cramer; Keyvan Farahani
Journal: J Med Imaging (Bellingham) Date: 2018-11-10

Review 2. Towards a guideline for evaluation metrics in medical image segmentation.

Authors: Dominik Müller; Iñaki Soto-Rey; Frank Kramer
Journal: BMC Res Notes Date: 2022-06-20

Review 3. Automatic brain lesion segmentation on standard magnetic resonance images: a scoping review.

Authors: Emilia Gryska; Justin Schneiderman; Isabella Björkman-Burtscher; Rolf A Heckemann
Journal: BMJ Open Date: 2021-01-29 Impact factor: 2.692

4. Automatic segmentation of prostate MRI using convolutional neural networks: Investigating the impact of network architecture on the accuracy of volume measurement and MRI-ultrasound registration.

Authors: Nooshin Ghavami; Yipeng Hu; Eli Gibson; Ester Bonmati; Mark Emberton; Caroline M Moore; Dean C Barratt
Journal: Med Image Anal Date: 2019-09-11 Impact factor: 8.545

5. Fully automated brain resection cavity delineation for radiation target volume definition in glioblastoma patients using deep learning.

Authors: Ekin Ermiş; Alain Jungo; Robert Poel; Marcela Blatti-Moreno; Raphael Meier; Urspeter Knecht; Daniel M Aebersold; Michael K Fix; Peter Manser; Mauricio Reyes; Evelyn Herrmann
Journal: Radiat Oncol Date: 2020-05-06 Impact factor: 3.481

5 in total