| Literature DB >> 28772163 |
Eli Gibson1, Yipeng Hu2, Henkjan J Huisman3, Dean C Barratt2.
Abstract
Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources. In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards. The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study.Entities:
Keywords: Image segmentation; Reference standard; Segmentation accuracy; Statistical power
Mesh:
Year: 2017 PMID: 28772163 PMCID: PMC5666910 DOI: 10.1016/j.media.2017.07.004
Source DB: PubMed Journal: Med Image Anal ISSN: 1361-8415 Impact factor: 8.545
Fig. 1Left: Illustrative prostate MRI segmentations from the PROMISE12 prostate segmentation challenge (Litjens et al., 2014b) by two algorithms – A (blue) and B (yellow) – and the two manually contoured reference standards – L (red) which is of lower quality and H (green) that is of higher quality. Compared to H, L oversegmented anteriorly where image information was ambiguous, affecting accuracy measurements of A and B using L. Right: Harder apical segmentations showing regions containing voxels with different combinations of segmentation labels ABLH (overbar denotes negative classifications). The statistical model underlying the derived sample size formula for segmentation evaluation studies is derived from probability distributions of these voxel-wise segmentation labels. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Notation for mathematical symbols.
| Type | notation |
|---|---|
| Segmentation algorithms | X (upper case non-italic) |
| Random variables and vectors | |
| Realizations of random variables and constants | |
| Vectors | |
| Estimates | |
| Parameterized distributions | |
| Expectation of | |
| Conditional expectation of | |
| Conditional variance of | |
| Conditional covariance of | |
| Event | |
| Event |
Glossary of mathematical symbols.
| Symbol | Support | Description |
|---|---|---|
| Experimental parameters | ||
| Sample size | ||
| Number of voxels per image | ||
| Significance threshold (acceptable Type I error) | ||
| Minimum difference to detect with specified power | ||
| Population parameters | ||
| [0, 1]3 | Population average marginal probability for the per-voxel accuracy difference | |
| Population accuracy difference | ||
| [0, 1] | Probability that A and B disagree on voxel label | |
| Population accuracy difference measured against high-quality reference standard H | ||
| [0, 1] | Probabilities of voxel labels being 1 for a randomly selected voxel | |
| Correlation between | ||
| [0, 1] | Average | |
| Variance of the accuracy difference in the marginal probability prior | ||
| Precision parameter of Dirichlet distribution controlling inter-image variability | ||
| Random variables | ||
| {0, 1} | Segmentation label for the | |
| [0, 1]3 | Per-image prior on average marginal probability | |
| [0, 1]3 | Per-voxel prior on marginal probability | |
| Vector of per-voxel accuracies for the | ||
| Difference in accuracy for the | ||
| Difference in accuracy for a random voxel | ||
| Per-image accuracy difference | ||
| Simulation variables | ||
| Distance between voxels | ||
| Scaling parameter to control spatial correlation in Monte Carlo simulations | ||
| Per-image accuracy difference of a simulated image | ||
| Per-voxel accuracy difference of a simulated voxel | ||
| Other notation | ||
| [0, 1] | Elements of | |
| [0, 1] | Elements of | |
| [0, 1] | Elements of | |
| A, B, L, H | Segmentation sources denoting two algorithms, a low-quality and a high-quality reference | |
| Design factor | ||
| 1- and 2-tailed | ||
| Per-image accuracy difference variance under the null hypothesis | ||
| Per-image accuracy difference variance under the alternative hypothesis | ||
[x, y] denotes real numbers between x and y; {x, y, z} denotes a set of possible values; a superscript denotes a vector with x elements; denotes natural numbers; denotes real numbers. denotes positive real numbers.
Model summary. These expressions summarize the nested model used in our derivations. The motivation and detailed description is given in Section 2.2.2.
Fig. 2The illustrated nested model shows, from left to right, (1) the prior distribution of per-image average marginal probabilities (shown on the triangular (standard 2-simplex) domain with axes O and shown and O implicitly defined as ; darkness represents the probability density), (2) three different samples (i.e. three images) of per-image average marginal probabilities (shown as arrows labelled and ), (3) three corresponding conditional prior distributions of per-voxel marginal probabilities for the three images (shown as in (1)), (4) nine different samples (i.e. nine voxels from the second image) of per-voxel marginal probabilities (shown as unlabelled arrows), and (5) the categorical distributions for the nine voxels from the second image (shown as pie charts of the relative probabilities of the per-voxel accuracy differences [orange], [blue], and [red]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 3Illustration of the relationship between the proportion of disagreement (ψ) and the accuracy difference (δ). In these four examples, segmentation algorithms A (blue) and B (yellow) both over-contour the circular object taken as the reference standard segmentation L (red), adding different perturbations that lower accuracy. When sets of segmentations have higher ψ and lower δ (as in the lower right), it is harder to detect accuracy differences. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Simulation parameters used to estimate the accuracy of the model. Note that the simulations varying v, ω, σ and ψ were conducted twice at two baseline δ values.
| # voxels | population accuracy difference | Dirichlet precision | spatial correlation width | population probability of disagreement | |
|---|---|---|---|---|---|
| Baseline | 36 | 3% / 6% | 128 | 0.7 | 15% |
| Minimum | 9 | 2% | 64 | 0 | 15% |
| Maximum | 100 | 10% | 1024 | 0.7 | 45% |
| Increment | +1% | × 2 | +0.1 | +5% |
Fig. 4Model accuracy (95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the simulated and model power) for each simulation set. For example, with the model predicted 82% power, 4% below the 86% power observed in the simulation. Each accuracy graph shows a blue line representing the expected error due to the observed skew alone (for the simulation varying δ and the baseline ) based on applying the regular t-test sample size formula to a skewed Pearson distribution. The similar shape of this curve to the observed errors suggests that the skew is a considerable contributor to the error. The histogram (lower right) shows the distribution of accuracy differences for the simulation with illustrating the slight but significant skew in the distribution, which contributes to the observed error. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 5The equivalent error in predicted sample size (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the sample size needed to achieve the simulated power and the sample size needed to achieve the modeled power. For example, with the model would overestimate by 1 the number of subjects needed to achieve the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 6The equivalent error in predicted minimum detectable difference (calculated from the observed error in power). Each plot shows the 95% confidence interval (shown in red for baseline and in cyan for baseline ) on the absolute difference between the minimum difference detectable with simulated power and the minimum difference detectable with the modeled power. For example, with the model would predict that a minimum detectable difference of 10.5% would result in the 84% power observed in the simulation. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the high-quality reference standard. The required sample sizes predicted by the model are given in parentheses.
| B | C | D | E | F | G | H | I | J | |
| A | 3 (108) | 1 (41) | 12 (28) | 2 (31) | 14 (11) | 1 (50) | 2 (101) | 13 (8) | 7 (22) |
| B | 10 (15) | 1 (163) | 1 (26418) | 1 (35) | 1 (1.8E6) | 10 (28) | 0 (14) | 0 (157) | |
| C | 12 (11) | 4 (10) | 14 (9) | 11 (12) | 0 (42) | 17 (5) | 9 (6) | ||
| D | 4 (102) | 2 (50) | 3 (115) | 13 (14) | 1 (15) | 2 (3357) | |||
| E | 7 (19) | 1 (14084) | 2 (11) | 12 (8) | 3 (95) | ||||
| F | 5 (23) | 12 (10) | 1 (312) | 5 (48) | |||||
| G | 7 (16) | 8 (10) | 0 (97) | ||||||
| H | 20 (5) | 15 (8) | |||||||
| I | 2 (17) | ||||||||
Differences between the proportion of positive findings and the predicted power for simulated studies from the PROMISE12 data set using the low-quality reference standard. The required sample sizes predicted by the model are given in parentheses.
| B | C | D | E | F | G | H | I | J | |
| A | 6 (43) | 12 (22) | 2 (11) | 15 (7) | 9 (21) | ||||
| B | 8 (14) | 8 (71) | 12 (3598) | 12 (24) | |||||
| C | 11 (12) | 11 (34) | 8 (11) | 7 (13) | 2 (50) | 10 (6) | 13 (8) | ||
| D | 6 (31) | 2 (87) | 4 (165) | 13 (16) | 0 (17) | 6 (508) | |||
| E | 0 (15) | 2 (41) | 11 (6) | 6 (34) | |||||
| F | 0 (37) | 5 (13) | 4 (159) | 4 (58) | |||||
| G | 4 (17) | 6 (12) | |||||||
| H | 13 (6) | 16 (11) | |||||||
| I | 5 (16) | ||||||||
Number of images required to detect a desired segmentation accuracy difference. When compensating for the use of a lower-quality reference standard, use Eq. (8) to estimate the minimum detectable difference (δ) first.
| Design factor ( | ||||
|---|---|---|---|---|
| 0.01 | 0.05 | 0.1 | ||
| Small differences ( | ||||
| ( | 6* | 21 | 41 | |
| ( | 24 | 110 | 218 | |
| ( | 41 | 198 | 394 | |
| Medium differences ( | ||||
| ( | 3* | 10 | 17 | |
| ( | 6* | 21 | 41 | |
| ( | 8* | 33 | 65 | |
| Large differences ( | ||||
| ( | 3* | 6* | 10 | |
| ( | 3* | 8* | 14 | |
| ( | 3* | 10 | 17 | |
* Small samples sizes calculated from Eq. (7) are reported here; however, studies with such small sample sizes may be highly sensitive to violations of the assumptions of the t-test, and are not recommended.