| Literature DB >> 30523263 |
Lena Maier-Hein1, Matthias Eisenmann2, Annika Reinke2, Sinan Onogur2, Marko Stankovic2, Patrick Scholz2, Tal Arbel3, Hrvoje Bogunovic4, Andrew P Bradley5, Aaron Carass6, Carolin Feldmann2, Alejandro F Frangi7, Peter M Full2, Bram van Ginneken8, Allan Hanbury9,10, Katrin Honauer11, Michal Kozubek12, Bennett A Landman13, Keno März2, Oskar Maier14, Klaus Maier-Hein15, Bjoern H Menze16, Henning Müller17, Peter F Neher15, Wiro Niessen18, Nasir Rajpoot19, Gregory C Sharp20, Korsuk Sirinukunwattana21, Stefanie Speidel22, Christian Stock23, Danail Stoyanov24, Abdel Aziz Taha25, Fons van der Sommen26, Ching-Wei Wang27, Marc-André Weber28, Guoyan Zheng29, Pierre Jannin30, Annette Kopp-Schneider31.
Abstract
International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the importance of challenges and show that the lack of quality control has critical consequences. First, reproducibility and interpretation of the results is often hampered as only a fraction of relevant information is typically provided. Second, the rank of an algorithm is generally not robust to a number of variables such as the test data used for validation, the ranking scheme applied and the observers that make the reference annotations. To overcome these problems, we recommend best practice guidelines and define open research questions to be addressed in the future.Entities:
Mesh:
Year: 2018 PMID: 30523263 PMCID: PMC6284017 DOI: 10.1038/s41467-018-07619-7
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of biomedical image analysis challenges. a Number of competitions (challenges and tasks) organized per year, b fields of application, c algorithm categories assessed in the challenges, d imaging techniques applied, e number of training and test cases used, f most commonly applied metrics for performance assessment used in at least 5 tasks, and g platforms (e.g. conferences) used to organize the challenges for the years 2008, 2012, and 2016
List of parameters that characterize a challenge
| Parameter name | Coverage [%] | Parameter name | Coverage [%] |
|---|---|---|---|
| Challenge namea | 100 | Operator(s) | 7 |
| Challenge websitea | 99 | Distribution of training and test cases a | 18 |
| Organizing institutions and contact persona | 97 | Category of training data generation methoda | 89 |
| Life cycle typea | 100 | Number of training casesa | 89 |
| Challenge venue or platform | 99 | Characteristics of training casesa | 79 |
| Challenge schedulea | 81 | Annotation policy for training casesa | 34 |
| Ethical approvala | 32 | Annotator(s) of training casesa | 81 |
| Data usage agreement | 60 | Annotation aggregation method(s) for training casesa | 30 |
| Interaction level policya | 62 | Category of test data generation methoda | 87 |
| Organizer participation policya | 6 | Number of test casesa | 77 |
| Training data policya | 16 | Characteristics of test casesa | 77 |
| Pre-evaluation method | 5 | Annotation policy for test casesa | 34 |
| Evaluation software | 26 | Annotator(s) of test casesa | 78 |
| Submission formata | 91 | Annotation aggregation method(s) for test casesa | 34 |
| Submission instructions | 91 | Data pre-processing method(s) | 24 |
| Field(s) of applicationa | 97 | Potential sources of reference errors | 28 |
| Task category(ies)a | 100 | Metric(s)a | 96 |
| Target cohort* | 65 | Justification of metricsa | 23 |
| Algorithm target(s)a | 99 | Rank computation methoda | 36 |
| Data origina | 98 | Interaction level handlinga | 44 |
| Assessment aim(s)a | 38 | Missing data handlinga | 18 |
| Study cohorta | 88 | Uncertainty handlinga | 7 |
| Context informationa | 35 | Statistical test(s)a | 6 |
| Center(s)a | 44 | Information on participants | 88 |
| Imaging modality(ies)a | 99 | Results | 87 |
| Acquisition device(s) | 25 | Report document | 74 |
| Acquisition protocol(s) | 72 |
List of parameters that were identified as relevant when reporting a challenge along with the percentage of challenge tasks for which information on the parameter has been reported. Parameter definitions can be found in Supplementary Table 2.
aParameters used for structured challenge submission for the MICCAI 2018 challenges
Fig. 2Robustness of rankings with respect to several challenge design choices. One data point corresponds to one segmentation task organized in 2015 (n = 56). The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1.5 interquartile range (IQR) of the first quartile, and the highest value still within 1.5 IQR of the third quartile. a Ranking (metric-based) with the standard Hausdorff Distance (HD) vs. its 95% variant (HD95). b Mean vs. median in metric-based ranking based on the HD. c Case-based (rank per case, then aggregate with mean) vs. metric-based (aggregate with mean, then rank) ranking in single-metric ranking based on the HD. d Metric values per algorithm and rankings for reference annotations performed by two different observers. In the box plots (a–c), descriptive statistics for Kendall’s tau, which quantifies differences between rankings (1: identical ranking; −1: inverse ranking), is shown. Key examples (red circles) illustrate that slight changes in challenge design may lead to the worst algorithm (Ai: Algorithm i) becoming the winner (a) or to almost all teams changing their ranking position (d). Even for relatively high values of Kendall’s tau (b: tau = 0.74; c: tau = 0.85), critical changes in the ranking may occur
Fig. 3The ranking scheme is a deciding factor for the ranking robustness. The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1.5 interquartile range (IQR) of the first quartile, and the highest value still within 1.5 IQR of the third quartile. According to bootstrapping experiments with 2015 segmentation challenge data, single-metric based rankings (those shown here are for the DSC) are significantly more robust when the mean rather than the median is used for aggregation (left) and when the ranking is performed after aggregation rather than before (right). One data point represents the robustness of one task, quantified by the percentage of simulations in bootstrapping experiments in which the winner remains the winner
Fig. 4Robustness of rankings with respect to the data used. Robustness of rankings with respect to the data used when a single-metric ranking scheme based on whether the Dice Similarity Coefficient (DSC) (left), the Hausdorff Distance (HD) (middle) or the 95% variant of the HD (right) is applied. One data point corresponds to one segmentation task organized in 2015 (n = 56). The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1.5 interquartile range (IQR) of the first quartile, and the highest value still within 1.5 IQR of the third quartile. Metric-based aggregation with mean was performed in all experiments. Top: percentage of simulations in bootstrapping experiments in which the winner (according to the respective metric) remains the winner. Bottom: percentage of other participating teams that were ranked first in the simulations
Fig. 5Main results of the international questionnaire on biomedical challenges. Issues raised by the participants were related to the challenge data, the data annotation, the evaluation (including choice of metrics and ranking schemes) and the documentation of challenge results
Inclusion criteria on challenge level
| # | Criterion | Number of affected tasks/challenges |
|---|---|---|
| 1 | If a challenge task has on- and off-site part, the results of the part with the most participating algorithms are used. | 1/1 |
| 2 | If multiple reference annotations are provided for a challenge task and no merged annotation is available, the results derived from the second annotator are used. In one challenge, the first annotator produced radically different annotations from all other observers. This is why we used the second observer of all challenges. | 2/2 |
| 3 | If multiple reference annotations are provided for a challenge task and a merged annotation is available, the results derived from the merged annotation are used. | 1/1 |
| 4 | If an algorithm produced invalid values for a metric in all test cases of a challenge task, this algorithm is omitted in the ranking | 1/1 |
Inclusion criteria on task level
| # | Criterion | Number of excluded tasks |
|---|---|---|
| 1 | Number of algorithms ≥3 | 42 |
| 2 | Number of test cases > 1 (for bootstrapping and cross-validation approaches) | 25 |
| 3 | No explicit argumentation against the usage of Hausdorff Distance as metric | 1 |