Literature DB >> 24058331

Instrumentation bias in the use and evaluation of scientific software: recommendations for reproducible practices in the computational sciences.

Nicholas J Tustison¹, Hans J Johnson, Torsten Rohlfing, Arno Klein, Satrajit S Ghosh, Luis Ibanez, Brian B Avants.

Abstract

Entities: Chemical Disease Gene Species

Keywords: best practices; comparative evaluations; confirmation bias; open science; reproducibility

Year: 2013 PMID： 24058331 PMCID： PMC3766821 DOI： 10.3389/fnins.2013.00162

Source DB: PubMed Journal: Front Neurosci ISSN： 1662-453X Impact factor: 4.677

× No keyword cloud information.

By honest I don't mean that you only tell what's true. But you make clear the entire situation. You make clear all the information that is required for somebody else who is intelligent to make up their mind.

1. Introduction

The neuroscience community significantly benefits from the proliferation of imaging-related analysis software packages. Established packages such as SPM (Ashburner, 2012), the FMRIB Software Library (FSL) (Jenkinson et al., 2012), Freesurfer (Fischl, 2012), Slicer (Fedorov et al., 2012), and the AFNI toolkit (Cox, 2012) aid neuroimaging researchers around the world in performing complex analyses as part of ongoing neuroscience research. In conjunction with distributing robust software tools, neuroimaging packages also continue to incorporate algorithmic innovation for improvement in analysis tools. As fellow scientists who actively participate in neuroscience research through our contributions to the Insight Toolkit (e.g., Johnson et al., 2007; Ibanez et al., 2009; Tustison and Avants, 2012) and other packages such as MindBoggle, Nipype (Gorgolewski et al., 2011), and the Advanced Normalization Tools (ANTs), (Avants et al., 2010, 2011) we notice an increasing number of publications that intend a fair comparison of algorithms which, in principle, is a good thing. Our concern is the lack of detail with which these comparisons are often presented and the corresponding possibility of instrumentation bias (Sackett, 1979) where “defects in the calibration or maintenance of measurement instruments may lead to systematic deviations from true values” (considering software as a type of instrument requiring proper “calibration” and “maintenance” for accurate measurements). Based on our experience (including our own mistakes), we propose a preliminary set of guidelines that seek to minimize such bias with the understanding that the discussion will require a more comprehensive response from the larger neuroscience community. Our intent is to raise awareness in both authors and reviewers to issues that arise when comparing quantitative algorithms. Although herein we focus largely on image registration, these recommendations are relevant for other application areas in biologically-focused computational image analysis, and for reproducible computational science in general. This commentary complements recent papers that highlight statistical bias (Kriegeskorte et al., 2009; Vul and Pashler, 2012), bias induced by registration metrics (Tustison et al., 2012), and registration strategy (Yushkevich et al., 2010) and guideline papers for software development (Prlic and Procter, 2012).

2. Guidelines

A comparative analysis paper's longevity and impact on future scientific explorations is directly related to the completeness of the evaluation. A complete evaluation requires preparation (before any experiment is performed) and effort to publish its details and results. Here, we suggest general guidelines for both of these steps most of which derive from basic scientific principles of clarity and reproducibility.

2.1. Designing the evaluation study

The very idea that one (e.g., registration) algorithm could perform better than all other algorithms on all types of data is fundamentally flawed. Indeed, the “No Free Lunch Theorem” provides bounds on solution quality. That is, it specifically demonstrates that “improvement of performance in problem-solving hinges on using prior information to match procedures to problems” (Wolpert and Macready, 1997). Therefore, the first thing that authors of new algorithms should do is identify how their methods differ with respect to other available techniques in terms of the use of prior knowledge. Furthermore, the author must consider if it is possible to incorporate prior knowledge across existing methods.

2.1.1. Demand that the algorithm developers provide default parameters for the comparative context being investigated

Expert knowledge of a specific program and/or algorithm is most likely found with the original developers who would be in a position to provide optimal parameterization. Relevant parameter files and sample scripts that detail command line calls should accompany an algorithm to aid in its proper use, evaluation, and comparison. For example, the developers of the image registration program elastix (Klein et al., 2010) provide an assortment of parameter files on a designated wiki page listed in tabular format complete with short description (including applied modality and object of interest) and any publications which used that specific parameter file. Another example is the National Alliance for Medical Image Computing registration use case inventory where each listed case comprises a test dataset, a guided step-by-step tutorial, the solution, and a custom Registration Parameter Presets file with optimized registration parameters.

2.1.2. Do not implement your own version of an algorithm, particularly if one is available from the original authors. if you must re-implement, consider making your implementation available

Much is left unstated in published manuscripts where novel algorithmic concepts are presented. Ideally, the authors provide an instantiation of the code to accompany the manuscript. As observed in Kovacevic (2006), however, this is often not the case (even in terms of pseudocode). As a result, comparative evaluations are sometimes carried out using code developed not by the original authors but by the group doing the comparison. For example, in Clarkson et al. (2011), the authors compared three algorithms for estimating cortical thickness. Two of the algorithms were coded by the authors of the study while the third was used “off the shelf.” Thus, a natural question to ask is whether the performance difference is due to the algorithm itself, implementation quality, and/or the parameter tuning. None of these are addressed by Clarkson et al. (2011) which may decrease the publication's usefulness.

2.1.3. Perform comparisons on publicly available data

For reasons of reproducibility and transparency, evaluations should be performed using publicly available data sets. Given the rather large number of such institutional efforts including NIREP, IXI, NKI, OASIS, Kirby, LONI, and others, evaluations should include (if not be exhausted by) comparisons using such data. While evaluation on private cohorts might exclude such possibilities, such evaluations should be extensively motivated in the introduction and/or discussion. For example, if a particular algorithm with general application is found to perform better on a private cohort of Parkinson's disease subject data, reasons for performance disparity should be offered and supplemented with analysis on public data.

2.2. Publishing the evaluation

2.2.1. Include parameters

In Klein et al. (2009), 14 non-linear registration algorithms were compared using four publicly available, labeled brain MRI data sets. As part of the study, the respective algorithms' authors were given an opportunity to tune the parameters to ensure good performance which were then distributed on Prof. Klein's website. In contrast, not specifying parameters leaves one susceptible to criticisms of confirmation and/or instrumentation bias. For example, in a recent paper (Haegelen et al., 2013), the authors compared their ANIMAL registration algorithm with SyN (Avants et al., 2011) and determined that “registration with ANIMAL was better than with SyN for the left thalamus” in a cohort of Parkinson's disease patients. The difference in the authors' experience and investment between the two algorithms could bias algorithmic performance assessment. However, inclusion of parameter settings for ANIMAL and SyN would permit independent verification by reviewers or readers of the article.

2.2.2. Provide details as to the source of the algorithm

Origin should be provided for any code or package used during the evaluation. For example, N4 (Tustison et al., 2010) is a well-known inhomogeneity correction algorithm for MRI first made available as a tech report (Tustison and Gee, 2009). However, since its inclusion in the Insight Toolkit, different programs have been made available. N4 is also available in ANTs (the only version directly maintained by the original authors), as a module in Slicer, a wrapper of the Slicer module in Nipype, a module in c3d, and as a plugin in the BRAINS suite. While each version is dependent on the original source code, there could exist subtle variations which can affect performance. As one specific example, the c3d implementation hard-codes certain parameter values with no access to modify them by the user.

2.2.3. Co-authors should verify findings

Although different journals have varying guidelines for determining co-authorship, there is at least an implied sense of responsibility for an article's contents assumed by each of the co-authors. Strategies taken by journal editorial boards are used to reduce undeserving authorship attribution such as requiring the listing of the specific contributions of each co-author. Additional proposals have included signed statements of responsibility for the contents of an article (Anonymous, 2007). We suggest that at least one co-author independently verify a subset of the results by running the data processing and analysis on their own computational platform. The point of this exercise is to verify not only reproducibility but also that the process can be explained in sufficient detail.

2.2.4. Provide computational platform details of the evaluation

A recent article (Gronenschild et al., 2012) pointed out significant differences in FreeSurfer output that varied with release version and with operating system. While the former is to be expected given upgrades and bug fixes which occur between releases, the latter underscores both the need for consistency in study processing as well as the reporting of computational details for reproducibility.

2.2.5. Supply pre- and post-processing steps

In addition to disclosure of all parameters associated with the methodologies to be compared, all processing steps from the raw to the final processed images in the workflow need to be specified. Tools like Nipype (Gorgolewski et al., 2011) capture this provenance information in a formal and rigorous way, but at a minimum the shell scripts or screen shots of the parameter choices should be made available. Justification for any deviation of steps between algorithms needs to be provided.

2.2.6. Post the resulting data online

The current publishing paradigm limits the quantity of results that can be posted. There are only so many pages allowed for a particular publication and displaying every slice of every processed image, for example, is not feasible. This results in possible selection bias where results provided in the manuscript are selected by the authors for demonstrating the effect postulated at the onset of the study. Thus, differences in performance assessment tend to be exaggerated based strictly on visual representations in the paper. Publication simply in print (or as figures in a PDF file) and its limitations in terms of dynamic range or spatial resolution also severely limits the ability of reviewers and readers to perform more sophisticated evaluation beyond simple visual inspection. Alternatively (or additionally), online resources such as the the LONI Segmentation Validation Engine (Shattuck et al., 2009) can be used to evaluate individual algorithms for brain segmentation on publicly available data sets and compare with previously posted results. A top ranking outcome provides significant external validation for publishing newly proposed methodologies (e.g., Eskildsen et al., 2012).

2.2.7. Put comparisons and observed performance differences into context

In addition to algorithmic and study specifics, it is important to discuss potential limitations concerning qualitative and/or quantitative assessment metrics. In Rohlfing (2012), the author pointed out deficiencies in using standard overlap measures and image similarity metrics in quantifying performance of image registration methods. Other issues, such as biological plausibility of the resulting transforms, need to also be considered. Also important for inclusion is discussion of the possible reasons for performance disparity. If one algorithm outperforms another, reporting of those findings would be much more significant if the authors discuss possible reasons for relative performance levels.

3. Conclusion

Considering that computational sciences permeate neuroimaging research, certain safeguards should be in place to prevent (or at least minimize) potential biases and errors that can unknowingly affect study outcomes. There is no vetting agency for ensuring that analysis programs used for research are reasonably error-free. In addition, these software packages are simply “black boxes” to many researchers who are not formally trained to debug code, and who, in most cases, have only a very superficial understanding of the algorithms that they apply. And even to those of us who are trained to debug code, understanding someone else's code, perhaps implemented in an unfamiliar programming language and different coding style, is oftentimes very difficult. To this end, algorithmic comparisons are a very good way of evaluating general performance. We hope that the guidelines proposed in this editorial help the community in future comparative assessments and avoid errors in scientific computing that may otherwise lead to publication of invalid results (Merali, 2010).

26 in total

1. BEaST: brain extraction based on nonlocal segmentation technique.

Authors: Simon F Eskildsen; Pierrick Coupé; Vladimir Fonov; José V Manjón; Kelvin K Leung; Nicolas Guizard; Shafik N Wassef; Lasse Riis Østergaard; D Louis Collins
Journal: Neuroimage Date: 2011-09-16 Impact factor: 6.556

2. Who is accountable?

Authors:
Journal: Nature Date: 2007-11-01 Impact factor: 49.962

3. A comparison of voxel and surface based cortical thickness estimation methods.

Authors: Matthew J Clarkson; M Jorge Cardoso; Gerard R Ridgway; Marc Modat; Kelvin K Leung; Jonathan D Rohrer; Nick C Fox; Sébastien Ourselin
Journal: Neuroimage Date: 2011-05-26 Impact factor: 6.556

4. Bias in analytic research.

Authors: D L Sackett
Journal: J Chronic Dis Date: 1979

5. N4ITK: improved N3 bias correction.

Authors: Nicholas J Tustison; Brian B Avants; Philip A Cook; Yuanjie Zheng; Alexander Egan; Paul A Yushkevich; James C Gee
Journal: IEEE Trans Med Imaging Date: 2010-04-08 Impact factor: 10.048

6. Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python.

Authors: Krzysztof Gorgolewski; Christopher D Burns; Cindee Madison; Dav Clark; Yaroslav O Halchenko; Michael L Waskom; Satrajit S Ghosh
Journal: Front Neuroinform Date: 2011-08-22 Impact factor: 4.081

7. The optimal template effect in hippocampus studies of diseased populations.

Authors: Brian B Avants; Paul Yushkevich; John Pluta; David Minkoff; Marc Korczykowski; John Detre; James C Gee
Journal: Neuroimage Date: 2009-10-08 Impact factor: 6.556

Review 8. FSL.

Authors: Mark Jenkinson; Christian F Beckmann; Timothy E J Behrens; Mark W Woolrich; Stephen M Smith
Journal: Neuroimage Date: 2011-09-16 Impact factor: 6.556

Review 9. SPM: a history.

Authors: John Ashburner
Journal: Neuroimage Date: 2011-10-17 Impact factor: 6.556

10. The effects of FreeSurfer version, workstation type, and Macintosh operating system version on anatomical volume and cortical thickness measurements.

Authors: Ed H B M Gronenschild; Petra Habets; Heidi I L Jacobs; Ron Mengelers; Nico Rozendaal; Jim van Os; Machteld Marcelis
Journal: PLoS One Date: 2012-06-01 Impact factor: 3.240

12 in total

1. Optimal Symmetric Multimodal Templates and Concatenated Random Forests for Supervised Brain Tumor Segmentation (Simplified) with ANTsR.

Authors: Nicholas J Tustison; K L Shrinidhi; Max Wintermark; Christopher R Durst; Benjamin M Kandel; James C Gee; Murray C Grossman; Brian B Avants
Journal: Neuroinformatics Date: 2015-04

2. Three-Phase Automatic Brain Tumor Diagnosis System Using Patches Based Updated Run Length Region Growing Technique.

Authors: T Kalaiselvi; P Kumarashankar; P Sriramakrishnan
Journal: J Digit Imaging Date: 2020-04 Impact factor: 4.056

3. The pediatric template of brain perfusion.

Authors: Brian B Avants; Jeffrey T Duda; Emily Kilroy; Kate Krasileva; Kay Jann; Benjamin T Kandel; Nicholas J Tustison; Lirong Yan; Mayank Jog; Robert Smith; Yi Wang; Mirella Dapretto; Danny J J Wang
Journal: Sci Data Date: 2015-02-03 Impact factor: 6.444

4. Multi-atlas segmentation of subcortical brain structures via the AutoSeg software pipeline.

Authors: Jiahui Wang; Clement Vachet; Ashley Rumple; Sylvain Gouttard; Clémentine Ouziel; Emilie Perrot; Guangwei Du; Xuemei Huang; Guido Gerig; Martin Styner
Journal: Front Neuroinform Date: 2014-02-06 Impact factor: 3.739

5. Explicit B-spline regularization in diffeomorphic image registration.

Authors: Nicholas J Tustison; Brian B Avants
Journal: Front Neuroinform Date: 2013-12-23 Impact factor: 4.081

6. Possum-A Framework for Three-Dimensional Reconstruction of Brain Images from Serial Sections.

Authors: Piotr Majka; Daniel K Wójcik
Journal: Neuroinformatics Date: 2016-07

Review 7. Multimodal neuroimaging computing: the workflows, methods, and platforms.

Authors: Sidong Liu; Weidong Cai; Siqi Liu; Fan Zhang; Michael Fulham; Dagan Feng; Sonia Pujol; Ron Kikinis
Journal: Brain Inform Date: 2015-09-04

8. larvalign: Aligning Gene Expression Patterns from the Larval Brain of Drosophila melanogaster.

Authors: Sascha E A Muenzing; Martin Strauch; James W Truman; Katja Bühler; Andreas S Thum; Dorit Merhof
Journal: Neuroinformatics Date: 2018-01

Review 9. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS).

Authors: Bjoern H Menze; Andras Jakab; Stefan Bauer; Jayashree Kalpathy-Cramer; Keyvan Farahani; Justin Kirby; Yuliya Burren; Nicole Porz; Johannes Slotboom; Roland Wiest; Levente Lanczi; Elizabeth Gerstner; Marc-André Weber; Tal Arbel; Brian B Avants; Nicholas Ayache; Patricia Buendia; D Louis Collins; Nicolas Cordier; Jason J Corso; Antonio Criminisi; Tilak Das; Hervé Delingette; Çağatay Demiralp; Christopher R Durst; Michel Dojat; Senan Doyle; Joana Festa; Florence Forbes; Ezequiel Geremia; Ben Glocker; Polina Golland; Xiaotao Guo; Andac Hamamci; Khan M Iftekharuddin; Raj Jena; Nigel M John; Ender Konukoglu; Danial Lashkari; José Antonió Mariz; Raphael Meier; Sérgio Pereira; Doina Precup; Stephen J Price; Tammy Riklin Raviv; Syed M S Reza; Michael Ryan; Duygu Sarikaya; Lawrence Schwartz; Hoo-Chang Shin; Jamie Shotton; Carlos A Silva; Nuno Sousa; Nagesh K Subbanna; Gabor Szekely; Thomas J Taylor; Owen M Thomas; Nicholas J Tustison; Gozde Unal; Flor Vasseur; Max Wintermark; Dong Hye Ye; Liang Zhao; Binsheng Zhao; Darko Zikic; Marcel Prastawa; Mauricio Reyes; Koen Van Leemput
Journal: IEEE Trans Med Imaging Date: 2014-12-04 Impact factor: 10.048

10. Discrete pre-processing step effects in registration-based pipelines, a preliminary volumetric study on T1-weighted images.

Authors: Nathan M Muncy; Ariana M Hedges-Muncy; C Brock Kirwan
Journal: PLoS One Date: 2017-10-12 Impact factor: 3.240