| Literature DB >> 24312207 |
Daniel Garijo1, Sarah Kinnings, Li Xie, Lei Xie, Yinliang Zhang, Philip E Bourne, Yolanda Gil.
Abstract
How easy is it to reproduce the results found in a typical computational biology paper? Either through experience or intuition the reader will already know that the answer is with difficulty or not at all. In this paper we attempt to quantify this difficulty by reproducing a previously published paper for different classes of users (ranging from users with little expertise to domain experts) and suggest ways in which the situation might be improved. Quantification is achieved by estimating the time required to reproduce each of the steps in the method described in the original paper and make them part of an explicit workflow that reproduces the original results. Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results. The quantification leads to "reproducibility maps" that reveal that novice researchers would only be able to reproduce a few of the steps in the method, and that only expert researchers with advance knowledge of the domain would be able to reproduce the method in its entirety. The workflow itself is published as an online resource together with supporting software and data. The paper concludes with a brief discussion of the complexities of requiring reproducibility in terms of cost versus benefit, and a desiderata with our observations and guidelines for improving reproducibility. This has implications not only in reproducing the work of others from published papers, but reproducing work from one's own laboratory.Entities:
Mesh:
Year: 2013 PMID: 24312207 PMCID: PMC3842296 DOI: 10.1371/journal.pone.0080278
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1A high-level dataflow diagram of the TB drugome method.
Figure 2The reproduced TB Drugome workflow with the different subsections highlighted.
(1) Comparison of ligand binding sites using SMAP; (2) protein structure comparison using FATCAT; (3) docking using Autodock Vina; and (4) graph network creation (visualization). We focus on the reproducibility of sections 1-3 here.
Figure 3Reproducibility maps of the three major subsections of the workflow.
A step is shown in red if it was not reproducible by that category of user, and green if it were.
Time to reproduce the method.
| Tasks | Time (hours) |
| Familiarization with workflow and running software | 160 |
| SMAP steps | 32 |
| SMAP result sorter steps | 8 |
| Merger steps | 4 |
| Get significant results | 4 |
| FATCAT URL checker | 8 |
| FATCAT step | 4 |
| Remove significant pairs | 4 |
| Create clip files | 8 |
| Create ideal ligands | 8 |
| Ideal ligand checker | 8 |
| Autodock Vina | 16 |
| Data visualization steps | 16 |
| TOTAL | 280 hours |
Observations and desiderata for reproducibility.
| Observation |
| ·We found that important computational steps were either missing or ambiguous. |
| ·Software is often used with carefully selected parameter settings and configurations. |
| ·The possibility of re-running the method periodically with new versions of software tools leading to new findings might help entice researchers to keep their methods readily reproducible. |
| ·Published results that depend on third party data sources may not always be accessible and may make the experiments run by the original authors irreproducible. |
| ·To implement some steps of their methods, authors often use proprietary software or software that is not widely available. |
| ·Although many methods are implemented by using public domain software tools, they often contain additional steps that were implemented by the authors. |
Reproducibility Guidelines for Authors.
| Guideline |
| 1. |
| 2. |
| 3. |
| 4. |
| 5. |