| Literature DB >> 26600012 |
Ashley Shade1,2, Tracy K Teal2,3.
Abstract
Extremely large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. Here, we present a step-by-step guide to computing workflows with the biologist end-user in mind. Starting from a foundation of sound data management practices, we make specific recommendations on how to approach and perform computational analyses of large datasets, with a view to enabling sound, reproducible biological research.Entities:
Mesh:
Year: 2015 PMID: 26600012 PMCID: PMC4658184 DOI: 10.1371/journal.pbio.1002303
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Analogies between computing and “wet-bench” experiments.
| Computing task | Wet-bench analogy | Example |
|---|---|---|
| Exploring parameter space | Using multiple experimental designs to address a hypothesis | Complementing in vitro and in vivo (and in silico!) experiments |
| Comparing different computing tools or software | Comparing protocols that perform the same task | Comparing kits from different manufacturers for nucleic acid extraction |
| Changing variables (flags or options) within a computing tool | Making minor adjustments in a single protocol | Changing buffer conditions in a PCR |
Fig 1Workflow for biological computing.
The workflow begins with read-only, secure raw data and ends with final code and data, ultimately accessible in a version-controlled repository (green boxes and arrows). Default and alternative parameters are explored and compared for each tool to optimize the analysis, and best choices (red boxes/text) are informed by biological and statistical expectations of the data. Purple ellipses show reproducibility checkpoints, with self-checkpoints numbered consecutively (here, 1 through 4). Purple dashed lines show iterative steps in the workflow that occur at reproducibility checkpoints. The workflow-in-progress is edited at every step until the documentation and code are finalized.
Fig 2A simplified schematic of an example workflow for bacterial or archaeal genome assembly.
These tools represent just a subset of those available, for illustration purposes. For example, Trimmomatic is one tool for trimming Illumina FASTQ data and removing adapters.
Reproducibility checkpoints during the development and refinement of a computational workflow.
| Type | On what? | By whom? |
|---|---|---|
| Self | • Every parameterized step in a workflow | • User(s) who develops the analysis workflow |
| Internal | • Final, complete workflow | • At least one colleague in the research group |
| External | • Final, complete workflow | • Crowdsourcing (e.g., GitHub/BitBucket/R community) |