| Literature DB >> 28701218 |
Sehrish Kanwal1, Farah Zaib Khan2, Andrew Lonie3, Richard O Sinnott4.
Abstract
BACKGROUND: Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows.Entities:
Keywords: Common Workflow Language (CWL); Cpipe; Galaxy; Provenance; Reproducibility; Workflow
Mesh:
Year: 2017 PMID: 28701218 PMCID: PMC5508699 DOI: 10.1186/s12859-017-1747-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Computational bioinformatics workflows are often deployed to deal with the data processing bottleneck. A typical workflow consists of a series of linked steps that transform raw input (e.g. a fastq file produced as a result of NGS data) into meaningful or interpretable output (e.g. variant calls). Typically, these steps are performed by specific tools developed to tackle a particular functional aspect of genomic sequence analysis. Workflows can have variable number of steps depending on the type of analysis performed, hence can be simple or complex
Fig. 2Screenshots of the Galaxy interface showing (a) A temporary sequence dictionary file creation using CreateSequenceDictionary as part of RealignTargetCreator and IndelRealigner step and (b) “Map with BWA-MEM” step combining indexing reference data, SAM to BAM conversion and sorting of the resultant aligned (BAM) file
Fig. 3Graphical representation of the GATK workflow representing artefacts and information necessary to be captured as part of workflow execution. The description of main steps is depicted in the black rectangles whereas the tools responsible to carry out the steps are shown in grey ellipses. Input and reference files (brown rounded rectangles) are shown separately and labelled by the dataset name. The primary and secondary output files (if any) are shown in dark and light green snip diagonal corner rectangles respectively. The input and output data flow for each workflow step is demonstrated through red and green dotted arrows respectively. The connection between processes in a workflow is represented by blue solid arrow. The yellow highlighted parts of the workflow are the pivotal processes not explicitly declared in Galaxy and Cpipe. The red flag highlights the main input and final output for the workflow
Fig. 4The variant calling workflow representation in Galaxy
Summary of assumptions (detailed in section Workflow enactment using the selected systems) and corresponding recommendation for reproducibility
| Assumptions | Recommendations |
|---|---|
| Availability of sufficient storage and compute resources to deal with processing of big genomics data | Workflow developers should provide complete documentation of compute and storage requirements along with the workflow to achieve long-term reproducibility of scientific results. |
| Availability of high performance networking infrastructure to move bulk genomics data | Considering the size and volume of genomic data, researchers reproducing any analysis should ensure that an appropriate networking structure for data transfer is on hand |
| The computing platform is preconfigured with the base software required by the workflow specification | Workflow developers should provide a mechanism with check points to ensure compatibility of the computing platform deployed by a researcher to reproduce the original analysis |
| Users are responsible to ensure access to copyrighted or proprietary tools | Community should encourage work leveraging open source software and collaborative approaches thereby avoiding use of copyrighted or proprietary tools |
| Analysis environment with a particular directory structure and file naming conventions is setup before executing the workflow | Workflow developers should avoid hardcoding environmental parameters such as file names, absolute file paths and directory names that would otherwise render their workflow dependent on a specific environment setup and configuration |
| Appropriate datasets are used as input to the tools incorporated in the workflow | As bioinformatics analysis tools require strict adherence to input or reference file formats, data annotations and controlled access to primary data can ultimately help reproduce the workflow precisely |
| Users will have a comprehensive understanding of the analysis and the provided information (in the form of incomplete workflow diagram) is sufficient to convey high level understanding of the workflow | Workflow developers should provide a complete data flow diagram serving as a blue print containing all the artefacts including tools, input data, intermediate data products, supporting resources, processes and the connection between these artefacts |
| Availability of specific tool versions and setting relevant parameter space | Tools should either be packaged along with the workflow or made available via public repositories to ensure accessibility to the exact same versions and parameter settings as used in the analysis being reproduced, hence supporting flexible and customizable workflows. |
| Users to have proficient knowledge of the specific reference implementation | This factor might be considered out of control of the workflow developers but detailed documentation of the underlying framework used and community support can help in overcoming the associated learning curve |