| Literature DB >> 26282399 |
Ola Spjuth1, Erik Bongcam-Rudloff2, Guillermo Carrasco Hernández3, Lukas Forer4, Mario Giovacchini5, Roman Valls Guimera6, Aleksi Kallio7, Eija Korpelainen8, Maciej M Kańduła9, Milko Krachunov10, David P Kreil11, Ognyan Kulev12, Paweł P Łabaj13, Samuel Lampa14, Luca Pireddu15, Sebastian Schönherr16, Alexey Siretskiy17, Dimitar Vassilev18.
Abstract
High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.Entities:
Mesh:
Year: 2015 PMID: 26282399 PMCID: PMC4539931 DOI: 10.1186/s13062-015-0071-8
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Fig. 1Visual representation of a user-made ChIP-seq data analysis workflow in the Chipster software. After detecting STAT1 binding regions in the genome, the user has filtered the resulting peaks for q-value, length and peak hight. S/he has then looked for common sequence motifs in the peaks and matched them against a transcription factor binding site database. S/he has also retrieved the closest genes to the peaks and performed pathway enrichment analysis for them. Finally, s/he has checked if the enriched pathways contain the STAT signaling pathway. All these downstream analysis steps can be saved as an automatic workflow, which can be shared and executed on another dataset. In addition to analysing data and building workflows, Chipster allows users to visualize data interactively. As an example, genome browser visualization is shown (bottom right panel)
Fig. 2Components in CRS4’s automation system. The system has been created by linking together freely available components with some specialized software built in-house. In addition to running preliminary processing, it records operations within OMERO.biobank, thus ensuring reproducibility
Fig. 3Example of a Galaxy Workflow. used at CRS4 to generates demultiplexed fastq files starting from an Illumina run directory. The BCL to qseq conversion and the demultiplexing operations are performed on a Hadoop cluster using the Seal toolkit
Advantages and disadvantages of different categories of automation strategies for bioinformatics
| Advantages | Disadvantages | |
|---|---|---|
| Scripting | ∙ Simple to construct | ∙ Hard to hand over, manual tools integration and difficult HPC interaction |
| Makefile | ∙ Simple to construct once you are familiar with the programming languages and the bioinformatics command-line tools involved | ∙ Multithreaded programs and remote execution not handled well |
| ∙ Describes data flow and takes care of dependency resolution, parallel execution and caching results from previous runs | ∙ Lack of recursion support | |
| ∙ Uses code fragments in familiar scripting languages for processing of data | ∙ Requires programming or shell experience | |
| ∙ Can’t be automatically parsed and visualized | ||
| Scientific Workflow Systems | ∙ More powerful features, easier to maintain and share | ∙ Requires more effort |