Literature DB >> 21352538

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines.

Marcin Cieślik1, Cameron Mura.   

Abstract

BACKGROUND: Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.
RESULTS: To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats).
CONCLUSIONS: PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.

Entities:  

Mesh:

Year:  2011        PMID: 21352538      PMCID: PMC3051902          DOI: 10.1186/1471-2105-12-61

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.307


  18 in total

1.  Adapters, shims, and glue--service interoperability for in silico experiments.

Authors:  U Radetzki; U Leser; S C Schulze-Rauschenbach; J Zimmermann; J Lüssem; T Bode; A B Cremers
Journal:  Bioinformatics       Date:  2006-02-15       Impact factor: 6.937

2.  Adaptive distributed replica-exchange simulations.

Authors:  Andre Luckow; Shantenu Jha; Joohyun Kim; Andre Merzky; Bettina Schnor
Journal:  Philos Trans A Math Phys Eng Sci       Date:  2009-06-28       Impact factor: 4.226

3.  Constructing computational pipelines.

Authors:  Mark Halling-Brown; Adrian J Shepherd
Journal:  Methods Mol Biol       Date:  2008

Review 4.  Parallel tempering: theory, applications, and new perspectives.

Authors:  David J Earl; Michael W Deem
Journal:  Phys Chem Chem Phys       Date:  2005-12-07       Impact factor: 3.676

5.  Knowledge-based protein secondary structure assignment.

Authors:  D Frishman; P Argos
Journal:  Proteins       Date:  1995-12

6.  Application of biasing-potential replica-exchange simulations for loop modeling and refinement of proteins in explicit solvent.

Authors:  Srinivasaraghavan Kannan; Martin Zacharias
Journal:  Proteins       Date:  2010-10

7.  Ergatis: a web interface and scalable software system for bioinformatics workflows.

Authors:  Joshua Orvis; Jonathan Crabtree; Kevin Galens; Aaron Gussman; Jason M Inman; Eduardo Lee; Sreenath Nampally; David Riley; Jaideep P Sundaram; Victor Felix; Brett Whitty; Anup Mahurkar; Jennifer Wortman; Owen White; Samuel V Angiuoli
Journal:  Bioinformatics       Date:  2010-04-22       Impact factor: 6.937

8.  BioWMS: a web-based Workflow Management System for bioinformatics.

Authors:  Ezio Bartocci; Flavio Corradini; Emanuela Merelli; Lorenzo Scortichini
Journal:  BMC Bioinformatics       Date:  2007-03-08       Impact factor: 3.169

9.  High-throughput bioinformatics with the Cyrille2 pipeline system.

Authors:  Mark W E J Fiers; Ate van der Burgt; Erwin Datema; Joost C W de Groot; Roeland C H J van Ham
Journal:  BMC Bioinformatics       Date:  2008-02-12       Impact factor: 3.169

10.  CloudBurst: highly sensitive read mapping with MapReduce.

Authors:  Michael C Schatz
Journal:  Bioinformatics       Date:  2009-04-08       Impact factor: 6.937

View more
  2 in total

1.  An Introduction to Programming for Bioscientists: A Python-Based Primer.

Authors:  Berk Ekmekci; Charles E McAnany; Cameron Mura
Journal:  PLoS Comput Biol       Date:  2016-06-07       Impact factor: 4.475

2.  Agile parallel bioinformatics workflow management using Pwrake.

Authors:  Hiroyuki Mishima; Kensaku Sasaki; Masahiro Tanaka; Osamu Tatebe; Koh-Ichiro Yoshiura
Journal:  BMC Res Notes       Date:  2011-09-08
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.