| Literature DB >> 31321430 |
Michael Kotliar1, Andrey V Kartashov1, Artem Barski1,2.
Abstract
BACKGROUND: Massive growth in the amount of research data and computational analysis has led to increased use of pipeline managers in biomedical computational research. However, each of the >100 such managers uses its own way to describe pipelines, leading to difficulty porting workflows to different environments and therefore poor reproducibility of computational studies. For this reason, the Common Workflow Language (CWL) was recently introduced as a specification for platform-independent workflow description, and work began to transition existing pipelines and workflow managers to CWL.Entities:
Keywords: Airflow; Common Workflow Language; pipeline manager; reproducible data analysis; workflow manager; workflow portability
Mesh:
Year: 2019 PMID: 31321430 PMCID: PMC6639121 DOI: 10.1093/gigascience/giz084
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:CWL-Airflow diagram. The job file contains information about the CWL workflow and inputs. CWL-Airflow creates a CWLDAG-class instance on the basis of the workflow structure and executes it in Airflow. The results are saved to the output folder.
Figure 2:Structure diagram for scaling out CWL-Airflow with a Celery cluster of 4 nodes. Node 1 runs the Airflow database to save task metadata and the Airflow scheduler with the Celery executor to submit tasks for processing to the Airflow celery workers on Nodes 2, 3, and 4. The Airflow and Flower (Celery) web servers allow for monitoring and controlling of the task execution process. All nodes have shared access to the dags, jobs, temp, and output folders.
Figure 3:Airflow web interface. The DAGs tab shows the list of the available pipelines (a) and their latest execution dates (c) and number of active, succeeded, and failed runs (d) and workflow step statuses (b). The buttons on the right (e) allow a user to control pipeline execution and obtain additional information on the current workflow and its steps.
Figure 4:Dashboard of the Celery monitoring tool Flower. Shown are the 3 Celery workers, their current status, and load information.
Figure 5:Using CWL-Airflow for analysis of ChIP-Seq data. (a) ChIP-Seq data analysis pipeline visualized by Rabix Composer. (b) Drosophila melanogasterembryo histone 3, lysine 4 trimethylation (H3K4me3) ChIP-Seq data (SRR1198790) were processed by our pipeline and CWL-Airflow. University of California Santa Criz genome browser view of tag density and peaks at the trx gene is shown. View via the Common Workflow Language Viewer permalink here: https://w3id.org/cwl/view/git/f28d47bd0911e5e7210c4dc83f75653a1e0297c9/biowardrobe_chipseq_se.cwl. ATDP: Average Tag Density Profile.
CWL-Airflow and cwltool mean execution time
| Pipeline | Mean ± SEM (seconds), n = 3 | ||
|---|---|---|---|
| CWL-Airflow | Cwltool | ||
| 1 Node, 1 workflow at a time | 3 Nodes, 3 workflows at a time | 1 Node, 1 workflow at a time | |
| BioWardrobe ChIP-Seq Workflow | 1,141 ± 18 | 1,231 ± 3 | 955 ± 1 |
| ENCODE ChIP-Seq Mapping Workflow | 3,784 ± 10 | 3,824 ± 28 | 3,245 ± 7 |
ChIP-Seq, chromatin immunoprecipitation sequencing; CWL, common workflow language; SEM, standard error of the mean.
Comparison of the open source workflow managers and engines with existing or planned support for CWL
| Feature | Airflow and CWL-Airflow | Rabix | Toil | Cromwell | REANA | Galaxy | Arvados | CWLEXEC | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Software installation complexity | Single Python package | JAR | Single Python package | JAR | Group of Python packages | Group of Python packages | Multiple components for minimum 7 nodes system | JAR | ||||||||||||||||
| Electron application | node.js application | |||||||||||||||||||||||
| License type | Apache License v2.0 | Apache License v2.0 | Apache License v2.0 | BSD-3-Clause | MIT License | Academic Free License v3.0 | Apache License v2.0, AGPL v3.0, CC-BY-SA v3.0 | Apache License v2.0 | ||||||||||||||||
| Workflow description language | CWL v1.0 | CWL v1.0 | CWL v1.0 | CWL v1.0 | CWL v1.0 | XML tool file | CWL v1.0 | CWL v1.0 | ||||||||||||||||
| Python code | WDL v1.0 | WDL v1.0 | Serial | JSON workflow file | ||||||||||||||||||||
| Python code | Yadage | |||||||||||||||||||||||
| Docker containerization |
|
|
|
|
|
|
|
| ||||||||||||||||
| Singularity containerization |
|
|
|
|
|
|
|
| ||||||||||||||||
| Cloud/cluster processing |
|
|
|
|
|
|
|
| ||||||||||||||||
| Workflow execution load balancing[ |
|
|
|
|
|
|
|
| ||||||||||||||||
| Parallel workflow step execution |
|
|
|
|
|
|
|
| ||||||||||||||||
| GUI | REST API | CLI | GUI | REST API | CLI | GUI | REST API | CLI | GUI | REST API | CLI | GUI | REST API | CLI | GUI | REST API | CLI | GUI | REST API | CLI | GUI | REST API | CLI | |
| Add new workflow[ |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
| Set workflow inputs[ |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
| Start/stop workflow execution |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
| Manage workflow execution process[ |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
| Get execution results of the specific workflow[ |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
| View workflow execution logs |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
| View workflow execution history and statistics |
|
|
|
| ∅ |
| ∅ | ∅ |
| ∅ |
|
| ∅ |
|
|
|
| ∅ |
|
|
| ∅ | ∅ |
|
+, Present; –, absent; ∅, not applicable; AGPL: Affero General Public License; BSD: Berkely Source Distribution; CC-BY-SA: Creative Commons Attribution-Share-Alike; CLI, command line interface; GUI, graphical user interface; MIT: Massachusetts Institute of Technology; REST API, representational state transfer application program interface; WDL, workflow description language.
1Assign workflow steps to the different pools and queues; use other resource utilization algorithms provided by the computing environment.
2Load the workflow from the file; create the workflow by combining the steps in GUI.
3Set the path to the job file; set input values through the GUI or CLI.
4Pause/resume workflow execution process; manually restart workflow steps.
5Get output file locations by the workflow ID, step ID, execution date, or other identifiers.