| Literature DB >> 34737383 |
Azza E Ahmed1,2,3, Joshua M Allen4, Tajesvi Bhat4,5,6, Prakruthi Burra4,5,7, Christina E Fliege4, Steven N Hart8, Jacob R Heldenbrand4, Matthew E Hudson4,9, Dave Deandre Istanto9, Michael T Kalmbach10, Gregory D Kapraun10, Katherine I Kendig4, Matthew Charles Kendzior4,10, Eric W Klee8, Nate Mattson10, Christian A Ross11, Sami M Sharif12, Ramshankar Venkatakrishnan4, Faisal M Fadlelmola13, Liudmila S Mainzer4,14.
Abstract
The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.Entities:
Mesh:
Year: 2021 PMID: 34737383 PMCID: PMC8569008 DOI: 10.1038/s41598-021-99288-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A WfMS is middleware between the analyst and the computational environment. It encompasses the workflow language specifications to interconnect the analysis executables, and the execution engine to dispatch tasks and manage dependencies on the compute infrastructure.
Summary of language-level differences among Swift/T, Nextflow, CWL and WDL.
| Aspect | Swift/T | Nextflow | CWL | WDL |
|---|---|---|---|---|
| Parent language | C, tcl | Ruby and Groovy | N/A | N/A |
| Compilation | Compiled | Interpreted | Compiled | Compiled |
| GUIs | – | NextflowWorkbench[ | Rabix composer | Pipeline Builder[ |
| DSL features | Complete, extensible in tcl | complete, extensible in Groovy and Java | Limited standard library, extensible via javascript | Limited standard library |
| Variables | Typed, unique within scope | Qualified, unique within scope | Typed, unique identifiers | Typed, fully qualified names |
| Loops | Sequential for and parallel foreach | Parallel queue channels | Parallel scatter via ScatterFeatureRequirement | Parallel scatter |
| Conditionals | If-else and no-fall through switch statements | Via when declaration within a process | When and pickValue fields proposed in CWLv1.2 | If blocks producing optional output types |
| Enforcing good practices | – | nf-core ( | CWL guide ( | – |
Summary of executor-level differences among Swift/T, Nextflow, CWL and WDL.
| WfMS | Remarks | |
|---|---|---|
| Language | Execution engine | |
| Complete WfMS, supporting conditionals, loops and nested logic | ||
| CWL | The official reference implementation of an execution engine for the complete CWL standard[ | |
| arvados†(1.0, 1.1, 1.2) | Most feature-rich CWL runner, albiet with tedious setup | |
| Optimized for cloud environments, less stable in batch environments (Section Scalability) | ||
| cwl airflow†(1.1) | Works with celery and Kubernetes clusters, not readily with HPC CRMs | |
| REANA†(Documentation missing) | Cloud-optimized platform. For HPC, only CERN Slurm and HTcondor are supported | |
| Supports CWL workflows via WOM, with comparable performance in both languages (Section Scalability) | ||
| cwl-tes (1.0) | Partial implementation at present, with tedious setup. GA4GH TES API compatible | |
| rabix executor (sbg:draft-2, 1.0) | Single node local executor is no longer supported by the original developer team at Seven Bridges | |
| WDL | De facto standard for executing WDL workflows. Support for nested loops is version-dependent | |
| No support for modularity or nesting of loops and conditionals. Support for batch systems is also rudimentary | ||
| miniWDL (draft-2, 1.0) | No cluster or cloud support. Includes Cromwell wrapper | |
Any given feature of a workflow language can be assumed supported by the executor, unless we note otherwise. Supported language versions are in parentheses for each executor. Italics indicates engines we thoroughly examined.
†These are listed as production-ready engines in the official CWL website in July 2021. The rest are listed as partial implementations.
Figure 2Bioinformatics workflows with multiple levels of complexity warrant a modular construction. It is easiest to program the workflow when its logic is abstracted away (in Tasks, red) from the command line invocations (in Bash scripts, pink) of the bioinformatics tools (light pink). Individual workflows can be further used as subworkflows of a larger Master workflow (e.g., Fig Supplementary 1). This architecture facilitates expression of additional complexity due to optional modules (dashed line), nested levels of parallelism (groups of arrows connecting red rectangles) and scatter-gather patterns (task 2 scattered across samples being merged into task 3).
Figure 4DAGs corresponding to a simple workflow of 2 processes (besides output aggregation) used to assess the scalability of the executors of “Scalability” section, as generated by the most recent version of each executor or utility visualizer of each language in July 2021.
Computational testing environments.
| WfMS | Biocluster | AWS | |||
|---|---|---|---|---|---|
| Language | Engine | Batch mode | Cloud cluster | Parallel cluster | |
| Swift/T | I, II | – | – | – | |
| Nextflow | I, II | – | I | II | |
| CWL | Cromwell | II | – | – | – |
| Toil | II | – | – | – | |
| WDL | Cromwell | I, II | I | – | II |
I and II refer to: Use case I: variant calling pipeline and Use case II: scalability evaluation.
Figure 3Scaling a one-step (solid line) and two-step (dashed line) workflow in Cromwell+WDL (black) and Nextflow (yellow) on AWS Parallel cluster. The thick green line in the right panel is the theoretical optimum of the number of nodes to be occupied by the tasks, computed as the ceiling of tasks/cores-per-node (96). Empty circles denote failed runs.
GitHub activities from each WfMS (March 4th, 2021).
| WfMS | First commit | Contributors | Closed | Open | License |
|---|---|---|---|---|---|
| Swift-t | 2011-05-11 | 16 | 109 | 81 | apache-2.0 |
| Nextflow | 2013-03-22 | 81 | 1770 | 159 | apache-2.0 |
| CWL | 2014-09-25 | 62 | 667 | 249 | apache-2.0 |
| WDL | 2012-08-01 | 44 | 376 | 50 | bsd-3-clause |
Contributors is the number of contributors in each repo, Open and Closed refer to the count of open and closed issues and pull requests in the repo.
The WfMSs examined in this study.
| WfMS | Use case I: Variant calling pipeline | Use case II: Scalability evaluation | |
|---|---|---|---|
| Language | Engine | ||
| Swift/T | GATK3; multi-sample; single-step if needed[ | – | |
| Nextflow | GATK 4; multi-sample; | Same repository for these three WfMSs ( | |
| CWLa | Cromwell, Toil† | – | |
| WDLa | Cromwell† | GATK4; single sample; | |
†Other engines were limited in portability, conformance to language specification, or setup (Table 2).
‡Repositories with identical code structure, which facilitated comparison of results (Supplementary note 1).
aCommon Workflow Languagen (CWL); Workflow Description Language (WDL).
| Highlights: Which WfMS to use day-to-day |
| In light of this, a pragmatic approach to workflow choice could be the following: |
| 1. Assess: is there a need to build a new pipeline, or there is an existing reasonable pipeline in the Nextflow, CWL,or WDL repos? |
| (a) If a workflow exists that follows good coding practices, it should be adopted and modified as per specific needs. |
| (b) If starting fresh, without restrictions by collaborators’ preferences or existing legacy code-base: |
| i. If a quick development cycle is important, Nextflow is optimal. |
| ii. If code readability is important, WDL is optimal. |
| iii. If execution environment is variable, or there is a need to work across heterogeneous hardware environments, CWL is optimal. |
| iv. Table |
| 2. Assess: what execution constraints are in place? |
| (a) For HPC environments, pay particular attention to runners supporting differnt CRMs. Our recommended free, production-scale runners for these are: Cromwell (for both WDL and CWL), and Nextflow (for Nextflow workflows). Toil was less performant in comparison. (refer to section: Scalability) |
| (b) For running in the cloud, pay particular attention to runners with support for different cloud APIs, and features like automatic rescaling, containerization, and security settings. Table |