| Literature DB >> 31675414 |
Farah Zaib Khan1,2, Stian Soiland-Reyes2,3, Richard O Sinnott1, Andrew Lonie1, Carole Goble3, Michael R Crusoe2.
Abstract
BACKGROUND: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.Entities:
Keywords: BagIt; CWL; Common Workflow Language; RO; Research Object; containers; interoperability; provenance; scientific workflows
Mesh:
Year: 2019 PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Summarized recommendations and justifications from the literature covering best practices on reproducibility, accessibility, interoperability, and portability of workflows
| R No. | Recommendations | Justifications |
|---|---|---|
| R1-parameters | Save and share all parameters used for each software executed in a given workflow (including default values of parameters used) [ | Affects reproducibility of results because different inputs and configurations of the software can produce different results. Different versions of a tool might upgrade the default values of the parameters. |
| R2-automate | Avoid manual processing of data, and if using shims [ | This ensures the complete capture of the computational process without broken links so that the analysis can be executed without the need for performing manual steps. |
| R3-intermediate | Include intermediate results where possible when publishing an analysis [ | Intermediate data products can be used to inspect and understand shared analysis when re-enactment is not possible. |
| R4-sw-version | Record the exact software versions used [ | This is necessary for reproducibility of results because different software versions can produce different results. |
| R5-data-version | If using public data (reference data, variant databases), then it is necessary to store and share the actual data versions used [ | This is needed because different versions of data, e.g., human reference genome or variant databases, can result in slightly different results for the same workflow. |
| R6-annotation | Workflows should be well-described, annotated, and offer associated metadata. Annotations such as user-contributed tags and versions should be assigned to workflows and shared when publishing the workflows and associated results [ | Metadata and annotations improve the understandability of the workflow, facilitate independent reuse by someone skilled in the field, make workflows more accessible, and hence promote the longevity of the workflows. |
| R7-identifier | Use and store stable identifiers for all artefacts including the workflow, the datasets, and the software components [ | Identifiers play an important role in the discovery, citation, and accessibility of resources made available in open access repositories. |
| R8-environment | Share the details of the computational environment [ | Such details support analysis of requirements before any re-enactment or reproducibility is attempted. |
| R9-workflow | Share workflow specifications/descriptions used in the analysis [ | The same workflow specifications can be used with different datasets, thereby supporting reusability. |
| R10-software | Aggregate the software with the analysis and share this when publishing a given analysis [ | Making software available reduces dependence on third-party resources and as a result minimizes “workflow decay" [ |
| R11-raw-data | Share raw data used in the analysis [ | When someone wants to validate published results, availability of data supports verification of claims and hence establishes trust in the published analysis. |
| R12-attribution | Store all attributions related to data resources and software systems used [ | Accreditation supports proper citation of resources used. |
| R13-provenance | Workflows should be preserved along with the provenance trace of the data and results [ | A provenance trace provides a historical view of the workflow enactment, enabling end users to better understand the analysis retrospectively. |
| R14-diagram | Data flow diagrams of the computational analysis using workflows should be provided [ | These diagrams are easy to understand and provide a human-readable view of the workflow. |
| R15-open-source | Open source licensing for methods, software, code, workflows, and data should be adopted instead of proprietary resources [ | This improves availability and legal reuse of the resources used in the original analysis, while restricted licenses would hinder reproducibility. |
| R16-format | Data, code, and all workflow steps should be shared in a format that others can easily understand, preferably in a system-neutral language [ | System-neutral languages help achieve interoperability and make an analysis understandable. |
| R17-executable | Promote easy execution of workflows without making significant changes to the underlying environment [ | In addition to helping reproducibility, this enables adapting the analysis methods to other infrastructures and improves workflow portability. |
| R18-resource-use | Information about compute and storage resources should be stored and shared as part of the workflow [ | Such information can assist users in estimating the resources needed for an analysis and thereby reduce the amount of failed executions. |
| R19-example | Example input and sample output data should be preserved and published along with the workflow-based analysis [ | This information enables more efficient test runs of an analysis to verify and understand the methods used. |
This list is not exhaustive; other studies have identified separate issues (e.g., laboratory work provenance and data security) that are beyond the scope of this work.
Figure 1:Recommendations from Table 1 classified into these categories.
Figure 2:Levels of provenance and resource sharing and their applications.
Figure 3:Left: A snapshot of part of a GATK workflow described using CWL. Two steps named as “bwa-mem" and “samtools-view" are shown, where the former links to the tool description executing the underlying tool (BWA-mem for alignment) and provides the output used as input for samtools. Right: Snapshot of BWA-mem.cwl and the associated Docker requirements for the exact tool version used in the workflow execution.
Figure 4:Core concepts of the PROV Data Model. Adapted from W3C PROV Model Primer [92].
Figure 5:Schematic representation of the aggregation and links between the components of a given workflow enactment. Layers of execution are separated for clarity. The workflow specification and command line tool specifications are described using CWL. Each individual command line tool specification can optionally interact with Docker to satisfy software dependencies. [A] The RO layer shows the structure of the RO including its content and interactions with different components in the RO and [B] the CWL layer.
Fulfilling recommendations with the CWLProv profile of W3C PROV, extended with RO Model’s wfdesc (prospective provenance) and wfprov (retrospective provenance)
| PROV type | Subtype | Relation | Range | Recommendation |
|---|---|---|---|---|
| Plan | wfdesc:Workflow | wfdesc:hasSubProcess | wfdesc:Process | R9-workflow |
| wfdesc:Process | ||||
| Activity | wfprov:WorkflowRun | wasAssociatedWith | wfprov:WorkflowEngine | R8-environment |
| ↳ hadPlan | wfdesc:Workflow | R9-workflow, R17-executable | ||
| wasStartedBy | wfprov:WorkflowEngine | R8-environment | ||
| ↳ atTime | ISO8601 timestamp | R13-provenance | ||
| wasStartedBy | wfprov:WorkflowRun | R9-workflow | ||
| wasEndedBy | wfprov:WorkflowEngine | R8-environment | ||
| ↳ atTime | ISO8601 timestamp | R13-provenance | ||
| wfprov:ProcessRun | wasStartedBy | wfprov:WorkflowRun | R10-software | |
| ↳ atTime | ISO8601 timestamp | R14-provenance | ||
| used | wfprov:Artifact | R11-raw-data | ||
| ↳ role | wfdesc:InputParameter | R1-parameters | ||
| wasAssociatedWith | wfprov:WorkflowRun | R9-workflow | ||
| ↳ hadPlan | wfdesc:Process | R17-executable, R16-format | ||
| wasEndedBy | wfprov:WorkflowRun | R13-provenance | ||
| ↳ atTime | ISO8601 timestamp | R13-provenance | ||
| SoftwareAgent | wasAssociatedWith | wfprov:ProcessRun | R8-environment | |
| ↳ cwlprov:image | docker image id | R4-sw-version | ||
| SoftwareAgent | wfprov:WorkFlowEngine | wasStartedBy | Person ORCID | R12-attribution |
| label | cwltool | R4-sw-version | ||
| Entity | wfprov:Artefact | wasGeneratedBy | wfprov:Processrun | R3-intermediate, R7-identifier |
| ↳ role | wfdesc:OutputParameter | R1-parameters | ||
| Collection | wfprov:Artefact | hadMember | wfprov:Artefact | R3-intermediate |
| Dictionary | hadDictionaryMember | wfprov:Artefact | ||
| ↳ pairKey | filename | R7-identifier |
Indentation with ↳ indicates n-ary relationships, which are expressed differently depending on PROV syntax. Namespaces: http://www.w3.org/ns/prov# (default), http://purl.org/wf4ever/wfdesc# (wfdesc), http://purl.org/wf4ever/wfprov# (wfprov), https://w3id.org/cwl/prov# (cwlprov).
Figure 6:High-level process flow representation of retrospective provenance capture.
Recommendations and provenance levels implemented in CWLProv
| Recommendation | Level 0 | Level 1 | Level 2 | Level 3 | Methods |
|---|---|---|---|---|---|
| R1-parameters | • | • | CWL, BP | ||
| R2-automate | • | CWL, Docker | |||
| R3-intermediate | • | PROV, RO | |||
| R4-sw-version | • | • | CWL, Docker, PROV | ||
| R5-data-version | • | • | CWL, BP | ||
| R6-annotation | • | * | CWL, RO, BP | ||
| R7-described | • | CWL, RO | |||
| R7-identifier | • | • | • | RO, CWLProv | |
| R8-environment | * | * | GFD.204 | ||
| R9-workflow | • | • | • | CWL, wfdesc | |
| R10-software | • | • | CWL, Docker | ||
| R11-raw-data | • | • | CWLProv, BP | ||
| R12-attribution | • | RO, CWL, BP | |||
| R13-provenance | • | • | PROV, RO | ||
| R14-diagram | ▓ | * | CWL, RO | ||
| R15-open-source | • | CWL, BP | |||
| R16-format | • | • | CWL, BP | ||
| R17-executable | ▓ | • | CWL, Docker | ||
| R18-resource-use | * | * | CWL, GFD.204 | ||
| R19-example | * | ▓ | RO, BP |
BP: best practices need to be followed manually; CWL: Common Workflow Language and embedded annotations; CWLProv: additional attributes in PROV; PROV: W3C Provenance model; RO: RO model and BagIt; wfdesc: prospective provenance in PROV.
• Implemented.
▓ Partially implemented.
*Implementation planned/ongoing.
Figure 7:Portion of an RNA-seq workflow generated by CWL viewer [129].
Figure 8:Alignment workflow representation generated by CWL viewer.
Figure 9:Visual representation of the bcbio somatic variant calling workflow (adapted from [143]). The subworkflow images are generated by CWL viewer .
CWLProv evaluation summary and status for the 3 bioinformatics case studies
| Enact-produce RO with | Re-enact using RO with | Status |
|---|---|---|
| cwltool on MacOS | toil-cwl-runner on MacOS | ✓ |
| cwltool on Linux | ✓ | |
| toil-cwl-runner on Linux | ✓ | |
| cwltool on Linux | toil-cwl-runner on Linux | ✓ |
| cwltool on MacOS | ✓ | |
| toil-cwl-runner on MacOS | ✓ |
Runtime comparison for the workflow enactments done cross-executor and cross-platform
| Workflow | Linux | MacOS | ||||
|---|---|---|---|---|---|---|
| cwltool | toil-cwl-runner | cwltool | toil-cwl-runner | |||
| RNA-Seq Analysis Workflow | With Prov | Without Prov | Without Prov | With Prov | Without Prov | Without Prov |
| 4m30.289s | 4m0.139s | 3m46.817s | 3m33.306s | 3m41.166s | 3m30.406s | |
| Alignment Workflow | 28m23.792s | 24m12.404s | 15m3.539s | – | 162m35.111s | 146m27.592s |
| Somatic Variant Calling Workflow | 21m25.868s | 19m27.519s | 7m10.470s | 17m26.722s | 17m0.227s | ** |
** This could not be tested because of a Docker mount issue on MacOS: https://github.com/DataBiosphere/toil/issues/2680.
– This could not be tested because of the insufficient hardware resources on the MacOS test machine; hence. step I of the evaluation activity could not be performed for this workflow.