| Literature DB >> 34804501 |
Anna-Lena Lamprecht1, Magnus Palmblad2, Jon Ison3, Veit Schwämmle4, Mohammad Sadnan Al Manir5, Ilkay Altintas6, Christopher J O Baker7,8, Ammar Ben Hadj Amor9, Salvador Capella-Gutierrez10, Paulos Charonyktakis11, Michael R Crusoe12, Yolanda Gil13, Carole Goble14, Timothy J Griffin15, Paul Groth16, Hans Ienasescu17, Pratik Jagtap15, Matúš Kalaš18, Vedran Kasalica1, Alireza Khanteymoori19, Tobias Kuhn12, Hailiang Mei20, Hervé Ménager21, Steffen Möller22, Robin A Richardson23, Vincent Robert9, Stian Soiland-Reyes14,24, Robert Stevens14, Szoke Szaniszlo9, Suzan Verberne25, Aswin Verhoeven2, Katherine Wolstencroft25.
Abstract
Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the "big picture" of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future. Copyright:Entities:
Keywords: automated workflow composition; bioinformatics; computational pipelines; life sciences; scientific workflows; semantic domain modelling; workflow benchmarking
Mesh:
Year: 2021 PMID: 34804501 PMCID: PMC8573700 DOI: 10.12688/f1000research.54159.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Scientific workflow life cycle.
Relevant information for static analysis of workflows.
| Technical parameters | Domain-specific considerations | Community influence |
|---|---|---|
|
Basic tool information (such as license, version, recent updates) (Theoretical) compatibility of tools based on their functional annotation (Practical) compatibility of tools based on their use in existing workflows Tool statistics like number of runs, number of users, speed, reliability, etc. Service monitoring information about availability, uptime, downtime, runtime, etc. Number of shims (format converters) needed in the workflow Data format flexibility (generic vs. tailored) Data-flow properties (such as live and dead variables) of the workflow Control-flow properties (such as cyclomatic complexity, parallelization potential) Tool and workflow FAIRness metrics |
Subject-specific unique or essential features of tools that the workflow needs and relation to a typical concept map in the domain Establishment of tools (known quality metrics, well-understood configuration). Usage in commonly used workflows and known compatibilities by actual usage Novelty of tools (new functionalities, potential for novel results, adaptation to new data types) Similarity to existing concrete or abstract workflows (see above), workflow motifs Type and format of produced results. Potential for direct comparison with output from other workflows Availability of common quality control, benchmarks and benchmarking data |
Reception in the domain literature (citations, altmetrics, praise and criticism) Reputation, someone or something being “famous” Trends, currently popular technologies User comments and ratings (reflecting, e.g., adequacy, understandability, usability) Trust in developers and/or providers |
Figure 2. Workflow personas and use cases.
Demands for configurability are ordered from low (top) to total (bottom). All participants need a publication-ready description of the provenance of their findings for perfect reproducibility. Not on the list is Rob the Routinier, who keeps doing his stuff just the way he did it for the last ten years.
Future work on foundations.
| Action | Examples |
|---|---|
| Clarification of usage scenarios | Collect and explicate concrete user stories and scenarios, including personas (“as a <role> I want to <capability> so I can <do x>”). Elicit requirements, prioritize using the MoSCoW method. |
| Definition of lacking standards | Universal identifier for workflows, IDs for code and tools. Format to formally represent parameter sets in a general way.
|
| Development of lacking methods | Systematic collection and analysis of tool usage data (for funding, sustainability, benchmarking). Alignment and similarity measures between workflows, together with methods for comparing abstract and concrete workflows. |
| Exploration of new ideas | “Knowledge acquisition by stealth” for scaling up tool annotations and provenance trace collection. The use of workflow provenance traces for heuristic improvements of automated composition. Methods or automated workflow benchmarking, possibly reusing approaches from machine learning (AutoML). |
Future work on tooling and infrastructure.
| Action | Examples |
|---|---|
| Provide missing functionality | Enrich bio.tools entries with additional information, e.g. annotation quality, user ratings, automated composition readiness. Enrich bio.tools with additional functionality, e.g. for finding similar tools, collection of user experience information. |
| Increase compatibility and interoperability | Support automated workflow composition in/to general workflow specification languages such as CWL. Exchange valid/benchmarkable workflows in common format (e.g. RO-Crate). Capture parameter settings as standardized items. Automatic conversion of data set metadata (data repositories) into correctly applied workflow parameters. Maintain dedicated libraries of helper services (shims). Improve information exchange between systems, e.g. OpenEBench, bio.tools, Conda and WorkflowHub. |
| Improve usability and convenience | Improve the ease and regularity of updating ontologies such as EDAM. Integrate tool recommendation, workflow exploration and user feedback features in WfMS (e.g. Galaxy), workflow repositories and registries (e.g. WorkflowHub). Use registry and workflow engine usage data for training recommendation systems. Collect tool usage data (anonymous, public) and workflow usage data (anonymous, public). Create infrastructure for (automated) workflow integration testing (in silico generated data and community-maintained test data). Support open-source community health checks (e.g.
|
Future work on community.
| Action | Examples |
|---|---|
| Community building | Use hackathons to bring the community together, e.g. propose topics in established Hackathons (e.g. BioHackathon Japan, European Biohackathon 2020). Establish a regular dedicated hackathon on the theme of automated workflow composition. Identify opportunities to train researchers to use resources and participate in the community efforts. Identify “hot topics” and forums for community mobilisation, e.g. collecting abstract workflows for an instructive “picture book of bioinformatics”. |
| Community development | Survey stakeholder needs, including industry, publishers (e.g. Gigascience), data repositories, frameworks (e.g. bioconductor, Linux distributions) etc. Leverage ELIXIR to drive the technical & political consolidation. Establish an ELIXIR Focus Group on automated workflow composition. |
Future work on applications.
| Action | Examples |
|---|---|
| Annotation of tools | Map command lines to individual tool functions. Organize available and possible shims. Annotate possible format transformations. |
| Automated composition of workflows | Compare the benefits of alternative methods for automated composition (exploration, recommendation, elaboration) on concrete examples. (Try to) reproduce a workflow found in a paper using literature mining and automated workflow composition. |
| Benchmarking of workflows | Explore the value of automated workflow composition in combination with systematic benchmarking (“Great Bake-Off”). Work towards a fully benchmarked set of >10 automatically composed proteomics workflows as a demonstrator. Collect data sets with ground truths and benchmark metrics in the omics domains. |