Literature DB >> 33325479

aCLImatise: Automated generation of tool definitions for bioinformatics workflows.

Michael Milton1,2, Natalie Thorne1,2,3,4.   

Abstract

SUMMARY: aCLImatise is a utility for automatically generating tool definitions compatible with bioinformatics workflow languages, by parsing command-line help output. aCLImatise also has an associated database called the aCLImatise Base Camp, which provides thousands of pre-computed tool definitions.
AVAILABILITY AND IMPLEMENTATION: The latest aCLImatise source code is available within a GitHub organisation, under the GPL-3.0 license: https://github.com/aCLImatise. In particular, documentation for the aCLImatise Python package is available at https://aclimatise.github.io/CliHelpParser/, and the aCLImatise Base Camp is available at https://aclimatise.github.io/BaseCamp/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Year:  2020        PMID: 33325479      PMCID: PMC8016486          DOI: 10.1093/bioinformatics/btaa1033

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


Bioinformatics workflow languages are domain-specific languages which aim to simplify the process of writing workflows for bioinformatics analysis (Larsonneur ). Four popular workflow languages in bioinformatics are Nextflow, Snakemake, Workflow Definition Language (WDL) and Common Workflow Language (CWL) (Bedő, 2019). In each of these languages the author defines ‘tool definitions’ (variously referred to as ‘command line tool descriptions’, ‘task definitions’, ‘process definitions’ or ‘wrappers’), which are then composed together using a separate ‘workflow’ definition (Chapman ; Di Tommaso ; Köster and Rahmann, 2012). Tool definitions describe the interface to a piece of software, generally a command-line interface, including all of its inputs, outputs and execution requirements. While workflow definitions must be customized according to the use-case, tool definitions simply describe a piece of software, and are therefore not coupled to a single workflow or context (Chapman ). For this reason, it is common to collect tool definitions in online tool repositories that can be used by workflow designers, reducing the work involved in constructing a workflow. Such repositories exist for WDL (https://github.com/biowdl/tasks), Snakemake (https://snakemake-wrappers.readthedocs.io/en/stable/), Nextflow (https://github.com/nf-core/modules) and CWL (https://github.com/common-workflow-library/bio-cwl-tools), while registries such as Dockstore cater to multiple workflow languages simultaneously (O’Connor ). Despite these initiatives, most tool repositories are incomplete or out-of-date. Maintaining up-to-date tool definitions would require frequent updates to describe new software and accommodate updates to existing software, which is not feasible to perform manually. However, some automated techniques have been developed for generating these tool definitions, most notably argparse2tool (https://github.com/hexylena/argparse2tool). This approach has shown further promise when enhanced with metadata from the bio.tools registry, but as argparse2tool is only compatible with software written in the Python language, it does not provide a general solution to this problem (Hillion ). Fortunately, all command-line software provides documentation in the form of the help output. This is the output that is generally printed by an application when invoked using the – help flag, as encouraged by Stallman (2015). Furthermore, this help is generally kept up-to-date, as it is the first point of reference for most users of the software, and in many cases is generated automatically by the argument parsing library (e.g. the argparse library for Python; https://docs.python.org/3/library/argparse.html). In addition, help output often follows a semi-formalised series of conventions, most notably the POSIX Utility Convention (IEEE, 2018) and more rigorously the docopt language (http://docopt.org/), making it a viable target for automated parsing. aCLImatise is a new contribution to the bioinformatics workflow ecosystem designed to streamline the creation of new portable workflows by providing automatically generated tool definitions for any tool with a conventional command-line interface. aCLImatise is itself a command-line application written in the Python programming language. To produce a tool definition, aCLImatise first executes the command of interest by trying a variety of help flags and storing the standard output from each. The resulting help text is then parsed with a Parsing Expression Grammar (PEG) defined using the powerful PyParsing library (McGuire, 2007). This parsing process, and the internal data format are briefly illustrated in Figure 1. Finally, the best intermediate data model is output as a YAML data structure, or translated into workflow formats such as CWL or WDL. Workflow designers are able to download the Python package from the PyPI, and then run aCLImatise on the software they intend to use in their workflow. To evaluate this approach, a detailed comparison between an automated tool definition generated by aCLImatise, and manually authored tool definition is available in Supplementary Appendix SA.
Fig. 1.

The help text produced by the command-line tool SAMtools dict (surrounded by a dotted line), annotated with a subset of the aCLImatise object model (the four surrounding boxes), illustrating how the source text is mapped into Python classes via the parser (coloured arrows). SAMtools dict is a subcommand of the popular SAMtools suite of bioinformatics utilities (Li ). The object model is an adapted version of a Unified Modelling Language (UML) object diagram (Rumbaugh ), where each box represents the instance of an internal class, and each arrow represents an association between objects that is navigable in the direction of the arrow. The coloured arrows indicate an association with a String or array of Strings originally sourced from the help text. The Command class represents an entire command-line tool or subcommand, which has many inputs: Positionals (aka arguments) and Flags (aka options), each of which has an argument specification, such as SimpleFlagArg

The help text produced by the command-line tool SAMtools dict (surrounded by a dotted line), annotated with a subset of the aCLImatise object model (the four surrounding boxes), illustrating how the source text is mapped into Python classes via the parser (coloured arrows). SAMtools dict is a subcommand of the popular SAMtools suite of bioinformatics utilities (Li ). The object model is an adapted version of a Unified Modelling Language (UML) object diagram (Rumbaugh ), where each box represents the instance of an internal class, and each arrow represents an association between objects that is navigable in the direction of the arrow. The coloured arrows indicate an association with a String or array of Strings originally sourced from the help text. The Command class represents an entire command-line tool or subcommand, which has many inputs: Positionals (aka arguments) and Flags (aka options), each of which has an argument specification, such as SimpleFlagArg To simplify the writing of workflows even further, a large database of approximately 20 000 tool definitions called the aCLImatise Base Camp has been generated by running aCLImatise on the Bioconda database of bioinformatics software (Grüning ), facilitated by BioContainers Docker images (da Veiga Leprevost ). This database will be periodically and automatically regenerated from the latest version of Bioconda, limiting the need for manual curation and reducing the risk of outdated tool definitions being used in workflows. It is intended that workflow authors first refer to the Base Camp for up-to-date tool definitions, and only resort to installing aCLImatise when using a tool that is not available in Bioconda. There are numerous potential directions for aCLImatise in the future. Firstly, we hope to expand the parser to support more unusual help formats that depart further from help text conventions. Secondly, there is the potential to add manual curation to the Base Camp database, allowing authors to refine the generated tool definitions and provide simple test suites. Finally, there has already been some effort made to expand the number of supported workflow languages from the initial two. We envisage that Galaxy (Afgan ), Nextflow and Snakemake could be supported in the future.

Funding

This work was supported by the State Government of Victoria and the 10 member organisations of the Melbourne Genomics Health Alliance. Conflict of Interest: none declared. Click here for additional data file.
  9 in total

1.  Nextflow enables reproducible computational workflows.

Authors:  Paolo Di Tommaso; Maria Chatzou; Evan W Floden; Pablo Prieto Barja; Emilio Palumbo; Cedric Notredame
Journal:  Nat Biotechnol       Date:  2017-04-11       Impact factor: 54.908

2.  Bioconda: sustainable and comprehensive software distribution for the life sciences.

Authors:  Björn Grüning; Ryan Dale; Andreas Sjödin; Brad A Chapman; Jillian Rowe; Christopher H Tomkins-Tinch; Renan Valieris; Johannes Köster
Journal:  Nat Methods       Date:  2018-07       Impact factor: 28.547

3.  Snakemake--a scalable bioinformatics workflow engine.

Authors:  Johannes Köster; Sven Rahmann
Journal:  Bioinformatics       Date:  2012-08-20       Impact factor: 6.937

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

5.  Using bio.tools to generate and annotate workbench tool descriptions.

Authors:  Kenzo-Hugo Hillion; Ivan Kuzmin; Anton Khodak; Eric Rasche; Michael Crusoe; Hedi Peterson; Jon Ison; Hervé Ménager
Journal:  F1000Res       Date:  2017-11-30

6.  The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows.

Authors:  Brian D O'Connor; Denis Yuen; Vincent Chung; Andrew G Duncan; Xiang Kun Liu; Janice Patricia; Benedict Paten; Lincoln Stein; Vincent Ferretti
Journal:  F1000Res       Date:  2017-01-18

7.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Authors:  Enis Afgan; Dannon Baker; Bérénice Batut; Marius van den Beek; Dave Bouvier; Martin Cech; John Chilton; Dave Clements; Nate Coraor; Björn A Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Saskia Hiltemann; Vahid Jalili; Helena Rasche; Nicola Soranzo; Jeremy Goecks; James Taylor; Anton Nekrutenko; Daniel Blankenberg
Journal:  Nucleic Acids Res       Date:  2018-07-02       Impact factor: 16.971

8.  BioShake: a Haskell EDSL for bioinformatics workflows.

Authors:  Justin Bedő
Journal:  PeerJ       Date:  2019-07-09       Impact factor: 2.984

9.  BioContainers: an open-source and community-driven framework for software standardization.

Authors:  Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol
Journal:  Bioinformatics       Date:  2017-08-15       Impact factor: 6.937

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.