Literature DB >> 28419324

Improving data workflow systems with cloud services and use of open data for bioinformatics research.

Md Rezaul Karim¹, Audrey Michel², Achille Zappa³, Pavel Baranov², Ratnesh Sahay¹, Dietrich Rebholz-Schuhmann³.

Abstract

Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 28419324 PMCID： PMC6169675 DOI： 10.1093/bib/bbx039

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Scientific workflow systems (SWFSs) efficiently support the analysis of large-scale data in transcriptome data analysis, medical genomics, bioimage informatics, drug discovery and proteomics often using cloud infrastructures and related services [i.e. IaaS, PaaS and Software as a Service (SaaS)]. The workflow systems enable researchers to perform their in silico experiments as a follow-up to their classical experiments in the laboratory, hence enabling the researcher to act as a data scientist to avoid becoming neither a software developer nor a scripting language expert [1]. Owing to the data-intensive nature of bioinformatics research, SWFSs nowadays transform into data workflow systems (DWFSs) that have to cope with the data deluge resulting from the numerous bioinformatics projects in general and the human genome projects in particular (or other data types, e.g. imaging). In addition, the transformation of the numerical data into meaningful information based on fact repositories, such as UniProtKB, and semantic sources, such as the Gene Ontology, puts additional requirements on the DWFS to enable efficient drug discovery and translational medicine based on experimental and numerical data [2]. Workflow technologies were introduced for the optimization of business processes, and specific languages [3] in combination with Web services are used to achieve flow control [4]. After that, the workflow systems have been adapted for scientific computations (i.e. SWFS), but not necessarily for large-scale data analytics nor the integration of semantic technologies. In particular, complex analyses are solved through combinations of modules [5-7], and data-intensive scientific analyses have been optimized for parallel and distributed computing infrastructures anticipating cloud-based services for large-scale data analytics. The integration of data from public fact repositories, e.g. Semantic Web data, is yet another important step, which should enable the sharing of data and the analytics pipelines across research teams, domains and geographic locations.

Bioinformatics research based on experimental and conceptual data with DWFS

Here, we distinguish observational data (i.e. experimental data) from conceptual or symbolic data (aka. ‘Semantic data’) often represented with Semantic Web technologies. The latter comprises not only concepts and labels, e.g. from ontologies but also axioms or facts in knowledge bases (KBs), and is used to add meaning to experimental data for human consumption but also to track the provenance of findings. Both types of data are increasingly analyzed in a joint approach in bioinformatics research and thus lead to innovative contributions to core bioinformatics research as well as drug discovery and translational medicine. The human genome is composed of 3.2 billion base pairs resulting to ∼200 GB of whole-genome sequencing data. At a larger scale, the experimental data of several individuals or the analysis of the full genome of several cells leads to terabytes of data, which should rather be delivered and analyzed in a central repository; at best using tools like DWFSs are required for extracting useful information out of massive amounts of data [8, 9]. This is in contrast to shuffling the data within a computing cluster or shipping it between different computing centers [9], which would unnecessarily extend the time needed for the analysis because of limits in bandwidth, especially in infrastructure-poor environments. Similar computational challenges for large-scale data analytics (i.e. on experimental data), which have been solved with a DWFS, cover a wide range of problems and approaches, which include, for example large-scale NGS [9-11], gene-expression profiling [12] and peptide and protein identification [13], the analysis of single-nucleotide polymorphism (SNP), phenotype association studies [14] and copy number variation (CNV) analysis [15]. The next generation sequencing (NGS) sequencing platforms and their expected throughputs, error types and rates have been summarized in [16]. Bioinformatics for drug discovery research analyzes the properties of lead compounds and the drug–target interactions for optimal drug activities as well as reduced side effects through optimal selectivity. This research leads into new domains such as pharmacogenomics, which combines pharmacology and genomics to identify how the genotype affects a person’s response to a drug [17]. Specialized DWFS can play an important role in the productivity of such domains [18-22] in developing effective and safe medications tailored to a person’s genetic conditions with considerable successes. Bioinformatics research for drug discovery combines different kinds of data including semantic data to identify inhibitors of a receptor, to find novel drugs affecting specific pathways [23] and to conduct cheminformatics analyses for pharmacogenomics research [24]. Biomedical approaches comprise protocol-based medical treatment [25] and neuroimaging data analysis [26, 27] among others.

DWFS for analyzing large-scale data for bioinformatics research

The DWFS provides data analysis components and an interactive working environment with a number of advantages: automation of workflows through scripting and batch processing, real-time data processing, efficient interpretation of results through data visualization and integration and along with the automated update of newly available or modified analytical results [28]. Thus, experts from heterogeneous backgrounds without special IT skills can still use the systems efficiently as a shared platform for data processing [28, 29]. Ultimately, they can publish and share their workflows over the Web, thereby increasing research collaborations and scientific openness, scientific reproducibility and reusability supported by data provenance across workflows for error backtracking and resolution. Altogether, the researcher faces the challenging task of identifying the most suitable workflow solutions, and therefore, our review will give an overview of available tools. It will assess the requirements for biomedical large-scale data (i.e. large-scale genome sequencing) and semantics-driven solutions (i.e. for drug discovery). Core questions of the analysis (Table 1) are concerned with the large-scale data analysis in the cloud infrastructure, benefits from Semantic Web technologies, reproducibility of results, Web-based approaches and next-generation workflow systems. Our investigations will focus on genomics, large-scale data analysis and drug discovery as the two contrasting core parts to bioinformatics research. In addition, Appendix describes the review methodology and exposes the filtering of the reference literature.

Table 1

Questions that arise for the DWFS for large-scale data analytics for bioinformatics research

Questions	Objective	Do DWFSs reach state of the art?	How important is the answering?
Q1	Do the current solutions enable large-scale data analysis in a cloud environment?	Yes	Important and need some special care too, for large-scale data analytics using DWFSs
Q2	Do existing solutions align well with the Semantic Web technologies for large-scale data analytics in bioinformatics research?	Mostly not	Bioinformatics research is now dependent on more data-intensive computing; therefore, existing solutions need to be aligned using the benefits of the Semantic Web technologies
Q3	Is reproducibility of a computational analysis ensured over a long period using computational resources?	Mostly not	Reproducibility is one of the most important requirements for a DWFS, so that scientific experiments are more repeatable and transparent to others based on the given infrastructures and associated technologies
Q4	Are current DWFS efficient and lightweight (workflow management and execution) enough for data analytics for bioinformatics research over the Web?	Mostly not	We need to deploy an efficient and lightweight data analytics approach on the cloud or data server without moving the data location
Q5	Can we design a next-generation DWFS with Semantic Web and cloud computing technologies based on existing DWFS?	Yes	Important and our primary objective. However, this mostly depends on the right consideration, research and technical expertise

Questions that arise for the DWFS for large-scale data analytics for bioinformatics research The rest of the article is structured as follows: the ‘Semantic Web and cloud services in action’ section is focused on the ongoing trends and possible future outcomes for bioinformatics workflow systems by incorporating Semantic Web and Cloud computing services. The ‘Data workflow systems for bioinformatics research’ section discusses the use of different DWFS and their limitations based on the two use cases. The ‘Advancing DWFS through Semantic Web and cloud technologies’ section provides research and technological guidelines toward the development of a new DWFS. The ‘Conclusions’ section elaborates on anticipated future outcomes and achievements.

Semantic Web and cloud services in action

In this section, we show how the Semantic Web and cloud services improve the usability and performance of existing DWFS. Table 1 summarizes the objectives and our assessment of the relevance of the current DWFS.

Large-scale data management in the cloud for bioinformatics research

Tasks associated with bioinformatics research such as searching, downloading, visualization and analysis are mainly performed on the scientist’s desktops using DWFS. This essentially limits the potential for large-scale data analytics (e.g. for high nucleotide precision [23]) and leading into failures because of ever-increasing amounts of data, time-consuming data downloads and other constraints in terms of data volume and variety [30, 31]. The ‘4 Ms’ in data management, i.e. move, manage, merge and munge, are not sufficiently performant for large-scale data [31]. Furthermore, more complex problems in data representation and data usage have to be addressed for bioinformatics research to make use of data sharing in the cloud [32]. High-throughput technologies, such as NGS, require the bioinformaticians’ expertise to carry out data management and analytics at scale using DWFS, as well as access to high-performance computing infrastructures to mount data resources from distributed hosting infrastructures [33]. Therefore, interoperable data at a central site with efficient cloud-based processing units would form the right setup for DWFS including advancements in data reproducibility. To this end, robust, scalable and effcient data management tools are required for large-scale scientific discoveries including visualizations [30–32, 34–36]. A number of parallel and distributed approaches to workflow creation and management have been suggested to address above challenges [37]. Although existing DWFS can already perform in parallel and distributed environments for high-performance data analysis [31, 38–42], fewer solutions have been migrated to the Cloud as a Service [43-45]. Remember that migrating into the cloud [46] requires careful planning of data management, task dependencies, job scheduling, execution and provenance tracking. However, local plug-in-based architectures (i.e. Eclipse) would offer even better options for researchers [28]. In addition, data provenance based on an abstract specification of workflows and its specific operations [30, 31] is a key element for transforming engineering reproducibility into scientific reproducibility, e.g. in human genomics analysis [47-51]. Specific solutions (e.g. in virtualization technologies) allow result replication step by step [5] and in particular, tools like Docker along with Semantic Web services improve the performance of DWFS in this regard [52]. Scientists may now use the DWFS in combination with cloud infrastructures [e.g. Amazon Web Services (AWS)] [53] and perform data analytics on the database server without knowing the underlying IT infrastructures.

Access to data with open data formats and Semantic technologies

Semantic Web technologies (e.g. linked data, ontologies and execution rules) and KBs connect humans with data and improve workflow systems [30, 54]—by adding human-readable labels to data sets and by providing definitions for concepts (and their labels) and formalizing facts as axiomatic statements. Bioinformatics solutions already use Semantic Web technologies, if publicly available resources have to be integrated in a transparent way [28], enabling data access in distributed and heterogeneous environments: the bioinformatics domain has embraced linked data as the Life Sciences Linked Open Data (LS-LOD) [24] to deliver its benefits into bioinformatics research [24, 32, 35, 54–67]. Bioinformatics research institutes increasingly provide their data as linked data, for example UniProtKB [63, 68], EMBL-EBI [69] and Data Databank of Japan [70]. Other bioinformatics groups are also contributing such as Bio2RDF [71, 72] that comprises most relevant biomedical data resources such as dbSNP [73, 74], OMIM [75, 76], pathway databases such as KEGG [77, 78], Reactome [79, 80] and Pathway Interaction Database [81]. Correspondingly, the NCBI itself [65, 82] provides its own data repositories in linked data format. Other existing data resources (on the Web) are enriched with additional metadata and semantic knowledge for efficient reuse, for example as Linked Open Data (LOD) [54]. LOD exposes the data semantics in a machine-readable format including universal identification of data across the World Wide Web [via Uniform Resource Identifiers (URIs)]. The inclusion of semantics into data workflows provides many advantages over traditional architectures [35, 54, 57, 58]. For example, annotating the provenance of data with vocabulary languages (i.e. RDFS and OWL) ensures interpretation of data in an unambiguous way according to the original semantic context [48, 59, 83]. Semantic data integration leads to improved data availability through query access to federated SPARQL end points [84, 85]. More generic solutions support reuse of data among related workflows, and semantically annotated data enable workflow engines to discover the most relevant Web services at runtime, thus achieving data provenance support at low overheads [59]. However, deficiencies in the reuse of available URIs are still a barrier to the accessibility of bioinformatics data [24, 62]. Likewise, broken links hinder progress in the interfacing between various genomic data sources. Ongoing research targets further improvements in DWFS to advance efficient workflow composition and reuse of workflows, scalability of processes, provenance tracking of data, flexibility in the workflow design, performance tuning and reliability through Web services. However, the existing systems do not yet reach scientists more advanced expectations [35, 57], in particular for embedding the DWFS as a core part of bioinformatics research [86]. Eventually, the researchers seek large-scale data integration for biological phenomena, e.g. biological and biochemical mechanisms and disease biomarkers [28, 87]; however, access to large-scale data from distributed public sources still requires unacceptably high levels of manual data integration, e.g. in drug discovery [19, 88].

Data workflow systems for bioinformatics research

This section gives the analysis of widely used DWFS for the bioinformatics research based on the literature review (see Appendix). Features and their definitions for DWFS are given in Table 2; the features are attributed to three categories, i.e. use of computational sources, human usability and access to public resources, which are again used to judge the DWFS (Table 3), and to provide research recommendations in the subsection ‘Full support for the cloud services and Semantic Web technologies’ (Table 4).

Table 2

Workflow systems, features and definitions from the scientific literature including [1, 2, 12, 15, 17, 20, 34, 48, 60, 65]

Features	Definition	Class
Data set conversion	DWFS enables the users to convert the data for bioinformatics research available in one format to another and helps create the corresponding mapping between different data types, thereafter with ease	IT characteristics
Adaptability	DWFS enables users to adopt the workflow system for new or unknown data types or formats
Automation and batch processing	DWFS enables users to configure the workflow environment, workflow editing and submitting the workflow jobs using script-based approach with ease
Workflow scheduling	DWFS enables users to schedule the workflow jobs (in case if the number of workflows to be submitted is enormous) before submitting
Data integration	DWFS enables users to integrate and upload data sets from diverse sources to the workflow data directory
Large-scale data processing	DWFS enables users to handle and process the data sets at scale
System reliability	DWFS ensures that computation will be done successfully and the jobs will not be stalled in between
Workflow specification	DWFS enables users to specify or develop or compose workflows with ease using standard workflow languages
Portability	DWFS enables users to execute a workflow (locally or remotely in platform independent manner) after it has been created somewhere else	Human interface
Reproducibility	DWFS enables users to reproduce identical results against claimed results for similar input and computational approaches in elsewhere
Data provenance	DWFS enables users to track experimental steps, parameter settings and intermediate input/outputs and experimental data lineage
Computational transparency	DWFS enables users to share the experimental steps and workflows to the research communities who will be reusing the similar approach
Reusability	DWFS enables users to reuse useful components further for similar experiments iteratively
Ease of use	DWFS enables users to use the DWFS with little or no training overheads
Scalability	DWFS processes data at different extents of data size and numbers of processing modules using available physical and software resources	Public resources
Extensibility	DWFS incorporates new modules or tools to the workflow system (when necessary) in the experimental steps
Interoperability	DWFS integrates mergeable components from different DWFSs together
Platform independence	DWFS operates on any operating system or platform (i.e. LINUX, Mac OS and Windows)
Cloud integration support	DWFS migrates the whole workflow system on the cloud to be used as SaaS
Open data and open-source design	DWFS is open to the research community so that they can configure the local copy on their machine or cloud and even contribute by adding new modules/tools or bug fixing, etc., to the next stable release

Table 3

Workflow systems and their scoring based on supported features

Note: Based on our extensive review of the literature, the scoring was marked 1 if the feature is supported by the workflow system and blank otherwise. ‘IT characteristics’ stand for the core processing capabilities of the DWFS, ‘human interface’ for user-friendliness and ‘public resources’ for alignment with publicly available data resources. Supported features are summarized based on our extensive review from literature [1, 2, 4, 11–13, 18, 19, 22, 23, 25, 33, 36, 37, 39, 43, 55, 56, 58, 59].

Table 4

Features, definitions and their significance to cloud computing, linked data and open data

Note: These definitions and outcomes have been summarized based on our systematic review including [1, 2, 4, 11–13, 18, 19, 22, 23, 25, 33, 36, 37, 39, 43, 55, 56, 58, 59, 61, 63, 90, 92, 95, 96, 98, 106–110]. The last column signifies that the combined use of DWFS along with Semantic Web and cloud services could help to ensure the availability of (most) the features needed in a DWFS. Based on the review outcome, if the count of yes is at least 2 (of 3), the verdict goes to yes (with green color), no (in red color) otherwise.

Workflow systems, features and definitions from the scientific literature including [1, 2, 12, 15, 17, 20, 34, 48, 60, 65] Workflow systems and their scoring based on supported features Note: Based on our extensive review of the literature, the scoring was marked 1 if the feature is supported by the workflow system and blank otherwise. ‘IT characteristics’ stand for the core processing capabilities of the DWFS, ‘human interface’ for user-friendliness and ‘public resources’ for alignment with publicly available data resources. Supported features are summarized based on our extensive review from literature [1, 2, 4, 11–13, 18, 19, 22, 23, 25, 33, 36, 37, 39, 43, 55, 56, 58, 59]. Features, definitions and their significance to cloud computing, linked data and open data Note: These definitions and outcomes have been summarized based on our systematic review including [1, 2, 4, 11–13, 18, 19, 22, 23, 25, 33, 36, 37, 39, 43, 55, 56, 58, 59, 61, 63, 90, 92, 95, 96, 98, 106–110]. The last column signifies that the combined use of DWFS along with Semantic Web and cloud services could help to ensure the availability of (most) the features needed in a DWFS. Based on the review outcome, if the count of yes is at least 2 (of 3), the verdict goes to yes (with green color), no (in red color) otherwise. In principle, we distinguish solutions that have been designed for the workflow-based integration of heterogeneous data sources and processes. For example, Taverna [67, 89, 111], Anduril [87], Taverna2-Galaxy [106], Konstanz Information Mine (KNIME) extension [107], Tavaxy, LONI [26, 27], SNAPR [90], Graphical Pipeline for Computational Genomics (GPCG) [91], Google Genomic Cloud, Pegasus [57, 58, 112], USC Epigenome Centre collaboration [10], Galaxy [92], GG2P [12] and Unipro UGENE NGS Pipeline [9, 108] that are linked to NGS, drug discovery and large-scale bioinformatics data analytics. Table 5 shows the DWFS for the bioinformatics area and their use cases along with limitations according to their Web site information and related literature [4, 28, 54, 93, 100, 113]. In addition to these reviews, several solutions for processing NGS data based on shell scripts or graphical workflow environments have been suggested to improve data processing tasks such as high-throughput genome sequencing, data manipulation and visualization [39, 87, 89, 92, 94, 95, 114].

Table 5

Some widely used DWFS and their potential use cases with limitations summarized from their Web site and other literature including [4, 28, 54, 98–100]

DWFS	Potential use cases	Technologies	Limitations
Tavaxy	Personalized medicine and NGS (short DNA reads, DNA segments, phylogenetic and taxonomical analyze, EMBOSS, SAMtools, etc.)	SCUFL, JSON, hierarchical workflow structure, asynchronous protocol and DAG style in workflow creation and execution	Difficulty in combining bio-pipelines between Galaxy and Taverna’s workflows using SCUFL Lack of sufficient interoperability Does not support loops in workflow creation Lack of opportunity of workflow sharing
Taverna2- Galaxy	Life Sciences (e.g. eukaryotic genome biology)	SCUFL 2 (experimental), Semantics, RDF, OWL and DAG	SCUFL 2 is still in Apache’s incubation Does not support loops in workflow Lack of opportunity in workflow sharing
Galaxy	NGS (QC and manipulation, Deep Tools, Mapping, RNA Analysis, SAMtools, BAM Tools, Picard, VCF Manipulation, Peak Calling, Variant Analysis, RNA Structure, Du Novo, Gemini, FASTA Manipulation, EMBOSS, etc.)	Python, JavaScript, Shell script, OS: Linux and Mac OS X	No proper interlinking mechanism in pipeline functionalities between dependent modules Does not support loops in workflow creation Does not support control-flow operations and remote services No workflow language available rather than RDBMS Adding new tools require advanced IT knowledge
KNIME	Pharma and healthcare (virtual high-throughput screening, chemical library enumeration, outlier detection in BioMed data and NGS analysis with KNIME Extension [107]	Java/Eclipse, KNIME SDK and Spotfire (supports Python ad Perl scripts)	JDBC mechanism to access the databases is slow High latency time in requests and responses Not scalable for large-scale data and heavy computation No reproducibility of the computational results
Taverna	Domain-independent (bioinformatics, cheminformatics, gravitational wave analysis)	WSDL, Java and DAG	Not scalable for large-scale data and heavy computation Slow response while creating large-scale workflow and submission, thereafter No reproducibility of the computational results
Wings	Multi-omics analysis and cancer omics	Java, Maven, DAG, Tomcat and Graphviz OS: Unix and Mac OS X	Not scalable for large-scale data and heavy computation No data integration support Lack of computational transparency Lack of interoperability with other DWFS
Anduril	Cancer research and molecular biology, DNA, RNA and ChIP-seq, DNA and RNA microarrays, cytometry and image analysis	Workflows are constructed using Scala, DAG notation, the AndurilScript, Developed in Java OS: Windows, Linux, and Mac OS X	No data conversion support Lack of interoperability with other DWFS Cannot be configured on cloud infrastructure Not suitable for workflows containing loops
Unipro UGENE	NGS: sequencing, annotationsMultiple alignments, phylogenetic trees, assemblies, RNA/ChIP-seq, raw NGS, local sequence alignment, protein sequencing, plasmid, variant calling, evolutionary biology and virology	C ++, Qt, DAG style workflow creation and support (Cross-platform software system)	Does not support loops in workflow creation Data provenance cannot be ensured Not scalable for large-scale data and heavy computation Lack of computational transparency No reproducibility of the computational results
Pipeline Pilot	NGS: gene expression and sequence data analysis, imaging, Pharma: drug–chemical material analysis, cheminformatics, ADMET, polymer properties synthesis, data modeling	Visual and data flow oriented, written with C ++ OS: Windows, and Linux	No control flow operation Not scalable for large-scale data and heavy computation Limited data provenance support No reproducibility of the computational results

Some widely used DWFS and their potential use cases with limitations summarized from their Web site and other literature including [4, 28, 54, 98–100] Difficulty in combining bio-pipelines between Galaxy and Taverna’s workflows using SCUFL Lack of sufficient interoperability Does not support loops in workflow creation Lack of opportunity of workflow sharing SCUFL 2 is still in Apache’s incubation Does not support loops in workflow Lack of opportunity in workflow sharing No proper interlinking mechanism in pipeline functionalities between dependent modules Does not support loops in workflow creation Does not support control-flow operations and remote services No workflow language available rather than RDBMS Adding new tools require advanced IT knowledge JDBC mechanism to access the databases is slow High latency time in requests and responses Not scalable for large-scale data and heavy computation No reproducibility of the computational results Not scalable for large-scale data and heavy computation Slow response while creating large-scale workflow and submission, thereafter No reproducibility of the computational results Java, Maven, DAG, Tomcat and Graphviz OS: Unix and Mac OS X Not scalable for large-scale data and heavy computation No data integration support Lack of computational transparency Lack of interoperability with other DWFS Workflows are constructed using Scala, DAG notation, the AndurilScript, Developed in Java OS: Windows, Linux, and Mac OS X No data conversion support Lack of interoperability with other DWFS Cannot be configured on cloud infrastructure Not suitable for workflows containing loops C ++, Qt, DAG style workflow creation and support (Cross-platform software system) Does not support loops in workflow creation Data provenance cannot be ensured Not scalable for large-scale data and heavy computation Lack of computational transparency No reproducibility of the computational results Visual and data flow oriented, written with C ++ OS: Windows, and Linux No control flow operation Not scalable for large-scale data and heavy computation Limited data provenance support No reproducibility of the computational results

DWFS as a platform for processing genomics data

The workflow representation in a DWFS is mostly a directed acyclic graph (DAG), which excludes cycles in the workflow execution; however, other specifications comprise BigDataScript [109], RDF pipeline [14], PilotScript [6, 24] or SCUFL 2 notations, which enable operational flow control based on decision, forking and joining nodes [84]. Often, the DWFS provides a graphical user interface (GUI) for generating workflows prior executing them and input data and processing tasks to be assigned to the physical resources by the workflow engine. As an alternative, scripting and batch processing help to automate a DWFS, thus avoiding unnecessary human interaction [43], and the Kepler workflow system [34, 95, 110, 115] is a good example of a sophisticated run-time workflow engine that offers a GUI and automatic processing. Galaxy is a comprehensive, well-established and widely used platform for interactive genomic analysis, reuse and sharing, offering an NGS computational framework for a single processing unit. It is well described with characteristics such as high usability, simplicity, accessibility and reproducibility of the computational results. It supports various sequence file formats like Text, Tabular, FASTA and FASTQ. Galaxy also provides special quality control (QC) by filtering the data sets by a quality score and solving specific gene sequence-related tasks. In addition, it provides full statistical support on user data sets showing the traits scoring and distribution functions. On the other side, Galaxy lacks the proper interlinking of pipeline functionalities from one module into subsequently dependent modules. It is often not suitable for workflows containing loops and does not support any control-flow operations or remote services [100]. Additionally, it does not use a workflow language but instead uses a relational database (i.e. PostgreSQL). The libraries for available Galaxy routines also require advanced IT knowledge for developing new tools. Although the XML wrappers specify the inputs and outputs for the different tools, so that from a user perspective only, the suitable data formats are given in the drop-down options. The LONI pipeline system is formed around a core pipeline engine for accessing distributed data sources, Web services and heterogeneous software tools focused on NGS data analysis [26]. The GPCG is also dedicated to NGS data analytics, which includes sequence alignment, SNP analysis, CNV identification, annotation, visualization and analysis of the results. Anduril is a workflow platform for analyzing large data sets—i.e. high-throughput data in biomedical research. The platform is fully extensible by third parties and supports data visualization, microarray analysis and cytometry and image analysis. Unipro UGENE provides the NGS pipelines for SAMtools, Tuxedo pipeline for RNA sequencing (RNA-seq) data analysis and Cistrome pipeline for chromatin immunoprecipitation sequencing (ChIP-seq) data analysis as an integrated platform in the Unipro UGENE desktop toolkit [9]. Other solutions deliver dedicated pipelines for specific data analytics tasks without following the ambition to form a platform. The SNAPR [90, 116] has been developed as a bioinformatics pipeline for effcient and accurate RNA-seq alignment and analysis [91]. The USC Epigenome Centre uses the Pegasus system as a computerized sequencing pipeline to conduct genome-wide epigenetic analyses [93, 100, 112]. GG2P supports seamless integration of various SNP genotype data sources like dbSNP [12, 73], and the discovery of indicative and predictive genotype-to-phenotype association. Recently, the KNIME has even been extended to NGS data analysis and processes NGS data formats like FASTQ, SAM, BAM and BED.

DWFS in drug discovery based on conceptual data

In bioinformatics for drug discovery, the DWFSs combine content from distributed databases to automate the reconstruction of biological pathways and the inference of relationships, for example finding the relationships between genes, proteins and metabolites to relevant knowledge about drugs. Solutions for drug discovery research use public data from fact repositories compliant with Semantic Web technologies and KBs that are contrasted by data from screening experiments for the profiling of chemical entities. These tools not only help in workflow generation but also support mechanisms for tracing provenance and other methodologies fostering reproducible science. The tight coupling of myExperiment [96] with Taverna enables the Taverna workflow system to access a network of shared workflows for data processing [9]. Stevens et al. [97] proposed to share myExperiment-based bioinformatics-related workflow for facilitating the drug discovery process. In this respect, Pipeline Pilot eases the cheminformatics analysis and the progress in a data pipelining environment by combining the Pipeline Pilot and KNIME [98] leading into an efficient high-level GUI for bioinformatics tasks. Chem2Bio2RDF [32] is a semantic workflow framework for linking the chemogenomic data to Bio2RDF and the LOD project [69]. It demonstrates the utility in investigating the polypharmacology identification of potential multiple pathway inhibitors and the association of pathways with adverse drug reactions. The customized version of the Kepler system for drug discovery and molecular simulations was proposed by Chichester et al. [99]. However, it is not scalable for large-scale drug-related data resources.

Advancing DWFS through Semantic Web and cloud technologies

This section examines usability improvements through data sharing, uploading, processing and analyzing with a focus on cloud infrastructures and semantics technologies. Table 4 lists characteristic features of DWFS (introduced in Table 2) and their relevance for cloud computing, semantic representation and open data access, respectively.

Increasing usability, reproducibility and data provenance

Scientists are often domain experts—not IT experts—and therefore require that the DWFSs expose high usability (and good documentation). Usability advances by hosting the services in a cloud infrastructure for ready access and by using semantic technologies for improved human–machine interaction through standardized semantic labeling of data. Furthermore, scientists profit from reproducibility of scientific work (i.e. repeatability of experiments and access to open data), which is supported by capturing workflow versioning and provenance information, again achieved with Semantic Web technologies [55, 83]. The data provenance for DWFS is managed by tracking the data management infrastructure, data lineage analysis and visualization [49]. Certainly, any data conversion has to preserve the data semantics. Semantic Web technologies and KBs in this regard allow integration of LS-LOD at scale [56]. A good example is Wings [57, 58], which is based on the semantic representation for the design and validation of workflows, choice of experimental parameters, selection of appropriate dynamic models suitable for the scientific data and scientist’s requirements. This leads toward automatic workflow generation with sufficient detail to determine the provenance of the data. As discussed before, provenance—as metadata information for data resources and workflow components—increases reproducibility and usability at a large scale [35, 103]. However, a uniform provenance standard is required to share the metadata in an explicit way [55], the Open Provenance Model could be further improved to this end, or the next release of SCFUL2 may bring semantics into the DWFS. Kepler Archive [115] and myExperiment are two repositories that facilitate the re-execution of workflows in a platform-independent manner by importing them in DWFS directly [104]. The last column in Table 4 signifies that the combined use of DWFS along with Semantic Web and cloud computing could help to ensure the availability of most of the features needed in a DWFS. Where, based on the review outcome, the overall verdict is yes if the count of yes responses is at least 2 of 3, and no otherwise.

Improving performance through data and workflow sharing in the cloud

A workflow engine has to scale according to the number of used resources, services and the volume of data leading to a difficult dependency between scalability and performance [28]. This dependency exposes the workflow engine as the core component solving the performance bottleneck [3]. Furthermore, computing infrastructures may restrict the deployment of workflow applications, and large data resources may only be transferred with significant overheads. An effcient policy-based data placement bolsters the performance of a DWFS [49] such as known from the Swift workflow system for cloud-based computation [36, 43–45]. Other examples of DWFS in the cloud can be found from Deelman et al. [105]. The Wings DWFS enables large scientific workflows based on semantic representations that expose the provenance of scientific experimentations and the connections to other useful data. The structure and content of the data provenance record can be complex, as it has to correctly represent the data derivations, multiple source origins, multistaged processing and diverse analysis activities. Finally, platform independence is important in bioinformatics research to share workflows across available platforms. Optimally, the DWFS would provide a browser-based user interface; the Taverna suite is a prime example as an open source, domain- and platform-independent workflow system. Interpreted programming languages like Perl, Python or PHP contribute to platform independence. Moreover, workflows should be easy to exchange, evolve and reusable and open source so that everybody can contribute to producing meaningful scientific results.

Toward fully integrated DWFS for analyzing large-scale data

The analytical overhead of genome sequencing data imposes restrictions to the research performed on NGS research overall [87]. Similarly, modern data-driven drug discovery requires integrated resources and pipeline solutions to support decision-making and enable new discoveries [101]. Data integration in bioinformatics requires resolving data sources heterogeneity when they use on large genomics and pharmacogenomics data sets in a distributed way [41]. The workflow presented in Figure 1 computationally integrates data from four different sources. The drug-related compounds are extracted from PubChem, bioassay from Bio2RDF, gene-related data from ClinVar and HGNC (or from the NCBIGene data set) and the pathway-related data set from Reactome and KEGG. The whole pipeline can be represented in RDF/XML, N3 or Turtle format. According to the literature [14], it is a decentralized approach with no central controller. Furthermore, it is data and programming language agnostic, where each node (rectangle) can be made live using an updater and a wrapper. The former has to be written using the same technology as the DWFS, but the latter could be of any programming language. However, in this regard, we would argue for using the SPARQL query language.

Figure 1

Workflow for finding the pathways affecting particular drugs by finding the number of inhibitors communicating signals from a receptor using RDF pipeline notation [14]. This helps us in data integration, processing and querying that can be used by a number of collaborative experts together (i.e. practitioners like medical doctors, pharmacologist, chemist and IT experts). This workflow is conceptually adapted from the RDF pipeline by Booth et al. [14]. Workflow systems like Galaxy and KNIME are particularly suitable to bring all the combined genomic data (i.e. numerical or sequence data) and drug-related data (i.e. facts data and KBs) to the scientist. Then, these data can be processed as a DWFS as a Service in the cloud [44]. These approaches have been applied recently for large-scale biological sequence alignments [37, 102] along with the bioKepler [110]. The Tavaxy serves as an interoperable workflow system for analyzing large-scale genomic sequencing. KNIME [98] has been tailored to drug discovery but could be augmented by incorporating Semantic Web technologies and then be attached to the Open PHACTS platform to query the RDFized drug compound-related data using SPARQL (as shown in Figure 1 as RDF pipeline notation [14]). This access to structured data gives input to questions concerned with the number of drug compounds having specific effects on pathways in the DNA regulation or with the side effects of a drug known from a drug–gene pathway. However, Galaxy has emerged as the leading open-source workflow platform for data analytics (e.g. NGS data) and for the benchmarking of bioinformatics components because of its high flexibility and extensibility standards [99]. Semantic Web tools can be incorporated into the Galaxy workflow system just like any data analysis tools for processing, job monitoring, workflow creation and delivery of ready workflows to the research communities. Beyond these, Semantic Automated Discovery and Integration (SADI)-Galaxy [66] brings semantics support through the SADI framework into the Galaxy workflow system. Moreover, SADI-Taverna has been implemented in Taverna workflow system as well. A similar extension would be the TopFed–Galaxy integration [8] to make cancer genomic data analytics more reproducible, scalable and transparent, where the TopFed distributes the data from ‘The Cancer Genome Atlas’ as LOD for access to genetic mutations responsible for cancer.

Full support for the cloud services and Semantic Web technologies

Once the semantics requirements have been met, DWFS like Galaxy or KNIME would be migrated to the cloud. The best candidates for NGS analysis are Tavaxy and Galaxy because of their high scores (16 each in Table 3). However, Galaxy would be the most suitable candidate because of its widespread distribution and its ease of use for NGS. KNIME, on the other hand, performed best against the pharmaceutical use cases. Altogether, the biomedical or pharmacogenomics researchers can draft their requirements into the workflow specification using BigDataScript, RDF pipeline notation, PilotScript or SCUFL 2 for creating platform-independent workflows with LOD technologies before submitting the jobs. Research questions can then be formalized as SPARQL queries addressing the data flow (Figure 1) between computational nodes and then can be submitted as workflow jobs (refer to Figure 2 for a generic overview) for execution. Likewise, Semantic Web tools can provide access to related data from heterogeneous sources (i.e. genomic- or drug-related compound data) via SPARQL end points as LOD with dereferenceable URIs, or Semantic Web tools can automatically transform local data sets and upload them to DWFS in RDF formats.

Figure 2

Solving bioinformatics research problems for two representative use cases (e.g. genome sequencing analysis and drug discovery) by incorporating Semantic Web technologies and cloud services into the DWFS. The predictors (in a DWFS) learn models from the training drug data and calculate predictions for the entire targets before combining and submitting them to the workflow engine. After submitting the workflow jobs, data then can be processed in a parallel and distributed way in the cloud services (e.g. Amazon AWS as IaaS and PaaS). Even the DWFS itself could be used to work as a SaaS tool. Further improvements result from the use of semantic provenance (and reasoning) to test and validate semantic consistency of the data model, conciseness of results and the reproducibility. Formal ontologies and KBs may contribute in addition. Automated reasoning validates RDFized instances and their compliance with the OWL classes of the data model. To validate the results during the drug discovery or sequence analysis process, evaluation and validation could be performed on statistically significant drug data or simulated/real genome data. Moreover, validation can be done by matching the expected results with KBs rules. After the results have been evaluated and validated, biomedical scientists can prove their hypothesis based on the outcome.

Conclusions

Representing and developing new workflow systems or integrating sufficient tools in existing workflow system with suitable scalability and extensibility will be a key challenge for bioinformatics research in the future. DWFS in bioinformatics has to evolve toward distributed and scalable infrastructures including ubiquitous computing and integration of Web services, Semantic Web technologies and also domain-specific tools. Data provenance not only has to be ensured for large-scale data but also LOD manageability on the system level. Here, are some key points for this systematic review for bioinformatics research. Bioinformatics researchers rely on a number of features such as result reproducibility, data provenance, scalability, openness, reusability, abstraction and simplicity. The suggestions provided in this manuscript should help researchers to develop more advanced DWFS. One particular focus will become the approaches of ontology-based formalism and semantic reasoning to achieve shared data representations and knowledge integration based on existing workflow systems (e.g. Galaxy and KNIME). More specifically: Using a graph-based approach for representing and executing workflow of pathways (e.g. what is done in KNIME). Making an efficient use of a modular approach (including parallelization) of the workflow job and processes (e.g. what is done in Galaxy). Making efficient use of specification languages for the pathway (e.g. SCUFL 2) apart from the graphical approach. Integration of the provenance information as metadata using Semantic Web technologies (e.g. exploiting the FAIR principles that were recently published in Nature). Integrating the semantic resources (ontologies, fact repositories) and KBs, e.g. either through access to SPARQL end points, BigDataScript, or RDF pipeline notation. Enabling the transformation of the experimental data into semantic information (e.g. via ML approaches) as available. For processing large-scale data for bioinformatics research requires an infrastructure—preferably a cloud infrastructure—to enable data analytics at scale to address emerging research problems. The data deluge in bioinformatics research drives the demand for parallel and distributed computing by imposing a need for scalability and high-throughput capabilities onto the DWFS. Emerging requirements for data sharing and access to public resources suggest that compliance of the DWFS using Semantic Web standards is needed, where the data analytics has to be done on the cloud-based infrastructure. If genome sequencing and drug discovery are considered as two of the most relevant use cases, following requirements must be met by using Semantic Web technologies on cloud-based infrastructure to attain the above outstanding advancements: a number of capabilities need to be developed in the existing DWFS to prepare workflow creation, management and execution for parallel and distributed computing; data provenance should be supported to combine engineering and scientific reproducibility based on Semantic Web technologies; interoperable data (experimental and symbolic data) should be hosted in a secure environment with efficient cloud-based processing through semantic labeling (for scientists); and the existing DWFSs have to advance into fully integrated DWFS for big data analytics in the cloud.

Table 6

Article searching queries and related statistics for the systematic review methodology

Query	Search query	Source	Results	Number of used publication	Section
Q1	(“workflows”[All Fields] AND “next generation sequencing”[All Fields]) OR (“workflows”[All Fields] AND “genomics”[All Fields]) OR (“workflows”[All Fields] AND “Bioinformatics”[All Fields])	PubMed Google Scholar IEEE Digital Library	688 2420 24	23	‘Introduction’, ‘Data workflow systems for bioinformatics research’ and ‘DWFS as a platform for processing genomics data’
Q2	(“Workflows”[All Fields] AND “Drug Discovery”[All Fields]) OR (“Workflows”[All Fields] AND “Pharmacogenomics “[All Fields])	PubMed Google Scholar IEEE Digital Library	91 472 34	22	‘Introduction’ and ‘Data workflow systems for bioinformatics research’
Q3	(“Workflows”[All Fields] AND “Big Data”[All Fields]) OR (“Workflows”[All Fields] AND “Large Scale Data”[All Fields]) OR (“Workflows”[All Fields] AND “Bioinformatics “[All Fields])	PubMed Google Scholar IEEE Digital Library	552 470 39	48	‘Semantic Web and cloud services in action’, ‘Large-scale data management in the cloud for bioinformatics research’ and ‘Access to data with open data formats and Semantic technologies’
Q4	(“Workflows”[All Fields] AND “Semantic Web “[All Fields]) OR (“Workflows”[All Fields] AND “Semantic”[All Fields]) OR (“Workflows”[All Fields] AND “Bioinformatics”[All Fields])	PubMed Google Scholar IEEE Digital Library	570 2600 3	13	‘Advancing DWFS through Semantic Web and cloud technologies’
Q5	(“Workflows”[All Fields] AND “Provenance”[All Fields])	PubMed Google Scholar IEEE Digital Library	25 8100 9896	9	‘Semantic Web and cloud services in action’, ‘Data workflow systems for bioinformatics research’ and ‘Advancing DWFS through Semantic Web and cloud technologies’

53 in total

Review 1. Workflow based framework for life science informatics.

Authors: Abhishek Tiwari; Arvind K T Sekhar
Journal: Comput Biol Chem Date: 2007-08-19 Impact factor: 2.877

Review 2. Drug discovery applications for KNIME: an open source data mining platform.

Authors: Michael P Mazanetz; Robert J Marmon; Catherine B T Reisser; Inaki Morao
Journal: Curr Top Med Chem Date: 2012 Impact factor: 3.295

3. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Authors: Marek S Wiewiórka; Antonio Messina; Alicja Pacholewska; Sergio Maffioletti; Piotr Gawrysiak; Michał J Okoniewski
Journal: Bioinformatics Date: 2014-05-19 Impact factor: 6.937

4. Infrastructure for the life sciences: design and implementation of the UniProt website.

Authors: Eric Jain; Amos Bairoch; Severine Duvaud; Isabelle Phan; Nicole Redaschi; Baris E Suzek; Maria J Martin; Peter McGarvey; Elisabeth Gasteiger
Journal: BMC Bioinformatics Date: 2009-05-08 Impact factor: 3.169

5. Tavaxy: integrating Taverna and Galaxy workflows with cloud computing support.

Authors: Mohamed Abouelhoda; Shadi Alaa Issa; Moustafa Ghanem
Journal: BMC Bioinformatics Date: 2012-05-04 Impact factor: 3.169

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

7. The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation.

Authors: Mark D Wilkinson; Benjamin Vandervalk; Luke McCarthy
Journal: J Biomed Semantics Date: 2011-10-24

8. A network flow approach to predict drug targets from microarray data, disease genes and interactome network - case study on prostate cancer.

Authors: Shih-Heng Yeh; Hsiang-Yuan Yeh; Von-Wun Soo
Journal: J Clin Bioinforma Date: 2012-01-13

9. Pharmacogenomic knowledge representation, reasoning and genome-based clinical decision support based on OWL 2 DL ontologies.

Authors: Matthias Samwald; Jose Antonio Miñarro Giménez; Richard D Boyce; Robert R Freimuth; Klaus-Peter Adlassnig; Michel Dumontier
Journal: BMC Med Inform Decis Mak Date: 2015-02-22 Impact factor: 2.796

10. PID: the Pathway Interaction Database.

Authors: Carl F Schaefer; Kira Anthony; Shiva Krupa; Jeffrey Buchoff; Matthew Day; Timo Hannay; Kenneth H Buetow
Journal: Nucleic Acids Res Date: 2008-10-02 Impact factor: 16.971

8 in total

1. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.

Authors: Fuyi Li; Yanan Wang; Chen Li; Tatiana T Marquez-Lago; André Leier; Neil D Rawlings; Gholamreza Haffari; Jerico Revote; Tatsuya Akutsu; Kuo-Chen Chou; Anthony W Purcell; Robert N Pike; Geoffrey I Webb; A Ian Smith; Trevor Lithgow; Roger J Daly; James C Whisstock; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

2. Differences in key genes in human alveolar macrophages between phenotypically normal smokers and nonsmokers: diagnostic and prognostic value in lung cancer.

Authors: Yi-De Wang; Zheng Li; Feng-Sen Li
Journal: Int J Clin Exp Pathol Date: 2020-11-01

3. Digitalization, clinical microbiology and infectious diseases.

Authors: A Egli
Journal: Clin Microbiol Infect Date: 2020-07-02 Impact factor: 8.067

4. doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows.

Authors: Daniel Svensson; Rickard Sjögren; David Sundell; Andreas Sjödin; Johan Trygg
Journal: BMC Bioinformatics Date: 2019-10-15 Impact factor: 3.169

5. Laniakea: an open solution to provide Galaxy "on-demand" instances over heterogeneous cloud infrastructures.

Authors: Marco Antonio Tangaro; Giacinto Donvito; Marica Antonacci; Matteo Chiara; Pietro Mandreoli; Graziano Pesole; Federico Zambelli
Journal: Gigascience Date: 2020-04-01 Impact factor: 6.524

6. Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives.

Authors: Charles Vesteghem; Rasmus Froberg Brøndum; Mads Sønderkær; Mia Sommer; Alexander Schmitz; Julie Støve Bødker; Karen Dybkær; Tarec Christoffer El-Galaly; Martin Bøgsted
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

7. Society for Immunotherapy of Cancer clinical and biomarkers data sharing resource document: Volume II-practical challenges.

Authors: Alessandra Cesano; Michael A Cannarile; Sacha Gnjatic; Bruno Gomes; Justin Guinney; Vaios Karanikas; Mohan Karkada; John M Kirkwood; Beatrix Kotlan; Giuseppe V Masucci; Els Meeusen; Anne Monette; Aung Naing; Vésteinn Thorsson; Nicholas Tschernia; Ena Wang; Daniel K Wells; Timothy L Wyant; Sergio Rutella
Journal: J Immunother Cancer Date: 2020-12 Impact factor: 13.751

Review 8. An Epigenetic Alphabet of Crop Adaptation to Climate Change.

Authors: Francesco Guarino; Angela Cicatelli; Stefano Castiglione; Dolores R Agius; Gul Ebru Orhun; Sotirios Fragkostefanakis; Julie Leclercq; Judit Dobránszki; Eirini Kaiserli; Michal Lieberman-Lazarovich; Merike Sõmera; Cecilia Sarmiento; Cristina Vettori; Donatella Paffetti; Anna M G Poma; Panagiotis N Moschou; Mateo Gašparović; Sanaz Yousefi; Chiara Vergata; Margot M J Berger; Philippe Gallusci; Dragana Miladinović; Federico Martinelli
Journal: Front Genet Date: 2022-02-16 Impact factor: 4.599

8 in total