| Literature DB >> 28419324 |
Md Rezaul Karim1, Audrey Michel2, Achille Zappa3, Pavel Baranov2, Ratnesh Sahay1, Dietrich Rebholz-Schuhmann3.
Abstract
Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.Entities:
Mesh:
Year: 2018 PMID: 28419324 PMCID: PMC6169675 DOI: 10.1093/bib/bbx039
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Questions that arise for the DWFS for large-scale data analytics for bioinformatics research
| Questions | Objective | Do DWFSs reach state of the art? | How important is the answering? |
|---|---|---|---|
| Q1 | Do the current solutions enable large-scale data analysis in a cloud environment? | Yes | Important and need some special care too, for large-scale data analytics using DWFSs |
| Q2 | Do existing solutions align well with the Semantic Web technologies for large-scale data analytics in bioinformatics research? | Mostly not | Bioinformatics research is now dependent on more data-intensive computing; therefore, existing solutions need to be aligned using the benefits of the Semantic Web technologies |
| Q3 | Is reproducibility of a computational analysis ensured over a long period using computational resources? | Mostly not | Reproducibility is one of the most important requirements for a DWFS, so that scientific experiments are more repeatable and transparent to others based on the given infrastructures and associated technologies |
| Q4 | Are current DWFS efficient and lightweight (workflow management and execution) enough for data analytics for bioinformatics research over the Web? | Mostly not | We need to deploy an efficient and lightweight data analytics approach on the cloud or data server without moving the data location |
| Q5 | Can we design a next-generation DWFS with Semantic Web and cloud computing technologies based on existing DWFS? | Yes | Important and our primary objective. However, this mostly depends on the right consideration, research and technical expertise |
Workflow systems, features and definitions from the scientific literature including [1, 2, 12, 15, 17, 20, 34, 48, 60, 65]
| Features | Definition | Class |
|---|---|---|
| Data set conversion | DWFS enables the users to convert the data for bioinformatics research available in one format to another and helps create the corresponding mapping between different data types, thereafter with ease |
|
| Adaptability | DWFS enables users to adopt the workflow system for new or unknown data types or formats | |
| Automation and batch processing | DWFS enables users to configure the workflow environment, workflow editing and submitting the workflow jobs using script-based approach with ease | |
| Workflow scheduling | DWFS enables users to schedule the workflow jobs (in case if the number of workflows to be submitted is enormous) before submitting | |
| Data integration | DWFS enables users to integrate and upload data sets from diverse sources to the workflow data directory | |
| Large-scale data processing | DWFS enables users to handle and process the data sets at scale | |
| System reliability | DWFS ensures that computation will be done successfully and the jobs will not be stalled in between | |
| Workflow specification | DWFS enables users to specify or develop or compose workflows with ease using standard workflow languages | |
| Portability | DWFS enables users to execute a workflow (locally or remotely in platform independent manner) after it has been created somewhere else |
|
| Reproducibility | DWFS enables users to reproduce identical results against claimed results for similar input and computational approaches in elsewhere | |
| Data provenance | DWFS enables users to track experimental steps, parameter settings and intermediate input/outputs and experimental data lineage | |
| Computational transparency | DWFS enables users to share the experimental steps and workflows to the research communities who will be reusing the similar approach | |
| Reusability | DWFS enables users to reuse useful components further for similar experiments iteratively | |
| Ease of use | DWFS enables users to use the DWFS with little or no training overheads | |
| Scalability | DWFS processes data at different extents of data size and numbers of processing modules using available physical and software resources |
|
| Extensibility | DWFS incorporates new modules or tools to the workflow system (when necessary) in the experimental steps | |
| Interoperability | DWFS integrates mergeable components from different DWFSs together | |
| Platform independence | DWFS operates on any operating system or platform (i.e. LINUX, Mac OS and Windows) | |
| Cloud integration support | DWFS migrates the whole workflow system on the cloud to be used as SaaS | |
| Open data and open-source design | DWFS is open to the research community so that they can configure the local copy on their machine or cloud and even contribute by adding new modules/tools or bug fixing, etc., to the next stable release |
Workflow systems and their scoring based on supported features
|
|
Note: Based on our extensive review of the literature, the scoring was marked 1 if the feature is supported by the workflow system and blank otherwise. ‘IT characteristics’ stand for the core processing capabilities of the DWFS, ‘human interface’ for user-friendliness and ‘public resources’ for alignment with publicly available data resources. Supported features are summarized based on our extensive review from literature [1, 2, 4, 11–13, 18, 19, 22, 23, 25, 33, 36, 37, 39, 43, 55, 56, 58, 59].
Features, definitions and their significance to cloud computing, linked data and open data
|
|
Note: These definitions and outcomes have been summarized based on our systematic review including [1, 2, 4, 11–13, 18, 19, 22, 23, 25, 33, 36, 37, 39, 43, 55, 56, 58, 59, 61, 63, 90, 92, 95, 96, 98, 106–110]. The last column signifies that the combined use of DWFS along with Semantic Web and cloud services could help to ensure the availability of (most) the features needed in a DWFS. Based on the review outcome, if the count of yes is at least 2 (of 3), the verdict goes to yes (with green color), no (in red color) otherwise.
Some widely used DWFS and their potential use cases with limitations summarized from their Web site and other literature including [4, 28, 54, 98–100]
| DWFS | Potential use cases | Technologies | Limitations |
|---|---|---|---|
|
| Personalized medicine and NGS (short DNA reads, DNA segments, phylogenetic and taxonomical analyze, EMBOSS, SAMtools, etc.) | SCUFL, JSON, hierarchical workflow structure, asynchronous protocol and DAG style in workflow creation and execution |
Difficulty in combining bio-pipelines between Galaxy and Taverna’s workflows using SCUFL Lack of sufficient interoperability Does not support loops in workflow creation Lack of opportunity of workflow sharing |
|
| Life Sciences (e.g. eukaryotic genome biology) | SCUFL 2 (experimental), Semantics, RDF, OWL and DAG |
SCUFL 2 is still in Apache’s incubation Does not support loops in workflow Lack of opportunity in workflow sharing |
|
| NGS (QC and manipulation, Deep Tools, Mapping, RNA Analysis, SAMtools, BAM Tools, Picard, VCF Manipulation, Peak Calling, Variant Analysis, RNA Structure, Du Novo, Gemini, FASTA Manipulation, EMBOSS, etc.) | Python, JavaScript, Shell script, OS: Linux and Mac OS X |
No proper interlinking mechanism in pipeline functionalities between dependent modules Does not support loops in workflow creation Does not support control-flow operations and remote services No workflow language available rather than RDBMS Adding new tools require advanced IT knowledge |
|
| Pharma and healthcare (virtual high-throughput screening, chemical library enumeration, outlier detection in BioMed data and NGS analysis with KNIME Extension [107] | Java/Eclipse, KNIME SDK and Spotfire (supports Python ad Perl scripts) |
JDBC mechanism to access the databases is slow High latency time in requests and responses Not scalable for large-scale data and heavy computation No reproducibility of the computational results |
|
| Domain-independent (bioinformatics, cheminformatics, gravitational wave analysis) | WSDL, Java and DAG |
Not scalable for large-scale data and heavy computation Slow response while creating large-scale workflow and submission, thereafter No reproducibility of the computational results |
|
| Multi-omics analysis and cancer omics |
Java, Maven, DAG, Tomcat and Graphviz OS: Unix and Mac OS X |
Not scalable for large-scale data and heavy computation No data integration support Lack of computational transparency Lack of interoperability with other DWFS |
|
| Cancer research and molecular biology, DNA, RNA and ChIP-seq, DNA and RNA microarrays, cytometry and image analysis |
Workflows are constructed using Scala, DAG notation, the AndurilScript, Developed in Java OS: Windows, Linux, and Mac OS X |
No data conversion support Lack of interoperability with other DWFS Cannot be configured on cloud infrastructure Not suitable for workflows containing loops |
|
| NGS: sequencing, annotationsMultiple alignments, phylogenetic trees, assemblies, RNA/ChIP-seq, raw NGS, local sequence alignment, protein sequencing, plasmid, variant calling, evolutionary biology and virology |
C ++, Qt, DAG style workflow creation and support (Cross-platform software system) |
Does not support loops in workflow creation Data provenance cannot be ensured Not scalable for large-scale data and heavy computation Lack of computational transparency No reproducibility of the computational results |
|
| NGS: gene expression and sequence data analysis, imaging, Pharma: drug–chemical material analysis, cheminformatics, ADMET, polymer properties synthesis, data modeling |
Visual and data flow oriented, written with C ++
|
No control flow operation Not scalable for large-scale data and heavy computation Limited data provenance support No reproducibility of the computational results |
Figure 1Workflow for finding the pathways affecting particular drugs by finding the number of inhibitors communicating signals from a receptor using RDF pipeline notation [14]. This helps us in data integration, processing and querying that can be used by a number of collaborative experts together (i.e. practitioners like medical doctors, pharmacologist, chemist and IT experts). This workflow is conceptually adapted from the RDF pipeline by Booth et al. [14].
Figure 2Solving bioinformatics research problems for two representative use cases (e.g. genome sequencing analysis and drug discovery) by incorporating Semantic Web technologies and cloud services into the DWFS.
Article searching queries and related statistics for the systematic review methodology
| Query | Search query | Source | Results | Number of used publication | Section |
|---|---|---|---|---|---|
| Q1 | (“workflows”[All Fields] AND “next generation sequencing”[All Fields]) OR (“workflows”[All Fields] AND “genomics”[All Fields]) OR (“workflows”[All Fields] AND “Bioinformatics”[All Fields]) |
PubMed Google Scholar IEEE Digital Library |
688 2420 24 | 23 | ‘Introduction’, ‘Data workflow systems for bioinformatics research’ and ‘DWFS as a platform for processing genomics data’ |
| Q2 | (“Workflows”[All Fields] AND “Drug Discovery”[All Fields]) OR (“Workflows”[All Fields] AND “Pharmacogenomics “[All Fields]) |
PubMed Google Scholar IEEE Digital Library |
91 472 34 | 22 | ‘Introduction’ and ‘Data workflow systems for bioinformatics research’ |
| Q3 | (“Workflows”[All Fields] AND “Big Data”[All Fields]) OR (“Workflows”[All Fields] AND “Large Scale Data”[All Fields]) OR (“Workflows”[All Fields] AND “Bioinformatics “[All Fields]) |
PubMed Google Scholar IEEE Digital Library |
552 470 39 | 48 | ‘Semantic Web and cloud services in action’, ‘Large-scale data management in the cloud for bioinformatics research’ and ‘Access to data with open data formats and Semantic technologies’ |
| Q4 | (“Workflows”[All Fields] AND “Semantic Web “[All Fields]) OR (“Workflows”[All Fields] AND “Semantic”[All Fields]) OR (“Workflows”[All Fields] AND “Bioinformatics”[All Fields]) |
PubMed Google Scholar IEEE Digital Library |
570 2600 3 | 13 | ‘Advancing DWFS through Semantic Web and cloud technologies’ |
| Q5 | (“Workflows”[All Fields] AND “Provenance”[All Fields]) |
PubMed Google Scholar IEEE Digital Library |
25 8100 9896 | 9 | ‘Semantic Web and cloud services in action’, ‘Data workflow systems for bioinformatics research’ and ‘Advancing DWFS through Semantic Web and cloud technologies’ |