Literature DB >> 23930017

Workflow management systems for gene sequence analysis and evolutionary studies - A Review.

Abstract

Post 'omic' era has resulted in the development of many primary, secondary and derived databases. Many analytical and visualization bioinformatics tools have been developed to manage and analyze the data available through large sequencing projects. Availability of heterogeneous databases and tools make it difficult for researchers to access information from varied sources and run different bioinformatics tools to get desired analysis done. Building integrated bioinformatics platforms is one of the most challenging tasks that bioinformatics community is facing. Integration of various databases, tools and algorithm is a challenging problem to deal with. This article describes the bioinformatics analysis workflow management systems that are developed in the area of gene sequence analysis and phylogeny. This article will be useful for biotechnologists, molecular biologists, computer scientists and statisticians engaged in computational biology and bioinformatics research.

Entities: Disease Gene Species

Keywords: Analysis; bioinformatics; databases; integration; phylogeny; workflows

Year: 2013 PMID： 23930017 PMCID： PMC3732438 DOI： 10.6026/97320630009663

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

MODERN biology is driven by large scale processing of heterogeneous data, which may come from diverse sources, such as sequences from GenBank, EMBL, PDB, DDJB, PROSITE, NGS and many other secondary databases. The interface which allows access to these different data sources vary widely. Therefore, in order to access these resources a researcher needs to be an expert in very different areas of computer science such as databases, networking, scripting languages etc. Furthermore, algorithms/tools used to extract biologically relevant information tend to be developed at faster pace by researchers but in isolation. There is hardly any code sharing among the data analysis algorithms however there is an increase in code complexity. Gene sequence analysis and study of evolutionary relationships among organisms are two major areas of interest to biologists. Gene sequence analysis involves identification of stretches of sequence in DNA that are biologically functional whereas evolutionary studies infer biological relationships among different organisms. in-silico identification of coding regions in genomes and phylogeny studies are important problems that have been brought into focus through advances in genomic sequencing. Availability of diversified tools available on different platforms, structures and heterogeneous databases makes this analysis, a difficult task for biologists. So, there is an urgent need for development of solutions for integration of various tools and database to assist biologist from burden on executing them independently on different platforms. Workflow Management System (WMS) is the integration of several bioinformatics tools with multiple databases, to automate the analysis and storage of genomic sequences. Several WMSs were developed for researchers to perform computational analysis with ease using various computational tools. These workflow systems, differs in scope and approach of integration for their execution. Many of these WMS are available as web based servers to provide access to powerful computing resources through familiar graphical-based environment for inexperienced users. This saves time for installing software on their own computer and analyzing biological data. Standalone workflow systems integrate various bioinformatics tools within desktop applications using graphically specified workflows. This also provides access to distributed computational resources to biologists. Although, many WMS are developed in the area of gene sequence analysis and evolutionary studies but no attempt has been made to compile these at one place along with bioinformatics tools used at each stage of analysis. The objective of this article is to provide comprehensive information on available WMS along with their limitations and practical considerations on their usage to biotechnologists, molecular biologists and other researchers. It also provides review of various tools to computational scientists, who are actively involved in the development of WMS.

Workflow Management Systems for Gene Sequence Analysis:

Gene sequence analysis involves identification of features such as genes, transcription initiation and poly(A) cleavage sites, 5' as well as 3'-untranslated regions (UTRs) and promoter regions etc. in genome, derived through, transformation of raw genomic sequences into information by integrating computational tools, auxiliary biological data, and biological knowledge. Identification and in-silico annotation of coding and non-coding sequences from a variety of genomes is necessary due to exponential increase in raw sequence data. Due to availability of advanced sequencing technologies, large volume of multi-species genomic data is generated. Manual curation and annotation of this data is a difficult and time consuming task. The development of automatic in-silico computational solution to aid the manual curation process is highly desirable. As, genome annotation involves performing various tasks like gene finding, repeat finding, Expressed Sequenced Tags (EST)/cDNA alignment, homology searching and protein family searching etc. Attempts have been made to develop various biological workflows through integration of various computational tools through development of automatic pipelines to perform this genomic annotation. The generic solution of this workflow is given in (Figure 1). Table 1 (see supplementary material) lists some of the important computational tools used for performing different tasks in the process. Major workflows for gene sequence analysis along with important tools which are integrated are compared in Table 2 (see supplementary material).

Figure 1

General gene sequence analysis workflow system

ESTpass [1] is a workflow, used for processing and annotating sequence data from ESTs. The major advantages of ESTpass are, the integration of cleansing and annotating processes, rigorous chimeric EST detection, exhaustive annotation, email reporting to inform users about progress and to send results. PSEUDOPIPE [2] is a homology-based computational pipeline, which helps to search a mammalian genome and identify pseudogene sequences in a comprehensive and consistent manner. The output of PSEUDOPIPE is the complete annotation of pseudogenes in genome, their chromosomal location, nucleotide sequences, name and sequence of the parent gene, and alignment of the pseudogene with the functional gene. Tiger Gene Indices Clustering tools (TGICL) [3] is a pipeline for analysis of large EST and mRNA databases in which sequences are first clustered based on pair wise sequence similarity, and then assembled by individual clusters to produce longer, more complete consensus sequences. TGICL is used to generate TIGR Gene Indices representing independent analyses for nearly 60 species with EST collections of fewer than 10000 to more than 4000000 sequences. EGene [4] is a generic, flexible and modular pipeline generation system that makes pipeline construction a modular job. EGene allows for third-party programs to be used and integrated according to the needs of distinct projects and without any previous experience of programming or formal language. A series of components to build pipelines for sequence processing is provided along with this. MAKER [5] is a portable and easy to configure genome annotation pipeline. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces Ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER's modular construction allows it to break annotation process down into a series of five discrete activities that are easily interoperable: compute, filter/cluster, polish, synthesis, and annotate. Protein Annotation Toolkit (PAT) [6] is an integrated bio-computing server that provides a standardized web interface to a wide range of protein analysis tools. It is designed as a streamlined analysis environment that implements many features, which strongly simplify studies dealing with protein sequences/structures and improve productivity. Pipeline for Protein Annotation (PIPA) [7] annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. Automatic and manual Functional Annotation in a Web services Environment (AFAWE) [8] simplifies the task of manual functional annotation by running different tools and workflows for automatic function prediction and displaying results in a way that facilitates comparison. AFAWE includes analyses for homolog detection, protein domain search and phylogenomics.

Workflow Management Systems for Phylogenetic Analysis:

Phylogeny and evolutionary analyses of sequences are among the most often used methodologies in laboratories working on functional, comparative and structural genomics. Phylogenetics analysis involves performing various tasks like multiple sequence alignment of uploaded sequences, curation of alignment obtained, construction of phylogenetics tress and their visualization as shown in (Figure 2).

Figure 2

General phylogenetic workflow system

Further, execution of each of these tasks requires, use of specialized bioinformatics tools. As, there were many tools or web servers were developed for phylogenetic and evolutionary analysis, many workflows have been developed to automate this process. Several web sites offer phylogenetic tree reconstruction. Some offer a single tool, while others bring together many of the most popular programs for phylogenetic reconstruction. The workflow pipeline integrates these commonly used computational tools in a flexible way and allows the user to plug in custom sequence databases as well as alternative analysis tools. This section describes the important workflow management systems developed for phylogenetic analysis. Table 3 (see supplementary material) lists some of the important computational tools used for performing different tasks in this process. Major workflows for phylogenetic analysis along with important tools which are integrated are compared in Table 4 (see supplementary material). Phylogena [9] is a user-friendly, interactive graphical user interface running on desktop computers that automatically performs a Basic Local Alignment Search Tool (BLAST) with respect toquery sequences, selects a representative subset of them, then creates a multiple alignment from the selected sequences, and finally computes a phylogenetic tree. Phylemon [10] is an online platform for phylogenetic and evolutionary analyses of molecular sequence data. Phylemon also provides facilities for file format conversion, gene concatenation, tree visualization and the computation of distances between trees. Automated Simultaneous Analysis Phylogenetics (ASAP) [11] is an automated technique developed to assemble multigene/multi species matrices and to evaluate the significance of individual genes within the context of a given phylogenetic hypothesis. Matrix assembly at the genome scale involves the acquisition of hundreds to thousands of gene regions for the taxa of interest, the formatting of these sequences for use in an alignment program, aligning them, and finally eexport of the data partitions into formats used by phylogenetic analysis packages. Hal [12] is command line programs that brings together a number of bioinformatic applications into an efficient pipeline that inputs unaligned proteins sequences in fasta format and generate species trees from super alignments containing several orthologous protein sequences in a fully automated manner. The BioExtract [13] Server was used to create a workflow for comparing and aligning a number of nucleotide sequences to build a phylogenetic evolutionary tree. The web server Phylogeny.fr [14] is designed for non-specialists and has up-to-date programs that are often designed for experts. Armadillo v1.1 [15] is a novel workflow platform dedicated to designing and conducting phylogenetic studies, including comprehensive simulations. As Armadillo is an open-source project, it allows scientists to develop their own modules as well as to integrate existing computer applications. TreeDomViewer [16] is a visualization tool available as a web-based interface that combines phylogenetic tree description, multiple sequence alignment and InterProScan data of sequences and generates a phylogenetic tree projecting the corresponding protein domain information onto the multiple sequence alignment.

Conclusion

Analysis of ‘omics’ data using integrated bioinformatics tools through workflow management systems will help in increasing the productivity of researchers by reducing the time and effort spent on searching and executing each tool independently on different platforms. This article attempt to compare the features and performance of workflows developed for gene sequence analysis and evolutionary studies. Some of the important issues that must be addressed by these workflows are security, scheduling, load balancing and resource pooling. There is a need to design workflows through object oriented approach for its better re-usability, transportability, code sharing and ultimately reducing the efforts.

15 in total

1. TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Authors: Geo Pertea; Xiaoqiu Huang; Feng Liang; Valentin Antonescu; Razvan Sultana; Svetlana Karamycheva; Yuandan Lee; Joseph White; Foo Cheung; Babak Parvizi; Jennifer Tsai; John Quackenbush
Journal: Bioinformatics Date: 2003-03-22 Impact factor: 6.937

Workflow management systems for gene sequence analysis and evolutionary studies - A Review.

Background

Workflow Management Systems for Gene Sequence Analysis:

Workflow Management Systems for Phylogenetic Analysis:

Conclusion

1. TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

2. EGene: a configurable pipeline generation system for automated sequence analysis.

3. PhyloGena--a user-friendly system for automated phylogenetic annotation of unknown sequences.

4. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes.

5. PseudoPipe: an automated pseudogene identification pipeline.

6. TreeDomViewer: a tool for the visualization of phylogeny and protein domain structure.

7. Phylemon: a suite of web tools for molecular evolution, phylogenetics and phylogenomics.

8. PAT: a protein analysis toolkit for integrated biocomputing on the web.

9. Automated simultaneous analysis phylogenetics (ASAP): an enabling tool for phlyogenomics.

10. ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences.