Literature DB >> 16381888

PUMA2--grid-based high-throughput analysis of genomes and metabolic pathways.

Natalia Maltsev¹, Elizabeth Glass, Dinanath Sulakhe, Alexis Rodriguez, Mustafa H Syed, Tanuja Bompada, Yi Zhang, Mark D'Souza.

Abstract

The PUMA2 system (available at http://compbio.mcs.anl.gov/puma2) is an interactive, integrated bioinformatics environment for high-throughput genetic sequence analysis and metabolic reconstructions from sequence data. PUMA2 provides a framework for comparative and evolutionary analysis of genomic data and metabolic networks in the context of taxonomic and phenotypic information. Grid infrastructure is used to perform computationally intensive tasks. PUMA2 currently contains precomputed analysis of 213 prokaryotic, 22 eukaryotic, 650 mitochondrial and 1493 viral genomes and automated metabolic reconstructions for >200 organisms. Genomic data is annotated with information integrated from >20 sequence, structural and metabolic databases and ontologies. PUMA2 supports both automated and interactive expert-driven annotation of genomes, using a variety of publicly available bioinformatics tools. It also contains a suite of unique PUMA2 tools for automated assignment of gene function, evolutionary analysis of protein families and comparative analysis of metabolic pathways. PUMA2 allows users to submit batch sequence data for automated functional analysis and construction of metabolic models. The results of these analyses are made available to the users in the PUMA2 environment for further interactive sequence analysis and annotation.

Entities: Chemical

Mesh：

Substances：
Enzymes

Year: 2006 PMID： 16381888 PMCID： PMC1347457 DOI： 10.1093/nar/gkj095

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Evolutionary analysis of a wide spectrum of diverse organisms is essential for understanding how they adapt to environments. Common ancestry of eukaryotes and prokaryotes leads to similarity of many molecular functions. However, differences in organisms' structural complexity, physiology and lifestyle result in divergent evolution and emergence of variants of molecular function, metabolic organization and phenotypic features. Recent progress in genomics, bioinformatics and physiological studies allows for systematic exploration of adaptive mechanisms that led to diversification of biological systems. Such adaptive changes usually are not limited to one component of the system; on the contrary, in the process of adaptation, organisms undergo co-adaptive changes, such as the complementary changes of protein sequences to accommodate changes in an enzyme's active site or co-evolution of properties of different steps in metabolic pathways. PUMA2, available at , provides an environment for studying co-evolution of genomes and metabolic pathways and enzymes. It also supports the comparative analysis of sequence data and metabolic pathways for identification, analysis and characterization of evolutionary patterns associated with particular phylogenetic neighborhoods or phenotypes. The system enables high-throughput automated analysis of genomes, development of metabolic reconstructions from sequence data and community curation of genomic and metabolic data (Figure 1).

Figure 1

Representation of the stages of analyses of genomes in PUMA2. Users can browse data that is integrated and analyzed in PUMA2 or submit their own sequences. Precomputed results for homology, domain architecture, functional analysis and metabolic reconstructions are provided in an interactive framework.

A number of excellent resources such as KEGG (1), MetaCyc (2) and IMG (3) support high-throughput analysis of genomes and metabolic reconstructions. Although PUMA2 has numerous commonalities with these systems, it offers a number of unique features. In brief PUMA2 (i) supports the interactive development of user models for public and user-submitted genomes; (ii) utilizes Grid technology for computationally intensive tasks; and (iii) provides new tools for comparative analysis of genomes and metabolic networks in the framework of taxonomic and phenotypic information. These last two features represent major differences between PUMA2 and its predecessor, the WIT2 system (4). In addition to genome analysis, PUMA2 supports comprehensive annotation of genomes. The PUMA2 knowledge base integrates information from >20 public databases, including sequence information and annotations [e.g. NCBI (5), PIR (6) and UniProt (7)], structural information [e.g. PDB (8), CATH (9) and SCOP (10)], metabolic information [e.g. EMP (11), KEGG and ENZYME (12)], taxonomic information from NCBI, gene ontologies from Gene Ontology (GO) (13) and physiological information [e.g. NCBI, TIGR (14) and the literature]. Extensive cross-referencing facilitates easy navigation of the data in PUMA2. The sections below describe major capabilities of the PUMA2 system in more detail.

AUTOMATED HIGH-THROUGHPUT ANALYSIS OF GENOMES IN PUMA2

Currently, the PUMA2 system contains automated precomputed analysis of 213 prokaryotic, 22 eukaryotic, 650 mitochondrial and 1493 viral genomes and automated metabolic reconstructions for >200 organisms. All data in PUMA2 are periodically updated. Sequence data in PUMA2 are obtained from the public sequence databases or provided by the users. It is analyzed by a variety of bioinformatics tools [e.g. BLAST (15), Blocks (16), Pfam (17), PepStats (18) and TMHMM (19)], as well as PUMA2 tools for prediction of gene functions and evolutionary analysis of enzymatic functions [e.g. Chisel and PhyloBlocks]. Grid technology is used to perform computationally intensive tasks. The results of these precomputed analyses are integrated into the database and presented to the users. PUMA2 tools for gene function prediction utilize the results of precomputed analyses of genomic data using BLAST, as well as InterPro, Blocks and TMHMM, to perform rules-based classification of the un-annotated sequences. PUMA2 also supports interactive analysis of sequences by the users, providing access to >40 publicly available bioinformatics tools.

PUMA2 environment for data curation by users

Although automated analysis may provide useful initial annotation of the genomes, these results require rigorous curation by expert biologists. PUMA2 provides registered users with tools for user-driven metabolic model development, reassignment of functions to genes and addition of comments. The user annotations in PUMA2 are persistent and are presented to the user when logged into the system.

PUMA2 sequence analysis tools

PUMA2 contains a suite of tools to facilitate the identification of evolutionary patterns and motifs characteristic of particular biological functions and their variations. The suite includes the following: Chisel, a web-based computational workbench for evolutionary analysis of enzymatic sequences (). Chisel utilizes information from the PUMA2 knowledge base to perform rules-based clustering and classification of annotated enzymatic sequences into functional categories. The resulting clusters are used for developing a library of Hidden Markov Models (HMM profiles) for particular enzymatic functions and (when possible) their taxonomic and phenotypic variations. These profiles are used by the classification tool for prediction of functions of hypothetical proteins. PhyloBlocks, a tool that allows a user to develop high-resolution HMM profiles for particular protein family interactively (). Tools for comparative analysis of enzymes and metabolic networks in phenotypic and taxonomic framework. PUMA2 allows to perform analyses of the data that allow users to ask such questions as ‘What metabolic pathways are common between hyperthermophilic organisms that live in aquatic environment?’ and ‘What variations of l-lactate dehydrogenase is characteristic for Firmicutes?’.

Metabolic reconstructions from sequence data in PUMA2

The technology of metabolic reconstructions from sequence data (1,4,20,21) proved useful for developing organism and process-specific functional models. PUMA2 currently contains automated metabolic reconstructions from the sequence data for >200 completely sequenced organisms. The system supports the development of automated metabolic reconstructions, which provide an initial basis for the development of expert-curated models. The developed metabolic reconstructions are based on pathway data from the EMP collection of enzymes and metabolic pathways, being developed by the EMP Project Inc., containing enzymatic information and metabolic diagrams accumulated from the literature describing >3000 metabolic pathways in a structured, indexed and searchable form. PUMA2 provides tools for comparative analysis of metabolic networks that allow identification of variations of the metabolic pathways characteristic for particular organisms or taxonomic groups of organisms, identification of ‘missing’ enzymes and viewing of the pathway data in a larger context of hierarchy of biological processes. PUMA2 metabolic reconstructions provide links to sequence data and enzymatic data. Genomic data and metabolic models in PUMA2 are annotated with the GO terms and cross-referenced with BioPAX format (22). Such representation simplifies navigation and comparative analyses of metabolic networks.

PUMA2 use of grid technology for high-throughput analysis

High-throughput computations in PUMA2 are performed by an automated Genome Analysis and Database Update (GADU) server with a grid-based distributed computational backend (23). GADU was developed as a collaboration between the ANL Globus group and the bioinformatics group in the framework of the NSF NCSA alliance. It leverages experience and technology developed by the GriPhyn (Physics Grid) project. High-throughput computations in GADU are performed by using distributed heterogeneous grid computing resources such as Grid2003 (24), OSG and the DOE Science Grid. Periodic updates of the PUMA2 system and analysis of the sequence data using BLAST, Blocks, Pfam, Chisel and the like are performed in the form of scientific workflows expressed and controlled by a ‘virtual data’ model using the Chimera Virtual Data System (25), which transparently maps computational workflows to distributed grid resources. Using the GADU system allows for automated genetic sequence analysis of an average bacterial genome (∼4000 protein sequences), development of automated metabolic reconstructions and integration of resulting models in the PUMA2 framework and its presentation to the user via a web interface in <3 h. Such analysis includes analysis of the genome by BLAST and Blocks, automated assignment of functions to genes and development of automated metabolic reconstructions.

PUMA2 support of genomes provided by users

Some users are interested in analyses of genomes that are not yet included in the PUMA2 framework. PUMA2 provides automated analyses of user-submitted sets of sequences. These sequence sets may be complete or incomplete genomes or sets of sequences of interest to the user. Limited amounts (up to 50 sequences) of user-submitted protein sequences may be analyzed via the PUMA2 website. Analysis of larger amounts of user-provided genomic data is available per request.

AVAILABILITY

PUMA2 is available for use via the web-based user interface at . The following datasets are available on request: metabolic data from EMP database in BioPax (OWL) format and precomputed results from organism specific analyses by BLAST, Blocks and Chisel. Requests may be sent via email to puma2@mcs.anl.gov.

24 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The ENZYME database in 2000.

Authors: A Bairoch
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

4. The Gene Ontology (GO) database and informatics resource.

Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. The KEGG resource for deciphering the genome.

Authors: Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Yasushi Okuno; Masahiro Hattori
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. MetaCyc: a multiorganism database of metabolic pathways and enzymes.

Authors: Cynthia J Krieger; Peifen Zhang; Lukas A Mueller; Alfred Wang; Suzanne Paley; Martha Arnaud; John Pick; Seung Y Rhee; Peter D Karp
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

7. A reconstruction of the metabolism of Methanococcus jannaschii from sequence data.

Authors: E Selkov; N Maltsev; G J Olsen; R Overbeek; W B Whitman
Journal: Gene Date: 1997-09-15 Impact factor: 3.688

Review 8. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

9. The metabolic pathway collection from EMP: the enzymes and metabolic pathways database.

Authors: E Selkov; S Basmanova; T Gaasterland; I Goryanin; Y Gretchkin; N Maltsev; V Nenashev; R Overbeek; E Panyushkina; L Pronevitch; E Selkov; I Yunus
Journal: Nucleic Acids Res Date: 1996-01-01 Impact factor: 16.971

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

21 in total

Review 1. A bioinformatician's guide to metagenomics.

Authors: Victor Kunin; Alex Copeland; Alla Lapidus; Konstantinos Mavromatis; Philip Hugenholtz
Journal: Microbiol Mol Biol Rev Date: 2008-12 Impact factor: 11.056

2. Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology.

Authors: Peter D Karp; Suzanne M Paley; Markus Krummenacker; Mario Latendresse; Joseph M Dale; Thomas J Lee; Pallavi Kaipa; Fred Gilham; Aaron Spaulding; Liviu Popescu; Tomer Altman; Ian Paulsen; Ingrid M Keseler; Ron Caspi
Journal: Brief Bioinform Date: 2009-12-02 Impact factor: 11.622

Review 3. A Survey of Data Mining and Deep Learning in Bioinformatics.

Authors: Kun Lan; Dan-Tong Wang; Simon Fong; Lian-Sheng Liu; Kelvin K L Wong; Nilanjan Dey
Journal: J Med Syst Date: 2018-06-28 Impact factor: 4.460

4. Reconstruction and validation of RefRec: a global model for the yeast molecular interaction network.

Authors: Tommi Aho; Henrikki Almusa; Jukka Matilainen; Antti Larjo; Pekka Ruusuvuori; Kaisa-Leena Aho; Thomas Wilhelm; Harri Lähdesmäki; Andreas Beyer; Manu Harju; Sharif Chowdhury; Kalle Leinonen; Christophe Roos; Olli Yli-Harja
Journal: PLoS One Date: 2010-05-14 Impact factor: 3.240

5. Genome-scale model for Clostridium acetobutylicum: Part I. Metabolic network resolution and analysis.

Authors: Ryan S Senger; Eleftherios T Papoutsakis
Journal: Biotechnol Bioeng Date: 2008-12-01 Impact factor: 4.530

6. Uncovering metabolic pathways relevant to phenotypic traits of microbial genomes.

Authors: Gabi Kastenmüller; Maria Elisabeth Schenk; Johann Gasteiger; Hans-Werner Mewes
Journal: Genome Biol Date: 2009-03-10 Impact factor: 13.583

7. The transferome of metabolic genes explored: analysis of the horizontal transfer of enzyme encoding genes in unicellular eukaryotes.

Authors: John W Whitaker; Glenn A McConkey; David R Westhead
Journal: Genome Biol Date: 2009-04-15 Impact factor: 13.583