Literature DB >> 16381839

MIPS: analysis and annotation of proteins from whole genomes in 2005.

H W Mewes¹, D Frishman, K F X Mayer, M Münsterkötter, O Noubibou, P Pagel, T Rattei, M Oesterheld, A Ruepp, V Stümpflen.

Abstract

The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2006 PMID： 16381839 PMCID： PMC1347510 DOI： 10.1093/nar/gkj148

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

FROM GENE-CENTRIC TO NETWORK-BASED REPRESENTATION OF GENOME INFORMATION

Since the creation of the very first genome databases in 1992 (1), data structures as well as the information content of genome databases underwent little change. Essentially the concept of genome databases is gene centered and the sequence associated information does not reach beyond its individual functional properties such as EC numbers or the annotation of certain classification types such as functional categories [e.g. FunCat (2) or GeneOntology (3)]. An important contribution of novel experimental techniques during the last few years has been the computational analysis of modules and networks based on the combination of independent types of information (4,5). To understand complex cellular processes, it is necessary to uncover the functional context of any individual gene. As a consequence, there is an urgent need to build information resources that enable the integration of different types of data as well as the quantitative evaluation of their reliability at inferring function as part of the annotation process. An essential but so far unsolved challenge for the annotation of gene properties within the functional context is the very dynamic change of the underlying data. In contrast to genome sequence data which are complete for hundreds of species and for which robust and mathematically rigorous algorithms are available, interaction information for most organisms is highly incomplete and the transfer of information between species is not straightforward [e.g. (6)]. From the user's point of view, a number of basic requirements have to be met for genome databases to provide complex functional information. The gene centered view has to be extended from the single-gene to encompass a network perspective and metabolic, regulatory and interactive dimensions have to be included. In addition, to enable an interaction-based comprehensible view, the so called ‘giant’ networks need to be subclustered to reflect their underlying modular structure and the edges of the functional interaction graphs must be quantitatively labeled. Both views need to be accessible through browsers as well as through suitable computational interfaces. Obviously, the current state of genome databases does not fulfill these requirements. In this paper, we will describe the current state of our developments to achieve these long-term goals for a limited number of model organisms (fungi including yeast, Arabidopsis thaliana and other plant models, and the mouse genome).

GENRE, THE GENERIC MODEL FOR COMPLEX GENOME INFORMATION

Genome information is traditionally stored in databases containing entries as instances of predefined rigid data structures (i.e. formats). However, the development of concepts to cope with complex data structures within a database becomes practically unmanageable as soon as several independent data sources have to be covered. Therefore information pointing to the same biological objects is distributed over a large number of independent and often syntactically incompatible databases (e.g. nucleic acids, proteins, protein interactions, metabolic and regulatory networks and the like). While passive integration of these databases is feasible through database indexing and integration of flat files or web resources [e.g. PubMed (7)], it does not allow for any semantic integration required for comprehensive annotation purposes (Table 1).

Table 1

URL addresses for MIPS database resources

Project description	Link
Project overview
Arabidopsis thaliana genome (MATDB)
CABiNet: Comprehensive Network Analysis
Complete Genomes (PEDANT server)
Comprehensive Yeast Genome Database (CYGD)
Database of Human cDNAs (DHGP)
FunCat: Functional Catalogue of Proteins
GABI: Genome analysis in plants
GenRE: Genome Research Environment
MIPS Neurospora crassa Database (MNCDB)
MosDB: Rice Genome Database
MPPI: Mammalian Protein–Protein Interactions
SIMAP : Similarity Matrix of Proteins
The Lotus Genome Database (Lotus japonica)
MPCDB Mammalian Protein Complex Data Base
URMELDB (European Medicago and Legume Database (Medicago)
FGDB: Fusarium graminearum genome database
MUMDB: Ustillago maydis genome database
MPACT: Representation of interaction data at MIPS

The Munich Information Center for Protein Sequences (MIPS) Genome Research Environment System (GenRE) provides a flexible technology to cope with the needs of biological data representation. It is a J2EE based multi-tier architecture, implemented with established software design patterns. Seamless integration of distributed information resources (databases and applications) is realized with Enterprise Java Beans (EJBs) capable of retrieving information in XML format for straightforward web publishing including expression based queries similar to PubMed. Internally, GenRE is based on three different types of objects and components. Components of the first type are responsible for the access of applications and databases. These EJBs provide a uniform interface while hiding the data resource dependent access mechanisms in the data integration tier. Databases for example are typically accessed via Hibernate, an object-relational mapping tool, whereas applications are often directly accessed. Input and output are commonly XML documents and data objects. Data objects, which represent the second type within GenRE abstract biological entities such as genes, proteins or even complexes at a semantic level. In this layered approach, the data object level is unambiguously separated from the underlying data sources. These objects are used for semantic integration into a third type of component. They are realized as EJBs and are responsible for any further information processing. This allows the association of any biochemical entities (e.g. RNA, drugs, etc.) with either an entity describing binary relationships—e.g. protein interactions from yeast two-hybrid experiments—with many to many relationships, e.g. functional assignments using the MIPS FunCat (2). Hence GenRE does not only allow for the flexible creation of different object types needed to include various types of ‘omics’ data, but is also capable of incorporating relations between instances from different data sources. Even complex data models suitable for handling biological networks together with functional annotation of the distinct nodes are realized. In combination with integrated applications like SIMAP components for comparative proteomics can be realized. The MIPS protein–protein interaction resource (MPact) (8) is illustrative of the advantage of our approach to extend the single-protein view into a network perspective. The data model allows extensions for interactions of proteins with other biochemical entities (e.g. RNA, drugs, etc.). Interacting objects can not only be associated with each other to represent for example complexes, but also with external information describing the corresponding experiments (e.g. yeast two-hybrid, co-immunoprecipitation or mass spectrometry data). In the same way information about the evidence of interactions and various protein annotations such as functions, motifs and cellular localization are associated with the interacting object. Owing to the object-oriented approach any instances of the interacting object (notably a protein) can be furthermore associated with the corresponding entry e.g. in a genome database. Our implementation allows two different approaches to query the repository. On the one hand a gene-centric or protein-centric query is possible where distinct interacting objects can be retrieved within a specific context. Since the proteins and interactions are ‘decorated’ with annotation information, it is possible to query for specific attributes of proteins (e.g. functions) or the interactions (e.g. evidence). On the other hand, network-centric queries can be performed. It is possible to query both for the nodes (the proteins) and the edges (the interactions) of the graph. Based on functional annotation a traversal of the network graph is possible. This traversal can be used to quickly scan the network for false-positive interactions between proteins whose functions and/or localizations differ completely or to assign new functions to proteins without functions which interact specifically within a certain functional context. Furthermore extraction of sub-networks based on any associated context (function, localization, experiment) is possible. It is relevant to point out that our approach enables seamless context dependent views starting from single genes and ending with complete networks. MIPS databases are implemented in the GenRE environment.

SIMAP AND CABINET: COMPUTATIONAL METHODS TO GENERATE INFORMATION FOR GENOME DATABASES

Pairwise similarity comparison of every protein against the set of all known proteins is an indispensable step in any annotation process. Many biological and evolutionary questions are related to the structure of the sequence space and its partitioning into substructures represented by the all-against-all similarity relations. However, individual searches for homologs do not allow structuring the sequence universe. The MIPS Similarity Matrix of Proteins (9), currently contains a matrix of all-against-all comparisons of more than 4 million proteins from >400 organisms including all UNIPROT sequences. They have been generated by exhaustive sequence similarity searches using FASTA (10). SIMAP is continuously updated to keep up with the rapidly increasing amount of data. The linked sequence feature database containing information on protein domains as well as predicted transmembrane regions, signal peptides and protein localization supports efficient post-processing of homology lists into sub-clusters of homogeneous sequence properties. These steps are also essential to identify inconsistencies in genome annotation. SIMAP serves as an example of a system offering highly dynamic information in the form of a persistent database to be explored systematically at high performance. SIMAP is also used as an annotation tool and generates similarity input information for the PEDANT databases (11). In contrast to the straightforward similarity matrix, the compilation of biological interaction information requires methods dealing with mostly incomplete but also inherently heterogeneous sets of data. To provide a system for comprehensive network analysis, we have developed CABiNet (Comprehensive Analysis of Biomolecular Networks) within the framework of GenRE. CABiNet offers a set of methods for network statistics, integration, analysis and clustering applied to interaction data administered by GenRE as well as to user-submitted network data. CABInet allows the user to browse and query across any subset of the generated networks and clusters. As with all methods in GenRE, the software design of CABiNet allows an easy adoption of new functions into the system. To address the issue of inconsistent use of protein identifiers in different networks, a GenRE component to resolve protein identifiers and aliases in all major sequence databases is employed by CABiNet.

THE MAMMALIAN PROTEIN COMPLEX DATABASE

High quality data collections are needed as a reference set for testing computational methods in network analysis. In the case of protein–protein interaction data, any collection from various sources including high-throughput experiments contains a considerable number of false-positive and false-negative results (12). Thus, for the analysis of protein networks there is a need for gold standards (12) in order to validate the quality of data resources or as a reference for testing the reliability of computational methods. The cooperation and interaction of proteins is most unambiguously found in protein complexes where several proteins simultaneously act together to perform a single reaction. For the analysis of protein networks in lower eukaryotes the protein complexes from yeast have become a gold standard in the field (8). Analysis of the even more elaborate protein networks in mammals requires manually curated resources in its own right. Information in Mammalian Protein Complex Database (MPCDB) is extracted from scientific literature describing individual experiments; data stemming from high-throughput experiments has not been incorporated. Information about protein complexes includes gene names of the members as well as protein names, cross-references and literature references. If a protein complex has been analysed from other mammals the orthologous mouse proteins are presented. In addition, the type of experiment that was used to analyse the protein complex is given. Evidence is structured according to the MIPS evidence catalogue originally developed for yeast protein complex and protein–protein interaction data and subsequently adopted for higher organisms. This is in line with the requirements for PSI-MI compliant annotation (8). Manual annotation of the respective proteins is performed in the Mouse functional Genome Database (MfunGD). Here, protein function annotation is performed with the FunCat (2) annotation scheme, which in a comparison of several individual features showing the highest predictive power for the analysis of protein networks (13). Whereas within the protein complex data in BIND (14) the vast majority of protein complexes contain less than three different proteins, the protein complexes within MPCDB contain 4.6 different proteins on average. Currently, MPCDB contains 122 protein complexes with a total of 643 proteins. High molecular weight protein complexes like proteasomes and the eukaryotic chaperonin TRiC perform central functions in cells like protein degradation and protein folding, respectively. These two protein complexes are examples that so far, are not present in any manually curated publicly available database of protein complexes. Hence, the MPCDB dataset provide scientists with a reliable data resource for the analysis of protein networks and functional modules in mammals.

SUMMARY

The MIPS data resource aims to extend the scope of its curated genome databases towards functional context information. These databases include model systems for mammals (Mus musculus, in progress), fungi (yeast, CYGD; Neurospora crassa, MNDB; Ustillago maydis, MUMDB; Fusarium graminearum, FGDB), plants (A.thaliana, MatDB; Oryza sativa, MOsDB; Maize Genome Sequencing Project, MGSP; European Medicago and Legume Database; URMELDB, Lotus japonica), microorganisms (Chlamydiae, Listeriae), the comprehensive automatic annotation of genomes in the PEDANT database (11). In addition, we have compiled manually curated reference protein–protein interaction datasets for mammalia (MPPI) and yeast (MPact) as well as the database for mammalian protein complexes (MPCDB). Proteins in these databases are functionally classified using the well established functional catalogue FunCat (2).

14 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

Review 2. Conservation of protein-protein interactions - lessons from ascomycota.

Authors: Philipp Pagel; Hans Werner Mewes; Dmitrij Frishman
Journal: Trends Genet Date: 2004-02 Impact factor: 11.639

3. STRING: a database of predicted functional associations between proteins.

Authors: Christian von Mering; Martijn Huynen; Daniel Jaeggi; Steffen Schmidt; Peer Bork; Berend Snel
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. BIND: the Biomolecular Interaction Network Database.

Authors: Gary D Bader; Doron Betel; Christopher W V Hogue
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. Integration of genomic datasets to predict protein complexes in yeast.

Authors: Ronald Jansen; Ning Lan; Jiang Qian; Mark Gerstein
Journal: J Struct Funct Genomics Date: 2002

6. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes.

Authors: Andreas Ruepp; Alfred Zollner; Dieter Maier; Kaj Albermann; Jean Hani; Martin Mokrejs; Igor Tetko; Ulrich Güldener; Gertrud Mannhaupt; Martin Münsterkötter; H Werner Mewes
Journal: Nucleic Acids Res Date: 2004-10-14 Impact factor: 16.971

7. The complete DNA sequence of yeast chromosome III.

Authors: S G Oliver; Q J van der Aart; M L Agostoni-Carbone; M Aigle; L Alberghina; D Alexandraki; G Antoine; R Anwar; J P Ballesta; P Benit
Journal: Nature Date: 1992-05-07 Impact factor: 49.962

8. Using the FASTA program to search protein and DNA sequence databases.

Authors: W R Pearson
Journal: Methods Mol Biol Date: 1994

9. Systematic interpretation of genetic interactions using protein networks.

Authors: Ryan Kelley; Trey Ideker
Journal: Nat Biotechnol Date: 2005-05 Impact factor: 54.908

10. The PEDANT genome database in 2005.

Authors: M Louise Riley; Thorsten Schmidt; Christian Wagner; Hans-Werner Mewes; Dmitrij Frishman
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

122 in total

1. SNPs discovery and CAPS marker conversion in soybean.

Authors: Yongjun Shu; Yong Li; Zhenlei Zhu; Xi Bai; Hua Cai; Wei Ji; Dianjing Guo; Yanming Zhu
Journal: Mol Biol Rep Date: 2010-09-22 Impact factor: 2.316

2. Genome-wide identification of intron fragment insertion mutations and their potential use as SCAR molecular markers in the soybean.

Authors: Yongjun Shu; Yong Li; Yanming Zhu; Zhenlei Zhu; Dekang Lv; Xi Bai; Hua Cai; Wei Ji; Dianjing Guo
Journal: Theor Appl Genet Date: 2010-02-17 Impact factor: 5.699

3. TFIIA plays a role in the response to oxidative stress.

Authors: Susan M Kraemer; David A Goldstrohm; Ann Berger; Susan Hankey; Sherry A Rovinsky; W Scott Moye-Rowley; Laurie A Stargell
Journal: Eukaryot Cell Date: 2006-07

4. Transcriptional regulation of protein complexes within and across species.

Authors: Kai Tan; Tomer Shlomi; Hoda Feizi; Trey Ideker; Roded Sharan
Journal: Proc Natl Acad Sci U S A Date: 2007-01-16 Impact factor: 11.205

Review 5. Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution.

Authors: Philip R Kensche; Vera van Noort; Bas E Dutilh; Martijn A Huynen
Journal: J R Soc Interface Date: 2008-02-06 Impact factor: 4.118

6. Genome of the opportunistic pathogen Streptococcus sanguinis.

Authors: Ping Xu; Joao M Alves; Todd Kitten; Arunsri Brown; Zhenming Chen; Luiz S Ozaki; Patricio Manque; Xiuchun Ge; Myrna G Serrano; Daniela Puiu; Stephanie Hendricks; Yingping Wang; Michael D Chaplin; Doruk Akan; Sehmi Paik; Darrell L Peterson; Francis L Macrina; Gregory A Buck
Journal: J Bacteriol Date: 2007-02-02 Impact factor: 3.490

7. Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks.

Authors: Shao-Shan Carol Huang; Ernest Fraenkel
Journal: Sci Signal Date: 2009-07-28 Impact factor: 8.192

Review 8. Protein interaction predictions from diverse sources.

Authors: Yin Liu; Inyoung Kim; Hongyu Zhao
Journal: Drug Discov Today Date: 2008-03-06 Impact factor: 7.851

9. A complex-based reconstruction of the Saccharomyces cerevisiae interactome.

Authors: Haidong Wang; Boyko Kakaradov; Sean R Collins; Lena Karotki; Dorothea Fiedler; Michael Shales; Kevan M Shokat; Tobias C Walther; Nevan J Krogan; Daphne Koller
Journal: Mol Cell Proteomics Date: 2009-01-27 Impact factor: 5.911

10. SpaK/SpaR two-component system characterized by a structure-driven domain-fusion method and in vitro phosphorylation studies.

Authors: Anu Chakicherla; Carol L Ecale Zhou; Martha Ligon Dang; Virginia Rodriguez; J Norman Hansen; Adam Zemla
Journal: PLoS Comput Biol Date: 2009-06-05 Impact factor: 4.475